Krotos Modules 3
Loading...
Searching...
No Matches
krotos::BM25 Class Reference

ATIRE BM25 ranking function. This variant does not result in negative IDF values. Trotman, A., Jia, X., & Crane, M. (2012, August). Towards an Efficient and Effective Search Engine. In OSIR@ SIGIR (pp. 40-47). More...

#include <BM25.h>

Public Member Functions

 BM25 (const std::vector< String > &corpus)
 BM25 ranking algorithm.
 
std::vector< float > getScores (String query)
 Calculate BM25 scores between query and all documents in corpus.
 
std::vector< float > getBatchScores (String query, const std::vector< std::size_t > &ids)
 Calculate BM25 scores between query and a subset of documents in corpus.
 

Private Member Functions

void initialise (const std::vector< String > &corpus)
 Initialise the BM25 ranking algorithm with a corpus.
 
StringArray tokenize (const String &text)
 Apply tokenization to a text.
 
String stemmer (String word)
 Apply stemming to a word.
 

Private Attributes

const float m_k1 {1.5f}
 
const float m_b {0.75f}
 
const float m_epsilon {0.25f}
 
float m_averageDocumentLength {1.0f}
 
int m_corpusSize {0}
 
float m_averageInverseDocumentFrequency {0.0f}
 
std::unordered_map< String, std::unordered_map< int, int > > m_termToDocument
 
std::unordered_map< String, float > m_inverseDocumentFrequency
 
std::vector< int > m_documentLength
 
StringArray m_stopwords
 

Detailed Description

ATIRE BM25 ranking function. This variant does not result in negative IDF values. Trotman, A., Jia, X., & Crane, M. (2012, August). Towards an Efficient and Effective Search Engine. In OSIR@ SIGIR (pp. 40-47).

Constructor & Destructor Documentation

◆ BM25()

krotos::BM25::BM25 ( const std::vector< String > & corpus)

BM25 ranking algorithm.

Parameters
corpusA vector of text 'documents'

Member Function Documentation

◆ getBatchScores()

std::vector< float > krotos::BM25::getBatchScores ( String query,
const std::vector< std::size_t > & ids )

Calculate BM25 scores between query and a subset of documents in corpus.

Parameters
queryThe text query used to score documents
idsThe integer ids of documents to get scores for
Returns
BM25 scores

◆ getScores()

std::vector< float > krotos::BM25::getScores ( String query)

Calculate BM25 scores between query and all documents in corpus.

Parameters
queryThe text query used to score documents
Returns
BM25 scores

◆ initialise()

void krotos::BM25::initialise ( const std::vector< String > & corpus)
private

Initialise the BM25 ranking algorithm with a corpus.

Parameters
corpusA vector of text 'documents'

◆ stemmer()

String krotos::BM25::stemmer ( String word)
private

Apply stemming to a word.

Parameters
wordThe word to stem
Returns
The stemmed word

◆ tokenize()

StringArray krotos::BM25::tokenize ( const String & text)
private

Apply tokenization to a text.

Parameters
textThe text to tokenize
Returns
The tokenized text

Member Data Documentation

◆ m_averageDocumentLength

float krotos::BM25::m_averageDocumentLength {1.0f}
private

◆ m_averageInverseDocumentFrequency

float krotos::BM25::m_averageInverseDocumentFrequency {0.0f}
private

◆ m_b

const float krotos::BM25::m_b {0.75f}
private

◆ m_corpusSize

int krotos::BM25::m_corpusSize {0}
private

◆ m_documentLength

std::vector<int> krotos::BM25::m_documentLength
private

◆ m_epsilon

const float krotos::BM25::m_epsilon {0.25f}
private

◆ m_inverseDocumentFrequency

std::unordered_map<String, float> krotos::BM25::m_inverseDocumentFrequency
private

◆ m_k1

const float krotos::BM25::m_k1 {1.5f}
private

◆ m_stopwords

StringArray krotos::BM25::m_stopwords
private
Initial value:
= {
"i", "me", "my", "myself", "we", "our", "ours", "ourselves",
"you", "you\'re", "you\'ve", "you\'ll", "you\'d", "your", "yours", "yourself",
"yourselves", "he", "him", "his", "himself", "she", "she\'s", "her",
"hers", "herself", "it", "it\'s", "its", "itself", "they", "them",
"their", "theirs", "themselves", "what", "which", "who", "whom", "this",
"that", "that\'ll", "these", "those", "am", "is", "are", "was",
"were", "be", "been", "being", "have", "has", "had", "having",
"do", "does", "did", "doing", "a", "an", "the", "and",
"but", "if", "or", "because", "as", "until", "while", "of",
"at", "by", "for", "with", "about", "against", "between", "into",
"through", "during", "before", "after", "above", "below", "to", "from",
"up", "down", "in", "out", "on", "off", "over", "under",
"again", "further", "then", "once", "here", "there", "when", "where",
"why", "how", "all", "any", "both", "each", "few", "more",
"most", "other", "some", "such", "no", "nor", "not", "only",
"own", "same", "so", "than", "too", "very", "s", "t",
"can", "will", "just", "don", "don\'t", "should", "should\'ve", "now",
"d", "ll", "m", "o", "re", "ve", "y", "ain",
"aren", "aren\'t", "couldn", "couldn\'t", "didn", "didn\'t", "doesn", "doesn\'t",
"hadn", "hadn\'t", "hasn", "hasn\'t", "haven", "haven\'t", "isn", "isn\'t",
"ma", "mightn", "mightn\'t", "mustn", "mustn\'t", "needn", "needn\'t", "shan",
"shan\'t", "shouldn", "shouldn\'t", "wasn", "wasn\'t", "weren", "weren\'t", "won",
"won\'t", "wouldn", "wouldn\'t"}

◆ m_termToDocument

std::unordered_map<String, std::unordered_map<int, int> > krotos::BM25::m_termToDocument
private

The documentation for this class was generated from the following files: