A scorer that implements the BM25 ranking formula.
Warning: the default values
BM25Scorer.DEFAULT_K1 and
BM25Scorer.DEFAULT_B have changed in MG4J 1.1.2 (see below).
BM25 is the name of a formula derived from the probabilistic model. The essential feature
of the formula is that of assigning to each term appearing in a given document a weight depending
both on the count (the number of occurrences of the term in the document), on the frequency (the
number of the documents in which the term appears) and on the document length (in words).
There are a number
of incarnations with small variations of the formula itself. Here, the weight
assigned to a term which appears in f documents out of a collection of N documents
w.r.t. to a document of length l in which the term appears c times is
log( (N − f + 1/2) / (f + 1/2) ) ( k1 + 1 ) c ⁄ ( c + k1 ((1 − b) + bl / L) ),
where L is the average document length, and k1 and b are
parameters that default to
BM25Scorer.DEFAULT_K1 and
BM25Scorer.DEFAULT_B : these values were chosen
following the suggestions given in
“Efficiency vs. effectiveness in Terabyte-scale information retrieval”, by Stefan Büttcher and Charles L. A. Clarke,
in Proceedings of the 14th Text REtrieval
Conference (TREC 2005). Gaithersburg, USA, November 2005. The logarithmic part (a.k.a.
idf (inverse document-frequency) part) is actually
maximised with
BM25Scorer.EPSILON_SCORE , so it is never negative (the net effect being that terms appearing
in more than half of the documents have almost no weight).
This class uses a
it.unimi.dsi.mg4j.search.visitor.CounterCollectionVisitor and related classes to take into consideration only terms that are actually involved
in the current document.
author: Mauro Mereu author: Sebastiano Vigna |