Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issues with scoring #129

Closed
rolftimmermans opened this issue Feb 18, 2022 · 15 comments
Closed

Issues with scoring #129

rolftimmermans opened this issue Feb 18, 2022 · 15 comments

Comments

@rolftimmermans
Copy link
Contributor

rolftimmermans commented Feb 18, 2022

Hi! First of all, v4 seems to be give slightly better search ranking than v3.

However, there is a crucial issue currently with the scoring of documents in our application for some search terms. I have tried to recreate this with a synthetic example. For that purpose I've collected 5 movies about sheep.

const ms = new MiniSearch({
  fields: ['title', 'description'],
  storeFields: ['title']
})

ms.add({
  id: 1,
  title: 'Rams',
  description: 'A feud between two sheep farmers.'
})

ms.add({
  id: 2,
  title: 'Shaun the Sheep',
  description: 'Shaun is a cheeky and mischievous sheep at Mossy Bottom farm who\'s the leader of the flock and always plays slapstick jokes, pranks and causes trouble especially on Farmer X and his grumpy guide dog, Bitzer.'
})

ms.add({
  id: 3,
  title: 'Silence of the Lambs',
  description: 'F.B.I. trainee Clarice Starling (Jodie Foster) works hard to advance her career, while trying to hide or put behind her West Virginia roots, of which if some knew, would automatically classify her as being backward or white trash. After graduation, she aspires to work in the agency\'s Behavioral Science Unit under the leadership of Jack Crawford (Scott Glenn). While she is still a trainee, Crawford asks her to question Dr. Hannibal Lecter (Sir Anthony Hopkins), a psychiatrist imprisoned, thus far, for eight years in maximum security isolation for being a serial killer who cannibalized his victims. Clarice is able to figure out the assignment is to pick Lecter\'s brains to help them solve another serial murder case, that of someone coined by the media as "Buffalo Bill" (Ted Levine), who has so far killed five victims, all located in the eastern U.S., all young women, who are slightly overweight (especially around the hips), all who were drowned in natural bodies of water, and all who were stripped of large swaths of skin. She also figures that Crawford chose her, as a woman, to be able to trigger some emotional response from Lecter. After speaking to Lecter for the first time, she realizes that everything with him will be a psychological game, with her often having to read between the very cryptic lines he provides. She has to decide how much she will play along, as his request in return for talking to him is to expose herself emotionally to him. The case takes a more dire turn when a sixth victim is discovered, this one from who they are able to retrieve a key piece of evidence, if Lecter is being forthright as to its meaning. A potential seventh victim is high profile Catherine Martin (Brooke Smith), the daughter of Senator Ruth Martin (Diane Baker), which places greater scrutiny on the case as they search for a hopefully still alive Catherine. Who may factor into what happens is Dr. Frederick Chilton (Anthony Heald), the warden at the prison, an opportunist who sees the higher profile with Catherine, meaning a higher profile for himself if he can insert himself successfully into the proceedings.'
})

ms.add({
  id: 4,
  title: 'Lamb',
  description: 'Haunted by the indelible mark of loss and silent grief, sad-eyed María and her taciturn husband, Ingvar, seek solace in back-breaking work and the demanding schedule at their sheep farm in the remote, harsh, wind-swept landscapes of mountainous Iceland. Then, with their relationship hanging on by a thread, something unexplainable happens, and just like that, happiness blesses the couple\'s grim household once more. Now, as a painful ending gives birth to a new beginning, Ingvar\'s troubled brother, Pétur, arrives at the farmhouse, threatening María and Ingvar\'s delicate, newfound bliss. But, nature\'s gifts demand sacrifice. How far are ecstatic María and Ingvar willing to go in the name of love?'
})

ms.add({
  id: 5,
  title: 'Ringing Bell',
  description: 'A baby lamb named Chirin is living an idyllic life on a farm with many other sheep. Chirin is very adventurous and tends to get lost, so he wears a bell around his neck so that his mother can always find him. His mother warns Chirin that he must never venture beyond the fence surrounding the farm, because a huge black wolf lives in the mountains and loves to eat sheep. Chirin is too young and naive to take the advice to heart, until one night the wolf enters the barn and is prepared to kill Chirin, but at the last moment the lamb\'s mother throws herself in the way and is killed instead. The wolf leaves, and Chirin is horrified to see his mother\'s body. Unable to understand why his mother was killed, he becomes very angry and swears that he will go into the mountains and kill the wolf.'
})

ms.search('sheep', { boost: { title: 2 } })

The following are the results:

[
  {
    id: 1,
    terms: [ 'sheep' ],
    score: 4.360862545683414,
    match: { sheep: [Array] },
    title: 'Rams'
  },
  {
    id: 2,
    terms: [ 'sheep' ],
    score: 3.163825722967836,
    match: { sheep: [Array] },
    title: 'Shaun the Sheep'
  },
  {
    id: 5,
    terms: [ 'sheep' ],
    score: 0.3964420496075831,
    match: { sheep: [Array] },
    title: 'Ringing Bell'
  },
  {
    id: 4,
    terms: [ 'sheep' ],
    score: 0.26090630615199917,
    match: { sheep: [Array] },
    title: 'Lamb'
  }
]

The issue is the following. I expect, without any doubt, that 'Shaun the Sheep' should be the top result. Why?

  • Because it is the only movie with 'sheep' in the title field and in the description field.
  • The subjective score of 'sheep' within a 3 word title is higher than 'sheep' in a 6 word description.
  • The subjective score of 'sheep' in 1 title out of 5 movies is much better than 4 descriptions out of 5 movies.
  • I have even boosted the title by a factor of 2. In our actual application, I don't really want to boost one field too much, because it can lead to other scoring problems.

So what goes wrong?

Fields with a high variance in length obscure fields with a low variance in length

The issue is that many other movies have very long descriptions, but 'Rams' only has a 6-word description. The relative scoring for field length is fieldLength / averageFieldLength. This heavily disadvantages the description of 'Shaun the Sheep', which is only of "average" length. This essentially means that if there is a high variance in a field's length, the documents with a short field get a very large boost. Regardless of matches in other fields!

A match in two distinct fields in the same document has no bonus

I would expect that 'Shaun the Sheep' is a great match for the query 'sheep' because it is the only document that has a match in both fields. I think it would be good to give a boost in those cases, similarly to how a document that matches two words in an OR query receives a boost.

So what are the options?

I think we could take a cue from Lucene, which uses 1 / sqrt(numFieldTerms) as the length normalisation factor.

https://www.compose.com/articles/how-scoring-works-in-elasticsearch/
https://theaidigest.in/how-does-elasticsearch-scoring-work/

Just as a quick test, if I take 1 / sqrt(fieldLength), I get the following results:

[
  {
    id: 2,
    terms: [ 'sheep' ],
    score: 1.8946174879859907,
    match: { sheep: [Array] },
    title: 'Shaun the Sheep'
  },
  {
    id: 1,
    terms: [ 'sheep' ],
    score: 0.08434033477788275,
    match: { sheep: [Array] },
    title: 'Rams'
  },
  {
    id: 5,
    terms: [ 'sheep' ],
    score: 0.03596283958463321,
    match: { sheep: [Array] },
    title: 'Ringing Bell'
  },
  {
    id: 4,
    terms: [ 'sheep' ],
    score: 0.020629628616731104,
    match: { sheep: [Array] },
    title: 'Lamb'
  }
]

I get the same results even if I drop the title boosting factor. That's actually exactly what I personally expect: the shorter fields should count more if they match unless I disadvantage them explicitly.

Problem solved?! Well, not really. What if I search for a highly specific sheep?

ms.search('chirin the sheep')
[
  {
    id: 2,
    terms: [ 'the', 'sheep' ],
    score: 4.537584326120562,
    match: { the: [Array], sheep: [Array] },
    title: 'Shaun the Sheep'
  },
  {
    id: 5,
    terms: [ 'chirin', 'the', 'sheep' ],
    score: 2.2902873329363285,
    match: { chirin: [Array], the: [Array], sheep: [Array] },
    title: 'Ringing Bell'
  },
  {
    id: 3,
    terms: [ 'the' ],
    score: 1.09077315757252,
    match: { the: [Array] },
    title: 'Silence of the Lambs'
  },
  {
    id: 4,
    terms: [ 'the', 'sheep' ],
    score: 0.2166111004756766,
    match: { the: [Array], sheep: [Array] },
    title: 'Lamb'
  },
  {
    id: 1,
    terms: [ 'sheep' ],
    score: 0.08434033477788275,
    match: { sheep: [Array] },
    title: 'Rams'
  }
]

I definitely wasn't looking for Shaun! 'Ringing Bell' should be the top result here, because it is the only match for 'chirin'. So what can we do? Taking cues from Lucene, it scores terms in query with a coordination mechanism. It effectively means the more term matches there are, the better the score should be. It uses matching terms / total terms as a weight factor for each document. This can also replace the 1.5 boost for OR queries. Hacking that into MiniSearch I get this:

[
  {
    id: 2,
    terms: [ 'the', 'sheep' ],
    score: 1.0445507364815925,
    match: { the: [Array], sheep: [Array] },
    title: 'Shaun the Sheep'
  },
  {
    id: 5,
    terms: [ 'chirin', 'the', 'sheep' ],
    score: 1.0298930944999127,
    match: { chirin: [Array], the: [Array], sheep: [Array] },
    title: 'Ringing Bell'
  },
  {
    id: 3,
    terms: [ 'the' ],
    score: 0.21087593054514742,
    match: { the: [Array] },
    title: 'Silence of the Lambs'
  },
  {
    id: 4,
    terms: [ 'the', 'sheep' ],
    score: 0.09627160021141183,
    match: { the: [Array], sheep: [Array] },
    title: 'Lamb'
  },
  {
    id: 1,
    terms: [ 'sheep' ],
    score: 0.028113444925960917,
    match: { sheep: [Array] },
    title: 'Rams'
  }
]

Almost there (1.04 vs 1.03), but not quite yet...

Lucene also uses the inverse document frequency of each term in the query as a factor for determining how unique a term is. I have not tested this (it touches more code in MiniSearch), but my guess is this would raise the score of 'Ringing Bell' to the top position because of the uniqueness of the term 'chirin'.

So, my question to you is this: would you be open to revising the scoring mechanism to be closer to what Lucene uses? I believe it could solve some practical issues with the current document scoring.

If you do, maybe we should collect some test sets which are realistic enough, but also small enough to be able to judge the scoring from the outside.

Looking forward to any thoughts you may have on this!

@lucaong
Copy link
Owner

lucaong commented Feb 18, 2022

I agree 100% with your analysis. And yes, I would be willing to revise the scoring mechanism, taking inspiration from other proven solutions like Lucene.

If you do, maybe we should collect some test sets which are realistic enough, but also small enough to be able to judge the scoring from the outside.

Sounds like a plan :)

@lucaong
Copy link
Owner

lucaong commented Feb 19, 2022

For more examples of the same issues, on the demo app, searching for “witch queen” scores the result “The Witch Queen of New Orleans” way too low compared with many songs from the band “Queen” that do not contain the word “witch” at all (my personal taste actually agrees with the current scoring, but MiniSearch should not be biased by musical taste 🙃).

It would make sense to specify the desired behavior and write tests for it. My feeling is that the following list of statements should each hold true, all else being equal, for exact match. They are also, in a more relaxed sense, ordered by importance, in the sense that their effect on scoring should be in decreasing order: if two statements apply, and would score in different directions, the one higher in the list should generally win (within a reasonable range, which is hard to define, but single “uncontroversial” examples like the ones you mentioned in your comment can be added to the test suite).

  1. A document matching more search terms should score higher than one matching fewer terms
  2. A document matching a term in more fields should score higher than a document matching the same term in fewer fields
  3. A document matching a less common search term should score higher than one matching a more common term
  4. A document matching a term in a shorter field should score higher than a document matching the same term in a longer field

All gets more complicated when considering fuzzy/prefix match and boosting, but even in those cases the general tendency should be the same. With the current scoring, the relative importance of those effects is different from the desired one, like in your examples.

Additions to the list, as well as discussion on their desired relative importance, is welcome, as they are a matter of perceived quality of results.

@rolftimmermans
Copy link
Contributor Author

rolftimmermans commented Feb 19, 2022

I have done some more research and reading and it seems Lucene has moved to scoring based on BM25. So my references above are outdated.

https://en.wikipedia.org/wiki/Okapi_BM25
https://opensourceconnections.com/blog/2015/10/16/bm25-the-next-generation-of-lucene-relevation/
https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/search/similarities/BM25Similarity.java

I have been experimenting with implementing BM25 in MiniSearch and so far it looks promising. However, I just checked your suggested query with the demo app and 'Killer Queen' still scores first, just above 'The Witch Queen Of New Orleans'.

@lucaong
Copy link
Owner

lucaong commented Feb 20, 2022

Arguably the “Killer Queen” vs “The Witch Queen of New Orleans” is not as clear cut as the rest: more than one of the rules above are at play, pulling in different directions, so I don’t think it’s strongly defined which one should be first, as long as they are both at the top. I do feel that “The Witch Queen of New Orleans” is more likely to be what one is searching for when inputting “witch queen”, but second place is a much better outcome than the original one, buried below many other results.

@rolftimmermans
Copy link
Contributor Author

rolftimmermans commented Feb 20, 2022

Here's a summary of what I found out so far.

BM25 improves on classic TF-IDF scoring in two ways:

  • It adds diminishing returns for term frequency: a document matching a term twice is always better than a document matching it once, but never twice as good.
  • Search terms occurring in a short document are more important than those occurring in a longer document. For longer text to reach relevance, it needs to contain the term more frequently. But contrary to other models, this effect is non-linear.

BM25 is a really great model, but it is designed for "bags of words", not structured documents (documents with multiple fields). Extending such a model to work with structured documents can be done in at least two ways:

  1. Treat each field as a distinct document. This means that frequencies, length, etc. all are counted separately for each field. The total score is a linear combination (a "weighted sum") of the scores for individual fields.
  2. Combine all searched fields into one "bag of words". Searching in title and description would be identical to searching in documents with just one field, the concatenation title + description.

There are trade-offs between both approaches:

  • Model 1 removes some of the nice non-linear properties of BM25. That can give confusing results. For example: for a query of 2 terms, 1 term matching both fields is generally a worse result than both terms matching 1 field (the 'Witch Queen' issue), but this model does not make a distinction between the two.
  • Model 2 does not suffer from the linearity issue described above, but has another problem. The relative difference in field length is completely lost. For example; the fact that a title is really short, and a description much longer is important information, but this is thrown away. This model doesn't care if a term occurs in the title or in the description, they're treated equally. That leads to surprising results for datasets with different field lengths (such as the one I described above).

There are approaches to solving the issues. I've read two papers that attempt this.

  • BM25F (paper) tries to solve the issue for model 2 by applying field weights. But the field weights need to be chosen manually, which is a usability issue IMO.
  • BM25-FIC (paper, author's explanation) uses a hybrid of model 1 and model 2 and also applies field weights. In theory this looks promising, but it requires field weights to be computed for entire queries, for every document. But MiniSearch deconstructs queries into terms before matching them with documents, which seems to make this unusable for the current implementation of MiniSearch.

I have now implemented model 1, with the addition of applying a boosting score at the final stage. The boost depends on how many terms were matched. This is a blunt way to reward queries that match more terms. It seems to work reasonably well from what I observed so far; it seems to solve the 'Witch Queen' query issue. This solution is actually quite similar to the 1.5 boosting factor for OR queries that MiniSearch has now, except that it is applied uniformly to all query terms (the current implementation unintentionally (?) rewards terms that occur earliest, if the term count is 3+). It's not without precedent either, Lucene used to have this as well.

I have pushed the implementation to my bm25 branch. Please test it and let me know what you think!

@rolftimmermans
Copy link
Contributor Author

rolftimmermans commented Feb 20, 2022

I'll summarise how the current implementation of BM25 combined with result boosting implements the rules you mentioned above:

  1. A document matching more search terms should score higher than one matching fewer terms
    This is addressed by applying a factor to all results that depends on how many terms were matched. This is the "final" adjustment factor and therefore carries a lot of weight.

  2. A document matching a term in more fields should score higher than a document matching the same term in fewer fields
    This is addressed because Model 1 described above uses a simple (weighted) sum of the BM25 scores for each individual field. This is a linear effect, which means it it carries more effect than the next two items...

  3. A document matching a less common search term should score higher than one matching a more common term
    This is addressed by BM25. But frequencies are counted per field and not across different fields. Whether that is an issue might depend on the dataset. I'm not sure.

  4. A document matching a term in a shorter field should score higher than a document matching the same term in a longer field
    This is addressed because we use Model 1 described above, which keeps the field lengths and uses this information for scoring terms in individual fields.

@lucaong
Copy link
Owner

lucaong commented Feb 21, 2022

I researched a bit BM25 to get on the same page, and it definitely sounds like the way to go, also given it’s adoption in reference implementations of full-text search engines.

Thanks for the detailed explanation, model 1 plus the custom boosting indeed sounds like the best trade off. I am testing out your branch, if all looks good I’d first do a bugfix release of v4, then we can discuss if releasing bm25 as a v5, or v4.1. I am leaning for releasing a new major, maybe with an initial release candidate so we can test BM25 in the wild a bit. Once it’s stable (it looks like it is already) I am happy to test the release candidate on my production applications.

@rolftimmermans
Copy link
Contributor Author

rolftimmermans commented Feb 21, 2022

Sounds like a great plan. I support the idea of a beta version or release candidate so it is easier to test in real applications before releasing it as a stable version. No need to rush this...

If adopted, I also think it should be a major version bump. One important reason is that any boost factors a user has implemented need to be reconsidered.

@lucaong
Copy link
Owner

lucaong commented Feb 21, 2022

In my tests with your branch on the demo app, the bm25 scoring shows noticeable improvements to the perceived quality of the results, especially in the kind of cases we discussed above. So far, I did not find critical cases where bm25 feels really off, and it rather seem more balanced, closer to the priorities listed above, and less negatively affected by outliers when it comes to field length or repetition of terms.

Some quick example queries that show the improvement:

  • "great big" ranks Say Something by A Great Big World And Christina Aguilera as the top result with bm25, whereas the current algorithm ranks it 5th, after several results that do not contain both terms
  • Searching for "rollin stone" vs "rolling stone" vs "rolling stones" shows much better results with bm25 compared to the current implementation, both for the top result and for the following ones.
  • "danger" ranks Danger Been So Long higher than Dangerous with bm25, while the current implementation penalizes the former too much due to being longer
  • "lady gaga" shows all results from the artist Lady Gaga first with b25, including Lady Gaga featuring someone else. The current implementation instead heavily penalizes the results where the artist is Lady Gaga Featuring ...

@rolftimmermans
Copy link
Contributor Author

rolftimmermans commented Feb 21, 2022

That's great! Excellent that you tested and compared the ranking of both implementations. Do you feel there are any great test cases that would be worthwhile adding to a test set such as in https://github.com/rolftimmermans/minisearch/blob/bm25/src/MiniSearch.test.js#L940?

@lucaong
Copy link
Owner

lucaong commented Feb 21, 2022

I think the test cases you added are enough, and they’re great because it’s hopefully easy to agree on the desired result. The comments help a lot in explaining why the expected ranking should be that. If during the beta test we find something to fine-tune, we can add the relevant examples too.

One of my production uses of MiniSearch is in a very different domain. I expect that also there the results will improve, but it will be interesting to try on the field with a very different and non-trivial corpus.

@lucaong
Copy link
Owner

lucaong commented Feb 23, 2022

I released v5.0.0-beta1

@rolftimmermans
Copy link
Contributor Author

For our application v5.0.0-beta1 shows very noticeable improvements.

Maybe we could open an issue requesting user feedback, and pointing them to it from the change log?

@lucaong
Copy link
Owner

lucaong commented Feb 28, 2022

That makes perfect sense, I will. Our apps will start using the beta today, tests look promising.

@lucaong
Copy link
Owner

lucaong commented Mar 11, 2022

@rolftimmermans I created #142 to collect feedback on v5.0.0 beta. So far, the applications that my teams maintain show very noticeable improvements too, and no issue was reported. I did adjust boosting in one case to be a bit less aggressive.

I will close this issue in favor of the newly created one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants