-
Notifications
You must be signed in to change notification settings - Fork 141
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Issues with scoring #129
Comments
I agree 100% with your analysis. And yes, I would be willing to revise the scoring mechanism, taking inspiration from other proven solutions like Lucene.
Sounds like a plan :) |
For more examples of the same issues, on the demo app, searching for “witch queen” scores the result “The Witch Queen of New Orleans” way too low compared with many songs from the band “Queen” that do not contain the word “witch” at all (my personal taste actually agrees with the current scoring, but MiniSearch should not be biased by musical taste 🙃). It would make sense to specify the desired behavior and write tests for it. My feeling is that the following list of statements should each hold true, all else being equal, for exact match. They are also, in a more relaxed sense, ordered by importance, in the sense that their effect on scoring should be in decreasing order: if two statements apply, and would score in different directions, the one higher in the list should generally win (within a reasonable range, which is hard to define, but single “uncontroversial” examples like the ones you mentioned in your comment can be added to the test suite).
All gets more complicated when considering fuzzy/prefix match and boosting, but even in those cases the general tendency should be the same. With the current scoring, the relative importance of those effects is different from the desired one, like in your examples. Additions to the list, as well as discussion on their desired relative importance, is welcome, as they are a matter of perceived quality of results. |
I have done some more research and reading and it seems Lucene has moved to scoring based on BM25. So my references above are outdated. https://en.wikipedia.org/wiki/Okapi_BM25 I have been experimenting with implementing BM25 in MiniSearch and so far it looks promising. However, I just checked your suggested query with the demo app and 'Killer Queen' still scores first, just above 'The Witch Queen Of New Orleans'. |
Arguably the “Killer Queen” vs “The Witch Queen of New Orleans” is not as clear cut as the rest: more than one of the rules above are at play, pulling in different directions, so I don’t think it’s strongly defined which one should be first, as long as they are both at the top. I do feel that “The Witch Queen of New Orleans” is more likely to be what one is searching for when inputting “witch queen”, but second place is a much better outcome than the original one, buried below many other results. |
Here's a summary of what I found out so far. BM25 improves on classic TF-IDF scoring in two ways:
BM25 is a really great model, but it is designed for "bags of words", not structured documents (documents with multiple fields). Extending such a model to work with structured documents can be done in at least two ways:
There are trade-offs between both approaches:
There are approaches to solving the issues. I've read two papers that attempt this.
I have now implemented model 1, with the addition of applying a boosting score at the final stage. The boost depends on how many terms were matched. This is a blunt way to reward queries that match more terms. It seems to work reasonably well from what I observed so far; it seems to solve the 'Witch Queen' query issue. This solution is actually quite similar to the 1.5 boosting factor for OR queries that MiniSearch has now, except that it is applied uniformly to all query terms (the current implementation unintentionally (?) rewards terms that occur earliest, if the term count is 3+). It's not without precedent either, Lucene used to have this as well. I have pushed the implementation to my |
I'll summarise how the current implementation of BM25 combined with result boosting implements the rules you mentioned above:
|
I researched a bit BM25 to get on the same page, and it definitely sounds like the way to go, also given it’s adoption in reference implementations of full-text search engines. Thanks for the detailed explanation, model 1 plus the custom boosting indeed sounds like the best trade off. I am testing out your branch, if all looks good I’d first do a bugfix release of v4, then we can discuss if releasing bm25 as a v5, or v4.1. I am leaning for releasing a new major, maybe with an initial release candidate so we can test BM25 in the wild a bit. Once it’s stable (it looks like it is already) I am happy to test the release candidate on my production applications. |
Sounds like a great plan. I support the idea of a beta version or release candidate so it is easier to test in real applications before releasing it as a stable version. No need to rush this... If adopted, I also think it should be a major version bump. One important reason is that any |
In my tests with your branch on the demo app, the bm25 scoring shows noticeable improvements to the perceived quality of the results, especially in the kind of cases we discussed above. So far, I did not find critical cases where bm25 feels really off, and it rather seem more balanced, closer to the priorities listed above, and less negatively affected by outliers when it comes to field length or repetition of terms. Some quick example queries that show the improvement:
|
That's great! Excellent that you tested and compared the ranking of both implementations. Do you feel there are any great test cases that would be worthwhile adding to a test set such as in https://github.com/rolftimmermans/minisearch/blob/bm25/src/MiniSearch.test.js#L940? |
I think the test cases you added are enough, and they’re great because it’s hopefully easy to agree on the desired result. The comments help a lot in explaining why the expected ranking should be that. If during the beta test we find something to fine-tune, we can add the relevant examples too. One of my production uses of MiniSearch is in a very different domain. I expect that also there the results will improve, but it will be interesting to try on the field with a very different and non-trivial corpus. |
I released |
For our application Maybe we could open an issue requesting user feedback, and pointing them to it from the change log? |
That makes perfect sense, I will. Our apps will start using the beta today, tests look promising. |
@rolftimmermans I created #142 to collect feedback on I will close this issue in favor of the newly created one. |
Hi! First of all, v4 seems to be give slightly better search ranking than v3.
However, there is a crucial issue currently with the scoring of documents in our application for some search terms. I have tried to recreate this with a synthetic example. For that purpose I've collected 5 movies about sheep.
The following are the results:
The issue is the following. I expect, without any doubt, that 'Shaun the Sheep' should be the top result. Why?
title
field and in thedescription
field.So what goes wrong?
Fields with a high variance in length obscure fields with a low variance in length
The issue is that many other movies have very long descriptions, but 'Rams' only has a 6-word description. The relative scoring for field length is
fieldLength / averageFieldLength
. This heavily disadvantages the description of 'Shaun the Sheep', which is only of "average" length. This essentially means that if there is a high variance in a field's length, the documents with a short field get a very large boost. Regardless of matches in other fields!A match in two distinct fields in the same document has no bonus
I would expect that 'Shaun the Sheep' is a great match for the query 'sheep' because it is the only document that has a match in both fields. I think it would be good to give a boost in those cases, similarly to how a document that matches two words in an OR query receives a boost.
So what are the options?
I think we could take a cue from Lucene, which uses
1 / sqrt(numFieldTerms)
as the length normalisation factor.https://www.compose.com/articles/how-scoring-works-in-elasticsearch/
https://theaidigest.in/how-does-elasticsearch-scoring-work/
Just as a quick test, if I take
1 / sqrt(fieldLength)
, I get the following results:I get the same results even if I drop the title boosting factor. That's actually exactly what I personally expect: the shorter fields should count more if they match unless I disadvantage them explicitly.
Problem solved?! Well, not really. What if I search for a highly specific sheep?
I definitely wasn't looking for Shaun! 'Ringing Bell' should be the top result here, because it is the only match for 'chirin'. So what can we do? Taking cues from Lucene, it scores terms in query with a
coordination
mechanism. It effectively means the more term matches there are, the better the score should be. It usesmatching terms / total terms
as a weight factor for each document. This can also replace the 1.5 boost for OR queries. Hacking that into MiniSearch I get this:Almost there (1.04 vs 1.03), but not quite yet...
Lucene also uses the inverse document frequency of each term in the query as a factor for determining how unique a term is. I have not tested this (it touches more code in MiniSearch), but my guess is this would raise the score of 'Ringing Bell' to the top position because of the uniqueness of the term 'chirin'.
So, my question to you is this: would you be open to revising the scoring mechanism to be closer to what Lucene uses? I believe it could solve some practical issues with the current document scoring.
If you do, maybe we should collect some test sets which are realistic enough, but also small enough to be able to judge the scoring from the outside.
Looking forward to any thoughts you may have on this!
The text was updated successfully, but these errors were encountered: