-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve search backend: add stop words, use "federated search", highlight creators #1319
Merged
LukasKalbertodt
merged 6 commits into
elan-ev:next
from
LukasKalbertodt:improve-search-backend
Jan 23, 2025
Merged
Improve search backend: add stop words, use "federated search", highlight creators #1319
LukasKalbertodt
merged 6 commits into
elan-ev:next
from
LukasKalbertodt:improve-search-backend
Jan 23, 2025
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This is needed for a few features we want to use. In 1.11, "federated multi search" is added. And in 1.12, the match range contains info about array indices so that matches inside array fields can now be correctly located.
This has several advantages: - More well defined ordering, with simple way to boost some item types. - Easier to implement pagination. - Slightly less code in Tobira - This could be faster (I haven't really observed a notable difference in my tiny tests though. This might change the result set for certain queries. I just quickly checked several cases to make sure it's still useful. Returning the exact same results as before is not important, it should just be useful. And the idea is that with this, it can be more useful than before.
This is is mostly done for one common case: a series where all events have the same title as the series. Searching for the series name previously showed all videos (in pretty much random order) first, and only then the series. With this tiny boost, the series is shown first in these cases. You can try that with the query "Quantenelektronik".
Stop words are very common words that carry basically no information. Usually, stop word lists are language specific and you can easily see why: "hat" might be a normal word in English, but carries no information in German. "these" might be a stop word in English, but is a useful word in German. Unfortunately we don't have the luxury of only supporting one language and in fact: we don't even know the language of a certain document. So we are kind of forced to have a combined list. I created this semi-manually by combining DE and EN (the only languages we currently support), making sure that words that carry meaning in any of the languages are not marked as stop words. Additional languages can be added in the future, but each new one decreases the usefulness of the list. Once the need arises, we can also easily add the feature to configure your own stop words. These stop words we could just send to Meili, instructing it to ignore them. Unfortunately, there are some disadvantages to that as Meili doesn't nicely deal with stop words IMO: especially in phrase search, the highlighting is broken and might confuse users. Phrase search still kind of works but from reading the docs, I think with stop search "the" and "a", searching for "foo the bar" will also find documents with the text "foo a bar". See https://github.com/orgs/meilisearch/discussions/793 So instead, we just use the stop words to filter out matches in texts. That doesn't improve indexing speed, search speed, or index size in Meili, but it can vastly reduce the size of the GQL response to the frontend and makes the frontend less likely to choke on these useless matches. We might still use our stop words for more in the future (ignoring matches in metadata or even sending them to Meili once Meili fixes its problems).
owi92
approved these changes
Jan 22, 2025
This was referenced Jan 27, 2025
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Fixes #1300
Fixes #1299
See commits for more information. There will be more changes for 3.0, but I wanted to open this already as the other changes can easily get their own PR(s).