Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve search backend: add stop words, use "federated search", highlight creators #1319

Merged
merged 6 commits into from
Jan 23, 2025

Conversation

LukasKalbertodt
Copy link
Member

Fixes #1300
Fixes #1299

See commits for more information. There will be more changes for 3.0, but I wanted to open this already as the other changes can easily get their own PR(s).

This is needed for a few features we want to use. In 1.11, "federated
multi search" is added. And in 1.12, the match range contains info
about array indices so that matches inside array fields can now be
correctly located.
This has several advantages:
- More well defined ordering, with simple way to boost some item types.
- Easier to implement pagination.
- Slightly less code in Tobira
- This could be faster (I haven't really observed a notable difference
  in my tiny tests though.

This might change the result set for certain queries. I just quickly
checked several cases to make sure it's still useful. Returning the
exact same results as before is not important, it should just be useful.
And the idea is that with this, it can be more useful than before.
This is is mostly done for one common case: a series where all events
have the same title as the series. Searching for the series name
previously showed all videos (in pretty much random order) first, and
only then the series. With this tiny boost, the series is shown first
in these cases. You can try that with the query "Quantenelektronik".
Stop words are very common words that carry basically no information.
Usually, stop word lists are language specific and you can easily see
why: "hat" might be a normal word in English, but carries no
information in German. "these" might be a stop word in English, but is
a useful word in German. Unfortunately we don't have the luxury of only
supporting one language and in fact: we don't even know the language
of a certain document. So we are kind of forced to have a combined list.
I created this semi-manually by combining DE and EN (the only languages
we currently support), making sure that words that carry meaning in any
of the languages are not marked as stop words. Additional languages can
be added in the future, but each new one decreases the usefulness of the
list.

Once the need arises, we can also easily add the feature to configure
your own stop words.

These stop words we could just send to Meili, instructing it to ignore
them. Unfortunately, there are some disadvantages to that as Meili
doesn't nicely deal with stop words IMO: especially in phrase search,
the highlighting is broken and might confuse users. Phrase search still
kind of works but from reading the docs, I think with stop search "the"
and "a", searching for "foo the bar" will also find documents with the
text "foo a bar". See https://github.com/orgs/meilisearch/discussions/793

So instead, we just use the stop words to filter out matches in texts.
That doesn't improve indexing speed, search speed, or index size in
Meili, but it can vastly reduce the size of the GQL response to the
frontend and makes the frontend less likely to choke on these useless
matches.

We might still use our stop words for more in the future (ignoring
matches in metadata or even sending them to Meili once Meili fixes its
problems).
@LukasKalbertodt LukasKalbertodt added the changelog:breaking Breaking changes label Jan 22, 2025
@github-actions github-actions bot temporarily deployed to test-deployment-pr1319 January 22, 2025 11:41 Destroyed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
changelog:breaking Breaking changes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants