Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Highlight stop words if they appear in the query #132

Merged
merged 11 commits into from
Dec 16, 2024

Conversation

daun
Copy link
Contributor

@daun daun commented Dec 7, 2024

Slight adjustment to how the highlighter handles stop words.

Currently, stop words are never highlighted. This change will highlight stop words if they are part of the query and they occur next to other words. The goal is to improve the match between what people searched for and what gets highlighted. I found this to be most useful in the movie example, where there is lots of The and An.

Examples

Query Before After
Pirates of the Caribbean: The Curse of the Black Pearl Pirates of the Caribbean: The Curse of the Black Pearl Pirates of the Caribbean: The Curse of the Black Pearl
a table for two I booked a table for two. I booked a table for two.
racing to a boxing match While racing to a boxing match... While racing to a boxing match...

Prior art

Meilisearch also highlights stop words with the following note in their docs:

Note: attributesToHighlight also highlights terms configured as synonyms and stop words.

daun added 11 commits December 7, 2024 00:12
Signed-off-by: Philipp Daun <post@philippdaun.net>
Signed-off-by: Philipp Daun <post@philippdaun.net>
Signed-off-by: Philipp Daun <post@philippdaun.net>
Signed-off-by: Philipp Daun <post@philippdaun.net>
Signed-off-by: Philipp Daun <post@philippdaun.net>
Signed-off-by: Philipp Daun <post@philippdaun.net>
Signed-off-by: Philipp Daun <post@philippdaun.net>
Signed-off-by: Philipp Daun <post@philippdaun.net>
Signed-off-by: Philipp Daun <post@philippdaun.net>
Signed-off-by: Philipp Daun <post@philippdaun.net>
Signed-off-by: Philipp Daun <post@philippdaun.net>
@daun
Copy link
Contributor Author

daun commented Dec 16, 2024

@Toflar What are your thoughts on this one? It only becomes an issue when using stop words, but it's quite noticeable on certain queries.

@Toflar
Copy link
Contributor

Toflar commented Dec 16, 2024

Oh sorry, I missed this totally! Nice work! This makes total sense to me, I just wonder how MeiliSearch handles this. Did you research maybe and want to put the notes here for future reference?

@daun
Copy link
Contributor Author

daun commented Dec 16, 2024

@Toflar Meilisearch also highlights stopwords. They have the below note in their docs. I don't think it makes sense to highlight all stop words as you'll have the highlights littered with a and the, but the note is pretty vague and can be read either way. Updated the PR comment with the note and a link to Meilisearch docs as well.

Note: attributesToHighlight also highlights terms configured as synonyms and stop words.

@Toflar Toflar merged commit 3bc815b into loupe-php:develop Dec 16, 2024
18 checks passed
@Toflar
Copy link
Contributor

Toflar commented Dec 16, 2024

Thanks a lot for yet another awesome contribution! Thinking about providing some default stop word list. Something like we have for TypoTolerance already, maybe?

  • $configuration->withStopWords(Stopwords::defaultList()) (enabled by default, who does not want to have stop words?)
  • $configuration->withStopWords(Stopwords::disable())
  • $configuration->withStopWords(Stopwords::withAddedToDefault('foo', 'bar', 'baz'))
  • $configuration->withStopWords(Stopwords::withRemovedFromDefault('foo', 'bar', 'baz'))

just thinking out loud.

@daun daun deleted the feat/stopword-highlights branch December 16, 2024 15:56
@daun
Copy link
Contributor Author

daun commented Dec 16, 2024

@Toflar A default stop word list sounds great for efficiency. Probably as an opt-in setting, though — you can easily get into trouble with multi-language setups. I can imagine somebody indexing french documents and not finding documents about tea (thé). You're probably only generating support requests with a default-enabled set of stop words 🙃 But providing the list would make sense, I think.

@Toflar
Copy link
Contributor

Toflar commented Dec 16, 2024

Which is why I think stop words lists should be language dependent ;)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants