-
Notifications
You must be signed in to change notification settings - Fork 25k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
English-minimal analyzer has bad plural stemming #42892
Comments
Pinging @elastic/es-search |
Is there a general rule, we can't just remove the If this is a hard problem, maybe an alternative would be to recommend using a synonym filter for those terms that get frequently misstemmed. A good list to start with could be added to our docs. If we want to change the behavior of this stemmer, I'd rather make a new one since this one is a direct implementation of a stemmer that is documented in a paper. |
I can do some digging but for a start I would expect For reference - crossword solvers: |
Probably for a different issue, but it would be good to consider how we name token filters and deal with BWC. |
According to the javadocs it's called "S stemmer" I took the signal media million news dataset and used this script to benchmark my proposals. I measured the recall gain that could be had from removing the extra "e" for a number of suffixes. sses -> ss effectsThis was a very positive uplift in recall for the most popular terms. Most of the rare non-zero matches for the S stemmer forms are questionable -
shes -> sh effectsAnother positive uplift. Disagreements with S stemmer like
tches -> tchAnother good uplift in recall for popular terms.
xes -> x effectsReasonable uplift from s stemmer.
|
Another bizarre choice in s-stemmer is to avoid any stemming of
|
Another proposal: |
|
It's possible that the EnglishMinimalStemmer's implementation of the original algorithm has a bug. This is the original S-stemmer description: The notes accompanying the table state :
For the "The first applicable rule" for I notice this implementation of the s-stemmer makes the same mistake. (Perhaps our Java version was a port of this javascript or vice versa?). @jpountz I've been working on a new TokenFilter but what does this ees/oes discovery mean for the existing EnglishMinimalStemmer code if it falls short of its goal in faithfully implementing the original paper? |
Ches rules: Looks like the es can be dropped but with a small number of English-adopted words like cliche, quiche and avalanche.
|
Drops the trailing “e” in taxes, dresses, watches etc that otherwise cause mismatches with plural and singular forms Closes elastic#42892
Final comparison of resultsHaving heard back from the author of the paper on which Lucene's
|
@markharwood - thanks for the effort put into analyzing this! As a temporary workaround, I gathered your corrected misstems in a synonyms file, here |
@softwaredoug I need to push this. Re your synonyms - note that there's a small amount of collateral damage in this stemming that you probably want to fix in your synonyms file - |
thanks, fixed! |
Pinging @elastic/es-search-relevance (Team:Search Relevance) |
Benchmarks on real data have steered me towards this token filter as other forms of stemmer are generally too aggressive for ecommerce (e.g.
loafers==loaf
).Good plural-stemming is ideally what is required because most user searches are plural and yet product descriptions are singular (e.g. "dresses" search should match product "red dress").
Good examples of plural stemming by this existing filter include:
cases
case
shades
shade
bottles
bottle
However, these terms fail to match because of bad stemming:
dresses
dresse
watches
watche
brushes
brushe
boxes
boxe
Example reproduction:
Solution
It would be good to fix these poor examples of stemming but would obviously need to worry about backwards compatibility.
The text was updated successfully, but these errors were encountered: