Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

English-minimal analyzer has bad plural stemming #42892

Open
markharwood opened this issue Jun 5, 2019 · 15 comments · May be fixed by #43248
Open

English-minimal analyzer has bad plural stemming #42892

markharwood opened this issue Jun 5, 2019 · 15 comments · May be fixed by #43248
Labels
>bug priority:normal A label for assessing bug priority to be used by ES engineers :Search Relevance/Analysis How text is split into tokens Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch

Comments

@markharwood
Copy link
Contributor

markharwood commented Jun 5, 2019

Benchmarks on real data have steered me towards this token filter as other forms of stemmer are generally too aggressive for ecommerce (e.g. loafers==loaf).
Good plural-stemming is ideally what is required because most user searches are plural and yet product descriptions are singular (e.g. "dresses" search should match product "red dress").

Good examples of plural stemming by this existing filter include:

Search string Good stemmed form
cases case
shades shade
bottles bottle

However, these terms fail to match because of bad stemming:

Search string Bad stemmed form
dresses dresse
watches watche
brushes brushe
boxes boxe

Example reproduction:

DELETE test
PUT test
{
  "settings": {
	"number_of_shards": 1,
	"number_of_replicas": 0,
	"analysis": {
	  "analyzer": {
		"my_analyzer": {
		  "tokenizer": "standard",
		  "filter": [
			"lowercase",
			"filter_english_minimal"
		  ]
		}
	  },
	  "filter": {
		"filter_english_minimal": {
		  "type": "stemmer",
		  "name": "minimal_english"
		}
	  }
	}
  },
  "mappings": {
	"_doc": {
	  "properties": {
		"name": {
		  "type": "text",
		  "analyzer": "my_analyzer"
		}
	  }
	}
  }
}


POST test/_doc/1
{
  "name":"red dress"  
}

# Does not match (search stems to "dresse")
GET test/_search
{
  "query":{
	"match":{
	  "name":"dresses"
	}
  }
}

Solution

It would be good to fix these poor examples of stemming but would obviously need to worry about backwards compatibility.

@markharwood markharwood added the :Search Relevance/Analysis How text is split into tokens label Jun 5, 2019
@markharwood markharwood self-assigned this Jun 5, 2019
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-search

@jpountz
Copy link
Contributor

jpountz commented Jun 5, 2019

Is there a general rule, we can't just remove the e before the s all the time? E.g. forces, votes, xylophones, etc.

If this is a hard problem, maybe an alternative would be to recommend using a synonym filter for those terms that get frequently misstemmed. A good list to start with could be added to our docs.

If we want to change the behavior of this stemmer, I'd rather make a new one since this one is a direct implementation of a stemmer that is documented in a paper.

@markharwood
Copy link
Contributor Author

markharwood commented Jun 5, 2019

Is there a general rule, we can't just remove the e before the s all the time?

I can do some digging but for a start I would expect *ss *tch *x *sh are always patterns that would always remove the es part of a plural.

For reference - crossword solvers:
*sses examples
*tches examples
*shes examples
*xes examples

@markharwood
Copy link
Contributor Author

markharwood commented Jun 5, 2019

If we want to change the behavior of this stemmer, I'd rather make a new one

Probably for a different issue, but it would be good to consider how we name token filters and deal with BWC.
We end up with a proliferation of analyzer/tokenizer names: e.g. "english_light", "english_minimal" and now perhaps "english_better_minimal" as we evolve.
Perhaps it would be useful to let the names convey the intention (eg stemming plurals only) but also include a version number component to allow us to evolve the details of the implementation e.g. english_minimal_2019

@markharwood
Copy link
Contributor Author

this one is a direct implementation of a stemmer that is documented in a paper.

According to the javadocs it's called "S stemmer"

I took the signal media million news dataset and used this script to benchmark my proposals. I measured the recall gain that could be had from removing the extra "e" for a number of suffixes.
In the tables below it's important to note that the figures for Proposed stem count are all the false-negatives we have today when using the S-stemmer so there's a lot of uplift for what I hope is a little loss.

sses -> ss effects

This was a very positive uplift in recall for the most popular terms. Most of the rare non-zero matches for the S stemmer forms are questionable - presses is rarely likely to appear in English and mean a plural of presse. The valid exceptions that retain the e are the are crevasse and posse.
Popular wins for typical ecommerce site searches might be the various dresses, mattresses they sell.

Plural count Proposed stem Proposed stem count S stemmer S stemmer count
businesses 38698 business 143764
processes 14592 process 69044
losses 13164 loss 42859
classes 11946 class 48412 classe 23
passes 10998 pass 29525 passe 42
addresses 7491 address 48607
witnesses 5929 witness 8866
weaknesses 3406 weakness 5481
discusses 3301 discuss 25244
successes 3071 success 47715
glasses 3036 glass 12610
bosses 2609 boss 13511 bosse 25
masses 2594 mass 20427 masse 499
dresses 2502 dress 9588
illnesses 2316 illness 6664
misses 2070 miss 25704
crosses 1692 cross 27400 crosse 221
encompasses 1630 encompass 1067
sunglasses 1551 sunglass 76
stresses 1511 stress 11000
possesses 1384 possess 3022
progresses 1298 progress 23792
expresses 1296 express 15843
actresses 1165 actress 10873
kisses 916 kiss 3228
assesses 825 assess 6807 assesse 4
presses 820 press 82998 presse 1068
mattresses 556 mattress 931
dismisses 536 dismiss 2207
harnesses 484 harness 2438
excesses 438 excess 6456
grasses 409 grass 4848 grasse 23
surpasses 408 surpass 1397
confesses 366 confess 951
guesses 350 guess 11651
messes 335 mess 5290 messe 145
impresses 325 impress 2589
princesses 323 princess 3624 princesse 9
eyewitnesses 321 eyewitness 971
tosses 317 toss 3223
asses 233 ass 2104 asse 19
accesses 224 access 66139
molasses 218 molass 0
tresses 217 tress 59
eyeglasses 209 eyeglass 86
carcasses 189 carcass 270
blesses 184 bless 2349
suppresses 156 suppress 971
recesses 152 recess 884
ulysses 143 ulyss 0 ulysse 13
waitresses 141 waitress 483
mistresses 137 mistress 539
goddesses 137 goddess 699
gasses 132 gass 73
busses 118 buss 84 busse 20
congresses 116 congress 20118
thicknesses 112 thickness 1096
bypasses 105 bypass 1559
compresses 104 compress 189
masterclasses 100 masterclass 329
posses 91 poss 36 posse 252
professes 89 profess 287
glosses 87 gloss 1024
likenesses 86 likeness 621
hostesses 80 hostess 383
pisses 70 piss 402
overpasses 69 overpass 352
grosses 64 gross 9838 grosse 104
trusses 62 truss 429
disses 59 diss 226 disse 10
lionesses 58 lioness 74
agribusinesses 58 agribusiness 467
basses 56 bass 3483 basse 17
subclasses 49 subclass 69
obsesses 44 obsess 171
represses 42 repress 89
buttresses 41 buttress 137
trespasses 40 trespass 376
crevasses 40 crevass 0 crevasse 39
plusses 40 pluss 11
caresses 36 caress 115
depresses 33 depress 180
amasses 33 amass 273
headdresses 32 headdress 83
embarrasses 32 embarrass 396
stewardesses 31 stewardess 44
sundresses 29 sundress 42

shes -> sh effects

Another positive uplift. Disagreements with S stemmer like ashe or rushe tended to be people names that again would not account for the majority of the plural forms.
Popular wins for typical ecommerce site searches might be the various brushes, dishes.

Plural count Proposed stem Proposed stem count S stemmer S stemmer count
wishes 5284 wish 20877
dishes 4470 dish 4548
clashes 3388 clash 8264
finishes 3039 finish 21923
publishes 2376 publish 5568
pushes 2188 push 21428
ashes 2179 ash 2203 ashe 1418
crashes 2143 crash 12997
establishes 1163 establish 11758
flashes 1150 flash 7612
bushes 886 bush 8213
rushes 864 rush 8995 rushe 7
brushes 738 brush 3491
parishes 733 parish 3146
lashes 697 lash 516
washes 565 wash 5493
distinguishes 364 distinguish 1718
flourishes 347 flourish 1611
unleashes 344 unleash 1235
diminishes 321 diminish 1061
fishes 286 fish 9491
crushes 282 crush 2946
accomplishes 256 accomplish 4052
blemishes 254 blemish 328
refreshes 224 refresh 2162
skirmishes 220 skirmish 256
relishes 215 relish 862
smashes 214 smash 2139
splashes 204 splash 2338
dashes 197 dash 2802 dashe 7
rashes 172 rash 1290
eyelashes 171 eyelash 107
punishes 168 punish 1522
toothbrushes 161 toothbrush 383
radishes 154 radish 173
polishes 152 polish 2726
slashes 151 slash 1729
marshes 149 marsh 1876
gushes 142 gush 203
nourishes 140 nourish 419
blushes 131 blush 714
vanishes 124 vanish 412
cherishes 111 cherish 1035
leashes 110 leash 584
meshes 87 mesh 1432
languishes 82 languish 220
furnishes 81 furnish 512
flushes 80 flush 1162
bashes 79 bash 1653 bashe 7
replenishes 79 replenish 462
squashes 68 squash 1143
sashes 66 sash 219
garnishes 66 garnish 700
ambushes 64 ambush 610
trashes 60 trash 3324
quashes 59 quash 225
hashes 56 hash 783
demolishes 56 demolish 482
mouthwashes 49 mouthwash 188
backsplashes 48 backsplash 135
tarnishes 43 tarnish 311
paintbrushes 41 paintbrush 166
admonishes 39 admonish 53
cashes 39 cash 30046
varnishes 35 varnish 175
fetishes 34 fetish 256
stashes 34 stash 869
perishes 32 perish 392
banishes 31 banish 261
whitewashes 29 whitewash 297
refurbishes 28 refurbish 209
fleshes 28 flesh 2287
gashes 28 gash 253
abolishes 25 abolish 800
mashes 23 mash 933
brandishes 21 brandish 56
embellishes 17 embellish 152
thrashes 14 thrash 397
rehashes 14 rehash 98
thrushes 13 thrush 75
dervishes 13 dervish 30
extinguishes 13 extinguish 541
lavishes 10 lavish 1283
noshes 9 nosh 75
vanquishes 9 vanquish 78
sloshes 7 slosh 18
reestablishes 7 reestablish 108
swashes 7 swash 15
refinishes 7 refinish 72
knishes 5 knish 9
stoushes 5 stoush 63

tches -> tch

Another good uplift in recall for popular terms.
Disagreement with S-stemmer is limited to WATCHe which look to be a brand name and certainly should not be seen as the stem of the popular watches term.
These words are mostly nouns and would benefit most ecommerce type searches

Plural count Proposed stem Proposed stem count S stemmer S stemmer count
matches 16535 match 39680
catches 4186 catch 17512
watches 2966 watch 56591 watche 5
pitches 2426 pitch 11668
stretches 1964 stretch 8756
switches 1769 switch 9901
patches 1523 patch 4377
sketches 889 sketch 2123
batches 888 batch 3122
stitches 694 stitch 688
scratches 648 scratch 3301
witches 540 witch 1492
smartwatches 518 smartwatch 966
glitches 464 glitch 564
clutches 448 clutch 1770
crutches 402 crutch 148
ditches 289 ditch 2272
dispatches 285 dispatch 2112
notches 277 notch 2482
bitches 177 bitch 1002
hatches 172 hatch 1152
swatches 170 swatch 181
mismatches 132 mismatch 481
latches 124 latch 519
snatches 107 snatch 831
hitches 101 hitch 715
fetches 77 fetch 926
wristwatches 39 wristwatch 109
britches 35 britch 1
blotches 33 blotch 10
twitches 33 twitch 518
rematches 26 rematch 846
itches 24 itch 340
splotches 20 splotch 8
crotches 17 crotch 182
scotches 15 scotch 638
cwtches 15 cwtch 5
etches 14 etch 122
masterbatches 14 masterbatch 18
wretches 13 wretch 44
snitches 12 snitch 74
despatches 12 despatch 125
botches 9 botch 86
iwatches 7 iwatch 42

xes -> x effects

Reasonable uplift from s stemmer.
Exceptions like taxe and boxe are again, names or acronyms that wouldn't be the usual interpretation of the plural form from which they stem.
Ecommerce sites will benefit from matchs on the various boxes they sell (gearboxes to jewellery boxes). One noticeable false stem is axes to axe.

Plural count Proposed stem Proposed stem count S stemmer S stemmer count
taxes 10477 tax 24595 taxe 4
boxes 5351 box 26638 boxe 8
indexes 1920 index 19179
fixes 1418 fix 10755 fixe 68
mixes 970 mix 18027
complexes 617 complex 24480 complexe 4
foxes 591 fox 14241
sixes 398 six 87451
sexes 378 sex 16266 sexe 6
axes 281 ax 371 axe 963
remixes 252 remix 1153
exes 238 ex 17191 exe 112
relaxes 195 relax 3978
reflexes 191 reflex 343
mailboxes 182 mailbox 837
hoaxes 135 hoax 1187
inboxes 114 inbox 3963
annexes 111 annex 514 annexe 46
waxes 109 wax 1122
multiplexes 105 multiplex 185
gearboxes 86 gearbox 646
flexes 72 flex 1475
faxes 72 fax 3908
lunchboxes 62 lunchbox 234
duplexes 56 duplex 367
paradoxes 54 paradox 566
tuxes 49 tux 192
climaxes 45 climax 1000
sandboxes 45 sandbox 330
influxes 34 influx 4764
maxes 32 max 8292
prefixes 30 prefix 114
coaxes 29 coax 287
toolboxes 28 toolbox 364
nixes 25 nix 222
premixes 25 premix 25
vortexes 24 vortex 381
fluxes 23 flux 650
suplexes 22 suplex 64
shoeboxes 22 shoebox 106
equinoxes 22 equinox 479
vexes 22 vex 57
hotfixes 21 hotfix 38
connexes 20 connex 16
xerxes 18 xerx 6
alexes 18 alex 12931
suffixes 17 suffix 67
checkboxes 16 checkbox 277
bugfixes 15 bugfix 7
crucifixes 15 crucifix 168
jukeboxes 14 jukebox 141
letterboxes 13 letterbox 83
saxes 13 sax 219 saxe 45
subindexes 13 subindex 34
hexes 13 hex 171
suezmaxes 12 suezmax 69
unboxes 12 unbox 31
perplexes 12 perplex 21
affixes 11 affix 108
detoxes 11 detox 431
pickaxes 9 pickax 5 pickaxe 8
rolexes 9 rolex 285
apexes 8 apex 1214
xboxes 8 xbox 2671
praxes 8 prax 1
aframaxes 8 aframax 16
cineplexes 7 cineplex 68
appendixes 6 appendix 878
flummoxes 6 flummox 9
panamaxes 6 panamax 104
boomboxes 6 boombox 40
transfixes 6 transfix 15
jinxes 5 jinx 244
textboxes 5 textbox 17
muxes 4 mux 20

@markharwood
Copy link
Contributor Author

Another bizarre choice in s-stemmer is to avoid any stemming of ees (ie the trailing s is not removed).
Tests on the signal media 1m news dataset shows this overlooks a lot of valid words.
The only ees words that are not plurals here are raess (a Bollywood movie) and drees and dees (names), all of which when index as stemmed would not clash with other common English words. I recommend making this change to the original S-stemmer algorithm too.

Plural count Proposed stem Proposed stem count
employees 32064 employee 12965
refugees 17323 refugee 12865
sees 12675 see 183619
fees 11210 fee 11549
degrees 8705 degree 21259
trees 7550 tree 9160
attendees 6994 attendee 493
guarantees 4877 guarantee 12974
agrees 3806 agree 18583
oversees 2491 oversee 2879
committees 2451 committee 27464
yankees 2445 yankee 963
knees 2390 knee 9674
nominees 2050 nominee 2900
trustees 1888 trustee 1491
bees 1215 bee 2001
retirees 1067 retiree 472
referees 1006 referee 3752
brees 814 bree 116
franchisees 778 franchisee 476
disagrees 776 disagree 2973
honorees 666 honoree 335
rupees 643 rupee 855
detainees 637 detainee 262
devotees 634 devotee 159
frees 588 free 116658
tees 530 tee 1876
trainees 487 trainee 633
licensees 468 licensee 567
entrees 434 entree 350
rees 418 ree 88
coffees 413 coffee 11426
lees 393 lee 13865
appointees 327 appointee 176
toffees 292 toffee 175
evacuees 246 evacuee 37
foresees 241 foresee 769
flees 226 flee 3335
decrees 218 decree 945
inductees 212 inductee 469
awardees 203 awardee 113
threes 186 three 194122
returnees 178 returnee 72
chimpanzees 173 chimpanzee 94
grantees 169 grantee 100
interviewees 162 interviewee 91
enrollees 156 enrollee 34
invitees 139 invitee 48
escapees 122 escapee 70
pharisees 113 pharisee 41
honeybees 112 honeybee 85
absentees 112 absentee 288
burpees 108 burpee 64
amputees 84 amputee 201
divorcees 77 divorcee 123
gees 74 gee 670
lessees 72 lessee 86
emcees 66 emcee 321
pedigrees 65 pedigree 971
humvees 63 humvee 62
soirees 60 soiree 161
maccabees 53 maccabee 15
sarees 50 saree 58
manatees 50 manatee 133
elysees 50 elysee 86
dees 49 dee 1223
marquees 47 marquee 1350
loanees 44 loanee 183
signees 43 signee 83
pees 41 pee 534
mentees 41 mentee 58
purees 40 puree 438
monkees 39 monkee 4
kees 38 kee 153
jaycees 35 jaycee 24
bumblebees 33 bumblebee 76
fugees 33 fugee 1
transferees 19 transferee 27
drees 12 dree 12

@markharwood
Copy link
Contributor Author

Another proposal:
The *ies -> *y rule should only apply to words longer than 4 characters.
In tests on the news dataset this proposal loses nothing but gains matches on pies -> pie, lies->lie and ties->tie.
The only 2 letter *y word of consequence in English is by and does not have a plural.

@markharwood
Copy link
Contributor Author

oes plurals are not treated at all by the "S" stemmer.
I beleive a positive uplift can be had by removing the es part of the suffix.
In the table below I contrast the popularity of the full oes term, an aggressive o stemmed form and a less aggressive oe stem in the million news articles dataset.
The aggressive stem back to the o word looks to be the most useful stemming rule to employ and a small set of exceptions that should retain the e (canoe, shoe and oboe) could be maintained as a list in the code.
The use of an exception list could make this a more contentious rule but it should be noted the alternative is to stick with the current policy of not stemming oes suffixes at all, which we can produce a lot of false negatives.

Plural count es stemmed count s stemmed count
shoes 8173 sho 255 shoe 2902
heroes 5076 hero 8280
tomatoes 2507 tomato 2133
potatoes 2271 potato 2530 potatoe 7
echoes 1192 echo 2485
superheroes 647 superhero 1554
mosquitoes 541 mosquito 744
undergoes 495 undergo 3492
volcanoes 353 volcano 668
tornadoes 336 tornado 688
buffaloes 240 buffalo 4746 buffaloe 4
cargoes 238 cargo 3595
throes 237 thro 27
zeroes 179 zero 13047
vetoes 157 veto 1903
canoes 147 cano 221 canoe 531
mangoes 144 mango 902
dominoes 113 domino 524
faroes 103 faro 195 faroe 342
negroes 85 negro 263
horseshoes 85 horseshoe 385
torpedoes 83 torpedo 202
frescoes 73 fresco 330
embargoes 70 embargo 1007
kroes 49 kro 27
backhoes 48 backhoe 73
mementoes 41 memento 275
tiptoes 33 tiptoe 53
floes 27 flo 277 floe 11
outdoes 27 outdo 222
simoes 26 simo 22
marloes 24 marlo 104
dingoes 22 dingo 31
forgoes 21 forgo 441
briscoes 18 brisco 9 briscoe 227
cohoes 16 coho 50
commandoes 14 commando 293
snowshoes 14 snowshoe 20
undoes 14 undo 665
avocadoes 14 avocado 981
mottoes 13 motto 1345
antiheroes 13 antihero 54
siloes 13 silo 253
foregoes 13 forego 271
flamingoes 12 flamingo 233
overdoes 12 overdo 183
sloes 11 slo 145 sloe 25
ghettoes 10 ghetto 300
gittoes 10 gitto 4
innuendoes 10 innuendo 250
manifestoes 9 manifesto 1041
haloes 9 halo 889
coxconservesheroes 8 coxconserveshero 8
aloes 8 alo 41 aloe 283
grottoes 7 grotto 150
ciscoes 7 cisco 2340
acoes 7 aco 113
desperadoes 6 desperado 43
sheroes 6 shero 53
peccadilloes 6 peccadillo 7
erdoes 6 erdo 13
weirdoes 6 weirdo 171
tahoes 5 taho 24 tahoe 433
supervolcanoes 5
ringoes 5 ringo 252
oboes 5 obo 32 oboe 43
domingoes 5 domingo 391
porticoes 3 portico 122
sermoheroes 3
vidoes 3
faeroes 3 faeroe 6
fiascoes 3 fiasco 573
hammertoes 3 hammertoe 6
ricardoes 3 ricardo 1041
groes 3 gro 108
croes 3 cro 274

@markharwood
Copy link
Contributor Author

markharwood commented Jun 17, 2019

It's possible that the EnglishMinimalStemmer's implementation of the original algorithm has a bug.

This is the original S-stemmer description:
image

The notes accompanying the table state :

"the first applicable rule encountered is the only one used"

For the ees and oes suffixes I think EnglishMinimalStemmer misinterpreted the rule logic and consequently bees != bee and tomatoes != tomato. The oes and ees suffixes are left intact.

"The first applicable rule" for ees could be interpreted as rule 2 or 3 in the table depending on if you take applicable to mean "the THEN part of the rule has fired" or just that the suffix was referenced in the rule. EnglishMinimalStemmer assumed the latter and I think it should be the former. We should fall through into rule 3 for ees and oes (remove any trailing S). That's certainly the conclusion I came to independently testing on real data.

I notice this implementation of the s-stemmer makes the same mistake. (Perhaps our Java version was a port of this javascript or vice versa?).

@jpountz I've been working on a new TokenFilter but what does this ees/oes discovery mean for the existing EnglishMinimalStemmer code if it falls short of its goal in faithfully implementing the original paper?

@markharwood
Copy link
Contributor Author

Ches rules:

Looks like the es can be dropped but with a small number of English-adopted words like cliche, quiche and avalanche.

Plural count Proposed stem Proposed stem count Retain E Retain E count
matches 16535 match 39680
launches 8251 launch 32948
coaches 8189 coach 35967
approaches 6736 approach 36169
inches 5675 inch 9864 inche 7
reaches 4684 reach 45691 reache 5
catches 4186 catch 17512
branches 3606 branch 8197
touches 3598 touch 22720 touche 208
teaches 3384 teach 8592
churches 3166 church 21364
watches 2966 watch 56591 watche 5
speeches 2832 speech 18023
searches 2829 search 31303
breaches 2739 breach 5148
beaches 2641 beach 19419
pitches 2426 pitch 11668
sandwiches 2203 sandwich 2630
stretches 1964 stretch 8756
switches 1769 switch 9901
headaches 1565 headache 1990
patches 1523 patch 4377
punches 1227 punch 4798
lunches 1169 lunch 10838
riches 890 rich 21176 riche 59
sketches 889 sketch 2123
batches 888 batch 3122
benches 810 bench 9126
stitches 694 stitch 688
scratches 648 scratch 3301
trenches 618 trench 547
peaches 618 peach 1247
marches 608 march 30547 marche 48
witches 540 witch 1492
smartwatches 518 smartwatch 966
attaches 496 attach 1896 attache 85
arches 473 arch 1973 arche 15
glitches 464 glitch 564
clutches 448 clutch 1770
pouches 417 pouch 628
researches 414 research 87327
crutches 402 crutch 148
niches 296 nich 6 niche 5715
ditches 289 ditch 2272
dispatches 285 dispatch 2112
notches 277 notch 2482
preaches 271 preach 891
couches 260 couch 2131 couche 45
tranches 253 tranche 469
torches 222 torch 827
bunches 217 bunch 7482 bunche 25
enriches 211 enrich 1720
backbenches 208 backbench 485
ranches 181 ranch 2495
bitches 177 bitch 1002
hatches 172 hatch 1152
swatches 170 swatch 181
cliches 166 clich 20 cliche 332
cockroaches 164 cockroach 79
crunches 152 crunch 2093
porches 143 porch 1408 porche 17
pooches 138 pooch 328
caches 133 cache 983
mismatches 132 mismatch 481
starches 129 starch 400
latches 124 latch 519
clinches 122 clinch 1744
porsches 113 porsche 2109
snatches 107 snatch 831
avalanches 105 avalanche 722
hitches 101 hitch 715
perches 97 perch 623
roaches 94 roach 372 roache 20
wrenches 93 wrench 351
finches 89 finch 860
pinches 80 pinch 2441 pinche 6
fetches 77 fetch 926
leeches 68 leech 109
brunches 66 brunch 924
lurches 63 lurch 239
mustaches 44 mustache 314
relaunches 44 relaunch 362
apaches 44 apache 889
breeches 42 breech 108
brooches 41 brooch 134
slouches 40 slouch 171
wristwatches 39 wristwatch 109
winches 39 winch 143
heartaches 38 heartache 516
psyches 37 psych 305 psyche 581
moustaches 37 moustache 220
haunches 34 haunch 7
hunches 34 hunch 246
blotches 33 blotch 10
beseeches 33 beseech 56
twitches 33 twitch 518
smooches 31 smooch 82
quiches 31 quiche 100
deutsches 29 deutsch 643 deutsche 4088
encroaches 28 encroach 136
entrenches 27 entrench 112
rematches 26 rematch 846
goldfinches 26 goldfinch 51
flinches 26 flinch 180
roches 25 roch 60 roche 1329
outreaches 24 outreach 3569
beeches 23 beech 473
naches 23 nach 57
bleaches 21 bleach 411
detaches 20 detach 162
poaches 19 poach 151
birches 19 birch 601
impeaches 17 impeach 145
crouches 16 crouch 257
belches 16 belch 37
cwtches 15 cwtch 5
masterbatches 14 masterbatch 18
geocaches 13 geocache 16
cinches 12 cinch 140
stiches 12 stich 11
despatches 12 despatch 125
botches 9 botch 86

markharwood added a commit to markharwood/elasticsearch that referenced this issue Jul 4, 2019
Drops the trailing “e” in taxes, dresses, watches etc that otherwise cause mismatches with plural and singular forms

Closes elastic#42892
@markharwood
Copy link
Contributor Author

markharwood commented Jul 17, 2019

Final comparison of results

Having heard back from the author of the paper on which Lucene's EnglishMinimalStemFilter is based I've concluded that the S-Stemmer algorithm presented there has muddled logic and the implementation of it in Lucene is also buggy.
Below is a final round-up of the differences between the proposed #43248 stemmer and the Lucene filter based on trials with the Signal Media news dataset. The differences illustrate that the Lucene version either fails to offer any stem (e.g. employees keeps the s) or offers a non-sensical stem (dresses becomes dresse which means a search for that wouldn't match dress).

Plural count proposed new stem count Lucene stem (blank if not stemmed) count
employees 32063 employee 12965
refugees 17323 refugee 12865
sees 12675 see 183619
fees 11210 fee 11549
degrees 8705 degree 21259
ties 8596 tie 10507 ty 1234
lies 8232 lie 6590 ly 157
shoes 8173 shoe 2902
trees 7550 tree 9160
attendees 6994 attendee 493
heroes 5076 hero 8280
guarantees 4877 guarantee 12974
dies 3967 die 13662 dy 133
agrees 3806 agree 18583
tomatoes 2507 tomato 2133
oversees 2491 oversee 2879
committees 2451 committee 27464
yankees 2445 yankee 963
knees 2390 knee 9674
woes 2365 wo 283
potatoes 2271 potato 2530
nominees 2050 nominee 2900
trustees 1888 trustee 1491
toes 1693 to 917321
foes 1246 fo 358
bees 1215 bee 2001
echoes 1192 echo 2485
retirees 1067 retiree 472
referees 1006 referee 3752
pies 927 pie 3091 py 43
brees 814 bree 116
franchisees 778 franchisee 476
disagrees 776 disagree 2973
honorees 666 honoree 335
superheroes 647 superhero 1554
rupees 643 rupee 855
detainees 637 detainee 262
devotees 634 devotee 159
frees 588 free 116658
mosquitoes 541 mosquito 744
tees 530 tee 1876
undergoes 495 undergo 3492
trainees 487 trainee 633
licensees 468 licensee 567
entrees 434 entree 350
rees 418 ree 88
coffees 413 coffee 11426
aes 398 ae 450
lees 393 lee 13865
volcanoes 353 volcanoe 0
tornadoes 336 tornado 688
paes 328 pae 28
appointees 327 appointee 176
toffees 292 toffee 175
evacuees 246 evacuee 37
foresees 241 foresee 769
buffaloes 240 buffalo 4746
businesses 38694 business 143764 businesse 0
matches 16535 match 39680 matche 1
processes 14592 process 69044 processe 2
losses 13164 loss 42859 losse 3
classes 11946 class 48412 classe 23
passes 10998 pass 29525 passe 42
taxes 10477 tax 24595 taxe 4
launches 8251 launch 32948 launche 3
coaches 8189 coach 35967 coache 2
addresses 7491 address 48607 addresse 0
approaches 6736 approach 36165 approache 2
witnesses 5929 witness 8866 witnesse 1
inches 5675 inch 9864 inche 7
boxes 5351 box 26638 boxe 8
wishes 5284 wish 20877 wishe 3
reaches 4684 reach 45691 reache 5
dishes 4470 dish 4548 dishe 0
catches 4186 catch 17512 catche 0
branches 3606 branch 8197 branche 3
touches 3598 touch 22720 touche 208
weaknesses 3406 weakness 5481 weaknesse 0
clashes 3388 clash 8264 clashe 0
teaches 3384 teach 8592 teache 1
discusses 3301 discuss 25244 discusse 0
churches 3166 church 21364 churche 1
successes 3071 success 47715 successe 1
finishes 3039 finish 21923 finishe 1
glasses 3036 glass 12610 glasse 0
watches 2966 watch 56591 watche 5
speeches 2832 speech 18023 speeche 0
searches 2829 search 31303 searche 0
breaches 2739 breach 5148 breache 0
beaches 2641 beach 19419 beache 0
bosses 2609 boss 13511 bosse 25
masses 2594 mass 20427 masse 499
dresses 2502 dress 9588 dresse 0
pitches 2426 pitch 11668 pitche 0
publishes 2376 publish 5568 publishe 2
illnesses 2316 illness 6664 illnesse 0
sandwiches 2203 sandwich 2630 sandwiche 1
pushes 2188 push 21428 pushe 1
ashes 2179 ash 2203 ashe 1418
crashes 2143 crash 12997 crashe 3
misses 2070 miss 25704 misse 2
stretches 1964 stretch 8756 stretche 1
indexes 1920 index 19179 indexe 2
switches 1769 switch 9901 switche 0
crosses 1692 cross 27398 crosse 221
encompasses 1630 encompass 1067 encompasse 0
sunglasses 1551 sunglass 76 sunglasse 0
patches 1523 patch 4377 patche 0
stresses 1511 stress 11000 stresse 0
fixes 1418 fix 10755 fixe 68
possesses 1384 possess 3022 possesse 1
progresses 1298 progress 23792 progresse 0
expresses 1296 express 15843 expresse 2
punches 1227 punch 4798 punche 0
lunches 1169 lunch 10838 lunche 2
actresses 1165 actress 10873 actresse 0
establishes 1163 establish 11754 establishe 1
flashes 1150 flash 7612 flashe 1
mixes 970 mix 18027 mixe 3
kisses 916 kiss 3228 kisse 0
riches 890 rich 21176 riche 59
sketches 889 sketch 2123 sketche 0
batches 888 batch 3122 batche 0
bushes 886 bush 8213 bushe 2
rushes 864 rush 8995 rushe 7
assesses 825 assess 6807 assesse 4
presses 820 press 82998 presse 1068
benches 810 bench 9126 benche 0
brushes 738 brush 3491 brushe 0
parishes 733 parish 3146 parishe 0
lashes 697 lash 516 lashe 0
stitches 694 stitch 688 stitche 0
scratches 648 scratch 3301 scratche 0
trenches 618 trench 547 trenche 0
peaches 618 peach 1247 peache 0
complexes 617 complex 24476 complexe 4
marches 608 march 30547 marche 48
foxes 591 fox 14241 foxe 3
washes 565 wash 5493 washe 0
mattresses 556 mattress 931 mattresse 0
witches 540 witch 1492 witche 0
dismisses 536 dismiss 2207 dismisse 1
harnesses 484 harness 2438 harnesse 0
glitches 464 glitch 564 glitche 1
clutches 448 clutch 1770 clutche 0
excesses 438 excess 6456 excesse 0
researches 414 research 87327 researche 0
sexes 378 sex 16266 sexe 6
messes 335 mess 5290 messe 145
impresses 325 impress 2589 impresse 2
diminishes 321 diminish 1061 diminishe 0
niches 296 nich 6 niche 5715
notches 277 notch 2482 notche 0

@softwaredoug
Copy link
Contributor

softwaredoug commented Feb 19, 2020

@markharwood - thanks for the effort put into analyzing this! As a temporary workaround, I gathered your corrected misstems in a synonyms file, here

@markharwood
Copy link
Contributor Author

@softwaredoug I need to push this. Re your synonyms - note that there's a small amount of collateral damage in this stemming that you probably want to fix in your synonyms file - toes -> to, woes -> wo and foes -> fo

@softwaredoug
Copy link
Contributor

thanks, fixed!

@rjernst rjernst added the Team:Search Meta label for search team label May 4, 2020
@javanna javanna added Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch and removed Team:Search Meta label for search team labels Jul 12, 2024
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-search-relevance (Team:Search Relevance)

@javanna javanna added the priority:normal A label for assessing bug priority to be used by ES engineers label Jul 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug priority:normal A label for assessing bug priority to be used by ES engineers :Search Relevance/Analysis How text is split into tokens Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants