English-minimal analyzer has bad plural stemming #42892

markharwood · 2019-06-05T13:20:50Z

Benchmarks on real data have steered me towards this token filter as other forms of stemmer are generally too aggressive for ecommerce (e.g. loafers==loaf).
Good plural-stemming is ideally what is required because most user searches are plural and yet product descriptions are singular (e.g. "dresses" search should match product "red dress").

Good examples of plural stemming by this existing filter include:

Search string	Good stemmed form
`cases`	`case`
`shades`	`shade`
`bottles`	`bottle`

However, these terms fail to match because of bad stemming:

Search string	Bad stemmed form
`dresses`	`dresse`
`watches`	`watche`
`brushes`	`brushe`
`boxes`	`boxe`

Example reproduction:

DELETE test
PUT test
{
  "settings": {
	"number_of_shards": 1,
	"number_of_replicas": 0,
	"analysis": {
	  "analyzer": {
		"my_analyzer": {
		  "tokenizer": "standard",
		  "filter": [
			"lowercase",
			"filter_english_minimal"
		  ]
		}
	  },
	  "filter": {
		"filter_english_minimal": {
		  "type": "stemmer",
		  "name": "minimal_english"
		}
	  }
	}
  },
  "mappings": {
	"_doc": {
	  "properties": {
		"name": {
		  "type": "text",
		  "analyzer": "my_analyzer"
		}
	  }
	}
  }
}


POST test/_doc/1
{
  "name":"red dress"  
}

# Does not match (search stems to "dresse")
GET test/_search
{
  "query":{
	"match":{
	  "name":"dresses"
	}
  }
}

Solution

It would be good to fix these poor examples of stemming but would obviously need to worry about backwards compatibility.

The text was updated successfully, but these errors were encountered:

elasticmachine · 2019-06-05T13:20:52Z

Pinging @elastic/es-search

jpountz · 2019-06-05T13:40:20Z

Is there a general rule, we can't just remove the e before the s all the time? E.g. forces, votes, xylophones, etc.

If this is a hard problem, maybe an alternative would be to recommend using a synonym filter for those terms that get frequently misstemmed. A good list to start with could be added to our docs.

If we want to change the behavior of this stemmer, I'd rather make a new one since this one is a direct implementation of a stemmer that is documented in a paper.

markharwood · 2019-06-05T13:45:20Z

Is there a general rule, we can't just remove the e before the s all the time?

I can do some digging but for a start I would expect *ss *tch *x *sh are always patterns that would always remove the es part of a plural.

For reference - crossword solvers:
*sses examples
*tches examples
*shes examples
*xes examples

markharwood · 2019-06-05T14:15:09Z

If we want to change the behavior of this stemmer, I'd rather make a new one

Probably for a different issue, but it would be good to consider how we name token filters and deal with BWC.
We end up with a proliferation of analyzer/tokenizer names: e.g. "english_light", "english_minimal" and now perhaps "english_better_minimal" as we evolve.
Perhaps it would be useful to let the names convey the intention (eg stemming plurals only) but also include a version number component to allow us to evolve the details of the implementation e.g. english_minimal_2019

markharwood · 2019-06-14T11:27:04Z

this one is a direct implementation of a stemmer that is documented in a paper.

According to the javadocs it's called "S stemmer"

I took the signal media million news dataset and used this script to benchmark my proposals. I measured the recall gain that could be had from removing the extra "e" for a number of suffixes.
In the tables below it's important to note that the figures for Proposed stem count are all the false-negatives we have today when using the S-stemmer so there's a lot of uplift for what I hope is a little loss.

sses -> ss effects

This was a very positive uplift in recall for the most popular terms. Most of the rare non-zero matches for the S stemmer forms are questionable - presses is rarely likely to appear in English and mean a plural of presse. The valid exceptions that retain the e are the are crevasse and posse.
Popular wins for typical ecommerce site searches might be the various dresses, mattresses they sell.

Plural	count	Proposed stem	Proposed stem count	S stemmer	S stemmer count
businesses	38698	business	143764
processes	14592	process	69044
losses	13164	loss	42859
classes	11946	class	48412	classe	23
passes	10998	pass	29525	passe	42
addresses	7491	address	48607
witnesses	5929	witness	8866
weaknesses	3406	weakness	5481
discusses	3301	discuss	25244
successes	3071	success	47715
glasses	3036	glass	12610
bosses	2609	boss	13511	bosse	25
masses	2594	mass	20427	masse	499
dresses	2502	dress	9588
illnesses	2316	illness	6664
misses	2070	miss	25704
crosses	1692	cross	27400	crosse	221
encompasses	1630	encompass	1067
sunglasses	1551	sunglass	76
stresses	1511	stress	11000
possesses	1384	possess	3022
progresses	1298	progress	23792
expresses	1296	express	15843
actresses	1165	actress	10873
kisses	916	kiss	3228
assesses	825	assess	6807	assesse	4
presses	820	press	82998	presse	1068
mattresses	556	mattress	931
dismisses	536	dismiss	2207
harnesses	484	harness	2438
excesses	438	excess	6456
grasses	409	grass	4848	grasse	23
surpasses	408	surpass	1397
confesses	366	confess	951
guesses	350	guess	11651
messes	335	mess	5290	messe	145
impresses	325	impress	2589
princesses	323	princess	3624	princesse	9
eyewitnesses	321	eyewitness	971
tosses	317	toss	3223
asses	233	ass	2104	asse	19
accesses	224	access	66139
molasses	218	molass	0
tresses	217	tress	59
eyeglasses	209	eyeglass	86
carcasses	189	carcass	270
blesses	184	bless	2349
suppresses	156	suppress	971
recesses	152	recess	884
ulysses	143	ulyss	0	ulysse	13
waitresses	141	waitress	483
mistresses	137	mistress	539
goddesses	137	goddess	699
gasses	132	gass	73
busses	118	buss	84	busse	20
congresses	116	congress	20118
thicknesses	112	thickness	1096
bypasses	105	bypass	1559
compresses	104	compress	189
masterclasses	100	masterclass	329
posses	91	poss	36	posse	252
professes	89	profess	287
glosses	87	gloss	1024
likenesses	86	likeness	621
hostesses	80	hostess	383
pisses	70	piss	402
overpasses	69	overpass	352
grosses	64	gross	9838	grosse	104
trusses	62	truss	429
disses	59	diss	226	disse	10
lionesses	58	lioness	74
agribusinesses	58	agribusiness	467
basses	56	bass	3483	basse	17
subclasses	49	subclass	69
obsesses	44	obsess	171
represses	42	repress	89
buttresses	41	buttress	137
trespasses	40	trespass	376
crevasses	40	crevass	0	crevasse	39
plusses	40	pluss	11
caresses	36	caress	115
depresses	33	depress	180
amasses	33	amass	273
headdresses	32	headdress	83
embarrasses	32	embarrass	396
stewardesses	31	stewardess	44
sundresses	29	sundress	42

shes -> sh effects

Another positive uplift. Disagreements with S stemmer like ashe or rushe tended to be people names that again would not account for the majority of the plural forms.
Popular wins for typical ecommerce site searches might be the various brushes, dishes.

Plural	count	Proposed stem	Proposed stem count	S stemmer	S stemmer count
wishes	5284	wish	20877
dishes	4470	dish	4548
clashes	3388	clash	8264
finishes	3039	finish	21923
publishes	2376	publish	5568
pushes	2188	push	21428
ashes	2179	ash	2203	ashe	1418
crashes	2143	crash	12997
establishes	1163	establish	11758
flashes	1150	flash	7612
bushes	886	bush	8213
rushes	864	rush	8995	rushe	7
brushes	738	brush	3491
parishes	733	parish	3146
lashes	697	lash	516
washes	565	wash	5493
distinguishes	364	distinguish	1718
flourishes	347	flourish	1611
unleashes	344	unleash	1235
diminishes	321	diminish	1061
fishes	286	fish	9491
crushes	282	crush	2946
accomplishes	256	accomplish	4052
blemishes	254	blemish	328
refreshes	224	refresh	2162
skirmishes	220	skirmish	256
relishes	215	relish	862
smashes	214	smash	2139
splashes	204	splash	2338
dashes	197	dash	2802	dashe	7
rashes	172	rash	1290
eyelashes	171	eyelash	107
punishes	168	punish	1522
toothbrushes	161	toothbrush	383
radishes	154	radish	173
polishes	152	polish	2726
slashes	151	slash	1729
marshes	149	marsh	1876
gushes	142	gush	203
nourishes	140	nourish	419
blushes	131	blush	714
vanishes	124	vanish	412
cherishes	111	cherish	1035
leashes	110	leash	584
meshes	87	mesh	1432
languishes	82	languish	220
furnishes	81	furnish	512
flushes	80	flush	1162
bashes	79	bash	1653	bashe	7
replenishes	79	replenish	462
squashes	68	squash	1143
sashes	66	sash	219
garnishes	66	garnish	700
ambushes	64	ambush	610
trashes	60	trash	3324
quashes	59	quash	225
hashes	56	hash	783
demolishes	56	demolish	482
mouthwashes	49	mouthwash	188
backsplashes	48	backsplash	135
tarnishes	43	tarnish	311
paintbrushes	41	paintbrush	166
admonishes	39	admonish	53
cashes	39	cash	30046
varnishes	35	varnish	175
fetishes	34	fetish	256
stashes	34	stash	869
perishes	32	perish	392
banishes	31	banish	261
whitewashes	29	whitewash	297
refurbishes	28	refurbish	209
fleshes	28	flesh	2287
gashes	28	gash	253
abolishes	25	abolish	800
mashes	23	mash	933
brandishes	21	brandish	56
embellishes	17	embellish	152
thrashes	14	thrash	397
rehashes	14	rehash	98
thrushes	13	thrush	75
dervishes	13	dervish	30
extinguishes	13	extinguish	541
lavishes	10	lavish	1283
noshes	9	nosh	75
vanquishes	9	vanquish	78
sloshes	7	slosh	18
reestablishes	7	reestablish	108
swashes	7	swash	15
refinishes	7	refinish	72
knishes	5	knish	9
stoushes	5	stoush	63

tches -> tch

Another good uplift in recall for popular terms.
Disagreement with S-stemmer is limited to WATCHe which look to be a brand name and certainly should not be seen as the stem of the popular watches term.
These words are mostly nouns and would benefit most ecommerce type searches

Plural	count	Proposed stem	Proposed stem count	S stemmer	S stemmer count
matches	16535	match	39680
catches	4186	catch	17512
watches	2966	watch	56591	watche	5
pitches	2426	pitch	11668
stretches	1964	stretch	8756
switches	1769	switch	9901
patches	1523	patch	4377
sketches	889	sketch	2123
batches	888	batch	3122
stitches	694	stitch	688
scratches	648	scratch	3301
witches	540	witch	1492
smartwatches	518	smartwatch	966
glitches	464	glitch	564
clutches	448	clutch	1770
crutches	402	crutch	148
ditches	289	ditch	2272
dispatches	285	dispatch	2112
notches	277	notch	2482
bitches	177	bitch	1002
hatches	172	hatch	1152
swatches	170	swatch	181
mismatches	132	mismatch	481
latches	124	latch	519
snatches	107	snatch	831
hitches	101	hitch	715
fetches	77	fetch	926
wristwatches	39	wristwatch	109
britches	35	britch	1
blotches	33	blotch	10
twitches	33	twitch	518
rematches	26	rematch	846
itches	24	itch	340
splotches	20	splotch	8
crotches	17	crotch	182
scotches	15	scotch	638
cwtches	15	cwtch	5
etches	14	etch	122
masterbatches	14	masterbatch	18
wretches	13	wretch	44
snitches	12	snitch	74
despatches	12	despatch	125
botches	9	botch	86
iwatches	7	iwatch	42

xes -> x effects

Reasonable uplift from s stemmer.
Exceptions like taxe and boxe are again, names or acronyms that wouldn't be the usual interpretation of the plural form from which they stem.
Ecommerce sites will benefit from matchs on the various boxes they sell (gearboxes to jewellery boxes). One noticeable false stem is axes to axe.

Plural	count	Proposed stem	Proposed stem count	S stemmer	S stemmer count
taxes	10477	tax	24595	taxe	4
boxes	5351	box	26638	boxe	8
indexes	1920	index	19179
fixes	1418	fix	10755	fixe	68
mixes	970	mix	18027
complexes	617	complex	24480	complexe	4
foxes	591	fox	14241
sixes	398	six	87451
sexes	378	sex	16266	sexe	6
axes	281	ax	371	axe	963
remixes	252	remix	1153
exes	238	ex	17191	exe	112
relaxes	195	relax	3978
reflexes	191	reflex	343
mailboxes	182	mailbox	837
hoaxes	135	hoax	1187
inboxes	114	inbox	3963
annexes	111	annex	514	annexe	46
waxes	109	wax	1122
multiplexes	105	multiplex	185
gearboxes	86	gearbox	646
flexes	72	flex	1475
faxes	72	fax	3908
lunchboxes	62	lunchbox	234
duplexes	56	duplex	367
paradoxes	54	paradox	566
tuxes	49	tux	192
climaxes	45	climax	1000
sandboxes	45	sandbox	330
influxes	34	influx	4764
maxes	32	max	8292
prefixes	30	prefix	114
coaxes	29	coax	287
toolboxes	28	toolbox	364
nixes	25	nix	222
premixes	25	premix	25
vortexes	24	vortex	381
fluxes	23	flux	650
suplexes	22	suplex	64
shoeboxes	22	shoebox	106
equinoxes	22	equinox	479
vexes	22	vex	57
hotfixes	21	hotfix	38
connexes	20	connex	16
xerxes	18	xerx	6
alexes	18	alex	12931
suffixes	17	suffix	67
checkboxes	16	checkbox	277
bugfixes	15	bugfix	7
crucifixes	15	crucifix	168
jukeboxes	14	jukebox	141
letterboxes	13	letterbox	83
saxes	13	sax	219	saxe	45
subindexes	13	subindex	34
hexes	13	hex	171
suezmaxes	12	suezmax	69
unboxes	12	unbox	31
perplexes	12	perplex	21
affixes	11	affix	108
detoxes	11	detox	431
pickaxes	9	pickax	5	pickaxe	8
rolexes	9	rolex	285
apexes	8	apex	1214
xboxes	8	xbox	2671
praxes	8	prax	1
aframaxes	8	aframax	16
cineplexes	7	cineplex	68
appendixes	6	appendix	878
flummoxes	6	flummox	9
panamaxes	6	panamax	104
boomboxes	6	boombox	40
transfixes	6	transfix	15
jinxes	5	jinx	244
textboxes	5	textbox	17
muxes	4	mux	20

markharwood · 2019-06-17T08:51:36Z

Another bizarre choice in s-stemmer is to avoid any stemming of ees (ie the trailing s is not removed).
Tests on the signal media 1m news dataset shows this overlooks a lot of valid words.
The only ees words that are not plurals here are raess (a Bollywood movie) and drees and dees (names), all of which when index as stemmed would not clash with other common English words. I recommend making this change to the original S-stemmer algorithm too.

Plural	count	Proposed stem	Proposed stem count
employees	32064	employee	12965
refugees	17323	refugee	12865
sees	12675	see	183619
fees	11210	fee	11549
degrees	8705	degree	21259
trees	7550	tree	9160
attendees	6994	attendee	493
guarantees	4877	guarantee	12974
agrees	3806	agree	18583
oversees	2491	oversee	2879
committees	2451	committee	27464
yankees	2445	yankee	963
knees	2390	knee	9674
nominees	2050	nominee	2900
trustees	1888	trustee	1491
bees	1215	bee	2001
retirees	1067	retiree	472
referees	1006	referee	3752
brees	814	bree	116
franchisees	778	franchisee	476
disagrees	776	disagree	2973
honorees	666	honoree	335
rupees	643	rupee	855
detainees	637	detainee	262
devotees	634	devotee	159
frees	588	free	116658
tees	530	tee	1876
trainees	487	trainee	633
licensees	468	licensee	567
entrees	434	entree	350
rees	418	ree	88
coffees	413	coffee	11426
lees	393	lee	13865
appointees	327	appointee	176
toffees	292	toffee	175
evacuees	246	evacuee	37
foresees	241	foresee	769
flees	226	flee	3335
decrees	218	decree	945
inductees	212	inductee	469
awardees	203	awardee	113
threes	186	three	194122
returnees	178	returnee	72
chimpanzees	173	chimpanzee	94
grantees	169	grantee	100
interviewees	162	interviewee	91
enrollees	156	enrollee	34
invitees	139	invitee	48
escapees	122	escapee	70
pharisees	113	pharisee	41
honeybees	112	honeybee	85
absentees	112	absentee	288
burpees	108	burpee	64
amputees	84	amputee	201
divorcees	77	divorcee	123
gees	74	gee	670
lessees	72	lessee	86
emcees	66	emcee	321
pedigrees	65	pedigree	971
humvees	63	humvee	62
soirees	60	soiree	161
maccabees	53	maccabee	15
sarees	50	saree	58
manatees	50	manatee	133
elysees	50	elysee	86
dees	49	dee	1223
marquees	47	marquee	1350
loanees	44	loanee	183
signees	43	signee	83
pees	41	pee	534
mentees	41	mentee	58
purees	40	puree	438
monkees	39	monkee	4
kees	38	kee	153
jaycees	35	jaycee	24
bumblebees	33	bumblebee	76
fugees	33	fugee	1
transferees	19	transferee	27
drees	12	dree	12

markharwood · 2019-06-17T10:15:11Z

Another proposal:
The *ies -> *y rule should only apply to words longer than 4 characters.
In tests on the news dataset this proposal loses nothing but gains matches on pies -> pie, lies->lie and ties->tie.
The only 2 letter *y word of consequence in English is by and does not have a plural.

markharwood · 2019-06-17T13:37:25Z

oes plurals are not treated at all by the "S" stemmer.
I beleive a positive uplift can be had by removing the es part of the suffix.
In the table below I contrast the popularity of the full oes term, an aggressive o stemmed form and a less aggressive oe stem in the million news articles dataset.
The aggressive stem back to the o word looks to be the most useful stemming rule to employ and a small set of exceptions that should retain the e (canoe, shoe and oboe) could be maintained as a list in the code.
The use of an exception list could make this a more contentious rule but it should be noted the alternative is to stick with the current policy of not stemming oes suffixes at all, which we can produce a lot of false negatives.

Plural	count	es stemmed	count	s stemmed	count
shoes	8173	sho	255	shoe	2902
heroes	5076	hero	8280
tomatoes	2507	tomato	2133
potatoes	2271	potato	2530	potatoe	7
echoes	1192	echo	2485
superheroes	647	superhero	1554
mosquitoes	541	mosquito	744
undergoes	495	undergo	3492
volcanoes	353	volcano	668
tornadoes	336	tornado	688
buffaloes	240	buffalo	4746	buffaloe	4
cargoes	238	cargo	3595
throes	237	thro	27
zeroes	179	zero	13047
vetoes	157	veto	1903
canoes	147	cano	221	canoe	531
mangoes	144	mango	902
dominoes	113	domino	524
faroes	103	faro	195	faroe	342
negroes	85	negro	263
horseshoes	85			horseshoe	385
torpedoes	83	torpedo	202
frescoes	73	fresco	330
embargoes	70	embargo	1007
kroes	49	kro	27
backhoes	48			backhoe	73
mementoes	41	memento	275
tiptoes	33			tiptoe	53
floes	27	flo	277	floe	11
outdoes	27	outdo	222
simoes	26	simo	22
marloes	24	marlo	104
dingoes	22	dingo	31
forgoes	21	forgo	441
briscoes	18	brisco	9	briscoe	227
cohoes	16	coho	50
commandoes	14	commando	293
snowshoes	14			snowshoe	20
undoes	14	undo	665
avocadoes	14	avocado	981
mottoes	13	motto	1345
antiheroes	13	antihero	54
siloes	13	silo	253
foregoes	13	forego	271
flamingoes	12	flamingo	233
overdoes	12	overdo	183
sloes	11	slo	145	sloe	25
ghettoes	10	ghetto	300
gittoes	10	gitto	4
innuendoes	10	innuendo	250
manifestoes	9	manifesto	1041
haloes	9	halo	889
coxconservesheroes	8	coxconserveshero	8
aloes	8	alo	41	aloe	283
grottoes	7	grotto	150
ciscoes	7	cisco	2340
acoes	7	aco	113
desperadoes	6	desperado	43
sheroes	6	shero	53
peccadilloes	6	peccadillo	7
erdoes	6	erdo	13
weirdoes	6	weirdo	171
tahoes	5	taho	24	tahoe	433
supervolcanoes	5
ringoes	5	ringo	252
oboes	5	obo	32	oboe	43
domingoes	5	domingo	391
porticoes	3	portico	122
sermoheroes	3
vidoes	3
faeroes	3			faeroe	6
fiascoes	3	fiasco	573
hammertoes	3			hammertoe	6
ricardoes	3	ricardo	1041
groes	3	gro	108
croes	3	cro	274

markharwood · 2019-06-17T15:39:10Z

It's possible that the EnglishMinimalStemmer's implementation of the original algorithm has a bug.

This is the original S-stemmer description:

The notes accompanying the table state :

"the first applicable rule encountered is the only one used"

For the ees and oes suffixes I think EnglishMinimalStemmer misinterpreted the rule logic and consequently bees != bee and tomatoes != tomato. The oes and ees suffixes are left intact.

"The first applicable rule" for ees could be interpreted as rule 2 or 3 in the table depending on if you take applicable to mean "the THEN part of the rule has fired" or just that the suffix was referenced in the rule. EnglishMinimalStemmer assumed the latter and I think it should be the former. We should fall through into rule 3 for ees and oes (remove any trailing S). That's certainly the conclusion I came to independently testing on real data.

I notice this implementation of the s-stemmer makes the same mistake. (Perhaps our Java version was a port of this javascript or vice versa?).

@jpountz I've been working on a new TokenFilter but what does this ees/oes discovery mean for the existing EnglishMinimalStemmer code if it falls short of its goal in faithfully implementing the original paper?

markharwood · 2019-07-04T15:02:27Z

Ches rules:

Looks like the es can be dropped but with a small number of English-adopted words like cliche, quiche and avalanche.

Plural	count	Proposed stem	Proposed stem count	Retain E	Retain E count
matches	16535	match	39680
launches	8251	launch	32948
coaches	8189	coach	35967
approaches	6736	approach	36169
inches	5675	inch	9864	inche	7
reaches	4684	reach	45691	reache	5
catches	4186	catch	17512
branches	3606	branch	8197
touches	3598	touch	22720	touche	208
teaches	3384	teach	8592
churches	3166	church	21364
watches	2966	watch	56591	watche	5
speeches	2832	speech	18023
searches	2829	search	31303
breaches	2739	breach	5148
beaches	2641	beach	19419
pitches	2426	pitch	11668
sandwiches	2203	sandwich	2630
stretches	1964	stretch	8756
switches	1769	switch	9901
headaches	1565			headache	1990
patches	1523	patch	4377
punches	1227	punch	4798
lunches	1169	lunch	10838
riches	890	rich	21176	riche	59
sketches	889	sketch	2123
batches	888	batch	3122
benches	810	bench	9126
stitches	694	stitch	688
scratches	648	scratch	3301
trenches	618	trench	547
peaches	618	peach	1247
marches	608	march	30547	marche	48
witches	540	witch	1492
smartwatches	518	smartwatch	966
attaches	496	attach	1896	attache	85
arches	473	arch	1973	arche	15
glitches	464	glitch	564
clutches	448	clutch	1770
pouches	417	pouch	628
researches	414	research	87327
crutches	402	crutch	148
niches	296	nich	6	niche	5715
ditches	289	ditch	2272
dispatches	285	dispatch	2112
notches	277	notch	2482
preaches	271	preach	891
couches	260	couch	2131	couche	45
tranches	253			tranche	469
torches	222	torch	827
bunches	217	bunch	7482	bunche	25
enriches	211	enrich	1720
backbenches	208	backbench	485
ranches	181	ranch	2495
bitches	177	bitch	1002
hatches	172	hatch	1152
swatches	170	swatch	181
cliches	166	clich	20	cliche	332
cockroaches	164	cockroach	79
crunches	152	crunch	2093
porches	143	porch	1408	porche	17
pooches	138	pooch	328
caches	133			cache	983
mismatches	132	mismatch	481
starches	129	starch	400
latches	124	latch	519
clinches	122	clinch	1744
porsches	113			porsche	2109
snatches	107	snatch	831
avalanches	105			avalanche	722
hitches	101	hitch	715
perches	97	perch	623
roaches	94	roach	372	roache	20
wrenches	93	wrench	351
finches	89	finch	860
pinches	80	pinch	2441	pinche	6
fetches	77	fetch	926
leeches	68	leech	109
brunches	66	brunch	924
lurches	63	lurch	239
mustaches	44			mustache	314
relaunches	44	relaunch	362
apaches	44			apache	889
breeches	42	breech	108
brooches	41	brooch	134
slouches	40	slouch	171
wristwatches	39	wristwatch	109
winches	39	winch	143
heartaches	38			heartache	516
psyches	37	psych	305	psyche	581
moustaches	37			moustache	220
haunches	34	haunch	7
hunches	34	hunch	246
blotches	33	blotch	10
beseeches	33	beseech	56
twitches	33	twitch	518
smooches	31	smooch	82
quiches	31			quiche	100
deutsches	29	deutsch	643	deutsche	4088
encroaches	28	encroach	136
entrenches	27	entrench	112
rematches	26	rematch	846
goldfinches	26	goldfinch	51
flinches	26	flinch	180
roches	25	roch	60	roche	1329
outreaches	24	outreach	3569
beeches	23	beech	473
naches	23	nach	57
bleaches	21	bleach	411
detaches	20	detach	162
poaches	19	poach	151
birches	19	birch	601
impeaches	17	impeach	145
crouches	16	crouch	257
belches	16	belch	37
cwtches	15	cwtch	5
masterbatches	14	masterbatch	18
geocaches	13			geocache	16
cinches	12	cinch	140
stiches	12	stich	11
despatches	12	despatch	125
botches	9	botch	86

Drops the trailing “e” in taxes, dresses, watches etc that otherwise cause mismatches with plural and singular forms Closes elastic#42892

markharwood · 2019-07-17T15:05:24Z

Final comparison of results

Having heard back from the author of the paper on which Lucene's EnglishMinimalStemFilter is based I've concluded that the S-Stemmer algorithm presented there has muddled logic and the implementation of it in Lucene is also buggy.
Below is a final round-up of the differences between the proposed #43248 stemmer and the Lucene filter based on trials with the Signal Media news dataset. The differences illustrate that the Lucene version either fails to offer any stem (e.g. employees keeps the s) or offers a non-sensical stem (dresses becomes dresse which means a search for that wouldn't match dress).

Plural	count	proposed new stem	count	Lucene stem (blank if not stemmed)	count
employees	32063	employee	12965
refugees	17323	refugee	12865
sees	12675	see	183619
fees	11210	fee	11549
degrees	8705	degree	21259
ties	8596	tie	10507	ty	1234
lies	8232	lie	6590	ly	157
shoes	8173	shoe	2902
trees	7550	tree	9160
attendees	6994	attendee	493
heroes	5076	hero	8280
guarantees	4877	guarantee	12974
dies	3967	die	13662	dy	133
agrees	3806	agree	18583
tomatoes	2507	tomato	2133
oversees	2491	oversee	2879
committees	2451	committee	27464
yankees	2445	yankee	963
knees	2390	knee	9674
woes	2365	wo	283
potatoes	2271	potato	2530
nominees	2050	nominee	2900
trustees	1888	trustee	1491
toes	1693	to	917321
foes	1246	fo	358
bees	1215	bee	2001
echoes	1192	echo	2485
retirees	1067	retiree	472
referees	1006	referee	3752
pies	927	pie	3091	py	43
brees	814	bree	116
franchisees	778	franchisee	476
disagrees	776	disagree	2973
honorees	666	honoree	335
superheroes	647	superhero	1554
rupees	643	rupee	855
detainees	637	detainee	262
devotees	634	devotee	159
frees	588	free	116658
mosquitoes	541	mosquito	744
tees	530	tee	1876
undergoes	495	undergo	3492
trainees	487	trainee	633
licensees	468	licensee	567
entrees	434	entree	350
rees	418	ree	88
coffees	413	coffee	11426
aes	398	ae	450
lees	393	lee	13865
volcanoes	353	volcanoe	0
tornadoes	336	tornado	688
paes	328	pae	28
appointees	327	appointee	176
toffees	292	toffee	175
evacuees	246	evacuee	37
foresees	241	foresee	769
buffaloes	240	buffalo	4746
businesses	38694	business	143764	businesse	0
matches	16535	match	39680	matche	1
processes	14592	process	69044	processe	2
losses	13164	loss	42859	losse	3
classes	11946	class	48412	classe	23
passes	10998	pass	29525	passe	42
taxes	10477	tax	24595	taxe	4
launches	8251	launch	32948	launche	3
coaches	8189	coach	35967	coache	2
addresses	7491	address	48607	addresse	0
approaches	6736	approach	36165	approache	2
witnesses	5929	witness	8866	witnesse	1
inches	5675	inch	9864	inche	7
boxes	5351	box	26638	boxe	8
wishes	5284	wish	20877	wishe	3
reaches	4684	reach	45691	reache	5
dishes	4470	dish	4548	dishe	0
catches	4186	catch	17512	catche	0
branches	3606	branch	8197	branche	3
touches	3598	touch	22720	touche	208
weaknesses	3406	weakness	5481	weaknesse	0
clashes	3388	clash	8264	clashe	0
teaches	3384	teach	8592	teache	1
discusses	3301	discuss	25244	discusse	0
churches	3166	church	21364	churche	1
successes	3071	success	47715	successe	1
finishes	3039	finish	21923	finishe	1
glasses	3036	glass	12610	glasse	0
watches	2966	watch	56591	watche	5
speeches	2832	speech	18023	speeche	0
searches	2829	search	31303	searche	0
breaches	2739	breach	5148	breache	0
beaches	2641	beach	19419	beache	0
bosses	2609	boss	13511	bosse	25
masses	2594	mass	20427	masse	499
dresses	2502	dress	9588	dresse	0
pitches	2426	pitch	11668	pitche	0
publishes	2376	publish	5568	publishe	2
illnesses	2316	illness	6664	illnesse	0
sandwiches	2203	sandwich	2630	sandwiche	1
pushes	2188	push	21428	pushe	1
ashes	2179	ash	2203	ashe	1418
crashes	2143	crash	12997	crashe	3
misses	2070	miss	25704	misse	2
stretches	1964	stretch	8756	stretche	1
indexes	1920	index	19179	indexe	2
switches	1769	switch	9901	switche	0
crosses	1692	cross	27398	crosse	221
encompasses	1630	encompass	1067	encompasse	0
sunglasses	1551	sunglass	76	sunglasse	0
patches	1523	patch	4377	patche	0
stresses	1511	stress	11000	stresse	0
fixes	1418	fix	10755	fixe	68
possesses	1384	possess	3022	possesse	1
progresses	1298	progress	23792	progresse	0
expresses	1296	express	15843	expresse	2
punches	1227	punch	4798	punche	0
lunches	1169	lunch	10838	lunche	2
actresses	1165	actress	10873	actresse	0
establishes	1163	establish	11754	establishe	1
flashes	1150	flash	7612	flashe	1
mixes	970	mix	18027	mixe	3
kisses	916	kiss	3228	kisse	0
riches	890	rich	21176	riche	59
sketches	889	sketch	2123	sketche	0
batches	888	batch	3122	batche	0
bushes	886	bush	8213	bushe	2
rushes	864	rush	8995	rushe	7
assesses	825	assess	6807	assesse	4
presses	820	press	82998	presse	1068
benches	810	bench	9126	benche	0
brushes	738	brush	3491	brushe	0
parishes	733	parish	3146	parishe	0
lashes	697	lash	516	lashe	0
stitches	694	stitch	688	stitche	0
scratches	648	scratch	3301	scratche	0
trenches	618	trench	547	trenche	0
peaches	618	peach	1247	peache	0
complexes	617	complex	24476	complexe	4
marches	608	march	30547	marche	48
foxes	591	fox	14241	foxe	3
washes	565	wash	5493	washe	0
mattresses	556	mattress	931	mattresse	0
witches	540	witch	1492	witche	0
dismisses	536	dismiss	2207	dismisse	1
harnesses	484	harness	2438	harnesse	0
glitches	464	glitch	564	glitche	1
clutches	448	clutch	1770	clutche	0
excesses	438	excess	6456	excesse	0
researches	414	research	87327	researche	0
sexes	378	sex	16266	sexe	6
messes	335	mess	5290	messe	145
impresses	325	impress	2589	impresse	2
diminishes	321	diminish	1061	diminishe	0
niches	296	nich	6	niche	5715
notches	277	notch	2482	notche	0

softwaredoug · 2020-02-19T02:09:26Z

@markharwood - thanks for the effort put into analyzing this! As a temporary workaround, I gathered your corrected misstems in a synonyms file, here

markharwood · 2020-02-19T10:07:01Z

@softwaredoug I need to push this. Re your synonyms - note that there's a small amount of collateral damage in this stemming that you probably want to fix in your synonyms file - toes -> to, woes -> wo and foes -> fo

softwaredoug · 2020-02-19T12:25:41Z

thanks, fixed!

elasticsearchmachine · 2024-07-12T10:25:50Z

Pinging @elastic/es-search-relevance (Team:Search Relevance)

markharwood added the :Search Relevance/Analysis How text is split into tokens label Jun 5, 2019

markharwood self-assigned this Jun 5, 2019

markharwood added the >bug label Jun 5, 2019

markharwood linked a pull request Jun 14, 2019 that will close this issue

Analysis enhancement - better plural stemmer than minimal_english. #43248

Open

markharwood mentioned this issue Jun 17, 2019

s-stemmer deviates from paper? Yomguithereal/talisman#157

Open

rjernst added the Team:Search Meta label for search team label May 4, 2020

javanna unassigned markharwood Jun 20, 2022

nknize mentioned this issue Oct 11, 2022

Better plural stemmer than minimal_english opensearch-project/OpenSearch#4738

Merged

javanna added Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch and removed Team:Search Meta label for search team labels Jul 12, 2024

javanna added the priority:normal A label for assessing bug priority to be used by ES engineers label Jul 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

English-minimal analyzer has bad plural stemming #42892

English-minimal analyzer has bad plural stemming #42892

markharwood commented Jun 5, 2019 •

edited

Loading

elasticmachine commented Jun 5, 2019

jpountz commented Jun 5, 2019

markharwood commented Jun 5, 2019 •

edited

Loading

markharwood commented Jun 5, 2019 •

edited

Loading

markharwood commented Jun 14, 2019

markharwood commented Jun 17, 2019

markharwood commented Jun 17, 2019

markharwood commented Jun 17, 2019

markharwood commented Jun 17, 2019 •

edited

Loading

markharwood commented Jul 4, 2019

markharwood commented Jul 17, 2019 •

edited

Loading

softwaredoug commented Feb 19, 2020 •

edited

Loading

markharwood commented Feb 19, 2020

softwaredoug commented Feb 19, 2020

elasticsearchmachine commented Jul 12, 2024

English-minimal analyzer has bad plural stemming #42892

English-minimal analyzer has bad plural stemming #42892

Comments

markharwood commented Jun 5, 2019 • edited Loading

Solution

elasticmachine commented Jun 5, 2019

jpountz commented Jun 5, 2019

markharwood commented Jun 5, 2019 • edited Loading

markharwood commented Jun 5, 2019 • edited Loading

markharwood commented Jun 14, 2019

sses -> ss effects

shes -> sh effects

tches -> tch

xes -> x effects

markharwood commented Jun 17, 2019

markharwood commented Jun 17, 2019

markharwood commented Jun 17, 2019

markharwood commented Jun 17, 2019 • edited Loading

markharwood commented Jul 4, 2019

markharwood commented Jul 17, 2019 • edited Loading

Final comparison of results

softwaredoug commented Feb 19, 2020 • edited Loading

markharwood commented Feb 19, 2020

softwaredoug commented Feb 19, 2020

elasticsearchmachine commented Jul 12, 2024

markharwood commented Jun 5, 2019 •

edited

Loading

markharwood commented Jun 5, 2019 •

edited

Loading

markharwood commented Jun 5, 2019 •

edited

Loading

markharwood commented Jun 17, 2019 •

edited

Loading

markharwood commented Jul 17, 2019 •

edited

Loading

softwaredoug commented Feb 19, 2020 •

edited

Loading