-
Notifications
You must be signed in to change notification settings - Fork 25k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support multiple tokens on LHS in stemmer_override rules (#56113) #56484
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,3 +1,3 @@ | ||
running => run | ||
running, runs => run | ||
|
||
stemmer => stemmer |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -22,7 +22,6 @@ | |
import org.apache.lucene.analysis.TokenStream; | ||
import org.apache.lucene.analysis.miscellaneous.StemmerOverrideFilter; | ||
import org.apache.lucene.analysis.miscellaneous.StemmerOverrideFilter.StemmerOverrideMap; | ||
import org.elasticsearch.common.Strings; | ||
import org.elasticsearch.common.settings.Settings; | ||
import org.elasticsearch.env.Environment; | ||
import org.elasticsearch.index.IndexSettings; | ||
|
@@ -57,19 +56,23 @@ public TokenStream create(TokenStream tokenStream) { | |
|
||
static void parseRules(List<String> rules, StemmerOverrideFilter.Builder builder, String mappingSep) { | ||
for (String rule : rules) { | ||
String key, override; | ||
List<String> mapping = Strings.splitSmart(rule, mappingSep, false); | ||
if (mapping.size() == 2) { | ||
key = mapping.get(0).trim(); | ||
override = mapping.get(1).trim(); | ||
} else { | ||
String[] sides = rule.split(mappingSep, -1); | ||
if (sides.length != 2) { | ||
throw new RuntimeException("Invalid Keyword override Rule:" + rule); | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Could you add unit testing for this (and the later) exception? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Sure, I'll do it tonight. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. BTW - what do you need about having more precise error messages (telling exactly what's wrong)? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Sure, feel free to extend the message with more details if you think its helpful and doesn't add to much complexity. |
||
} | ||
|
||
if (key.isEmpty() || override.isEmpty()) { | ||
String[] keys = sides[0].split(",", -1); | ||
String override = sides[1].trim(); | ||
if (override.isEmpty() || override.indexOf(',') != -1) { | ||
throw new RuntimeException("Invalid Keyword override Rule:" + rule); | ||
} else { | ||
builder.add(key, override); | ||
} | ||
|
||
for (String key : keys) { | ||
String trimmedKey = key.trim(); | ||
if (trimmedKey.isEmpty()) { | ||
throw new RuntimeException("Invalid Keyword override Rule:" + rule); | ||
} | ||
builder.add(trimmedKey, override); | ||
} | ||
} | ||
} | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,80 @@ | ||
/* | ||
* Licensed to Elasticsearch under one or more contributor | ||
* license agreements. See the NOTICE file distributed with | ||
* this work for additional information regarding copyright | ||
* ownership. Elasticsearch licenses this file to you under | ||
* the Apache License, Version 2.0 (the "License"); you may | ||
* not use this file except in compliance with the License. | ||
* You may obtain a copy of the License at | ||
* | ||
* http://www.apache.org/licenses/LICENSE-2.0 | ||
* | ||
* Unless required by applicable law or agreed to in writing, | ||
* software distributed under the License is distributed on an | ||
* "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY | ||
* KIND, either express or implied. See the License for the | ||
* specific language governing permissions and limitations | ||
* under the License. | ||
*/ | ||
|
||
package org.elasticsearch.analysis.common; | ||
|
||
import org.apache.lucene.analysis.Tokenizer; | ||
import org.apache.lucene.analysis.core.WhitespaceTokenizer; | ||
import org.elasticsearch.common.settings.Settings; | ||
import org.elasticsearch.env.Environment; | ||
import org.elasticsearch.index.analysis.AnalysisTestsHelper; | ||
import org.elasticsearch.index.analysis.TokenFilterFactory; | ||
import org.elasticsearch.test.ESTestCase; | ||
import org.elasticsearch.test.ESTokenStreamTestCase; | ||
import org.junit.Rule; | ||
import org.junit.rules.ExpectedException; | ||
|
||
import java.io.IOException; | ||
import java.io.StringReader; | ||
import java.util.List; | ||
import java.util.Locale; | ||
|
||
public class StemmerOverrideTokenFilterFactoryTests extends ESTokenStreamTestCase { | ||
@Rule | ||
public ExpectedException expectedException = ExpectedException.none(); | ||
|
||
public static TokenFilterFactory create(String... rules) throws IOException { | ||
ESTestCase.TestAnalysis analysis = AnalysisTestsHelper.createTestAnalysisFromSettings( | ||
Settings.builder() | ||
.put("index.analysis.filter.my_stemmer_override.type", "stemmer_override") | ||
.putList("index.analysis.filter.my_stemmer_override.rules", rules) | ||
.put(Environment.PATH_HOME_SETTING.getKey(), createTempDir().toString()) | ||
.build(), | ||
new CommonAnalysisPlugin()); | ||
|
||
return analysis.tokenFilter.get("my_stemmer_override"); | ||
} | ||
|
||
public void testRuleError() { | ||
for (String rule : List.of( | ||
"", // empty | ||
"a", // no arrow | ||
"a=>b=>c", // multiple arrows | ||
"=>a=>b", // multiple arrows | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This case was valid before when split still happened with |
||
"a=>", // no override | ||
"a=>b,c", // multiple overrides | ||
"=>a", // no keys | ||
"a,=>b" // empty key | ||
)) { | ||
expectThrows(RuntimeException.class, String.format( | ||
Locale.ROOT, "Should fail for invalid rule: '%s'", rule | ||
), () -> create(rule)); | ||
} | ||
} | ||
|
||
public void testRulesOk() throws IOException { | ||
TokenFilterFactory tokenFilterFactory = create( | ||
"a => 1", | ||
"b,c => 2" | ||
); | ||
Tokenizer tokenizer = new WhitespaceTokenizer(); | ||
tokenizer.setReader(new StringReader("a b c")); | ||
assertTokenStreamContents(tokenFilterFactory.create(tokenizer), new String[]{"1", "2", "2"}); | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this is a static method, it should be possible to unit test it directly. I think it would also be an improvement to make it return the StemmerOverrideFilter.Builder instead of passing it in and adding to it inside the method. That way you could use the returned builder to create a
StemmerOverrideMap
from it, which I think can be queried for testing. Its a Lucene class so not so easy to directly work with, but I lifted some parts fromStemmerOverrideFilter
and something like this should work:It would be nice to test a few more cases (with commas, trimming of whitespaces etc...) and the exception cases in a new unit test for StemmerOverrideTokenFilterFactory (that is a new
StemmerOverrideTokenFilterFactoryTests
) in the analysis-common module.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cbuescher yes, the only reason I didn't write any tests was that I didn't find any unit test classes for this token filter. But I agree that it needs it, even more now that there is some extra logic.
I'll take a look how you test token filters and address your comments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there's no need to emulate what other token filters are testing at this point (unless you like to), just adding
StemmerOverrideTokenFilterFactoryTests
toorg.elasticsearch.analysis.common
and add tests for the static method should be okay. Looks like most similar "*FactoryTests" extendESTokenStreamTestCase
, which might be a good idea to do as well, then it might be easier to add more test functionality going forward. There's some more info on how to run certain tests only in the TESTING.asciidoc, if you have other questions let me know.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cbuescher is it really worth to use the complicated
StemmerOverrideMap
lookup API to verify thatparseRules
added the right key and value?To test unhappy cases I would simply try to instantiate
StemmerOverrideTokenFilterFactory
and check if it throws (I would do it with a loop andexpectThrows
as you don't seem to use JUnit's parameterized tests).Happy case (valid rules) could be tested with analysis output of its
TokenStream
.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, use anything you think makes sense. The
StemmerOverrideMap
API was just a thought. Any approach that is easy to set up and read and that doesn't need to spin up a whole cluster (i.e. integration test) is fine. Testing some valid and invalid cases is the most important thing I think.