Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support multiple tokens on LHS in stemmer_override rules (#56113) #56484

Merged
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ Overrides stemming algorithms, by applying a custom mapping, then
protecting these terms from being modified by stemmers. Must be placed
before any stemming filters.

Rules are separated by `=>`
Rules are mappings in the form of `token1[, ..., tokenN] => override`.

[cols="<,<",options="header",]
|=======================================================================
Expand Down Expand Up @@ -69,7 +69,7 @@ PUT /my_index
"custom_stems" : {
"type" : "stemmer_override",
"rules" : [
"running => run",
"running, runs => run",
"stemmer => stemmer"
]
}
Expand Down
2 changes: 1 addition & 1 deletion docs/src/test/cluster/config/analysis/stemmer_override.txt
Original file line number Diff line number Diff line change
@@ -1,3 +1,3 @@
running => run
running, runs => run

stemmer => stemmer
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,6 @@
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.miscellaneous.StemmerOverrideFilter;
import org.apache.lucene.analysis.miscellaneous.StemmerOverrideFilter.StemmerOverrideMap;
import org.elasticsearch.common.Strings;
import org.elasticsearch.common.settings.Settings;
import org.elasticsearch.env.Environment;
import org.elasticsearch.index.IndexSettings;
Expand Down Expand Up @@ -57,19 +56,23 @@ public TokenStream create(TokenStream tokenStream) {

static void parseRules(List<String> rules, StemmerOverrideFilter.Builder builder, String mappingSep) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is a static method, it should be possible to unit test it directly. I think it would also be an improvement to make it return the StemmerOverrideFilter.Builder instead of passing it in and adding to it inside the method. That way you could use the returned builder to create a StemmerOverrideMap from it, which I think can be queried for testing. Its a Lucene class so not so easy to directly work with, but I lifted some parts from StemmerOverrideFilter and something like this should work:

       StemmerOverrideMap map = builder.build();
        BytesReader fstReader = map.getBytesReader();
        final Arc<BytesRef> scratchArc = new FST.Arc<>();
        String key = "something"
        BytesRef bytesRef = map.get(key.toCharArray(), key.length() , scratchArc, fstReader);
        assertEquals("someValue", bytesRef.utf8ToString());

It would be nice to test a few more cases (with commas, trimming of whitespaces etc...) and the exception cases in a new unit test for StemmerOverrideTokenFilterFactory (that is a new StemmerOverrideTokenFilterFactoryTests) in the analysis-common module.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cbuescher yes, the only reason I didn't write any tests was that I didn't find any unit test classes for this token filter. But I agree that it needs it, even more now that there is some extra logic.

I'll take a look how you test token filters and address your comments.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there's no need to emulate what other token filters are testing at this point (unless you like to), just adding StemmerOverrideTokenFilterFactoryTests to org.elasticsearch.analysis.common and add tests for the static method should be okay. Looks like most similar "*FactoryTests" extend ESTokenStreamTestCase, which might be a good idea to do as well, then it might be easier to add more test functionality going forward. There's some more info on how to run certain tests only in the TESTING.asciidoc, if you have other questions let me know.

Copy link
Contributor Author

@telendt telendt May 17, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cbuescher is it really worth to use the complicated StemmerOverrideMap lookup API to verify that parseRules added the right key and value?

To test unhappy cases I would simply try to instantiate StemmerOverrideTokenFilterFactory and check if it throws (I would do it with a loop and expectThrows as you don't seem to use JUnit's parameterized tests).

Happy case (valid rules) could be tested with analysis output of its TokenStream.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, use anything you think makes sense. The StemmerOverrideMap API was just a thought. Any approach that is easy to set up and read and that doesn't need to spin up a whole cluster (i.e. integration test) is fine. Testing some valid and invalid cases is the most important thing I think.

for (String rule : rules) {
String key, override;
List<String> mapping = Strings.splitSmart(rule, mappingSep, false);
if (mapping.size() == 2) {
key = mapping.get(0).trim();
override = mapping.get(1).trim();
} else {
String[] sides = rule.split(mappingSep, -1);
if (sides.length != 2) {
throw new RuntimeException("Invalid Keyword override Rule:" + rule);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add unit testing for this (and the later) exception?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I'll do it tonight.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW - what do you need about having more precise error messages (telling exactly what's wrong)?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, feel free to extend the message with more details if you think its helpful and doesn't add to much complexity.

}

if (key.isEmpty() || override.isEmpty()) {
String[] keys = sides[0].split(",", -1);
String override = sides[1].trim();
if (override.isEmpty() || override.indexOf(',') != -1) {
throw new RuntimeException("Invalid Keyword override Rule:" + rule);
} else {
builder.add(key, override);
}

for (String key : keys) {
String trimmedKey = key.trim();
if (trimmedKey.isEmpty()) {
throw new RuntimeException("Invalid Keyword override Rule:" + rule);
}
builder.add(trimmedKey, override);
}
}
}
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,80 @@
/*
* Licensed to Elasticsearch under one or more contributor
* license agreements. See the NOTICE file distributed with
* this work for additional information regarding copyright
* ownership. Elasticsearch licenses this file to you under
* the Apache License, Version 2.0 (the "License"); you may
* not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing,
* software distributed under the License is distributed on an
* "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
* KIND, either express or implied. See the License for the
* specific language governing permissions and limitations
* under the License.
*/

package org.elasticsearch.analysis.common;

import org.apache.lucene.analysis.Tokenizer;
import org.apache.lucene.analysis.core.WhitespaceTokenizer;
import org.elasticsearch.common.settings.Settings;
import org.elasticsearch.env.Environment;
import org.elasticsearch.index.analysis.AnalysisTestsHelper;
import org.elasticsearch.index.analysis.TokenFilterFactory;
import org.elasticsearch.test.ESTestCase;
import org.elasticsearch.test.ESTokenStreamTestCase;
import org.junit.Rule;
import org.junit.rules.ExpectedException;

import java.io.IOException;
import java.io.StringReader;
import java.util.List;
import java.util.Locale;

public class StemmerOverrideTokenFilterFactoryTests extends ESTokenStreamTestCase {
@Rule
public ExpectedException expectedException = ExpectedException.none();

public static TokenFilterFactory create(String... rules) throws IOException {
ESTestCase.TestAnalysis analysis = AnalysisTestsHelper.createTestAnalysisFromSettings(
Settings.builder()
.put("index.analysis.filter.my_stemmer_override.type", "stemmer_override")
.putList("index.analysis.filter.my_stemmer_override.rules", rules)
.put(Environment.PATH_HOME_SETTING.getKey(), createTempDir().toString())
.build(),
new CommonAnalysisPlugin());

return analysis.tokenFilter.get("my_stemmer_override");
}

public void testRuleError() {
for (String rule : List.of(
"", // empty
"a", // no arrow
"a=>b=>c", // multiple arrows
"=>a=>b", // multiple arrows
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This case was valid before when split still happened with smartSplit but it's not anymore.

"a=>", // no override
"a=>b,c", // multiple overrides
"=>a", // no keys
"a,=>b" // empty key
)) {
expectThrows(RuntimeException.class, String.format(
Locale.ROOT, "Should fail for invalid rule: '%s'", rule
), () -> create(rule));
}
}

public void testRulesOk() throws IOException {
TokenFilterFactory tokenFilterFactory = create(
"a => 1",
"b,c => 2"
);
Tokenizer tokenizer = new WhitespaceTokenizer();
tokenizer.setReader(new StringReader("a b c"));
assertTokenStreamContents(tokenFilterFactory.create(tokenizer), new String[]{"1", "2", "2"});
}
}
Original file line number Diff line number Diff line change
Expand Up @@ -636,7 +636,10 @@
language: dutch
my_stemmer_override:
type: stemmer_override
rules: ["zoeken => override"]
rules: [
"zoeken => override",
"foo, bar => baz"
]
- do:
indices.analyze:
index: test
Expand All @@ -647,6 +650,18 @@
- length: { tokens: 1 }
- match: { tokens.0.token: override }

# Test rule with multiple tokens on LHS:
- do:
indices.analyze:
index: test
body:
text: foo bar
tokenizer: whitespace
filter: [my_stemmer_override]
- length: { tokens: 2 }
- match: { tokens.0.token: baz }
- match: { tokens.1.token: baz }

---
"decompounder":
- do:
Expand Down