Support multiple tokens on LHS in stemmer_override rules (#56113) #56484

telendt · 2020-05-09T09:35:13Z

Support multiple tokens on LHS in stemmer_override rules (#56113)

This commit adds support for rules with multiple tokens on LHS, also
known as "contraction rules", into stemmer override token
filter. Contraction rules are handy into translating multiple
inflected words into the same root form. One side effect of this change is
that it brings stemmer override rules format closer to synonym rules
format so that it makes it easier to translate one into another.

This change also makes stemmer override rules parser more strict so
that it should catch more errors which were previously accepted.

Fixes #56113.

elasticmachine · 2020-05-11T07:58:23Z

Pinging @elastic/es-search (:Search/Analysis)

cbuescher

Hi @telendt, thanks a lot for this PR. Looks great already, I left a few minor notes around potential improvements for testing. Let me know what you think.
Could you also add a short description of the issue derived from the original issue you opened that we can later use as a commit comment? Can be a few lines just summarizing the change.

cbuescher · 2020-05-13T13:36:46Z

...ommon/src/main/java/org/elasticsearch/analysis/common/StemmerOverrideTokenFilterFactory.java

                throw new RuntimeException("Invalid Keyword override Rule:" + rule);
            }

-            if (key.isEmpty() || override.isEmpty()) {
+            List<String> keys = Strings.splitSmart(sides.get(0), ",", false);


One suggestion: maybe we could use splitSmart(input, ",", true) here, I believe this would allow to cover the (admittedly very unlikely) case where ppl need a comma in their rules. I believe a rule like a\\,b would then be resolved to a key a,b. If you think thats worth the troube we'd also need a small test for this (see other comment around unit testing).

@cbuescher

Ugh... Honestly, I didn't even look at what splitSmart does and I simply used that because it was used to split rules on =>. Now I took a look (although that helper itself does not have any test) and I'm confused.
Unlike String.split splitSmart does not return empty tokens. So the result of splitting this:

=>one=>two=>=>

is [one, two] which is considered a valid rule. IMO it should not.

I believe a regular String.split should be used. If we want to support "a\\,b" tokens then we should split on ""(?<!\\\\)," regexp and then we should unescape each token before adding it - this is something that I would normally use StringEscapeUtils.unescapeJava from apache commons-text, but you don't seem to use it. What do you suggest in this case? Do you have a similar helper?

=>one=>two=>=>

Good point, maybe we should use simple String.split for the LHS/RHS separation and reject invalid patterns indeed.
For the LHS split on , I'm find both ways. If "splitSmart" works for this and supports escaping we can still consider using that, but simple String splitting on , is also fine if other solutions add too much complexity. I don't think we really need to support commas in LHS entries if its too complicated.

cbuescher · 2020-05-13T13:37:26Z

...ommon/src/main/java/org/elasticsearch/analysis/common/StemmerOverrideTokenFilterFactory.java

-                override = mapping.get(1).trim();
-            } else {
+            List<String> sides = Strings.splitSmart(rule, mappingSep, false);
+            if (sides.size() != 2) {
                throw new RuntimeException("Invalid Keyword override Rule:" + rule);


Could you add unit testing for this (and the later) exception?

Sure, I'll do it tonight.

BTW - what do you need about having more precise error messages (telling exactly what's wrong)?

Sure, feel free to extend the message with more details if you think its helpful and doesn't add to much complexity.

cbuescher · 2020-05-13T13:57:10Z

...ommon/src/main/java/org/elasticsearch/analysis/common/StemmerOverrideTokenFilterFactory.java

@@ -57,19 +57,23 @@ public TokenStream create(TokenStream tokenStream) {

    static void parseRules(List<String> rules, StemmerOverrideFilter.Builder builder, String mappingSep) {


Since this is a static method, it should be possible to unit test it directly. I think it would also be an improvement to make it return the StemmerOverrideFilter.Builder instead of passing it in and adding to it inside the method. That way you could use the returned builder to create a StemmerOverrideMap from it, which I think can be queried for testing. Its a Lucene class so not so easy to directly work with, but I lifted some parts from StemmerOverrideFilter and something like this should work:

StemmerOverrideMap map = builder.build(); BytesReader fstReader = map.getBytesReader(); final Arc<BytesRef> scratchArc = new FST.Arc<>(); String key = "something" BytesRef bytesRef = map.get(key.toCharArray(), key.length() , scratchArc, fstReader); assertEquals("someValue", bytesRef.utf8ToString());

It would be nice to test a few more cases (with commas, trimming of whitespaces etc...) and the exception cases in a new unit test for StemmerOverrideTokenFilterFactory (that is a new StemmerOverrideTokenFilterFactoryTests) in the analysis-common module.

@cbuescher yes, the only reason I didn't write any tests was that I didn't find any unit test classes for this token filter. But I agree that it needs it, even more now that there is some extra logic.

I'll take a look how you test token filters and address your comments.

I think there's no need to emulate what other token filters are testing at this point (unless you like to), just adding StemmerOverrideTokenFilterFactoryTests to org.elasticsearch.analysis.common and add tests for the static method should be okay. Looks like most similar "*FactoryTests" extend ESTokenStreamTestCase, which might be a good idea to do as well, then it might be easier to add more test functionality going forward. There's some more info on how to run certain tests only in the TESTING.asciidoc, if you have other questions let me know.

@cbuescher is it really worth to use the complicated StemmerOverrideMap lookup API to verify that parseRules added the right key and value?

To test unhappy cases I would simply try to instantiate StemmerOverrideTokenFilterFactory and check if it throws (I would do it with a loop and expectThrows as you don't seem to use JUnit's parameterized tests).

Happy case (valid rules) could be tested with analysis output of its TokenStream.

Sure, use anything you think makes sense. The StemmerOverrideMap API was just a thought. Any approach that is easy to set up and read and that doesn't need to spin up a whole cluster (i.e. integration test) is fine. Testing some valid and invalid cases is the most important thing I think.

cbuescher · 2020-05-26T14:10:43Z

@telendt I left a few comments on your last commit, let me know if you need anything else or things are unclear.

telendt · 2020-05-28T21:38:23Z

@cbuescher I've just pushed the updated version where I provided requested tests and better commit description.

I decided to not support commas (or white spaces, which are/were trimmed from each word) as this would complicate implementation and would probably need to be well documented.
(e.g. - if \, was used to treat comma as a regular comma and not a token separator, how should we treat backslash characters preceding other characters?).

Also, while commas are perfectly fine in some tokens, like keywords or numbers, I don't see why anyone would like to stem them.

Let me know if you have any comments left and I will address them in the following days.

telendt · 2020-05-28T21:41:32Z

.../src/test/java/org/elasticsearch/analysis/common/StemmerOverrideTokenFilterFactoryTests.java

+            "",        // empty
+            "a",       // no arrow
+            "a=>b=>c", // multiple arrows
+            "=>a=>b",  // multiple arrows


This case was valid before when split still happened with smartSplit but it's not anymore.

cbuescher

@telendt thanks for the update, looks great.
Let me run CI to see if we missed any formating issues etc...

cbuescher · 2020-05-29T14:12:29Z

@elasticmachine test this please

telendt · 2020-05-29T14:47:55Z

@cbuescher I noticed there were some style errors and I pushed a fix for that. Sorry, I didn't run all checks on my machine as it takes ages. Can I re-trigger @elasticmachine or is it something that only your org members can do?

cbuescher · 2020-05-29T15:06:15Z

Sorry, I didn't run all checks on my machine as it takes ages.

No problem at all, we realize that running all test without an additional machine is cumbersome at this point. For future reference, there's a gradle "precommit" task that runs only compilation, checkstyle and a few other checks and skips all test that runs considerably faster, especially when you only run it on sub-modules, like e.g. ./gradlew -p modules/analysis-common precommit

Can I re-trigger @elasticmachine or is it something that only your org members can do?

Yes, thats restricted to org so that other bots don't mess around with our CI infrastructure ;-)
But I'll ask our own bot again nicely ;-)
@elasticmachine test this please

cbuescher · 2020-05-29T15:32:56Z

.../src/test/java/org/elasticsearch/analysis/common/StemmerOverrideTokenFilterFactoryTests.java

+            "=>a",     // no keys
+            "a,=>b"    // empty key
+        )) {
+            expectThrows(RuntimeException.class, String.format("Should fail for invalid rule: '%s'", rule), () -> create(rule));


Another small complaint coming from CI here:
"Forbidden method invocation: java.lang.String#format(java.lang.String,java.lang.Object[]) [Uses default locale]"
The "forbiddenAPIs" plugin is configured to require several string-related methods to specify the locale explicitely, e.g. using Locale.ROOT here would work.

ok, I've just pushed a fix

This commit adds support for rules with multiple tokens on LHS, also known as "contraction rules", into stemmer override token filter. Contraction rules are handy into translating multiple inflected words into the same root form. One side effect of this change is that it brings stemmer override rules format closer to synonym rules format so that it makes it easier to translate one into another. This change also makes stemmer override rules parser more strict so that it should catch more errors which were previously accepted.

cbuescher · 2020-05-29T15:45:36Z

@elasticmachine test this please

cbuescher · 2020-05-29T16:05:56Z

There are some more failures that clearly look unrelated, but I'm merging in master again to see if those got fixed in the meantime.
@elasticmachine update branch

cbuescher · 2020-05-29T16:09:12Z

@elasticmachine update branch

cbuescher · 2020-05-29T16:09:41Z

@elasticmachine test this please

cbuescher · 2020-05-29T20:27:27Z

CI looks happy, will merge to master and the 7.x branch then.
Thanks a lot for opening the issue and working on this PR, very much appreciated!

…6484) This commit adds support for rules with multiple tokens on LHS, also known as "contraction rules", into stemmer override token filter. Contraction rules are handy into translating multiple inflected words into the same root form. One side effect of this change is that it brings stemmer override rules format closer to synonym rules format so that it makes it easier to translate one into another. This change also makes stemmer override rules parser more strict so that it should catch more errors which were previously accepted. Closes #56113

cbuescher added the :Search Relevance/Analysis How text is split into tokens label May 11, 2020

elasticmachine added the Team:Search Meta label for search team label May 11, 2020

cbuescher self-assigned this May 13, 2020

cbuescher requested changes May 13, 2020

View reviewed changes

cbuescher added v7.9.0 v8.0.0 labels May 13, 2020

telendt force-pushed the stemmer-overrride-multiple-tokens-rule branch 2 times, most recently from 4afbdf5 to 91be651 Compare May 28, 2020 21:25

telendt commented May 28, 2020

View reviewed changes

telendt requested a review from cbuescher May 28, 2020 21:43

cbuescher approved these changes May 29, 2020

View reviewed changes

telendt force-pushed the stemmer-overrride-multiple-tokens-rule branch from 91be651 to 32c3fc5 Compare May 29, 2020 14:45

cbuescher reviewed May 29, 2020

View reviewed changes

telendt force-pushed the stemmer-overrride-multiple-tokens-rule branch from 32c3fc5 to d9a882a Compare May 29, 2020 15:42

Merge branch 'master' into stemmer-overrride-multiple-tokens-rule

e649e51

cbuescher merged commit 66ded59 into elastic:master May 29, 2020

pugnascotia added the >enhancement label Jul 17, 2020

russcam mentioned this pull request Jul 23, 2020

7.9.0 Meta ticket elastic/elasticsearch-net#4872

Closed

29 tasks

jakelandis added v8.0.0-alpha1 and removed v8.0.0 labels Jul 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support multiple tokens on LHS in stemmer_override rules (#56113) #56484

Support multiple tokens on LHS in stemmer_override rules (#56113) #56484

telendt commented May 9, 2020 •

edited

Loading

elasticmachine commented May 11, 2020

cbuescher left a comment

cbuescher May 13, 2020

telendt May 17, 2020 •

edited

Loading

cbuescher May 19, 2020

cbuescher May 13, 2020

telendt May 14, 2020

telendt May 14, 2020

cbuescher May 14, 2020

cbuescher May 13, 2020

telendt May 14, 2020

cbuescher May 14, 2020

telendt May 17, 2020 •

edited

Loading

cbuescher May 19, 2020

cbuescher commented May 26, 2020

telendt commented May 28, 2020 •

edited

Loading

telendt May 28, 2020

cbuescher left a comment

cbuescher commented May 29, 2020

telendt commented May 29, 2020 •

edited

Loading

cbuescher commented May 29, 2020 •

edited

Loading

cbuescher May 29, 2020

telendt May 29, 2020 •

edited

Loading

cbuescher commented May 29, 2020

cbuescher commented May 29, 2020

cbuescher commented May 29, 2020

cbuescher commented May 29, 2020

cbuescher commented May 29, 2020

		@@ -57,19 +57,23 @@ public TokenStream create(TokenStream tokenStream) {

		static void parseRules(List<String> rules, StemmerOverrideFilter.Builder builder, String mappingSep) {

Support multiple tokens on LHS in stemmer_override rules (#56113) #56484

Support multiple tokens on LHS in stemmer_override rules (#56113) #56484

Conversation

telendt commented May 9, 2020 • edited Loading

elasticmachine commented May 11, 2020

cbuescher left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

telendt May 17, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

telendt May 17, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cbuescher commented May 26, 2020

telendt commented May 28, 2020 • edited Loading

Choose a reason for hiding this comment

cbuescher left a comment

Choose a reason for hiding this comment

cbuescher commented May 29, 2020

telendt commented May 29, 2020 • edited Loading

cbuescher commented May 29, 2020 • edited Loading

Choose a reason for hiding this comment

telendt May 29, 2020 • edited Loading

Choose a reason for hiding this comment

cbuescher commented May 29, 2020

cbuescher commented May 29, 2020

cbuescher commented May 29, 2020

cbuescher commented May 29, 2020

cbuescher commented May 29, 2020

telendt commented May 9, 2020 •

edited

Loading

telendt May 17, 2020 •

edited

Loading

telendt May 17, 2020 •

edited

Loading

telendt commented May 28, 2020 •

edited

Loading

telendt commented May 29, 2020 •

edited

Loading

cbuescher commented May 29, 2020 •

edited

Loading

telendt May 29, 2020 •

edited

Loading