Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support multiple tokens on LHS in stemmer_override rules (#56113) #56484

Merged

Conversation

telendt
Copy link
Contributor

@telendt telendt commented May 9, 2020

Support multiple tokens on LHS in stemmer_override rules (#56113)

This commit adds support for rules with multiple tokens on LHS, also
known as "contraction rules", into stemmer override token
filter. Contraction rules are handy into translating multiple
inflected words into the same root form. One side effect of this change is
that it brings stemmer override rules format closer to synonym rules
format so that it makes it easier to translate one into another.

This change also makes stemmer override rules parser more strict so
that it should catch more errors which were previously accepted.

Fixes #56113.

@cbuescher cbuescher added the :Search Relevance/Analysis How text is split into tokens label May 11, 2020
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-search (:Search/Analysis)

@elasticmachine elasticmachine added the Team:Search Meta label for search team label May 11, 2020
@cbuescher cbuescher self-assigned this May 13, 2020
Copy link
Member

@cbuescher cbuescher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @telendt, thanks a lot for this PR. Looks great already, I left a few minor notes around potential improvements for testing. Let me know what you think.
Could you also add a short description of the issue derived from the original issue you opened that we can later use as a commit comment? Can be a few lines just summarizing the change.

throw new RuntimeException("Invalid Keyword override Rule:" + rule);
}

if (key.isEmpty() || override.isEmpty()) {
List<String> keys = Strings.splitSmart(sides.get(0), ",", false);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One suggestion: maybe we could use splitSmart(input, ",", true) here, I believe this would allow to cover the (admittedly very unlikely) case where ppl need a comma in their rules. I believe a rule like a\\,b would then be resolved to a key a,b. If you think thats worth the troube we'd also need a small test for this (see other comment around unit testing).

Copy link
Contributor Author

@telendt telendt May 17, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cbuescher

Ugh... Honestly, I didn't even look at what splitSmart does and I simply used that because it was used to split rules on =>. Now I took a look (although that helper itself does not have any test) and I'm confused.
Unlike String.split splitSmart does not return empty tokens. So the result of splitting this:

=>one=>two=>=>

is [one, two] which is considered a valid rule. IMO it should not.

I believe a regular String.split should be used. If we want to support "a\\,b" tokens then we should split on ""(?<!\\\\)," regexp and then we should unescape each token before adding it - this is something that I would normally use StringEscapeUtils.unescapeJava from apache commons-text, but you don't seem to use it. What do you suggest in this case? Do you have a similar helper?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

=>one=>two=>=>

Good point, maybe we should use simple String.split for the LHS/RHS separation and reject invalid patterns indeed.
For the LHS split on , I'm find both ways. If "splitSmart" works for this and supports escaping we can still consider using that, but simple String splitting on , is also fine if other solutions add too much complexity. I don't think we really need to support commas in LHS entries if its too complicated.

override = mapping.get(1).trim();
} else {
List<String> sides = Strings.splitSmart(rule, mappingSep, false);
if (sides.size() != 2) {
throw new RuntimeException("Invalid Keyword override Rule:" + rule);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add unit testing for this (and the later) exception?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I'll do it tonight.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW - what do you need about having more precise error messages (telling exactly what's wrong)?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, feel free to extend the message with more details if you think its helpful and doesn't add to much complexity.

@@ -57,19 +57,23 @@ public TokenStream create(TokenStream tokenStream) {

static void parseRules(List<String> rules, StemmerOverrideFilter.Builder builder, String mappingSep) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is a static method, it should be possible to unit test it directly. I think it would also be an improvement to make it return the StemmerOverrideFilter.Builder instead of passing it in and adding to it inside the method. That way you could use the returned builder to create a StemmerOverrideMap from it, which I think can be queried for testing. Its a Lucene class so not so easy to directly work with, but I lifted some parts from StemmerOverrideFilter and something like this should work:

       StemmerOverrideMap map = builder.build();
        BytesReader fstReader = map.getBytesReader();
        final Arc<BytesRef> scratchArc = new FST.Arc<>();
        String key = "something"
        BytesRef bytesRef = map.get(key.toCharArray(), key.length() , scratchArc, fstReader);
        assertEquals("someValue", bytesRef.utf8ToString());

It would be nice to test a few more cases (with commas, trimming of whitespaces etc...) and the exception cases in a new unit test for StemmerOverrideTokenFilterFactory (that is a new StemmerOverrideTokenFilterFactoryTests) in the analysis-common module.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cbuescher yes, the only reason I didn't write any tests was that I didn't find any unit test classes for this token filter. But I agree that it needs it, even more now that there is some extra logic.

I'll take a look how you test token filters and address your comments.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there's no need to emulate what other token filters are testing at this point (unless you like to), just adding StemmerOverrideTokenFilterFactoryTests to org.elasticsearch.analysis.common and add tests for the static method should be okay. Looks like most similar "*FactoryTests" extend ESTokenStreamTestCase, which might be a good idea to do as well, then it might be easier to add more test functionality going forward. There's some more info on how to run certain tests only in the TESTING.asciidoc, if you have other questions let me know.

Copy link
Contributor Author

@telendt telendt May 17, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cbuescher is it really worth to use the complicated StemmerOverrideMap lookup API to verify that parseRules added the right key and value?

To test unhappy cases I would simply try to instantiate StemmerOverrideTokenFilterFactory and check if it throws (I would do it with a loop and expectThrows as you don't seem to use JUnit's parameterized tests).

Happy case (valid rules) could be tested with analysis output of its TokenStream.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, use anything you think makes sense. The StemmerOverrideMap API was just a thought. Any approach that is easy to set up and read and that doesn't need to spin up a whole cluster (i.e. integration test) is fine. Testing some valid and invalid cases is the most important thing I think.

@cbuescher
Copy link
Member

@telendt I left a few comments on your last commit, let me know if you need anything else or things are unclear.

@telendt telendt force-pushed the stemmer-overrride-multiple-tokens-rule branch 2 times, most recently from 4afbdf5 to 91be651 Compare May 28, 2020 21:25
@telendt
Copy link
Contributor Author

telendt commented May 28, 2020

@cbuescher I've just pushed the updated version where I provided requested tests and better commit description.

I decided to not support commas (or white spaces, which are/were trimmed from each word) as this would complicate implementation and would probably need to be well documented.
(e.g. - if \, was used to treat comma as a regular comma and not a token separator, how should we treat backslash characters preceding other characters?).

Also, while commas are perfectly fine in some tokens, like keywords or numbers, I don't see why anyone would like to stem them.

Let me know if you have any comments left and I will address them in the following days.

"", // empty
"a", // no arrow
"a=>b=>c", // multiple arrows
"=>a=>b", // multiple arrows
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This case was valid before when split still happened with smartSplit but it's not anymore.

@telendt telendt requested a review from cbuescher May 28, 2020 21:43
Copy link
Member

@cbuescher cbuescher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@telendt thanks for the update, looks great.
Let me run CI to see if we missed any formating issues etc...

@cbuescher
Copy link
Member

@elasticmachine test this please

@telendt telendt force-pushed the stemmer-overrride-multiple-tokens-rule branch from 91be651 to 32c3fc5 Compare May 29, 2020 14:45
@telendt
Copy link
Contributor Author

telendt commented May 29, 2020

@cbuescher I noticed there were some style errors and I pushed a fix for that. Sorry, I didn't run all checks on my machine as it takes ages. Can I re-trigger @elasticmachine or is it something that only your org members can do?

@cbuescher
Copy link
Member

cbuescher commented May 29, 2020

Sorry, I didn't run all checks on my machine as it takes ages.

No problem at all, we realize that running all test without an additional machine is cumbersome at this point. For future reference, there's a gradle "precommit" task that runs only compilation, checkstyle and a few other checks and skips all test that runs considerably faster, especially when you only run it on sub-modules, like e.g. ./gradlew -p modules/analysis-common precommit

Can I re-trigger @elasticmachine or is it something that only your org members can do?

Yes, thats restricted to org so that other bots don't mess around with our CI infrastructure ;-)
But I'll ask our own bot again nicely ;-)
@elasticmachine test this please

"=>a", // no keys
"a,=>b" // empty key
)) {
expectThrows(RuntimeException.class, String.format("Should fail for invalid rule: '%s'", rule), () -> create(rule));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another small complaint coming from CI here:
"Forbidden method invocation: java.lang.String#format(java.lang.String,java.lang.Object[]) [Uses default locale]"
The "forbiddenAPIs" plugin is configured to require several string-related methods to specify the locale explicitely, e.g. using Locale.ROOT here would work.

Copy link
Contributor Author

@telendt telendt May 29, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, I've just pushed a fix

This commit adds support for rules with multiple tokens on LHS, also
known as "contraction rules", into stemmer override token
filter. Contraction rules are handy into translating multiple
inflected words into the same root form. One side effect of this change is
that it brings stemmer override rules format closer to synonym rules
format so that it makes it easier to translate one into another.

This change also makes stemmer override rules parser more strict so
that it should catch more errors which were previously accepted.
@telendt telendt force-pushed the stemmer-overrride-multiple-tokens-rule branch from 32c3fc5 to d9a882a Compare May 29, 2020 15:42
@cbuescher
Copy link
Member

@elasticmachine test this please

@cbuescher
Copy link
Member

There are some more failures that clearly look unrelated, but I'm merging in master again to see if those got fixed in the meantime.
@elasticmachine update branch

@cbuescher
Copy link
Member

@elasticmachine update branch

@cbuescher
Copy link
Member

@elasticmachine test this please

@cbuescher
Copy link
Member

CI looks happy, will merge to master and the 7.x branch then.
Thanks a lot for opening the issue and working on this PR, very much appreciated!

@cbuescher cbuescher merged commit 66ded59 into elastic:master May 29, 2020
cbuescher pushed a commit that referenced this pull request May 29, 2020
…6484)

This commit adds support for rules with multiple tokens on LHS, also
known as "contraction rules", into stemmer override token
filter. Contraction rules are handy into translating multiple
inflected words into the same root form. One side effect of this change is
that it brings stemmer override rules format closer to synonym rules
format so that it makes it easier to translate one into another.

This change also makes stemmer override rules parser more strict so
that it should catch more errors which were previously accepted.

Closes #56113
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>enhancement :Search Relevance/Analysis How text is split into tokens Team:Search Meta label for search team v7.9.0 v8.0.0-alpha1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Multiple tokens on LHS in stemmer_override rules
5 participants