-
Notifications
You must be signed in to change notification settings - Fork 25k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support multiple tokens on LHS in stemmer_override rules (#56113) #56484
Support multiple tokens on LHS in stemmer_override rules (#56113) #56484
Conversation
Pinging @elastic/es-search (:Search/Analysis) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @telendt, thanks a lot for this PR. Looks great already, I left a few minor notes around potential improvements for testing. Let me know what you think.
Could you also add a short description of the issue derived from the original issue you opened that we can later use as a commit comment? Can be a few lines just summarizing the change.
throw new RuntimeException("Invalid Keyword override Rule:" + rule); | ||
} | ||
|
||
if (key.isEmpty() || override.isEmpty()) { | ||
List<String> keys = Strings.splitSmart(sides.get(0), ",", false); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One suggestion: maybe we could use splitSmart(input, ",", true)
here, I believe this would allow to cover the (admittedly very unlikely) case where ppl need a comma in their rules. I believe a rule like a\\,b
would then be resolved to a key a,b
. If you think thats worth the troube we'd also need a small test for this (see other comment around unit testing).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ugh... Honestly, I didn't even look at what splitSmart
does and I simply used that because it was used to split rules on =>
. Now I took a look (although that helper itself does not have any test) and I'm confused.
Unlike String.split
splitSmart
does not return empty tokens. So the result of splitting this:
=>one=>two=>=>
is [one, two]
which is considered a valid rule. IMO it should not.
I believe a regular String.split
should be used. If we want to support "a\\,b"
tokens then we should split on ""(?<!\\\\),"
regexp and then we should unescape each token before adding it - this is something that I would normally use StringEscapeUtils.unescapeJava
from apache commons-text, but you don't seem to use it. What do you suggest in this case? Do you have a similar helper?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
=>one=>two=>=>
Good point, maybe we should use simple String.split for the LHS/RHS separation and reject invalid patterns indeed.
For the LHS split on ,
I'm find both ways. If "splitSmart" works for this and supports escaping we can still consider using that, but simple String splitting on ,
is also fine if other solutions add too much complexity. I don't think we really need to support commas in LHS entries if its too complicated.
override = mapping.get(1).trim(); | ||
} else { | ||
List<String> sides = Strings.splitSmart(rule, mappingSep, false); | ||
if (sides.size() != 2) { | ||
throw new RuntimeException("Invalid Keyword override Rule:" + rule); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add unit testing for this (and the later) exception?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I'll do it tonight.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW - what do you need about having more precise error messages (telling exactly what's wrong)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, feel free to extend the message with more details if you think its helpful and doesn't add to much complexity.
@@ -57,19 +57,23 @@ public TokenStream create(TokenStream tokenStream) { | |||
|
|||
static void parseRules(List<String> rules, StemmerOverrideFilter.Builder builder, String mappingSep) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this is a static method, it should be possible to unit test it directly. I think it would also be an improvement to make it return the StemmerOverrideFilter.Builder instead of passing it in and adding to it inside the method. That way you could use the returned builder to create a StemmerOverrideMap
from it, which I think can be queried for testing. Its a Lucene class so not so easy to directly work with, but I lifted some parts from StemmerOverrideFilter
and something like this should work:
StemmerOverrideMap map = builder.build();
BytesReader fstReader = map.getBytesReader();
final Arc<BytesRef> scratchArc = new FST.Arc<>();
String key = "something"
BytesRef bytesRef = map.get(key.toCharArray(), key.length() , scratchArc, fstReader);
assertEquals("someValue", bytesRef.utf8ToString());
It would be nice to test a few more cases (with commas, trimming of whitespaces etc...) and the exception cases in a new unit test for StemmerOverrideTokenFilterFactory (that is a new StemmerOverrideTokenFilterFactoryTests
) in the analysis-common module.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cbuescher yes, the only reason I didn't write any tests was that I didn't find any unit test classes for this token filter. But I agree that it needs it, even more now that there is some extra logic.
I'll take a look how you test token filters and address your comments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there's no need to emulate what other token filters are testing at this point (unless you like to), just adding StemmerOverrideTokenFilterFactoryTests
to org.elasticsearch.analysis.common
and add tests for the static method should be okay. Looks like most similar "*FactoryTests" extend ESTokenStreamTestCase
, which might be a good idea to do as well, then it might be easier to add more test functionality going forward. There's some more info on how to run certain tests only in the TESTING.asciidoc, if you have other questions let me know.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cbuescher is it really worth to use the complicated StemmerOverrideMap
lookup API to verify that parseRules
added the right key and value?
To test unhappy cases I would simply try to instantiate StemmerOverrideTokenFilterFactory
and check if it throws (I would do it with a loop and expectThrows
as you don't seem to use JUnit's parameterized tests).
Happy case (valid rules) could be tested with analysis output of its TokenStream
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, use anything you think makes sense. The StemmerOverrideMap
API was just a thought. Any approach that is easy to set up and read and that doesn't need to spin up a whole cluster (i.e. integration test) is fine. Testing some valid and invalid cases is the most important thing I think.
@telendt I left a few comments on your last commit, let me know if you need anything else or things are unclear. |
4afbdf5
to
91be651
Compare
@cbuescher I've just pushed the updated version where I provided requested tests and better commit description. I decided to not support commas (or white spaces, which are/were trimmed from each word) as this would complicate implementation and would probably need to be well documented. Also, while commas are perfectly fine in some tokens, like keywords or numbers, I don't see why anyone would like to stem them. Let me know if you have any comments left and I will address them in the following days. |
"", // empty | ||
"a", // no arrow | ||
"a=>b=>c", // multiple arrows | ||
"=>a=>b", // multiple arrows |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This case was valid before when split still happened with smartSplit
but it's not anymore.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@telendt thanks for the update, looks great.
Let me run CI to see if we missed any formating issues etc...
@elasticmachine test this please |
91be651
to
32c3fc5
Compare
@cbuescher I noticed there were some style errors and I pushed a fix for that. Sorry, I didn't run all checks on my machine as it takes ages. Can I re-trigger @elasticmachine or is it something that only your org members can do? |
No problem at all, we realize that running all test without an additional machine is cumbersome at this point. For future reference, there's a gradle "precommit" task that runs only compilation, checkstyle and a few other checks and skips all test that runs considerably faster, especially when you only run it on sub-modules, like e.g.
Yes, thats restricted to org so that other bots don't mess around with our CI infrastructure ;-) |
"=>a", // no keys | ||
"a,=>b" // empty key | ||
)) { | ||
expectThrows(RuntimeException.class, String.format("Should fail for invalid rule: '%s'", rule), () -> create(rule)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another small complaint coming from CI here:
"Forbidden method invocation: java.lang.String#format(java.lang.String,java.lang.Object[]) [Uses default locale]"
The "forbiddenAPIs" plugin is configured to require several string-related methods to specify the locale explicitely, e.g. using Locale.ROOT here would work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, I've just pushed a fix
This commit adds support for rules with multiple tokens on LHS, also known as "contraction rules", into stemmer override token filter. Contraction rules are handy into translating multiple inflected words into the same root form. One side effect of this change is that it brings stemmer override rules format closer to synonym rules format so that it makes it easier to translate one into another. This change also makes stemmer override rules parser more strict so that it should catch more errors which were previously accepted.
32c3fc5
to
d9a882a
Compare
@elasticmachine test this please |
There are some more failures that clearly look unrelated, but I'm merging in master again to see if those got fixed in the meantime. |
@elasticmachine update branch |
@elasticmachine test this please |
CI looks happy, will merge to master and the 7.x branch then. |
…6484) This commit adds support for rules with multiple tokens on LHS, also known as "contraction rules", into stemmer override token filter. Contraction rules are handy into translating multiple inflected words into the same root form. One side effect of this change is that it brings stemmer override rules format closer to synonym rules format so that it makes it easier to translate one into another. This change also makes stemmer override rules parser more strict so that it should catch more errors which were previously accepted. Closes #56113
Support multiple tokens on LHS in stemmer_override rules (#56113)
This commit adds support for rules with multiple tokens on LHS, also
known as "contraction rules", into stemmer override token
filter. Contraction rules are handy into translating multiple
inflected words into the same root form. One side effect of this change is
that it brings stemmer override rules format closer to synonym rules
format so that it makes it easier to translate one into another.
This change also makes stemmer override rules parser more strict so
that it should catch more errors which were previously accepted.
Fixes #56113.