-
Notifications
You must be signed in to change notification settings - Fork 25k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for inlined user dictionary in Nori #36123
Conversation
This change adds a new option called `user_dictionary_rules` to the Nori a tokenizer`. It can be used to set additional tokenization rules to the Korean tokenizer directly in the settings (instead of using a file). Closes elastic#35842
Pinging @elastic/es-search |
if (settings.get(USER_DICT_PATH_OPTION) != null && settings.get(USER_DICT_RULES_OPTION) != null) { | ||
throw new ElasticsearchException("It is not allowed to use [" + USER_DICT_PATH_OPTION + "] in conjunction" + | ||
" with [" + USER_DICT_RULES_OPTION + "]"); | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: remove empty line. But I don't think its worth changing this if there are no other changes and CI is green, only if anything else needs changing anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then change itself looks good to me but I couldn't get the additional part on the docs to be displayed locally. Maybe you can recheck how this is supposed to work. Left another nit which probably only makes sense to change if the PR needs additions anyway.
@@ -154,6 +154,40 @@ The above `analyze` request returns the following: | |||
|
|||
<1> This is a compound token that spans two positions (`mixed` mode). | |||
|
|||
`user_dictionary_rules`:: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For some reason this whole sections doesn't render when I build the docs locally. I played around with it a bit but couldn't get it to work but its probably worth taking another look.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch, thanks. I forgot to add the end of section (e.g. --
) so the whole section was not displayed. I pushed adcee29 to fix this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @jimczi, I took a look at the recent changes and left a few questions for you, mind to take another look?
@@ -48,32 +50,24 @@ public NoriTokenizerFactory(IndexSettings indexSettings, Environment env, String | |||
} | |||
|
|||
public static UserDictionary getUserDictionary(Environment env, Settings settings) { | |||
if (settings.get(USER_DICT_PATH_OPTION) != null && settings.get(USER_DICT_RULES_OPTION) != null) { | |||
throw new ElasticsearchException("It is not allowed to use [" + USER_DICT_PATH_OPTION + "] in conjunction" + | |||
" with [" + USER_DICT_RULES_OPTION + "]"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the reason you decided to not use this check anymore? I cannot find it in the refactored method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good catch thanks, I restored this check in 5fcfad4
return UserDictionary.open(rulesReader); | ||
} catch (IOException e) { | ||
throw new ElasticsearchException("failed to load nori user dictionary", e); | ||
// check for duplicate terms |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where does the requirement for checking for duplicates come from? Would it make sense to simply ignore them if they happen or to log it only instead of throwing an error? I can imagine ppl compiling larger dictionaries where duplicates might sneak in over time, maybe this shouldn't block loading the whole analyzer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wanted to align the Kuromoji and Nori regarding duplicates in the user dictionary. The Japanese tokenizer doesn't accept duplicates (#36100) while Nori just ignores them. However I agree that this is not the scope of this pr so I removed the duplicate detection and will open a new pr in a follow up.
for (String line : ruleList) { | ||
String[] split = line.split("\\s+"); | ||
if (terms.add(split[0]) == false) { | ||
throw new IllegalArgumentException("Found duplicate term: [" + split[0] + "] in user dictionary. "); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this stays an error or maybe gets converted to a log message, the line number would be a helpful debugging information for the user the file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, I'll make sure we report the line number in the follow up pr.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the reviews @cbuescher . I pushed a commit to address your latest comments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, LGTM
Add support for inlined user dictionary in Nori This change adds a new option called `user_dictionary_rules` to the Nori a tokenizer`. It can be used to set additional tokenization rules to the Korean tokenizer directly in the settings (instead of using a file). Closes #35842
I can't thank more @jimczi for your work. This will solve many problems of our customers, especially with Elastic Cloud and ECE. 👍 |
This change adds a new option called
user_dictionary_rules
toNori's tokenizer. It can be used to set additional tokenization rules
to the Korean tokenizer directly in the settings (instead of using a file).
We should do the same for the
kuromoji
tokenizer in a follow up.Closes #35842