Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for inlined user dictionary in Nori #36123

Merged
merged 12 commits into from
Dec 7, 2018
34 changes: 34 additions & 0 deletions docs/plugins/analysis-nori.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -154,6 +154,40 @@ The above `analyze` request returns the following:

<1> This is a compound token that spans two positions (`mixed` mode).

`user_dictionary_rules`::
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For some reason this whole sections doesn't render when I build the docs locally. I played around with it a bit but couldn't get it to work but its probably worth taking another look.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, thanks. I forgot to add the end of section (e.g. --) so the whole section was not displayed. I pushed adcee29 to fix this.

+
--

You can also inline the rules directly in the tokenizer definition using
the `user_dictionary_rules` option:

[source,js]
--------------------------------------------------
PUT nori_sample
{
"settings": {
"index": {
"analysis": {
"tokenizer": {
"nori_user_dict": {
"type": "nori_tokenizer",
"decompound_mode": "mixed",
"user_dictionary_rules": ["c++", "C샤프", "세종", "세종시", "세종", "시"]
}
},
"analyzer": {
"my_analyzer": {
"type": "custom",
"tokenizer": "nori_user_dict"
}
}
}
}
}
}
--------------------------------------------------
// CONSOLE

The `nori_tokenizer` sets a number of additional attributes per token that are used by token filters
to modify the stream.
You can view all these additional attributes with the following request:
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -29,10 +29,14 @@

import java.io.IOException;
import java.io.Reader;
import java.io.StringReader;
import java.util.Collections;
import java.util.List;
import java.util.Locale;

public class NoriTokenizerFactory extends AbstractTokenizerFactory {
private static final String USER_DICT_OPTION = "user_dictionary";
private static final String USER_DICT_PATH_OPTION = "user_dictionary";
private static final String USER_DICT_RULES_OPTION = "user_dictionary_rules";

private final UserDictionary userDictionary;
private final KoreanTokenizer.DecompoundMode decompoundMode;
Expand All @@ -44,14 +48,32 @@ public NoriTokenizerFactory(IndexSettings indexSettings, Environment env, String
}

public static UserDictionary getUserDictionary(Environment env, Settings settings) {
try (Reader reader = Analysis.getReaderFromFile(env, settings, USER_DICT_OPTION)) {
if (reader == null) {
if (settings.get(USER_DICT_PATH_OPTION) != null && settings.get(USER_DICT_RULES_OPTION) != null) {
throw new ElasticsearchException("It is not allowed to use [" + USER_DICT_PATH_OPTION + "] in conjunction" +
" with [" + USER_DICT_RULES_OPTION + "]");

Copy link
Member

@cbuescher cbuescher Dec 2, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: remove empty line. But I don't think its worth changing this if there are no other changes and CI is green, only if anything else needs changing anyway.

}
String path = settings.get(USER_DICT_PATH_OPTION);
if (path != null) {
try (Reader rulesReader = Analysis.getReaderFromFile(env, settings, USER_DICT_PATH_OPTION)) {
return rulesReader == null ? null : UserDictionary.open(rulesReader);
} catch (IOException e) {
throw new ElasticsearchException("failed to load nori user dictionary", e);
}
} else {
List<String> rulesList = settings.getAsList(USER_DICT_RULES_OPTION, Collections.emptyList(), false);
if (rulesList == null || rulesList.size() == 0) {
return null;
} else {
return UserDictionary.open(reader);
}
} catch (IOException e) {
throw new ElasticsearchException("failed to load nori user dictionary", e);
StringBuilder sb = new StringBuilder();
for (String line : rulesList) {
sb.append(line).append(System.lineSeparator());
}
try (Reader rulesReader = new StringReader(sb.toString())) {
return UserDictionary.open(rulesReader);
} catch (IOException e) {
throw new ElasticsearchException("failed to load nori user dictionary", e);
}
}
}

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -76,6 +76,22 @@ public void testNoriAnalyzer() throws Exception {
}

public void testNoriAnalyzerUserDict() throws Exception {
Settings settings = Settings.builder()
.put("index.analysis.analyzer.my_analyzer.type", "nori")
.putList("index.analysis.analyzer.my_analyzer.user_dictionary_rules", "c++", "C샤프", "세종", "세종시 세종 시")
.build();
TestAnalysis analysis = createTestAnalysis(settings);
Analyzer analyzer = analysis.indexAnalyzers.get("my_analyzer");
try (TokenStream stream = analyzer.tokenStream("", "세종시" )) {
assertTokenStreamContents(stream, new String[] {"세종", "시"});
}

try (TokenStream stream = analyzer.tokenStream("", "c++world")) {
assertTokenStreamContents(stream, new String[] {"c++", "world"});
}
}

public void testNoriAnalyzerUserDictPath() throws Exception {
Settings settings = Settings.builder()
.put("index.analysis.analyzer.my_analyzer.type", "nori")
.put("index.analysis.analyzer.my_analyzer.user_dictionary", "user_dict.txt")
Expand Down