-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add new token filters for Japanese sutegana (捨て仮名) #12915
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great! Thanks @daixque. I left some small comments.
I do not know Japanese myself, so I'll leave this open for a few days to give others a chance to review. After that I'll merge (lazy consensus).
@@ -0,0 +1,65 @@ | |||
package org.apache.lucene.analysis.ja; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please add the standard Apache copyright header, if that's OK with you? Thanks! I think this will also make the GitHub actions checks (./gradlew check
) happy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm happy to do, thanks!
String term = termAttr.toString(); | ||
// Small letter "ㇷ゚" is not single character, so it should be converted to "プ" as String | ||
term = term.replace("ㇷ゚", "プ"); | ||
char[] src = term.toCharArray(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could instead call term.buffer()
to access the source char[]
and save creating a few temporary objects.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, but it will affect length of result character array and break the tests. So let me keep current implementation.
Here is the example of test result.
term 0 expected:<ちよつと[]> but was:<ちよつと[sTerm�������]>
Expected :ちよつと
Actual :ちよつとsTerm�������
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The buffer return the internal byte[] of the CharTermAttribute, which might has more bytes than the actual term length. You need to use term.length() as well.
} | ||
} | ||
String resultTerm = String.copyValueOf(result); | ||
termAttr.setEmpty().append(resultTerm); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can avoid making String
here by appending the char[] result
instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I couldn't find append
method signature which accept char[]. (There is CharSequence instead)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems you can modify the CharTermAttribute
directly by accessing buffer()
, which will return the internal buffer.
byte[] buffer = termAttr.buffer();
buffer[i] = LETTER_MAPPINGS.get(buffer[i]);
This will eliminate all of the byte copy. I don't know if we are supposed to do that (but the API allow). Maybe @mikemccand could have some thought here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will eliminate all of the byte copy. I don't know if we are supposed to do that (but the API allow). Maybe @mikemccand could have some thought here.
This is indeed the intended usage for high performance -- directly alter that underlying char[]
buffer, asking the term att to grow if needed, and setting the length when you are done.
From a Japanese perspective, the necessity sounds reasonable. Thank you for the contribution! |
Hi @mikemccand and @kojisekig, thank you for your reviews. |
String term = termAttr.toString(); | ||
char[] src = term.toCharArray(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you can iterate through the term attribute directly. These methods require byte-copy so might be inefficient
for (int i = 0; i < termAttr.length(); i++) {
char c = termAttr.charAt(i);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, let me do that.
* legal, contract policies, etc. | ||
*/ | ||
public final class JapaneseHiraganaUppercaseFilter extends TokenFilter { | ||
private static final Map<Character, Character> s2l; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the parameter should be in all-uppercase as it's a constant?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also s2l is a bit cryptic, maybe we could use LETTER_MAPPINGS or something
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, let me do that.
for (int i = 0; i < src.length; i++) { | ||
Character c = s2l.get(src[i]); | ||
if (c != null) { | ||
result[i] = c; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems all small characters are just 1 position ahead of the normal characters, so you can use result[i] = src[i] + 1;
, and you can use a Set instead of Map: https://en.wikipedia.org/wiki/Hiragana_(Unicode_block)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems all small characters are just 1 position ahead of the normal characters
It's not correct. See ゕ
for example.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, that makes sense. Thank you
* <p>This filter is useful if you want to search against old style Japanese text such as patents, | ||
* legal, contract policies, etc. | ||
*/ | ||
public final class JapaneseKatakanaUppercaseFilter extends TokenFilter { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems to be mostly the same as the other filter, so maybe we can combine them?
E.g you can either pass the mapping as a constructor parameter and provide 2 constants mapping
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dungba88 How should the constructor look like?
Like this?
public JapaneseKanaUppercaseFilter(TokenStream input, bool hiragana, bool katakana)
Note that Katakana has an exceptional character ㇷ゚
, so logic is slightly different from hiragana.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right, maybe we can consolidate them with a base class as a follow-up. This LGTM.
Besides the optimization of manipulating the internal byte[] directly, I think this is good to go. |
char[] result = new char[src.length]; | ||
for (int i = 0; i < src.length; i++) { | ||
Character c = s2l.get(src[i]); | ||
char[] result = new char[termAttr.length()]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this could instead be something like:
char[] termBuffer = termAttr.buffer();
int termLength = termAttr.length();
for(int i=0; i<termLength; i++) {
Character c = LETTER_MAPPINGS.get(termBuffer[i]);
if (c != null) {
termBuffer[i] = c;
}
}
return true;
I.e. you can just directly manipulate the underlying buffer.
But really this is all just optimizing -- not urgent for the first commit of this awesome contribution. It can be done in follow-on PRs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mikemccand @dungba88 Yeah, thanks for your suggestion. I reflected this, so please check it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the change and optimization. LGTM!
I did refactoring to apply a same kind of enhancement to Katakana filter as well. |
Looks great @daixque -- would you like to add a |
@mikemccand This is done. Thanks! |
This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the dev@lucene.apache.org list. Thank you for your contribution! |
@mikemccand @dungba88 Let me ping. Do I still have anything to do for this PR? If not, could you merge it or let me know when will it be merged? |
I think it's good to go, but I don't have merge permission. Mike should be able to help you, otherwise you can try notify the dev mailing list as suggested by the bot |
This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the dev@lucene.apache.org list. Thank you for your contribution! |
lucene/CHANGES.txt
Outdated
* GITHUB#12915: Add new token filters for Japanese sutegana (捨て仮名). This introduces JapaneseHiraganaUppercaseFilter | ||
and JapaneseKatakanaUppercaseFilter. (Dai Sugimori) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have since released 9.10. Could you add your changes to 9.11 and remove from 9.10?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @benwtrent, this is done.
### Description Sutegana (捨て仮名) is small letter of hiragana and katakana in Japanese. In the old Japanese text, sutegana (捨て仮名) is not used unlikely to modern one. For example: - "ストップウォッチ" is written as "ストツプウオツチ" - "ちょっとまって" is written as "ちよつとまつて" So it's meaningful to normalize sutegana to normal (uppercase) characters if we search against the corpus which includes old Japanese text such as patents, legal documents, contract policies, etc. This pull request introduces 2 token filters: - JapaneseHiraganaUppercaseFilter for hiragana - JapaneseKatakanaUppercaseFilter for katakana so that user can use either one separately. Each. filter make all the sutegana (small characters) into normal kana (uppercase character) to normalize the token. ### Why it is needed This transformation must be done as token filter. There have already been [MappingCharFilter](https://lucene.apache.org/core/8_0_0/analyzers-common/org/apache/lucene/analysis/charfilter/MappingCharFilter.html), but if we apply this character filter to normalize sutegana, it will impact to tokenization and it is not expected.
… in kuromoji analysis plugin (#106553) This adds support for `hiragana_uppercase` and `katakana_uppercase` provided in the new lucene release. Sutegana (捨て仮名) is small letter of hiragana and katakana in Japanese. In the old Japanese text, sutegana (捨て仮名) is not used unlikely to modern one. For example: - "ストップウォッチ" is written as "ストツプウオツチ" - "ちょっとまって" is written as "ちよつとまつて" So it's meaningful to normalize sutegana to normal (uppercase) characters if we search against the corpus which includes old Japanese text such as patents, legal documents, contract policies, etc. Related to: apache/lucene#12915
Cherry-picked from: apache/lucene#12915 by @daixque Context Sutegana (捨て仮名) is small letter of hiragana and katakana in Japanese. In the old Japanese text, sutegana (捨て仮名) is not used unlike in the modern texts. For example: "ストップウォッチ" is written as "ストツプウオツチ" "ちょっとまって" is written as "ちよつとまつて" So it's meaningful to normalize Sutegana to normal (uppercase) characters if we search against the corpuses which includes old Japanese texts such as patents, legal documents, contract policies, etc.
…Sudachi dictionary version (#110) - Updated to the newer Sudachi dictionary version `20240409` - Added new token filters for Japanese sutegana (`捨て仮名`) - Cherry-picked from: apache/lucene#12915 by @daixque - Avoid OOM issues during tokenization when the input text is ginormous - Cherry-picked from: WorksApplications/elasticsearch-sudachi#132 by @kenmasumitsu - Also, see the WorksApplications/Sudachi#230 and WorksApplications/elasticsearch-sudachi#136
Description
Sutegana (捨て仮名) is small letter of hiragana and katakana in Japanese. In the old Japanese text, sutegana (捨て仮名) is not used unlikely to modern one. For example:
So it's meaningful to normalize sutegana to normal (uppercase) characters if we search against the corpus which includes old Japanese text such as patents, legal documents, contract policies, etc.
This pull request introduces 2 token filters:
so that user can use either one separately. Each. filter make all the sutegana (small characters) into normal kana (uppercase character) to normalize the token.
Why it is needed
This transformation must be done as token filter. There have already been MappingCharFilter, but if we apply this character filter to normalize sutegana, it will impact to tokenization and it is not expected.