Add new token filters for Japanese sutegana (捨て仮名) #12915

daixque · 2023-12-11T14:45:32Z

Description

Sutegana (捨て仮名) is small letter of hiragana and katakana in Japanese. In the old Japanese text, sutegana (捨て仮名) is not used unlikely to modern one. For example:

"ストップウォッチ" is written as "ストツプウオツチ"
"ちょっとまって" is written as "ちよつとまつて"

So it's meaningful to normalize sutegana to normal (uppercase) characters if we search against the corpus which includes old Japanese text such as patents, legal documents, contract policies, etc.

This pull request introduces 2 token filters:

JapaneseHiraganaUppercaseFilter for hiragana
JapaneseKatakanaUppercaseFilter for katakana

so that user can use either one separately. Each. filter make all the sutegana (small characters) into normal kana (uppercase character) to normalize the token.

Why it is needed

This transformation must be done as token filter. There have already been MappingCharFilter, but if we apply this character filter to normalize sutegana, it will impact to tokenization and it is not expected.

mikemccand

This looks great! Thanks @daixque. I left some small comments.

I do not know Japanese myself, so I'll leave this open for a few days to give others a chance to review. After that I'll merge (lazy consensus).

mikemccand · 2023-12-11T16:54:22Z

...nalysis/kuromoji/src/java/org/apache/lucene/analysis/ja/JapaneseHiraganaUppercaseFilter.java

@@ -0,0 +1,65 @@
+package org.apache.lucene.analysis.ja;


Could you please add the standard Apache copyright header, if that's OK with you? Thanks! I think this will also make the GitHub actions checks (./gradlew check) happy.

I'm happy to do, thanks!

mikemccand · 2023-12-11T16:57:55Z

...nalysis/kuromoji/src/java/org/apache/lucene/analysis/ja/JapaneseKatakanaUppercaseFilter.java

+      String term = termAttr.toString();
+      // Small letter "ㇷ゚" is not single character, so it should be converted to "プ" as String
+      term = term.replace("ㇷ゚", "プ");
+      char[] src = term.toCharArray();


You could instead call term.buffer() to access the source char[] and save creating a few temporary objects.

Thanks, but it will affect length of result character array and break the tests. So let me keep current implementation.

Here is the example of test result.

term 0 expected:<ちよつと[]> but was:<ちよつと[sTerm��]> Expected :ちよつと Actual :ちよつとsTerm��

The buffer return the internal byte[] of the CharTermAttribute, which might has more bytes than the actual term length. You need to use term.length() as well.

mikemccand · 2023-12-11T17:01:06Z

...nalysis/kuromoji/src/java/org/apache/lucene/analysis/ja/JapaneseHiraganaUppercaseFilter.java

+        }
+      }
+      String resultTerm = String.copyValueOf(result);
+      termAttr.setEmpty().append(resultTerm);


You can avoid making String here by appending the char[] result instead.

I couldn't find append method signature which accept char[]. (There is CharSequence instead)

It seems you can modify the CharTermAttribute directly by accessing buffer(), which will return the internal buffer.

byte[] buffer = termAttr.buffer(); buffer[i] = LETTER_MAPPINGS.get(buffer[i]);

This will eliminate all of the byte copy. I don't know if we are supposed to do that (but the API allow). Maybe @mikemccand could have some thought here.

This will eliminate all of the byte copy. I don't know if we are supposed to do that (but the API allow). Maybe @mikemccand could have some thought here.

This is indeed the intended usage for high performance -- directly alter that underlying char[] buffer, asking the term att to grow if needed, and setting the length when you are done.

kojisekig · 2023-12-12T00:34:04Z

From a Japanese perspective, the necessity sounds reasonable. Thank you for the contribution!

daixque · 2023-12-12T01:51:32Z

Hi @mikemccand and @kojisekig, thank you for your reviews.
I updated some codes along with the comments and add lines to module-info and resources to make gradle check green.

dungba88 · 2023-12-12T04:11:03Z

...nalysis/kuromoji/src/java/org/apache/lucene/analysis/ja/JapaneseHiraganaUppercaseFilter.java

+      String term = termAttr.toString();
+      char[] src = term.toCharArray();


I think you can iterate through the term attribute directly. These methods require byte-copy so might be inefficient

for (int i = 0; i < termAttr.length(); i++) { char c = termAttr.charAt(i);

Thanks, let me do that.

dungba88 · 2023-12-12T04:13:15Z

...nalysis/kuromoji/src/java/org/apache/lucene/analysis/ja/JapaneseHiraganaUppercaseFilter.java

+ * legal, contract policies, etc.
+ */
+public final class JapaneseHiraganaUppercaseFilter extends TokenFilter {
+  private static final Map<Character, Character> s2l;


I think the parameter should be in all-uppercase as it's a constant?

Also s2l is a bit cryptic, maybe we could use LETTER_MAPPINGS or something

Thanks, let me do that.

dungba88 · 2023-12-12T04:16:22Z

...nalysis/kuromoji/src/java/org/apache/lucene/analysis/ja/JapaneseHiraganaUppercaseFilter.java

+      for (int i = 0; i < src.length; i++) {
+        Character c = s2l.get(src[i]);
+        if (c != null) {
+          result[i] = c;


It seems all small characters are just 1 position ahead of the normal characters, so you can use result[i] = src[i] + 1;, and you can use a Set instead of Map: https://en.wikipedia.org/wiki/Hiragana_(Unicode_block)

It seems all small characters are just 1 position ahead of the normal characters

It's not correct. See ゕ for example.

I see, that makes sense. Thank you

dungba88 · 2023-12-12T04:19:16Z

...nalysis/kuromoji/src/java/org/apache/lucene/analysis/ja/JapaneseKatakanaUppercaseFilter.java

+ * <p>This filter is useful if you want to search against old style Japanese text such as patents,
+ * legal, contract policies, etc.
+ */
+public final class JapaneseKatakanaUppercaseFilter extends TokenFilter {


This seems to be mostly the same as the other filter, so maybe we can combine them?

E.g you can either pass the mapping as a constructor parameter and provide 2 constants mapping

@dungba88 How should the constructor look like?

Like this?

public JapaneseKanaUppercaseFilter(TokenStream input, bool hiragana, bool katakana)

Note that Katakana has an exceptional character ㇷ゚, so logic is slightly different from hiragana.

You are right, maybe we can consolidate them with a base class as a follow-up. This LGTM.

dungba88 · 2023-12-13T11:28:49Z

Besides the optimization of manipulating the internal byte[] directly, I think this is good to go.

mikemccand · 2023-12-14T11:34:15Z

...nalysis/kuromoji/src/java/org/apache/lucene/analysis/ja/JapaneseHiraganaUppercaseFilter.java

-      char[] result = new char[src.length];
-      for (int i = 0; i < src.length; i++) {
-        Character c = s2l.get(src[i]);
+      char[] result = new char[termAttr.length()];


I think this could instead be something like:

char[] termBuffer = termAttr.buffer(); int termLength = termAttr.length(); for(int i=0; i<termLength; i++) { Character c = LETTER_MAPPINGS.get(termBuffer[i]); if (c != null) { termBuffer[i] = c; } } return true;

I.e. you can just directly manipulate the underlying buffer.

But really this is all just optimizing -- not urgent for the first commit of this awesome contribution. It can be done in follow-on PRs.

@mikemccand @dungba88 Yeah, thanks for your suggestion. I reflected this, so please check it.

dungba88

Thank you for the change and optimization. LGTM!

daixque · 2023-12-16T02:21:06Z

I did refactoring to apply a same kind of enhancement to Katakana filter as well.

mikemccand · 2023-12-18T14:04:00Z

Looks great @daixque -- would you like to add a lucene/CHANGES.txt entry dscribing this awesome new capability? Be sure to put it under the 9.10.0 section since we can backport this change (it is not a 10.0.0-only feature).

daixque · 2023-12-18T14:38:53Z

Looks great @daixque -- would you like to add a lucene/CHANGES.txt entry dscribing this awesome new capability? Be sure to put it under the 9.10.0 section since we can backport this change (it is not a 10.0.0-only feature).

@mikemccand This is done. Thanks!

github-actions · 2024-01-08T12:22:21Z

This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the dev@lucene.apache.org list. Thank you for your contribution!

daixque · 2024-01-09T08:46:47Z

@mikemccand @dungba88 Let me ping. Do I still have anything to do for this PR? If not, could you merge it or let me know when will it be merged?

dungba88 · 2024-01-09T10:17:34Z

I think it's good to go, but I don't have merge permission. Mike should be able to help you, otherwise you can try notify the dev mailing list as suggested by the bot

github-actions · 2024-01-24T00:13:44Z

This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the dev@lucene.apache.org list. Thank you for your contribution!

benwtrent · 2024-03-18T11:12:54Z

lucene/CHANGES.txt

+* GITHUB#12915: Add new token filters for Japanese sutegana (捨て仮名). This introduces JapaneseHiraganaUppercaseFilter
+  and JapaneseKatakanaUppercaseFilter. (Dai Sugimori)
+


We have since released 9.10. Could you add your changes to 9.11 and remove from 9.10?

Thanks @benwtrent, this is done.

### Description Sutegana (捨て仮名) is small letter of hiragana and katakana in Japanese. In the old Japanese text, sutegana (捨て仮名) is not used unlikely to modern one. For example: - "ストップウォッチ" is written as "ストツプウオツチ" - "ちょっとまって" is written as "ちよつとまつて" So it's meaningful to normalize sutegana to normal (uppercase) characters if we search against the corpus which includes old Japanese text such as patents, legal documents, contract policies, etc. This pull request introduces 2 token filters: - JapaneseHiraganaUppercaseFilter for hiragana - JapaneseKatakanaUppercaseFilter for katakana so that user can use either one separately. Each. filter make all the sutegana (small characters) into normal kana (uppercase character) to normalize the token. ### Why it is needed This transformation must be done as token filter. There have already been [MappingCharFilter](https://lucene.apache.org/core/8_0_0/analyzers-common/org/apache/lucene/analysis/charfilter/MappingCharFilter.html), but if we apply this character filter to normalize sutegana, it will impact to tokenization and it is not expected.

… in kuromoji analysis plugin (#106553) This adds support for `hiragana_uppercase` and `katakana_uppercase` provided in the new lucene release. Sutegana (捨て仮名) is small letter of hiragana and katakana in Japanese. In the old Japanese text, sutegana (捨て仮名) is not used unlikely to modern one. For example: - "ストップウォッチ" is written as "ストツプウオツチ" - "ちょっとまって" is written as "ちよつとまつて" So it's meaningful to normalize sutegana to normal (uppercase) characters if we search against the corpus which includes old Japanese text such as patents, legal documents, contract policies, etc. Related to: apache/lucene#12915

@daixque

Cherry-picked from: apache/lucene#12915 by @daixque Context Sutegana (捨て仮名) is small letter of hiragana and katakana in Japanese. In the old Japanese text, sutegana (捨て仮名) is not used unlike in the modern texts. For example: "ストップウォッチ" is written as "ストツプウオツチ" "ちょっとまって" is written as "ちよつとまつて" So it's meaningful to normalize Sutegana to normal (uppercase) characters if we search against the corpuses which includes old Japanese texts such as patents, legal documents, contract policies, etc.

@daixque

…Sudachi dictionary version (#110) - Updated to the newer Sudachi dictionary version `20240409` - Added new token filters for Japanese sutegana (`捨て仮名`) - Cherry-picked from: apache/lucene#12915 by @daixque - Avoid OOM issues during tokenization when the input text is ginormous - Cherry-picked from: WorksApplications/elasticsearch-sudachi#132 by @kenmasumitsu - Also, see the WorksApplications/Sudachi#230 and WorksApplications/elasticsearch-sudachi#136

daixque added 6 commits December 10, 2023 23:07

Add JapaneseKanaUppercaseFilter

409a80b

Add all the small letters to the translation table and add tests

7408013

Add comments

c617ef8

Split to Hiragana and Katakana filters

5acfa34

Add Factory classes

6c72ff7

Apply code formatter

a15b138

mikemccand approved these changes Dec 11, 2023

View reviewed changes

daixque added 2 commits December 12, 2023 09:55

Updated along with review comment

50e9916

Add new factory classes to kuromoji module

b5b29d8

dungba88 reviewed Dec 12, 2023

View reviewed changes

Correct constant name and remove unnecessary variable

2f4463d

mikemccand reviewed Dec 14, 2023

View reviewed changes

Avoid recreate internal buffer

508b485

dungba88 approved these changes Dec 15, 2023

View reviewed changes

Refactor JapaneseKatakanaUppercaseFilter

c610053

daixque added 2 commits December 18, 2023 23:32

Added to CHANGES.txt

6a16ad1

Merge branch 'main' into kana_uppercase_filter

e79893b

github-actions bot added the Stale label Jan 8, 2024

github-actions bot removed the Stale label Jan 10, 2024

github-actions bot added the Stale label Jan 24, 2024

benwtrent reviewed Mar 18, 2024

View reviewed changes

Merge branch 'main' into kana_uppercase_filter

01e7d2e

benwtrent merged commit d393b9d into apache:main Mar 18, 2024
3 checks passed

benwtrent mentioned this pull request Mar 20, 2024

Add support for hiragana_uppercase & katakana_uppercase token filters in kuromoji analysis plugin elastic/elasticsearch#106553

Merged

azagniotov mentioned this pull request Jun 18, 2024

Performance optimizations, new filters for Japanese Sutegana and new Sudachi dictionary version azagniotov/solr-lucene-analyzer-sudachi#110

Merged

		String term = termAttr.toString();
		char[] src = term.toCharArray();

		* GITHUB#12915: Add new token filters for Japanese sutegana (捨て仮名). This introduces JapaneseHiraganaUppercaseFilter
		and JapaneseKatakanaUppercaseFilter. (Dai Sugimori)

Add new token filters for Japanese sutegana (捨て仮名) #12915

Add new token filters for Japanese sutegana (捨て仮名) #12915

Conversation

daixque commented Dec 11, 2023

Description

Why it is needed

mikemccand left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kojisekig commented Dec 12, 2023

daixque commented Dec 12, 2023

dungba88 Dec 12, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

daixque Dec 12, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

daixque Dec 12, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dungba88 commented Dec 13, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dungba88 left a comment

Choose a reason for hiding this comment

daixque commented Dec 16, 2023

mikemccand commented Dec 18, 2023

daixque commented Dec 18, 2023

github-actions bot commented Jan 8, 2024

daixque commented Jan 9, 2024 • edited Loading

dungba88 commented Jan 9, 2024

github-actions bot commented Jan 24, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dungba88 Dec 12, 2023 •

edited

Loading

daixque Dec 12, 2023 •

edited

Loading

daixque Dec 12, 2023 •

edited

Loading

daixque commented Jan 9, 2024 •

edited

Loading