Add support for hiragana_uppercase & katakana_uppercase token filters in kuromoji analysis plugin #106553

benwtrent · 2024-03-20T14:20:55Z

This adds support for hiragana_uppercase and katakana_uppercase provided in the new lucene release.

Sutegana (捨て仮名) is small letter of hiragana and katakana in Japanese. In the old Japanese text, sutegana (捨て仮名) is not used unlikely to modern one. For example:

"ストップウォッチ" is written as "ストツプウオツチ"
"ちょっとまって" is written as "ちよつとまつて"

So it's meaningful to normalize sutegana to normal (uppercase) characters if we search against the corpus which includes old Japanese text such as patents, legal documents, contract policies, etc.

Related to: apache/lucene#12915

… in kuromoji analysis plugin

elasticsearchmachine · 2024-03-20T14:21:19Z

Pinging @elastic/es-search (Team:Search)

benwtrent · 2024-03-20T14:34:02Z

Builds are failing due to the two new token filters created via: apache/lucene#12915

Since the original idea was to expose those in ES, I decided to do that and fix the failing test.

benwtrent · 2024-03-20T16:03:00Z

The test failure is fixed by: apache/lucene#13195

Will rerun CI once new snapshot is pulled in.

…s/lucene_snapshot

tteofili

LGTM

daixque · 2024-03-21T13:39:05Z

Thanks Ben! LGTM.

New filters are: - hiragana_uppercase - katakana_uppercase This is related to: * elastic#106553

Add support for hiragana_uppercase & katakana_uppercase token filters…

901c776

… in kuromoji analysis plugin

benwtrent added >enhancement :Search Relevance/Analysis How text is split into tokens v8.14.0 labels Mar 20, 2024

elasticsearchmachine added the Team:Search Meta label for search team label Mar 20, 2024

benwtrent requested a review from daixque March 20, 2024 14:21

add changelog entry

0525279

Merge remote-tracking branch 'upstream/lucene_snapshot' into testfixe…

327fba6

…s/lucene_snapshot

tteofili approved these changes Mar 21, 2024

View reviewed changes

benwtrent merged commit cadd5ab into elastic:lucene_snapshot Mar 21, 2024
14 checks passed

benwtrent deleted the testfixes/lucene_snapshot branch March 21, 2024 14:26

benwtrent removed the v8.14.0 label Apr 11, 2024

daixque added a commit to daixque/elasticsearch that referenced this pull request Aug 29, 2024

[DOCS] Add docs for new Lucene's filters for Japanese text.

14025b2

New filters are: - hiragana_uppercase - katakana_uppercase This is related to: * elastic#106553

daixque mentioned this pull request Aug 29, 2024

[DOCS] Add docs for new Lucene's filters for Japanese text. #112356

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for hiragana_uppercase & katakana_uppercase token filters in kuromoji analysis plugin #106553

Add support for hiragana_uppercase & katakana_uppercase token filters in kuromoji analysis plugin #106553

benwtrent commented Mar 20, 2024 •

edited

Loading

elasticsearchmachine commented Mar 20, 2024

benwtrent commented Mar 20, 2024

benwtrent commented Mar 20, 2024

tteofili left a comment

daixque commented Mar 21, 2024

Add support for hiragana_uppercase & katakana_uppercase token filters in kuromoji analysis plugin #106553

Add support for hiragana_uppercase & katakana_uppercase token filters in kuromoji analysis plugin #106553

Conversation

benwtrent commented Mar 20, 2024 • edited Loading

elasticsearchmachine commented Mar 20, 2024

benwtrent commented Mar 20, 2024

benwtrent commented Mar 20, 2024

tteofili left a comment

Choose a reason for hiding this comment

daixque commented Mar 21, 2024

benwtrent commented Mar 20, 2024 •

edited

Loading