LUCENE-10393: Unify binary dictionary and dictionary writer in kuromoji and nori #740

mocobeta · 2022-03-10T11:12:38Z

Both Kuromoji and Nori have BinaryDictionary and BinaryDictionaryWriter classes, and there is significant code duplication. This PR unifies them by decoupling language-specific information (or morphological information) from the base dictionary interface.
For details, see https://issues.apache.org/jira/browse/LUCENE-10393

rmuir · 2022-03-10T13:14:00Z

I only looked at the high-level design so far, this seems to be a good approach @mocobeta ! Thank you for tackling it. I think the bottom-up approach is a good one, and splitting out the morphological data into separate interface makes sense to me.

I would suggest reconsidering the name MorphAttributes, mostly because "Attributes" already has a complex meaning within lucene analysis. Some possibilities (not exhaustive list):

MorphData
DictionaryData

I will do more review and testing, I am digging into it in detail.

rmuir · 2022-03-10T13:17:14Z

I ran ./gradlew regenerate --rerun-tasks on your branch as an additional test and all binary data files were unchanged. So I feel good about correctness!

mocobeta · 2022-03-11T08:51:34Z

I ran ./gradlew regenerate --rerun-tasks on your branch as an additional test and all binary data files were unchanged. So I feel good about correctness!

Thanks, @rmuir for confirming that! This is code refactoring and shouldn't change the outputs of the BinaryDictionaryWriter.

About the interface, MorphData sounds fine to me if Attributes is confusing. I'll rename the classes.

…way.

mocobeta · 2022-03-14T11:31:13Z

We could have a common DictionaryBuilder class in analyzers-common but it brings too complex class hierarchy to me. I'd postpone refactoring XXXDictionaryBuilder until we come up with good interfaces or framework for that - it may need public interface changes and is out of the scope of this PR.

rmuir · 2022-03-16T16:31:01Z

...e/analysis/kuromoji/src/java/org/apache/lucene/analysis/ja/util/UnknownDictionaryWriter.java

@@ -56,9 +59,14 @@ public void putInvokeDefinition(String characterClassName, int invoke, int group
    characterDefinition.putInvokeDefinition(characterClassName, invoke, group, length);
  }

-  @Override
+  // @Override


was this intentional?

No, it looks the interfaces were still work-in-progress.
I think an abstract method should be added to the base dictionary writer class. 7d43514

…lar way.

rmuir · 2022-03-17T10:30:27Z

lucene/analysis/kuromoji/src/java/org/apache/lucene/analysis/ja/dict/UnknownMorphData.java

+import org.apache.lucene.util.IOSupplier;
+
+/** Morphological information for unk dictionary. */
+public class UnknownMorphData extends TokenInfoMorphData {


We can remove the public keyword here I think? (Same for the nori equivalent)

Thanks, yes - they also can be final. b1a0033

mocobeta · 2022-03-17T10:56:54Z

I think I finished what I'd like to make changes in kuromoji and nori.
Now, ConnectionCosts/ConnectionCostsWriter and CharacterDefinition/CharacterDefinitionWriter are also reside in analyzers-common in a very similar way to BinaryDictionary/BinaryDictionaryWriter.

I manually checked the shared code is exactly the same for kuromoji and nori in the current main (in other words, they were copied and have never been changed so far). Also, gradlew regenerate --rerun-tasks does not change the generated dictionary files for me.

rmuir · 2022-03-17T11:26:38Z

...sis/kuromoji/src/java/org/apache/lucene/analysis/ja/util/TokenInfoDictionaryEntryWriter.java

+import org.apache.lucene.util.ArrayUtil;
+
+/** Writes system dictionary entries */
+public class TokenInfoDictionaryEntryWriter extends DictionaryEntryWriter {


This also looks like it need not be public? I'm just looking for opportunities to reduce visibility as it really helps simplify the javadocs output on users.

mocobeta · 2022-03-17T13:19:06Z

The core dictionary logic is split into two modules (analysis-common and analysis-kuromoji/nori), I manually tested the tokenizers work with Java modules - for now luke app is a quick way to do so. It might be good to have some automated tests. I will try to locate where we should have such tests, maybe in another patch.
I think these changes are module-friendly, but just in case.

uschindler · 2022-03-17T15:54:25Z

I like the idea to remove the code duplication and have only one implementation.

On the other hand, if you look at LOC before/after: +1,818 −1,492
We now have 326 lines more, in addition both NORI and Kuromojo hardly depend on code in a different module (analysis-common), which now gets public.

So I feel a bit ambiguous about if it really makes sense to forcefully combine those implementations just because we have an overlap of "about 200 lines" (did not count).

mocobeta · 2022-03-17T17:08:22Z

About the increased number of lines, the majority of them are license headers and documentation.
There are added 22 files (to sort out interfaces) and removed 4 files - each license header contains 16 lines so 288 license header lines are added. Another reason for increasing the number of lines is newly introduced interfaces and their Javadocs. I think the substantial amount of code of the implementation classes was reduced by this change, though, I didn't count it.

I think the apparent demerit of this patch is exposing dictionary internals as public interfaces (and kuromoji and nori depend on it). We would have to choose which is better - keep hiding internals and maintain duplicated code, or open up some internals and share them. I myself would prefer the latter approach to ease the development or bug fixes that are common to kuromoji and nori, and prevent further diversifying them.

uschindler · 2022-03-18T09:20:29Z

I think the apparent demerit of this patch is exposing dictionary internals as public interfaces (and kuromoji and nori depend on it).

This is also my biggest concern, although the additional complexity and additional interfaces also make me sad.

One possibility is to use module system to at least hide the interface for "modern users" (are there any using module system): hide the package :-)

mocobeta · 2022-03-18T09:44:26Z

I just wanted to explain my view about interfaces. From my perspective, I don't think this adds complexities to interfaces.
As I wrote in the Jira issue, there are only two conceptual interfaces.

Dictionary: a high-level interface parameterized by a specific MorphAttributes
MorphAttributes: a high-level interface that represents morphological information. This is supposed to be extended to hold language-specific details.

Our current kuromoji/nori interfaces mix up "dictionary-lookup" and "language-specific feature", and in theory - they should be decoupled as original mecab library and its various ports do so in order to have only one analysis engine that can "switch" specific dictionaries.

rmuir · 2022-03-18T16:50:00Z

we could do that stuff in another PR. there is enough changes in this PR already I think? And the problem is really a separate, existing problem...

This reverts commit 67ed016.

mocobeta · 2022-03-19T02:04:22Z

I accidentally committed 67ed016 and reverted it. It merges .util to .dict package, minimize the visibility of the internal classes to package-private, and deletes .util package.
It'd be easy refactoring (with the help of IDE) but many files are affected. I agree that it should be done in another issue/PR.

-> here is the issue. https://issues.apache.org/jira/browse/LUCENE-10475

mocobeta · 2022-03-22T11:09:38Z

I added test modules analysis/kuromoji.tests and analysis/nori.tests to make sure that both tokenizers correctly load the dictionary resources and work in module-mode. They are tiny tests but it'd be good to have ones for sanity checks.

mocobeta · 2022-03-23T10:18:58Z

To me, this is already self-contained and ready to be merged. This is not perfect though, I think it would be a start point to move forward (having flexibly switchable or modularized dictionaries, or unified Tokenizers at some level so that we can simultaneously improve/optimize both of kuromoji and nori; I'll keep going to work on it once this is successfully merged).

I added the CHANGES entry but I'd need approval(s) to merge such a large patch. I understand this perhaps could be a bit controversial, I will keep it open to wait for feedback from others.

mocobeta · 2022-03-23T13:25:16Z

Hi @rmuir, I just wanted to ping you to let you know that I requested a review (in github) a little while ago. I don't intend to rush you, thanks.

rmuir

thanks for this refactoring. it is a good first step!

rmuir · 2022-03-25T00:50:19Z

also, sorry about the review slowness. i didn't want to just click "approve" without taking another pass through the comments and code. Again, I like the way the concerns were split apart, the explanation you gave about +/- LOC from github is exactly how I feel, too.

The overall algorithm is the same one here for nori and kuromoji, so it is a shame that we have duplicated implementation code (the holy grail will be factoring the actual tokenization logic!). At the same time, different languages have quirks about them and need different encoding/compression to be efficient. Different dictionaries might have quirks, too. It would be great to give all reasonable options compatible with the apache2 license to the user, without forking thousands of lines of Tokenizer code, each time.

uschindler

I agree with Robert's comment.

uschindler · 2022-03-25T07:34:02Z

I added test modules analysis/kuromoji.tests and analysis/nori.tests to make sure that both tokenizers correctly load the dictionary resources and work in module-mode. They are tiny tests but it'd be good to have ones for sanity checks.

This is not needed due to the improved TestRandomChains. This one instantiates all analyzer compients from module system.

uschindler · 2022-03-25T07:37:18Z

As said before, I would not add the extra module tests. We have the integration test for all analyzers, so if tokenizer can be instantiated it is happy. If there is an IOException or similar it would fail.

I don't like the many Gradle modules, but I leave it up to you if you want to keep the tests.

This reverts commit c069ecb.

mocobeta · 2022-03-25T09:31:02Z

I added test modules analysis/kuromoji.tests and analysis/nori.tests to make sure that both tokenizers correctly load the dictionary resources and work in module-mode. They are tiny tests but it'd be good to have ones for sanity checks.

This is not needed due to the improved TestRandomChains. This one instantiates all analyzer compients from module system.

I removed the test modules. If we need specific tests for kuromoji or nori at some point in the future, then we will be able to re-add them.

mocobeta · 2022-03-25T09:43:46Z

@rmuir @uschindler
Thank you for your thorough review! I thought it'd take some more time.

I'm merging it now - I think it'd be better to open a follow-up issue to make org.apache.lucene.analysis.morph package visible only to kuromoji and nori.

uschindler · 2022-03-25T10:26:16Z

FYI, here is this Test (it also checks if all components have factories): https://github.com/apache/lucene/tree/main/lucene/analysis.tests/src/test/org/apache/lucene/analysis/tests

Of course RandomChains won't trigger on every run, because it randomly builds combinations of tokenizers and filters, but a broken Tokenizer Component that can't load its resources will for sure break this test.

mocobeta · 2022-03-28T09:43:47Z

Just a quick note: I opened #772, which slightly improves the encapsulation of the dictionary-related internals. The whole refactoring was done by IDE and I think there wouldn't be big matters to be discussed, I am going to merge it tomorrow.

@monster

* main: (38 commits) remove obsolete image/description from luke/README.md Upgrade to forbiddenapis 3.3 (apache#768) LUCENE-10393: Unify binary dictionary and dictionary writer in kuromoji and nori (apache#740) LUCENE-9651 Update benchmark module docs (apache#759) LUCENE-10458: BoundedDocSetIdIterator may supply error count in Weigth#count(LeafReaderContext) when missingValue enables (apache#736) LUCENE-10481: FacetsCollector will not request scores if it does not use them (apache#760) LUCENE-10477: mention 'call multiple times' in Query.rewrite javadoc (apache#758) Add back-compat indices for 9.1.0. Synchronize CHANGES. LUCENE-10464, LUCENE-10477: WeightedSpanTermExtractor.extractWeightedSpanTerms to rewrite sufficiently (apache#737) Add version 9.1.0. DOAP changes for release 9.1.0 LUCENE-10422: Make errorprone happy LUCENE-10478: mark Test4GBStoredFields as @monster (apache#757) LUCENE-10422: Read-only monitor implementation (apache#679) LUCENE-10473: Make tests a bit faster when running nightly. (apache#754) LUCENE-9905: Fix check in TestPerFieldKnnVectorsFormat#testMergeUsesNewFormat LUCENE-9614: Fix rare TestKnnVectorQuery failures LUCENE-10472: Fix TestMatchAllDocsQuery#testEarlyTermination (apache#753) LUCENE-10418: Move CHANGES to the correct section. ...

@monster

* main: (52 commits) remove obsolete image/description from luke/README.md Upgrade to forbiddenapis 3.3 (apache#768) LUCENE-10393: Unify binary dictionary and dictionary writer in kuromoji and nori (apache#740) LUCENE-9651 Update benchmark module docs (apache#759) LUCENE-10458: BoundedDocSetIdIterator may supply error count in Weigth#count(LeafReaderContext) when missingValue enables (apache#736) LUCENE-10481: FacetsCollector will not request scores if it does not use them (apache#760) LUCENE-10477: mention 'call multiple times' in Query.rewrite javadoc (apache#758) Add back-compat indices for 9.1.0. Synchronize CHANGES. LUCENE-10464, LUCENE-10477: WeightedSpanTermExtractor.extractWeightedSpanTerms to rewrite sufficiently (apache#737) Add version 9.1.0. DOAP changes for release 9.1.0 LUCENE-10422: Make errorprone happy LUCENE-10478: mark Test4GBStoredFields as @monster (apache#757) LUCENE-10422: Read-only monitor implementation (apache#679) LUCENE-10473: Make tests a bit faster when running nightly. (apache#754) LUCENE-9905: Fix check in TestPerFieldKnnVectorsFormat#testMergeUsesNewFormat LUCENE-9614: Fix rare TestKnnVectorQuery failures LUCENE-10472: Fix TestMatchAllDocsQuery#testEarlyTermination (apache#753) LUCENE-10418: Move CHANGES to the correct section. ... # Conflicts: # lucene/sandbox/src/java/org/apache/lucene/sandbox/search/IndexSortSortedNumericDocValuesRangeQuery.java # lucene/sandbox/src/test/org/apache/lucene/sandbox/search/TestIndexSortSortedNumericDocValuesRangeQuery.java

mikemccand · 2024-02-05T18:33:34Z

@mocobeta just checking: it looks like this was never backported to 9.x (I hit unexpected merge conflicts while backporting an FST change) -- was that intentional? Were there API breaks that prevented backport maybe? Thanks!

uschindler · 2024-02-05T19:11:18Z

Yes this was intentional. It breaks API.

mocobeta added 4 commits March 10, 2022 01:23

factor out binary dictionary writer/reader for kuromoji

1e2864e

apply to nori.

3c0562f

lint

d220198

add javadocs

1863e25

mocobeta added 4 commits March 12, 2022 10:02

rename MorphAttributes to MorphData

8323180

minor refactor on dictionary writer

b53cd3c

lint

15e80de

ConnectionCosts and ConnectionCostsWriter can be shared in a similar …

4f4a0f2

…way.

rmuir reviewed Mar 16, 2022

View reviewed changes

mocobeta added 4 commits March 17, 2022 17:36

add write(Path) abstract method to BinaryDictionaryWriter

7d43514

CharacterDefinition/CharacterDefinitionWriter can be shared in a simi…

49bf2a5

…lar way.

small refactoring on codec header constants

8d0645e

remove obsolete comments

6502d63

rmuir reviewed Mar 17, 2022

View reviewed changes

mocobeta added 2 commits March 17, 2022 19:38

reduce visibility of XXXMorphData; and they can be final

b1a0033

add javadocs

5b23148

rmuir reviewed Mar 17, 2022

View reviewed changes

mocobeta added 3 commits March 17, 2022 21:13

reduce visibility of DictionaryEntryWriters

34cfa68

add documentation about ipadic format

38481bc

putEntry() also can be protected

4387d3f

mocobeta added 2 commits March 19, 2022 10:53

merge .util package to .dict

67ed016

Revert "merge .util package to .dict"

6003fae

This reverts commit 67ed016.

add module tests for kuromoji and nori

c069ecb

mocobeta force-pushed the jira/lucene-10393-refine-dictionary-api branch from ffef5d4 to c069ecb Compare March 22, 2022 10:53

mocobeta added 2 commits March 23, 2022 18:48

add changes entry.

02794d9

Merge branch 'main' into jira/lucene-10393-refine-dictionary-api

5b09688

mocobeta requested a review from rmuir March 23, 2022 12:06

rmuir approved these changes Mar 25, 2022

View reviewed changes

uschindler approved these changes Mar 25, 2022

View reviewed changes

Revert "add module tests for kuromoji and nori"

a13bc51

This reverts commit c069ecb.

mocobeta merged commit bd22f19 into apache:main Mar 25, 2022

mocobeta deleted the jira/lucene-10393-refine-dictionary-api branch March 25, 2022 09:44

mocobeta mentioned this pull request Apr 29, 2022

LUCENE-10545: Allow to link to github PR from changes #854

Merged

asfimport mentioned this pull request Mar 29, 2022

Reconsider package structure in kuromoji and nori to mininize classes' visibiilty [LUCENE-10475] #11511

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LUCENE-10393: Unify binary dictionary and dictionary writer in kuromoji and nori #740

LUCENE-10393: Unify binary dictionary and dictionary writer in kuromoji and nori #740

mocobeta commented Mar 10, 2022 •

edited

Loading

rmuir commented Mar 10, 2022

rmuir commented Mar 10, 2022

mocobeta commented Mar 11, 2022

mocobeta commented Mar 14, 2022

rmuir Mar 16, 2022

mocobeta Mar 17, 2022

rmuir Mar 17, 2022

mocobeta Mar 17, 2022

mocobeta commented Mar 17, 2022

rmuir Mar 17, 2022

mocobeta Mar 17, 2022

mocobeta commented Mar 17, 2022

uschindler commented Mar 17, 2022 •

edited

Loading

mocobeta commented Mar 17, 2022 •

edited

Loading

uschindler commented Mar 18, 2022

mocobeta commented Mar 18, 2022

rmuir commented Mar 18, 2022

mocobeta commented Mar 19, 2022 •

edited

Loading

mocobeta commented Mar 22, 2022

mocobeta commented Mar 23, 2022 •

edited

Loading

mocobeta commented Mar 23, 2022

rmuir left a comment

rmuir commented Mar 25, 2022

uschindler left a comment

uschindler commented Mar 25, 2022

uschindler commented Mar 25, 2022

mocobeta commented Mar 25, 2022

mocobeta commented Mar 25, 2022

uschindler commented Mar 25, 2022

mocobeta commented Mar 28, 2022

mikemccand commented Feb 5, 2024

uschindler commented Feb 5, 2024

LUCENE-10393: Unify binary dictionary and dictionary writer in kuromoji and nori #740

LUCENE-10393: Unify binary dictionary and dictionary writer in kuromoji and nori #740

Conversation

mocobeta commented Mar 10, 2022 • edited Loading

rmuir commented Mar 10, 2022

rmuir commented Mar 10, 2022

mocobeta commented Mar 11, 2022

mocobeta commented Mar 14, 2022

rmuir Mar 16, 2022

Choose a reason for hiding this comment

mocobeta Mar 17, 2022

Choose a reason for hiding this comment

rmuir Mar 17, 2022

Choose a reason for hiding this comment

mocobeta Mar 17, 2022

Choose a reason for hiding this comment

mocobeta commented Mar 17, 2022

rmuir Mar 17, 2022

Choose a reason for hiding this comment

mocobeta Mar 17, 2022

Choose a reason for hiding this comment

mocobeta commented Mar 17, 2022

uschindler commented Mar 17, 2022 • edited Loading

mocobeta commented Mar 17, 2022 • edited Loading

uschindler commented Mar 18, 2022

mocobeta commented Mar 18, 2022

rmuir commented Mar 18, 2022

mocobeta commented Mar 19, 2022 • edited Loading

mocobeta commented Mar 22, 2022

mocobeta commented Mar 23, 2022 • edited Loading

mocobeta commented Mar 23, 2022

rmuir left a comment

Choose a reason for hiding this comment

rmuir commented Mar 25, 2022

uschindler left a comment

Choose a reason for hiding this comment

uschindler commented Mar 25, 2022

uschindler commented Mar 25, 2022

mocobeta commented Mar 25, 2022

mocobeta commented Mar 25, 2022

uschindler commented Mar 25, 2022

mocobeta commented Mar 28, 2022

mikemccand commented Feb 5, 2024

uschindler commented Feb 5, 2024

mocobeta commented Mar 10, 2022 •

edited

Loading

uschindler commented Mar 17, 2022 •

edited

Loading

mocobeta commented Mar 17, 2022 •

edited

Loading

mocobeta commented Mar 19, 2022 •

edited

Loading

mocobeta commented Mar 23, 2022 •

edited

Loading