-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LUCENE-10393: Unify binary dictionary and dictionary writer in kuromoji and nori #740
LUCENE-10393: Unify binary dictionary and dictionary writer in kuromoji and nori #740
Conversation
I only looked at the high-level design so far, this seems to be a good approach @mocobeta ! Thank you for tackling it. I think the bottom-up approach is a good one, and splitting out the morphological data into separate interface makes sense to me. I would suggest reconsidering the name
I will do more review and testing, I am digging into it in detail. |
I ran |
Thanks, @rmuir for confirming that! This is code refactoring and shouldn't change the outputs of the BinaryDictionaryWriter. About the interface, |
We could have a common |
@@ -56,9 +59,14 @@ public void putInvokeDefinition(String characterClassName, int invoke, int group | |||
characterDefinition.putInvokeDefinition(characterClassName, invoke, group, length); | |||
} | |||
|
|||
@Override | |||
// @Override |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
was this intentional?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, it looks the interfaces were still work-in-progress.
I think an abstract method should be added to the base dictionary writer class. 7d43514
import org.apache.lucene.util.IOSupplier; | ||
|
||
/** Morphological information for unk dictionary. */ | ||
public class UnknownMorphData extends TokenInfoMorphData { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can remove the public
keyword here I think? (Same for the nori equivalent)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, yes - they also can be final. b1a0033
I think I finished what I'd like to make changes in kuromoji and nori. I manually checked the shared code is exactly the same for kuromoji and nori in the current |
import org.apache.lucene.util.ArrayUtil; | ||
|
||
/** Writes system dictionary entries */ | ||
public class TokenInfoDictionaryEntryWriter extends DictionaryEntryWriter { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This also looks like it need not be public? I'm just looking for opportunities to reduce visibility as it really helps simplify the javadocs output on users.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, yes!
The core dictionary logic is split into two modules (analysis-common and analysis-kuromoji/nori), I manually tested the tokenizers work with Java modules - for now luke app is a quick way to do so. It might be good to have some automated tests. I will try to locate where we should have such tests, maybe in another patch. |
I like the idea to remove the code duplication and have only one implementation. On the other hand, if you look at LOC before/after: +1,818 −1,492 So I feel a bit ambiguous about if it really makes sense to forcefully combine those implementations just because we have an overlap of "about 200 lines" (did not count). |
About the increased number of lines, the majority of them are license headers and documentation. I think the apparent demerit of this patch is exposing dictionary internals as public interfaces (and kuromoji and nori depend on it). We would have to choose which is better - keep hiding internals and maintain duplicated code, or open up some internals and share them. I myself would prefer the latter approach to ease the development or bug fixes that are common to kuromoji and nori, and prevent further diversifying them. |
This is also my biggest concern, although the additional complexity and additional interfaces also make me sad. One possibility is to use module system to at least hide the interface for "modern users" (are there any using module system): hide the package :-) |
I just wanted to explain my view about interfaces. From my perspective, I don't think this adds complexities to interfaces.
Our current kuromoji/nori interfaces mix up "dictionary-lookup" and "language-specific feature", and in theory - they should be decoupled as original mecab library and its various ports do so in order to have only one analysis engine that can "switch" specific dictionaries. |
we could do that stuff in another PR. there is enough changes in this PR already I think? And the problem is really a separate, existing problem... |
This reverts commit 67ed016.
I accidentally committed 67ed016 and reverted it. It merges -> here is the issue. https://issues.apache.org/jira/browse/LUCENE-10475 |
ffef5d4
to
c069ecb
Compare
I added test modules |
To me, this is already self-contained and ready to be merged. This is not perfect though, I think it would be a start point to move forward (having flexibly switchable or modularized dictionaries, or unified Tokenizers at some level so that we can simultaneously improve/optimize both of kuromoji and nori; I'll keep going to work on it once this is successfully merged). I added the CHANGES entry but I'd need approval(s) to merge such a large patch. I understand this perhaps could be a bit controversial, I will keep it open to wait for feedback from others. |
Hi @rmuir, I just wanted to ping you to let you know that I requested a review (in github) a little while ago. I don't intend to rush you, thanks. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for this refactoring. it is a good first step!
also, sorry about the review slowness. i didn't want to just click "approve" without taking another pass through the comments and code. Again, I like the way the concerns were split apart, the explanation you gave about +/- LOC from github is exactly how I feel, too. The overall algorithm is the same one here for nori and kuromoji, so it is a shame that we have duplicated implementation code (the holy grail will be factoring the actual tokenization logic!). At the same time, different languages have quirks about them and need different encoding/compression to be efficient. Different dictionaries might have quirks, too. It would be great to give all reasonable options compatible with the apache2 license to the user, without forking thousands of lines of Tokenizer code, each time. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with Robert's comment.
This is not needed due to the improved TestRandomChains. This one instantiates all analyzer compients from module system. |
As said before, I would not add the extra module tests. We have the integration test for all analyzers, so if tokenizer can be instantiated it is happy. If there is an IOException or similar it would fail. I don't like the many Gradle modules, but I leave it up to you if you want to keep the tests. |
This reverts commit c069ecb.
I removed the test modules. If we need specific tests for kuromoji or nori at some point in the future, then we will be able to re-add them. |
@rmuir @uschindler I'm merging it now - I think it'd be better to open a follow-up issue to make |
FYI, here is this Test (it also checks if all components have factories): https://github.com/apache/lucene/tree/main/lucene/analysis.tests/src/test/org/apache/lucene/analysis/tests Of course RandomChains won't trigger on every run, because it randomly builds combinations of tokenizers and filters, but a broken Tokenizer Component that can't load its resources will for sure break this test. |
Just a quick note: I opened #772, which slightly improves the encapsulation of the dictionary-related internals. The whole refactoring was done by IDE and I think there wouldn't be big matters to be discussed, I am going to merge it tomorrow. |
* main: (38 commits) remove obsolete image/description from luke/README.md Upgrade to forbiddenapis 3.3 (apache#768) LUCENE-10393: Unify binary dictionary and dictionary writer in kuromoji and nori (apache#740) LUCENE-9651 Update benchmark module docs (apache#759) LUCENE-10458: BoundedDocSetIdIterator may supply error count in Weigth#count(LeafReaderContext) when missingValue enables (apache#736) LUCENE-10481: FacetsCollector will not request scores if it does not use them (apache#760) LUCENE-10477: mention 'call multiple times' in Query.rewrite javadoc (apache#758) Add back-compat indices for 9.1.0. Synchronize CHANGES. LUCENE-10464, LUCENE-10477: WeightedSpanTermExtractor.extractWeightedSpanTerms to rewrite sufficiently (apache#737) Add version 9.1.0. DOAP changes for release 9.1.0 LUCENE-10422: Make errorprone happy LUCENE-10478: mark Test4GBStoredFields as @monster (apache#757) LUCENE-10422: Read-only monitor implementation (apache#679) LUCENE-10473: Make tests a bit faster when running nightly. (apache#754) LUCENE-9905: Fix check in TestPerFieldKnnVectorsFormat#testMergeUsesNewFormat LUCENE-9614: Fix rare TestKnnVectorQuery failures LUCENE-10472: Fix TestMatchAllDocsQuery#testEarlyTermination (apache#753) LUCENE-10418: Move CHANGES to the correct section. ...
* main: (52 commits) remove obsolete image/description from luke/README.md Upgrade to forbiddenapis 3.3 (apache#768) LUCENE-10393: Unify binary dictionary and dictionary writer in kuromoji and nori (apache#740) LUCENE-9651 Update benchmark module docs (apache#759) LUCENE-10458: BoundedDocSetIdIterator may supply error count in Weigth#count(LeafReaderContext) when missingValue enables (apache#736) LUCENE-10481: FacetsCollector will not request scores if it does not use them (apache#760) LUCENE-10477: mention 'call multiple times' in Query.rewrite javadoc (apache#758) Add back-compat indices for 9.1.0. Synchronize CHANGES. LUCENE-10464, LUCENE-10477: WeightedSpanTermExtractor.extractWeightedSpanTerms to rewrite sufficiently (apache#737) Add version 9.1.0. DOAP changes for release 9.1.0 LUCENE-10422: Make errorprone happy LUCENE-10478: mark Test4GBStoredFields as @monster (apache#757) LUCENE-10422: Read-only monitor implementation (apache#679) LUCENE-10473: Make tests a bit faster when running nightly. (apache#754) LUCENE-9905: Fix check in TestPerFieldKnnVectorsFormat#testMergeUsesNewFormat LUCENE-9614: Fix rare TestKnnVectorQuery failures LUCENE-10472: Fix TestMatchAllDocsQuery#testEarlyTermination (apache#753) LUCENE-10418: Move CHANGES to the correct section. ... # Conflicts: # lucene/sandbox/src/java/org/apache/lucene/sandbox/search/IndexSortSortedNumericDocValuesRangeQuery.java # lucene/sandbox/src/test/org/apache/lucene/sandbox/search/TestIndexSortSortedNumericDocValuesRangeQuery.java
@mocobeta just checking: it looks like this was never backported to 9.x (I hit unexpected merge conflicts while backporting an FST change) -- was that intentional? Were there API breaks that prevented backport maybe? Thanks! |
Yes this was intentional. It breaks API. |
Both Kuromoji and Nori have
BinaryDictionary
andBinaryDictionaryWriter
classes, and there is significant code duplication. This PR unifies them by decoupling language-specific information (or morphological information) from the base dictionary interface.For details, see https://issues.apache.org/jira/browse/LUCENE-10393