Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LUCENE-10393: Unify binary dictionary and dictionary writer in kuromoji and nori #740

Merged
merged 23 commits into from
Mar 25, 2022

Conversation

mocobeta
Copy link
Contributor

@mocobeta mocobeta commented Mar 10, 2022

Both Kuromoji and Nori have BinaryDictionary and BinaryDictionaryWriter classes, and there is significant code duplication. This PR unifies them by decoupling language-specific information (or morphological information) from the base dictionary interface.
For details, see https://issues.apache.org/jira/browse/LUCENE-10393

@rmuir
Copy link
Member

rmuir commented Mar 10, 2022

I only looked at the high-level design so far, this seems to be a good approach @mocobeta ! Thank you for tackling it. I think the bottom-up approach is a good one, and splitting out the morphological data into separate interface makes sense to me.

I would suggest reconsidering the name MorphAttributes, mostly because "Attributes" already has a complex meaning within lucene analysis. Some possibilities (not exhaustive list):

  • MorphData
  • DictionaryData

I will do more review and testing, I am digging into it in detail.

@rmuir
Copy link
Member

rmuir commented Mar 10, 2022

I ran ./gradlew regenerate --rerun-tasks on your branch as an additional test and all binary data files were unchanged. So I feel good about correctness!

@mocobeta
Copy link
Contributor Author

I ran ./gradlew regenerate --rerun-tasks on your branch as an additional test and all binary data files were unchanged. So I feel good about correctness!

Thanks, @rmuir for confirming that! This is code refactoring and shouldn't change the outputs of the BinaryDictionaryWriter.

About the interface, MorphData sounds fine to me if Attributes is confusing. I'll rename the classes.

@mocobeta
Copy link
Contributor Author

We could have a common DictionaryBuilder class in analyzers-common but it brings too complex class hierarchy to me. I'd postpone refactoring XXXDictionaryBuilder until we come up with good interfaces or framework for that - it may need public interface changes and is out of the scope of this PR.

@@ -56,9 +59,14 @@ public void putInvokeDefinition(String characterClassName, int invoke, int group
characterDefinition.putInvokeDefinition(characterClassName, invoke, group, length);
}

@Override
// @Override
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

was this intentional?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, it looks the interfaces were still work-in-progress.
I think an abstract method should be added to the base dictionary writer class. 7d43514

import org.apache.lucene.util.IOSupplier;

/** Morphological information for unk dictionary. */
public class UnknownMorphData extends TokenInfoMorphData {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can remove the public keyword here I think? (Same for the nori equivalent)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, yes - they also can be final. b1a0033

@mocobeta
Copy link
Contributor Author

I think I finished what I'd like to make changes in kuromoji and nori.
Now, ConnectionCosts/ConnectionCostsWriter and CharacterDefinition/CharacterDefinitionWriter are also reside in analyzers-common in a very similar way to BinaryDictionary/BinaryDictionaryWriter.

I manually checked the shared code is exactly the same for kuromoji and nori in the current main (in other words, they were copied and have never been changed so far). Also, gradlew regenerate --rerun-tasks does not change the generated dictionary files for me.

import org.apache.lucene.util.ArrayUtil;

/** Writes system dictionary entries */
public class TokenInfoDictionaryEntryWriter extends DictionaryEntryWriter {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This also looks like it need not be public? I'm just looking for opportunities to reduce visibility as it really helps simplify the javadocs output on users.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, yes!

@mocobeta
Copy link
Contributor Author

The core dictionary logic is split into two modules (analysis-common and analysis-kuromoji/nori), I manually tested the tokenizers work with Java modules - for now luke app is a quick way to do so. It might be good to have some automated tests. I will try to locate where we should have such tests, maybe in another patch.
I think these changes are module-friendly, but just in case.

@uschindler
Copy link
Contributor

uschindler commented Mar 17, 2022

I like the idea to remove the code duplication and have only one implementation.

On the other hand, if you look at LOC before/after: +1,818 −1,492
We now have 326 lines more, in addition both NORI and Kuromojo hardly depend on code in a different module (analysis-common), which now gets public.

So I feel a bit ambiguous about if it really makes sense to forcefully combine those implementations just because we have an overlap of "about 200 lines" (did not count).

@mocobeta
Copy link
Contributor Author

mocobeta commented Mar 17, 2022

About the increased number of lines, the majority of them are license headers and documentation.
There are added 22 files (to sort out interfaces) and removed 4 files - each license header contains 16 lines so 288 license header lines are added. Another reason for increasing the number of lines is newly introduced interfaces and their Javadocs. I think the substantial amount of code of the implementation classes was reduced by this change, though, I didn't count it.

I think the apparent demerit of this patch is exposing dictionary internals as public interfaces (and kuromoji and nori depend on it). We would have to choose which is better - keep hiding internals and maintain duplicated code, or open up some internals and share them. I myself would prefer the latter approach to ease the development or bug fixes that are common to kuromoji and nori, and prevent further diversifying them.

@uschindler
Copy link
Contributor

I think the apparent demerit of this patch is exposing dictionary internals as public interfaces (and kuromoji and nori depend on it).

This is also my biggest concern, although the additional complexity and additional interfaces also make me sad.

One possibility is to use module system to at least hide the interface for "modern users" (are there any using module system): hide the package :-)

@mocobeta
Copy link
Contributor Author

I just wanted to explain my view about interfaces. From my perspective, I don't think this adds complexities to interfaces.
As I wrote in the Jira issue, there are only two conceptual interfaces.

  • Dictionary: a high-level interface parameterized by a specific MorphAttributes
  • MorphAttributes: a high-level interface that represents morphological information. This is supposed to be extended to hold language-specific details.

Our current kuromoji/nori interfaces mix up "dictionary-lookup" and "language-specific feature", and in theory - they should be decoupled as original mecab library and its various ports do so in order to have only one analysis engine that can "switch" specific dictionaries.

@rmuir
Copy link
Member

rmuir commented Mar 18, 2022

we could do that stuff in another PR. there is enough changes in this PR already I think? And the problem is really a separate, existing problem...

@mocobeta
Copy link
Contributor Author

mocobeta commented Mar 19, 2022

I accidentally committed 67ed016 and reverted it. It merges .util to .dict package, minimize the visibility of the internal classes to package-private, and deletes .util package.
It'd be easy refactoring (with the help of IDE) but many files are affected. I agree that it should be done in another issue/PR.

-> here is the issue. https://issues.apache.org/jira/browse/LUCENE-10475

@mocobeta mocobeta force-pushed the jira/lucene-10393-refine-dictionary-api branch from ffef5d4 to c069ecb Compare March 22, 2022 10:53
@mocobeta
Copy link
Contributor Author

I added test modules analysis/kuromoji.tests and analysis/nori.tests to make sure that both tokenizers correctly load the dictionary resources and work in module-mode. They are tiny tests but it'd be good to have ones for sanity checks.

@mocobeta
Copy link
Contributor Author

mocobeta commented Mar 23, 2022

To me, this is already self-contained and ready to be merged. This is not perfect though, I think it would be a start point to move forward (having flexibly switchable or modularized dictionaries, or unified Tokenizers at some level so that we can simultaneously improve/optimize both of kuromoji and nori; I'll keep going to work on it once this is successfully merged).

I added the CHANGES entry but I'd need approval(s) to merge such a large patch. I understand this perhaps could be a bit controversial, I will keep it open to wait for feedback from others.

@mocobeta mocobeta requested a review from rmuir March 23, 2022 12:06
@mocobeta
Copy link
Contributor Author

Hi @rmuir, I just wanted to ping you to let you know that I requested a review (in github) a little while ago. I don't intend to rush you, thanks.

Copy link
Member

@rmuir rmuir left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for this refactoring. it is a good first step!

@rmuir
Copy link
Member

rmuir commented Mar 25, 2022

also, sorry about the review slowness. i didn't want to just click "approve" without taking another pass through the comments and code. Again, I like the way the concerns were split apart, the explanation you gave about +/- LOC from github is exactly how I feel, too.

The overall algorithm is the same one here for nori and kuromoji, so it is a shame that we have duplicated implementation code (the holy grail will be factoring the actual tokenization logic!). At the same time, different languages have quirks about them and need different encoding/compression to be efficient. Different dictionaries might have quirks, too. It would be great to give all reasonable options compatible with the apache2 license to the user, without forking thousands of lines of Tokenizer code, each time.

Copy link
Contributor

@uschindler uschindler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with Robert's comment.

@uschindler
Copy link
Contributor

I added test modules analysis/kuromoji.tests and analysis/nori.tests to make sure that both tokenizers correctly load the dictionary resources and work in module-mode. They are tiny tests but it'd be good to have ones for sanity checks.

This is not needed due to the improved TestRandomChains. This one instantiates all analyzer compients from module system.

@uschindler
Copy link
Contributor

As said before, I would not add the extra module tests. We have the integration test for all analyzers, so if tokenizer can be instantiated it is happy. If there is an IOException or similar it would fail.

I don't like the many Gradle modules, but I leave it up to you if you want to keep the tests.

@mocobeta
Copy link
Contributor Author

I added test modules analysis/kuromoji.tests and analysis/nori.tests to make sure that both tokenizers correctly load the dictionary resources and work in module-mode. They are tiny tests but it'd be good to have ones for sanity checks.

This is not needed due to the improved TestRandomChains. This one instantiates all analyzer compients from module system.

I removed the test modules. If we need specific tests for kuromoji or nori at some point in the future, then we will be able to re-add them.

@mocobeta
Copy link
Contributor Author

@rmuir @uschindler
Thank you for your thorough review! I thought it'd take some more time.

I'm merging it now - I think it'd be better to open a follow-up issue to make org.apache.lucene.analysis.morph package visible only to kuromoji and nori.

@mocobeta mocobeta merged commit bd22f19 into apache:main Mar 25, 2022
@mocobeta mocobeta deleted the jira/lucene-10393-refine-dictionary-api branch March 25, 2022 09:44
@uschindler
Copy link
Contributor

FYI, here is this Test (it also checks if all components have factories): https://github.com/apache/lucene/tree/main/lucene/analysis.tests/src/test/org/apache/lucene/analysis/tests

Of course RandomChains won't trigger on every run, because it randomly builds combinations of tokenizers and filters, but a broken Tokenizer Component that can't load its resources will for sure break this test.

@mocobeta
Copy link
Contributor Author

Just a quick note: I opened #772, which slightly improves the encapsulation of the dictionary-related internals. The whole refactoring was done by IDE and I think there wouldn't be big matters to be discussed, I am going to merge it tomorrow.

wjp719 added a commit to wjp719/lucene that referenced this pull request Mar 28, 2022
* main: (38 commits)
  remove obsolete image/description from luke/README.md
  Upgrade to forbiddenapis 3.3 (apache#768)
  LUCENE-10393: Unify binary dictionary and dictionary writer in kuromoji and nori (apache#740)
  LUCENE-9651 Update benchmark module docs (apache#759)
  LUCENE-10458: BoundedDocSetIdIterator may supply error count in Weigth#count(LeafReaderContext) when missingValue enables (apache#736)
  LUCENE-10481: FacetsCollector will not request scores if it does not use them (apache#760)
  LUCENE-10477: mention 'call multiple times' in Query.rewrite javadoc (apache#758)
  Add back-compat indices for 9.1.0.
  Synchronize CHANGES.
  LUCENE-10464, LUCENE-10477: WeightedSpanTermExtractor.extractWeightedSpanTerms to rewrite sufficiently (apache#737)
  Add version 9.1.0.
  DOAP changes for release 9.1.0
  LUCENE-10422: Make errorprone happy
  LUCENE-10478: mark Test4GBStoredFields as @monster (apache#757)
  LUCENE-10422: Read-only monitor implementation (apache#679)
  LUCENE-10473: Make tests a bit faster when running nightly. (apache#754)
  LUCENE-9905: Fix check in TestPerFieldKnnVectorsFormat#testMergeUsesNewFormat
  LUCENE-9614: Fix rare TestKnnVectorQuery failures
  LUCENE-10472: Fix TestMatchAllDocsQuery#testEarlyTermination (apache#753)
  LUCENE-10418: Move CHANGES to the correct section.
  ...
wjp719 added a commit to wjp719/lucene that referenced this pull request Mar 28, 2022
* main: (52 commits)
  remove obsolete image/description from luke/README.md
  Upgrade to forbiddenapis 3.3 (apache#768)
  LUCENE-10393: Unify binary dictionary and dictionary writer in kuromoji and nori (apache#740)
  LUCENE-9651 Update benchmark module docs (apache#759)
  LUCENE-10458: BoundedDocSetIdIterator may supply error count in Weigth#count(LeafReaderContext) when missingValue enables (apache#736)
  LUCENE-10481: FacetsCollector will not request scores if it does not use them (apache#760)
  LUCENE-10477: mention 'call multiple times' in Query.rewrite javadoc (apache#758)
  Add back-compat indices for 9.1.0.
  Synchronize CHANGES.
  LUCENE-10464, LUCENE-10477: WeightedSpanTermExtractor.extractWeightedSpanTerms to rewrite sufficiently (apache#737)
  Add version 9.1.0.
  DOAP changes for release 9.1.0
  LUCENE-10422: Make errorprone happy
  LUCENE-10478: mark Test4GBStoredFields as @monster (apache#757)
  LUCENE-10422: Read-only monitor implementation (apache#679)
  LUCENE-10473: Make tests a bit faster when running nightly. (apache#754)
  LUCENE-9905: Fix check in TestPerFieldKnnVectorsFormat#testMergeUsesNewFormat
  LUCENE-9614: Fix rare TestKnnVectorQuery failures
  LUCENE-10472: Fix TestMatchAllDocsQuery#testEarlyTermination (apache#753)
  LUCENE-10418: Move CHANGES to the correct section.
  ...

# Conflicts:
#	lucene/sandbox/src/java/org/apache/lucene/sandbox/search/IndexSortSortedNumericDocValuesRangeQuery.java
#	lucene/sandbox/src/test/org/apache/lucene/sandbox/search/TestIndexSortSortedNumericDocValuesRangeQuery.java
@mikemccand
Copy link
Member

@mocobeta just checking: it looks like this was never backported to 9.x (I hit unexpected merge conflicts while backporting an FST change) -- was that intentional? Were there API breaks that prevented backport maybe? Thanks!

@uschindler
Copy link
Contributor

Yes this was intentional. It breaks API.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants