Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LUCENE-9830: Hunspell: store word length for faster dictionary lookup/enumeration #3

Merged
merged 1 commit into from
Mar 15, 2021

Conversation

donnerpeter
Copy link
Contributor

Description

Word length could be checked before more materializing the whole word

Solution

Use the spare bits in the hash table int and collision byte to encode word length, if it's short enough (almost always).

Tests

No new tests, 10-15% speedup in TestPerformance.de_suggest.

Checklist

Please review the following and check all that apply:

  • I have reviewed the guidelines for How to Contribute and my code conforms to the standards described there to the best of my ability.
  • I have created a Jira issue and added the issue ID to my pull request title.
  • I have given Solr maintainers access to contribute to my PR branch. (optional but recommended)
  • I have developed this patch against the master branch.
  • I have run ./gradlew check.
  • I have added tests for my changes.
  • I have added documentation for the Ref Guide (for Solr changes only).

@donnerpeter
Copy link
Contributor Author

Solr references probably should be removed from the checklist above

@@ -65,6 +65,7 @@
TrigramAutomaton automaton = new TrigramAutomaton(word);

dictionary.words.processAllWords(
Math.max(1, word.length() - 4),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We now pass minLength in addition to maxLength

@@ -63,17 +75,14 @@
* <li>VINT: a delta pointer to the entry for the same word without the last character.
* Precisely, it's the difference of this entry's start and the prefix's entry start. 0 for
* single-character entries
* <li>Optional, for non-leaf entries only:
* <li>(Optional, for hash-colliding entries only)
Copy link
Contributor Author

@donnerpeter donnerpeter Mar 10, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've moved collision info before the forms data to avoid skipping over vInts on mismatches

@rmuir rmuir merged commit 8913a98 into apache:main Mar 15, 2021
@donnerpeter donnerpeter deleted the storeLen branch March 18, 2021 15:03
mikemccand pushed a commit to mikemccand/lucene that referenced this pull request Sep 3, 2021
mocobeta pushed a commit to mocobeta/lucene that referenced this pull request Dec 2, 2021
Add a utility task to list all existing pacage names
jpountz pushed a commit to jpountz/lucene that referenced this pull request Mar 22, 2024
benwtrent pushed a commit to benwtrent/lucene that referenced this pull request Jul 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants