LUCENE-9830: Hunspell: store word length for faster dictionary lookup/enumeration #3

donnerpeter · 2021-03-10T13:51:13Z

Description

Word length could be checked before more materializing the whole word

Solution

Use the spare bits in the hash table int and collision byte to encode word length, if it's short enough (almost always).

Tests

No new tests, 10-15% speedup in TestPerformance.de_suggest.

Checklist

Please review the following and check all that apply:

I have reviewed the guidelines for How to Contribute and my code conforms to the standards described there to the best of my ability.
I have created a Jira issue and added the issue ID to my pull request title.
I have given Solr maintainers access to contribute to my PR branch. (optional but recommended)
I have developed this patch against the master branch.
I have run ./gradlew check.
I have added tests for my changes.
I have added documentation for the Ref Guide (for Solr changes only).

…/enumeration

donnerpeter · 2021-03-10T13:51:31Z

Solr references probably should be removed from the checklist above

donnerpeter · 2021-03-10T13:55:13Z

lucene/analysis/common/src/java/org/apache/lucene/analysis/hunspell/GeneratingSuggester.java

@@ -65,6 +65,7 @@
    TrigramAutomaton automaton = new TrigramAutomaton(word);

    dictionary.words.processAllWords(
+        Math.max(1, word.length() - 4),


We now pass minLength in addition to maxLength

donnerpeter · 2021-03-10T13:56:35Z

lucene/analysis/common/src/java/org/apache/lucene/analysis/hunspell/WordStorage.java

@@ -63,17 +75,14 @@
   *   <li>VINT: a delta pointer to the entry for the same word without the last character.
   *       Precisely, it's the difference of this entry's start and the prefix's entry start. 0 for
   *       single-character entries
-   *   <li>Optional, for non-leaf entries only:
+   *   <li>(Optional, for hash-colliding entries only)


I've moved collision info before the forms data to avoid skipping over vInts on mismatches

Add a utility task to list all existing pacage names

Add HNSW building to search tests

LUCENE-9830: Hunspell: store word length for faster dictionary lookup…

c9d55ef

…/enumeration

donnerpeter commented Mar 10, 2021

View reviewed changes

rmuir merged commit 8913a98 into apache:main Mar 15, 2021

donnerpeter deleted the storeLen branch March 18, 2021 15:03

mikemccand pushed a commit to mikemccand/lucene that referenced this pull request Sep 3, 2021

Correct some of the jdk17-offending javadocs (Lucene apache#3)

dc37a3d

mocobeta pushed a commit to mocobeta/lucene that referenced this pull request Dec 2, 2021

Merge pull request apache#3 from mocobeta/add-task-show-all-packages

fc45e59

Add a utility task to list all existing pacage names

msokolov mentioned this pull request Jun 7, 2022

LUCENE-10577: enable quantization of HNSW vectors to 8 bits #947

Closed

This was referenced Dec 8, 2021

Hunspell: store word length for faster dictionary lookup/enumeration [LUCENE-9830] #10869

Closed

another idea for updatable fields [LUCENE-4272] #5341

Open

Add unsigned packed int impls in oal.util [LUCENE-1990] #3065

Closed

gsmiller mentioned this pull request Oct 12, 2023

Ensure LeafCollector#finish is only called once on the main collector during drill-sideways #12642

Merged

luozhuang mentioned this pull request Jan 3, 2024

NullPointerException in IndexSearcher.search() when searching with SpanfirstQuery and a customized collector #12991

Closed

jpountz pushed a commit to jpountz/lucene that referenced this pull request Mar 22, 2024

Add IOContext.randomAccess (apache#3)

5873d7c

benwtrent pushed a commit to benwtrent/lucene that referenced this pull request Jul 26, 2024

Merge pull request apache#3 from benwtrent/rabitq/add-hnsw

ea0790d

Add HNSW building to search tests

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LUCENE-9830: Hunspell: store word length for faster dictionary lookup/enumeration #3

LUCENE-9830: Hunspell: store word length for faster dictionary lookup/enumeration #3

donnerpeter commented Mar 10, 2021

donnerpeter commented Mar 10, 2021

donnerpeter Mar 10, 2021

donnerpeter Mar 10, 2021 •

edited

Loading

LUCENE-9830: Hunspell: store word length for faster dictionary lookup/enumeration #3

LUCENE-9830: Hunspell: store word length for faster dictionary lookup/enumeration #3

Conversation

donnerpeter commented Mar 10, 2021

Description

Solution

Tests

Checklist

donnerpeter commented Mar 10, 2021

donnerpeter Mar 10, 2021

Choose a reason for hiding this comment

donnerpeter Mar 10, 2021 • edited Loading

Choose a reason for hiding this comment

donnerpeter Mar 10, 2021 •

edited

Loading