New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Add "direct to binary" option for DaciukMihovAutomatonBuilder and use it in TermInSetQuery#visit #12320

Merged

gsmiller merged 6 commits into apache:main from gsmiller:tis/fsa-builder-only

Jun 2, 2023

Contributor

gsmiller commented May 21, 2023

Description

Adds the ability to directly build a binary automaton for a string union using the Daciuk-Mihov algorithm, and uses it to make the TermInSetQuery#visit implementation a little more optimal. I'm hoping we end up moving to an automaton approach in general for TermInSetQuery (see #12312), but I think this is a good iterative step for now, as suggested by @rmuir / @mikemccand in #12310.

This was referenced May 21, 2023

#12276: rename DaciukMihovAutomatonBuilder to StringsToAutomaton #12310

Merged

Can we make DaciukMihovAutomatonBuilder pkg-private? #12321

Closed

mikemccand reviewed

View reviewed changes

lucene/core/src/java/org/apache/lucene/util/automaton/DaciukMihovAutomatonBuilder.java Outdated

+                    final int len = current.length();
+                    while (fromIndex < len) {
+                      int cp = Character.codePointAt(current, fromIndex);
+                      state = state.newState(cp);

Member

mikemccand May 25, 2023

Hmm, I wonder how this is creating a minimal Automaton? It seems to create a new path for every suffix without sharing the common suffixes?

(This is not a problem with this PR but rather a pre-existing issue and likely my not understanding this algorithm!).

Actually, I think this is nearly the same algorithm as the FST Builder, just applied to automaton (no outputs) instead of FST.

Edit: maybe minimizing the "tail" of the automaton happens in convert?

Edit 2: actually, I think we could maybe further optimize this builder to directly build Automaton instead of first creating its intermediate (and more RAM consuming?) automaton representation. Future work :)

Contributor Author

gsmiller May 25, 2023

This is minimizing through the replaceOrRegister method that gets called when "moving on" to a new suffix. Once the common prefix has been found, we can minimize its most recently added transition since it's now immutable (thanks to adding terms in sorted order).

Also, +1 to the idea of building to an Automaton directly instead of going through convert at the end.

Contributor Author

gsmiller May 25, 2023

I looked briefly at what it would take to build directly and not convert at the end, and I think it's better tackled as a follow-up. It will require a little bit of work to handle the "minimization as we go" bit without our own intermediate state representation. I'll open a spin-off issue.

lucene/core/src/java/org/apache/lucene/util/automaton/DaciukMihovAutomatonBuilder.java Outdated

+                  protected void doAdd(BytesRef current) {
+                    // Convert the input UTF-8 bytes to CharsRef so we can use the code points as our transition
+                    // labels.
+                    scratch.copyUTF8Bytes(current);

Member

mikemccand May 25, 2023

It looks like we were already doing this conversion previously? So this change is not adding more cost in the CharsRef case?

Contributor Author

gsmiller May 25, 2023

That's correct.

lucene/core/src/java/org/apache/lucene/util/automaton/DaciukMihovAutomatonBuilder.java Outdated

-                    int cp = Character.codePointAt(current, fromIndex);
-                    state = state.newState(cp);
-                    fromIndex += Character.charCount(cp);
+                private static class CharacterBasedBuilder extends DaciukMihovAutomatonBuilder {

Member

mikemccand May 25, 2023

final too?

Contributor Author

gsmiller May 26, 2023

Removed these classes since I was able to make some simplifications based on your other feedback. Thanks!

lucene/core/src/java/org/apache/lucene/util/automaton/DaciukMihovAutomatonBuilder.java Outdated

+                      state = state.newState(cp);
+                      fromIndex += Character.charCount(cp);
+                    }
+                    state.is_final = true;

Member

mikemccand May 25, 2023

Do we have a unit test for this class that generates random strings in a smallish alphabet, uses this builder to create the minimal automaton, and then builds an inefficient automaton with the existing union methods, then minimizing in the end, then asserting that the two ways for creating the minimal automaton (simple yet slow, complex but fast) produce identical (isomorphic) automaton?

Contributor Author

gsmiller May 25, 2023

Oh, I like that idea!

lucene/core/src/java/org/apache/lucene/util/automaton/DaciukMihovAutomatonBuilder.java Outdated

+                        break;
+                      }
+                      int codePoint = Character.codePointAt(currentChars, pos);

Member

mikemccand May 25, 2023

We could also decode the next Unicode code point directly from the UTF-8 bytes, instead of converting up front to a CharsRef? Or maybe just convert to int[] (UnicodeUtil.UTF8toUTF32)?

If we did the former (decode directly from BytesRef) we could perhaps not even make subclasses here and just have a small if in each of the add/addSuffix methods to pull the "next int" (either a UTF-8 unit or Unicode code point) on each loop.

Member

mikemccand May 25, 2023

I don't think Lucene has a BytesRef (UTF8) equivalent of Character.codePointAt and Character.charCount (byteCount), but it's quite trivial to implement ... UTF-8 makes this easy by just looking at the top (sign) bit of each byte to see if the character "continues", I think.

Contributor Author

gsmiller May 25, 2023

Hmm... yeah good idea. I'll explore this a bit since it could make the implementation simpler. Thanks for the idea!

gsmiller force-pushed the tis/fsa-builder-only branch from c2b042d to a756d80 Compare

May 26, 2023 02:09

Contributor Author

gsmiller commented May 26, 2023 •

edited

Loading

Updated based on the prior feedback, except for one outstanding testing suggestion. I'll have a look at that soon. I think the builder logic is much cleaner now between building string/binary automata.

Apologies for squashing the commit history (makes it harder to see the updates). Resolving the class naming conflicts from main was a bit of a hassle with an incremental git history.

gsmiller commented

View reviewed changes

lucene/CHANGES.txt

		@@ -139,6 +139,9 @@ Improvements

		* GITHUB#12305: Minor cleanup and improvements to DaciukMihovAutomatonBuilder. (Greg Miller)

		* GITHUB#12320: Add "direct to binary" option for DaciukMihovAutomatonBuilder and use it in TermInSetQuery#visit.

Contributor Author

gsmiller May 26, 2023

Kept the old name here since I'm proposing this change for 9.x (and the rename will come in 10)

Member

mikemccand commented May 26, 2023

Resolving the class naming conflicts from main was a bit of a hassle with an incremental git history.

Woops, sorry!

mikemccand reviewed

View reviewed changes

Member

mikemccand left a comment

Looks great! I left some small comments. Thanks @gsmiller.

lucene/core/src/java/org/apache/lucene/util/UnicodeUtil.java Outdated

+                 * class, this assumes valid UTF8 input and <strong>does not perform</strong> full UTF8
+                 * validation.
+                 *
+                 * @throws IllegalArgumentException If invalid codepoint header byte occurs or the content is

Member

mikemccand May 26, 2023

I think we may also throw ArrayIndexOutOfBoundException on really badly not-UTF-8 byte[]? The utf8CodeLength array is I think length 248 (256 - 8). Also, it has a bunch of v in it, which I think are invalid UTF-8 first bytes, which should throw the IllegalArgumentException.

Maybe either catch the AIOOBE and rethrow as IAE, or, soften the statement to say "throws various exceptions on invalid UTF-8, or, if the provided pos is NOT the start of a Unicode character". I don't think we want to promise we will always detect invalid UTF-8 and throw a clean exception.

Contributor Author

gsmiller May 26, 2023

You're correct that it could AIOOBE on a particularly malformed header byte. I think the v business is OK since the default switch case translates that to IAE, but I agree with your suggestion to make a more general statement that this method may do all sort of terrible and unexpected things if you feed it invalid utf8 (or reference an invalid start position)

lucene/core/src/java/org/apache/lucene/util/UnicodeUtil.java Outdated

+                    case 2 -> v = leadByte & 31; // 5 useful bits
+                    case 3 -> v = leadByte & 15; // 4 useful bits
+                    case 4 -> v = leadByte & 7; // 3 useful bits
+                    default -> throw new IllegalArgumentException("invalid utf8");

Member

mikemccand May 26, 2023

Maybe include the Arrays.toString(utf8) and pos in the exception message? Or perhaps just the fragment where the malformed utf-8 started (utf8[pos: in Python syntax).

Contributor Author

gsmiller May 26, 2023

How about the header byte that resulted in an illegal parse? I'm a little nervous of including the whole substring of bytes as it has unbounded length and could be a bit unwieldy?

lucene/core/src/java/org/apache/lucene/util/UnicodeUtil.java

                   }
-                  return utf32Count;
+                  // TODO: this may read past utf8's limit.

Member

mikemccand May 26, 2023

Ahh yes another AIOOBE case. I think it's fine if we throw whatever exceptions if you pass invalid UTF-8.

lucene/core/src/java/org/apache/lucene/util/UnicodeUtil.java Outdated

+                /** Holds a codepoint along with the number of bytes required to represent it in UTF8 */
+                public static final class UTF8CodePoint {
+                  public int codePoint;
+                  public int codePointBytes;

Member

mikemccand May 26, 2023

Maybe rename to numBytes? The codePoint prefix seems redundant.

gsmiller force-pushed the tis/fsa-builder-only branch from 1e85293 to 3d1e852 Compare

May 26, 2023 15:15

Contributor Author

gsmiller commented May 26, 2023

Thanks @mikemccand! Did a pass to address your comments. Much appreciated! I also added some testing around the minimization aspect of the automaton building. I think all feedback has been addressed at this point, but no rush on having another look. Thanks again!

gsmiller force-pushed the tis/fsa-builder-only branch from 774dc11 to 15e3a68 Compare

May 26, 2023 17:24

mikemccand approved these changes

View reviewed changes

Member

mikemccand left a comment

Thanks @gsmiller -- looks great!

Contributor Author

gsmiller commented Jun 2, 2023

Thanks @mikemccand! Appreciate you making time for this! 🎉

gsmiller added 6 commits

June 2, 2023 09:19


          Add "direct to binary" option for DaciukMihovAutomatonBuilder and use…

… in TermInSetQuery#visit


          plumb through Automata

681f384


          little more cleanup

3163d4d


          test basic minimization

e7ce7e3


          addressing some feedback

84061ee


          fixup rebase

485ebc8

gsmiller force-pushed the tis/fsa-builder-only branch from 15e3a68 to 485ebc8 Compare

June 2, 2023 16:19

gsmiller merged commit 52ace7e into apache:main

gsmiller deleted the tis/fsa-builder-only branch

June 2, 2023 16:34

gsmiller added a commit that referenced this pull request


          Add "direct to binary" option for DaciukMihovAutomatonBuilder and use…

349b458

… it in TermInSetQuery#visit (#12320)

gsmiller added this to the 9.7.0 milestone

hiteshk25 pushed a commit to cowpaths/lucene that referenced this pull request

…dc8ca633e8bcf`) (#20)

* Add next minor version 9.7.0

* Fix SynonymQuery equals implementation (apache#12260)

The term member of TermAndBoost used to be a Term instance and became a
BytesRef with apache#11941, which means its equals impl won't take the field
name into account. The SynonymQuery equals impl needs to be updated
accordingly to take the field into account as well, otherwise synonym
queries with same term and boost across different fields are equal which
is a bug.

* Fix MMapDirectory documentation for Java 20 (apache#12265)

* Don't generate stacktrace in CollectionTerminatedException (apache#12270)

CollectionTerminatedException is always caught and never exposed to users so there's no point in filling
in a stack-trace for it.

* add missing changelog entry for apache#12260

* Add missing author to changelog entry for apache#12220

* Make query timeout members final in ExitableDirectoryReader (apache#12274)

There's a couple of places in the Exitable wrapper classes where
queryTimeout is set within the constructor and never modified. This
commit makes such members final.

* Update javadocs for QueryTimeout (apache#12272)

QueryTimeout was introduced together with ExitableDirectoryReader but is
now also optionally set to the IndexSearcher to wrap the bulk scorer
with a TimeLimitingBulkScorer. Its javadocs needs updating.

* Make TimeExceededException members final (apache#12271)

TimeExceededException has three members that are set within its constructor and never modified. They can be made final.

* DOAP changes for release 9.6.0

* Add back-compat indices for 9.6.0

* `ToParentBlockJoinQuery` Explain Support Score Mode (apache#12245) (apache#12283)

* `ToParentBlockJoinQuery` Explain Support Score Mode

---------

Co-authored-by: Marcus <marcuseagan@gmail.com>

* Simplify SliceExecutor and QueueSizeBasedExecutor (apache#12285)

The only behaviour that QueueSizeBasedExecutor overrides from SliceExecutor is when to execute on the caller thread. There is no need to override the whole invokeAll method for that. Instead, this commit introduces a shouldExecuteOnCallerThread method that can be overridden.

* [Backport] GITHUB-11838 Add api to allow concurrent query rewrite (apache#12197)

* GITHUB-11838 Change API to allow concurrent query rewrite (apache#11840)

Replace Query#rewrite(IndexReader) with Query#rewrite(IndexSearcher)

Co-authored-by: Patrick Zhai <zhaih@users.noreply.github.com>
Co-authored-by: Adrien Grand <jpountz@gmail.com>

Backport of apache#11840

Changes from original:
 - Query keeps `rewrite(IndexReader)`, but it is now deprecated
 - VirtualMethod is used to correct delegate to the overridden methods
 - The changes to `RewriteMethod` type classes are reverted, this increased the backwards compatibility impact. 

------------------------------

### Description
Issue: apache#11838 

#### Updated Proposal
 * Change signature of rewrite to `rewrite(IndexSearcher)`
 * How did I migrate the usage:
   * Use Intellij to do preliminary refactoring for me
   * For test usage, use searcher whenever is available, otherwise create one using `newSearcher(reader)`
   * For very few non-test classes which doesn't have IndexSearcher available but called rewrite, create a searcher using `new IndexSearcher(reader)`, tried my best to avoid creating it recurrently (Especially in `FieldQuery`)
   * For queries who have implemented the rewrite and uses some part of reader's functionality, use shortcut method when possible, otherwise pull out the reader from indexSearcher.

* Backport: Concurrent rewrite for KnnVectorQuery (apache#12160) (apache#12288)

* Concurrent rewrite for KnnVectorQuery (apache#12160)


- Reduce overhead of non-concurrent search by preserving original execution
- Improve readability by factoring into separate functions

---------

Co-authored-by: Kaival Parikh <kaivalp2000@gmail.com>

* adjusting for backport

---------

Co-authored-by: Kaival Parikh <46070017+kaivalnp@users.noreply.github.com>
Co-authored-by: Kaival Parikh <kaivalp2000@gmail.com>

* toposort use iterator to avoid stackoverflow (apache#12286)

Co-authored-by: tangdonghai <tangdonghai@meituan.com>
# Conflicts:
#	lucene/CHANGES.txt

* Fix test to compile with Java 11 after backport of apache#12286

* Update Javadoc for topoSortStates method after apache#12286 (apache#12292)

* Optimize HNSW diversity calculation (apache#12235)

* Minor cleanup and improvements to DaciukMihovAutomatonBuilder (apache#12305)

* GITHUB-12291: Skip blank lines from stopwords list. (apache#12299)

* Wrap Query rewrite backwards layer with AccessController (apache#12308)

* Make sure APIJAR reproduces with different timezone (unfortunately java encodes the date using local timezone) (apache#12315)

* Add multi-thread searchability to OnHeapHnswGraph (apache#12257)

* Fix backport error

* [MINOR] Update javadoc in Query class (apache#12233)

- add a few missing full stops
- update wording in the description of Query#equals method

* [Backport] Integrate the Incubating Panama Vector API apache#12311 (apache#12327)

Leverage accelerated vector hardware instructions in Vector Search.

Lucene already has a mechanism that enables the use of non-final JDK APIs, currently used for the Previewing Pamana Foreign API. This change expands this mechanism to include the Incubating Pamana Vector API. When the jdk.incubator.vector module is present at run time the Panamaized version of the low-level primitives used by Vector Search is enabled. If not present, the default scalar version of these low-level primitives is used (as it was previously).

Currently, we're only targeting support for JDK 20. A subsequent PR should evaluate JDK 21.
---------

Co-authored-by: Uwe Schindler <uschindler@apache.org>
Co-authored-by: Robert Muir <rmuir@apache.org>

* Parallelize knn query rewrite across slices rather than segments (apache#12325)

The concurrent query rewrite for knn vectory query introduced with apache#12160
requests one thread per segment to the executor. To align this with the
IndexSearcher parallel behaviour, we should rather parallelize across
slices. Also, we can reuse the same slice executor instance that the
index searcher already holds, in that way we are using a
QueueSizeBasedExecutor when a thread pool executor is provided.

* Optimize ConjunctionDISI.createConjunction (apache#12328)

This method is showing up as a little hot when profiling some queries.
Almost all the time spent in this method is just burnt on ceremony
around stream indirections that don't inline.
Moving this to iterators, simplifying the check for same doc id and also saving one iteration (for the min
cost) makes this method far cheaper and easier to read.

* Update changes to be correct with ARM (it is called NEON there)

* GH#12321: Marked DaciukMihovAutomatonBuilder as deprecated (apache#12332)

Preparing to reduce visibility of this class in a future release

* add BitSet.clear() (apache#12268)

# Conflicts:
#	lucene/CHANGES.txt

* Clenaup and update changes and synchronize with 9.x

* Update TestVectorUtilProviders.java (apache#12338)

* Don't generate stacktrace for TimeExceededException (apache#12335)

The exception is package private and never rethrown, we can avoid
generating a stacktrace for it.

* Introduced the Word2VecSynonymFilter (apache#12169)

Co-authored-by: Alessandro Benedetti <a.benedetti@sease.io>

* Word2VecSynonymFilter constructor null check (apache#12169)

* Use thread-safe search version of HnswGraphSearcher (apache#12246)

Addressing comment received in the PR apache#12246

* Word2VecSynonymProvider to use standard Integer max value for hnsw searches (apache#12235)
We observed this change was not ported previously from main in an old cherry-pick

* Fix searchafter high latency when after value is out of range for segment (apache#12334)

* Make memory fence in `ByteBufferGuard` explicit (apache#12290)

* Add "direct to binary" option for DaciukMihovAutomatonBuilder and use it in TermInSetQuery#visit (apache#12320)

* Add updateDocuments API which accept a query (reopen) (apache#12346)

* GITHUB#11350: Handle backward compatibility when merging segments with different FieldInfo

This commits restores Lucene 9's ability to handle indices created with Lucene 8 where there are discrepancies in FieldInfos, such as different IndexOptions

* [Tessellator] Improve the checks that validate the diagonal between two polygon nodes (apache#12353)

# Conflicts:
#	lucene/CHANGES.txt

* feat: soft delete optimize (apache#12339)

* Better paging when random reads go backwards (apache#12357)

When reading data from outside the buffer, BufferedIndexInput always resets
its buffer to start at the new read position. If we are reading backwards (for example,
using an OffHeapFSTStore for a terms dictionary) then this can have the effect of
re-reading the same data over and over again.

This commit changes BufferedIndexInput to use paging when reading backwards,
so that if we ask for a byte that is before the current buffer, we read a block of data
of bufferSize that ends at the previous buffer start.

Fixes apache#12356

* Work around SecurityManager issues during initialization of vector api (JDK-8309727) (apache#12362)

* Restrict GraphTokenStreamFiniteStrings#articulationPointsRecurse recursion depth (apache#12249)

* Implement MMapDirectory with Java 21 Project Panama Preview API (apache#12294)
Backport incl JDK21 apijar file with java.util.Objects regenerated

* remove relic in apijar folder caused by vector additions

* Speed up IndexedDISI Sparse #AdvanceExactWithinBlock for tiny step advance (apache#12324)

* Add checks in KNNVectorField / KNNVectorQuery to only allow non-null, non-empty and finite vectors (apache#12281)


---------

Co-authored-by: Uwe Schindler <uschindler@apache.org>

* Implement VectorUtilProvider with Java 21 Project Panama Vector API (apache#12363) (apache#12365)

This commit enables the Panama Vector API for Java 21. The version of
VectorUtilPanamaProvider for Java 21 is identical to that of Java 20.
As such, there is no specific 21 version - the Java 20 version will be
loaded from the MRJAR.

* Add CHANGES.txt for apache#12334 Honor after value for skipping documents even if queue is not full for PagingFieldCollector (apache#12368)

Signed-off-by: gashutos <gashutos@amazon.com>

* Move TermAndBoost back to its original location. (apache#12366)

PR apache#12169 accidentally moved the `TermAndBoost` class to a different location,
which would break custom sub-classes of `QueryBuilder`. This commit moves it
back to its original location.

* GITHUB-12252: Add function queries for computing similarity scores between knn vectors (apache#12253)

Co-authored-by: Alessandro Benedetti <a.benedetti@sease.io>

* hunspell (minor): reduce allocations when processing compound rules (apache#12316)

(cherry picked from commit a454388)

* hunspell (minor): reduce allocations when reading the dictionary's morphological data (apache#12323)

there can be many entries with morph data, so we'd better avoid compiling and matching regexes and even stream allocation

(cherry picked from commit 4bf1b94)

* TestHunspell: reduce the flakiness probability (apache#12351)

* TestHunspell: reduce the flakiness probability

We need to check how the timeout interacts with custom exception-throwing checkCanceled.
The default timeout seems not enough for some CI agents, so let's increase it.

Co-authored-by: Dawid Weiss <dawid.weiss@gmail.com>
(cherry picked from commit 5b63a18)

* This allows VectorUtilProvider tests to be executed although hardware may not fully support vectorization or if C2 is not enabled (apache#12376)

---------

Signed-off-by: gashutos <gashutos@amazon.com>
Co-authored-by: Alan Woodward <romseygeek@apache.org>
Co-authored-by: Luca Cavanna <javanna@apache.org>
Co-authored-by: Uwe Schindler <uschindler@apache.org>
Co-authored-by: Armin Braun <me@obrown.io>
Co-authored-by: Mikhail Khludnev <mkhludnev@users.noreply.github.com>
Co-authored-by: Marcus <marcuseagan@gmail.com>
Co-authored-by: Benjamin Trent <ben.w.trent@gmail.com>
Co-authored-by: Kaival Parikh <46070017+kaivalnp@users.noreply.github.com>
Co-authored-by: Kaival Parikh <kaivalp2000@gmail.com>
Co-authored-by: tang donghai <tangdhcs@gmail.com>
Co-authored-by: Patrick Zhai <zhaih@users.noreply.github.com>
Co-authored-by: Greg Miller <gsmiller@gmail.com>
Co-authored-by: Jerry Chin <metrxqin@gmail.com>
Co-authored-by: Patrick Zhai <zhai7631@gmail.com>
Co-authored-by: Andrey Bozhko <andybozhko@gmail.com>
Co-authored-by: Chris Hegarty <62058229+ChrisHegarty@users.noreply.github.com>
Co-authored-by: Robert Muir <rmuir@apache.org>
Co-authored-by: Jonathan Ellis <jbellis@datastax.com>
Co-authored-by: Daniele Antuzi <daniele.antuzi@gmail.com>
Co-authored-by: Alessandro Benedetti <a.benedetti@sease.io>
Co-authored-by: Chaitanya Gohel <104654647+gashutos@users.noreply.github.com>
Co-authored-by: Petr Portnov | PROgrm_JARvis <pportnov@ozon.ru>
Co-authored-by: Tomas Eduardo Fernandez Lobbe <tflobbe@apache.org>
Co-authored-by: Ignacio Vera <ivera@apache.org>
Co-authored-by: fudongying <30896830+fudongyingluck@users.noreply.github.com>
Co-authored-by: Chris Fournier <chris.fournier@shopify.com>
Co-authored-by: gf2121 <52390227+gf2121@users.noreply.github.com>
Co-authored-by: Adrien Grand <jpountz@gmail.com>
Co-authored-by: Elia Porciani <e.porciani@sease.io>
Co-authored-by: Peter Gromov <peter@jetbrains.com>

Contributor

jpountz commented Jul 20, 2023

We have had 3 failures of TestStringsToAutomaton on Policeman/Apache Jenkins since this change was merged that we were not getting before, so I wonder if it's related. I opened #12451 earlier today that has a reproducible seed.

Contributor Author

gsmiller commented Jul 20, 2023

Thanks @jpountz. I'll have a look soon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet