-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Random access term dictionary #12688
base: main
Are you sure you want to change the base?
Conversation
…ccessDictionaryPostingsFormat
I'll try to review this soon -- it sounds compelling @Tony-X! I like how it is inspired by Tantivy's term dictionary format (which holds all terms + their metadata in RAM). Also, with the upcoming ability to cleanly limit how much RAM the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this start! I left a few comments.
I'm really curious how big the FST will be if you encode realistic terms. Maybe use the new IndexToFST
tool in luceneutil
? It pulls all unique terms from a Lucene index, and then builds an FST from them. I wrote it (and used it) for the "limit RAM in FSTCompiler" PR.
lucene/core/src/java/org/apache/lucene/codecs/lucene90/radomaccess/TermsIndex.java
Outdated
Show resolved
Hide resolved
lucene/core/src/java/org/apache/lucene/codecs/lucene90/radomaccess/TermType.java
Outdated
Show resolved
Hide resolved
lucene/core/src/java/org/apache/lucene/codecs/lucene90/radomaccess/TermsIndexBuilder.java
Outdated
Show resolved
Hide resolved
I'll also try to review! |
Thanks @bruno-roustant ! If you're okay to share it feel free to share it here. I'm in the process of baking my own specific implementation (which internally uses a single long as bit buffer), but I might absorb some interesting ideas from your impl. |
motivation: We will need to deal with encoding `IntBlockTermState` for different type of terms. Instead of having dedicated class for each term type, which would be 8 types in total, we can spell out the individual components of `IntBlockTermState`. Then implement a codec which works with the composition of the components. This way we can have a single implementation of the codec and construct the composition (really just array of components) per term type.
TermStateCodecImpl implements TermStateCodec which supports encoding a block of IntBlockTermState and decoding within that block at a given index.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
bump
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like this is only used by tests? Maybe move to the tests package? I'm also curious every time I see a new bit packer as we do this a lot throughout the code. Is there some reuse from another class impl maybe? PackedInts? DataOutput#writeVLong? Do we need it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey Knize, thanks for reviewing.
This is not test-only. It is used by the TestTermStateCodecImpl
. I'm in the process of building the real compact bit packer.
I'm also curious every time I see a new bit packer as we do this a lot throughout the code. Is there some reuse from another class impl maybe? PackedInts? DataOutput#writeVLong? Do we need it?
I did search through the code base and couldn't find something I can use. The goal here is to pack a sequence of values that have different bitwidths . We can't use PackedInts
as it requires values to have same bitwidth. We can't use VLong either since we aim to write fixed size record, so that we can do random access.
More detailed discussion can be found in this email thread: https://lists.apache.org/list?dev@lucene.apache.org:lte=1M:packedInts
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm in the process of building the real compact bit packer.
+1. I'm assuming that will be added to this PR?
The goal here is to pack a sequence of values that have different bitwidths... We can't use VLong either since we aim to write fixed size record, so that we can do random access
Hmm.. I'll have to look deeper at this. The reason I ask is because I did a similar bit packing w/ "random access" when serializing the ShapeDocValues binary tree and it feels like we often re-implement this logic in different forms for different use cases. Can we generalize this and lean out our code base to make it more useable and readable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1. I'm assuming that will be added to this PR?
Yes! This PR can be large so I took the advice from @mikemccand to open it early to avoid massive diff in one shot. Goal of the PR is to have a fully functional PostingFormat (or Codec).
The reason I ask is because I did a similar bit packing w/ "random access" when serializing the ShapeDocValues binary tree and it feels like we often re-implement this logic in different forms for different use cases. Can we generalize this and lean out our code base to make it more useable and readable?
Not sure if this code does the same thing. I could be wrong, but by a quick glance it seems to me it encodes values with variable length (VInt, VLong). Maybe the random-access is achieved in different ways?
Here in this PR, the use case is -- I have a bunch of bits in the form of byte[] that represents a block records that have same size (measure in bits, but size can be > 64 so we can't use PackedInts). Since they are of the same size, we can randomly access any record with an index and read the bits at [index * size, (index+1) * size]
I do agree that we should seek opportunities to unify. But for now since this is under sandbox, I'll make it specific to this implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since they are of the same size...
That's the difference. In your use case the records (blocks) are guaranteed to be the same size where as in the serialized tree case the records (tree nodes) are not guaranteed to be the same size. This is by design to ensure the resulting docvalue disk consumption is as efficient (small) as possible.
...by a quick glance it seems to me it encodes values with variable length (VInt, VLong). Maybe the random-access is achieved in different ways?
Yes to variable length encoding. The "random-ness" isn't purely random in that traversal of the serialized tree is DFS. Because the tree nodes are variable size the serialized array includes copious "book-keeping" in the form of "sizeOf" values. At DFS traversal the first "sizeOf" value provides the size of the entire left tree. To prune the left tree just means we skip that many bytes to get to the right tree.. this continues recursively. In practice we don't expect to ever "back up" in our DFS traversal so there is only a rewind
method that simply resets the offset values to 0.
Seems the two use cases are subtly different but I could see roughly 80% overlap in the implementation. I'd love to efficiently encapsulate this logic for the next contributor that wants a random serialized traversal mechanism without a ridiculous amount of java object overhead. Sounds like @bruno-roustant had the same need? Could be a good follow on progress PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good @Tony-X! I'm very curious how the FST behaves in this terms dict ... it should be quite compact compared to our existing "whole terms dict in an FST" impls because this one just stores the packed long
term type + ordinal in the FST and derefs out to the detailed metadata for this term.
It seems like you have the low level encode/decode working? So all that remains is to hook that up with the Codec components that read/write the terms dict ... then you can test the Codec by passing -Dtests.codec=<name>
and Lucene will run all tests cases using your Codec.
...src/java/org/apache/lucene/sandbox/codecs/lucene90/randomaccess/TermStateCodecComponent.java
Outdated
Show resolved
Hide resolved
...dbox/src/java/org/apache/lucene/sandbox/codecs/lucene90/randomaccess/TermStateCodecImpl.java
Outdated
Show resolved
Hide resolved
lucene/sandbox/src/java/org/apache/lucene/sandbox/codecs/lucene90/randomaccess/TermType.java
Outdated
Show resolved
Hide resolved
lucene/sandbox/src/java/org/apache/lucene/sandbox/codecs/lucene90/randomaccess/TermType.java
Outdated
Show resolved
Hide resolved
...ndbox/src/java/org/apache/lucene/sandbox/codecs/lucene90/randomaccess/TermsIndexBuilder.java
Outdated
Show resolved
Hide resolved
} | ||
if (ord > MAX_ORD) { | ||
throw new IllegalArgumentException( | ||
"Input ord is too large for TermType: " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
More user friendly message here too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you mean to spell out the input ord?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, maybe something like Can only index XXX unique terms, but saw YYY terms for field ZZZ
or so? (If that's what this error message really means to the user).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. That's a good point. I'll keep that in mind when implementing the main FieldsConsumer!
...ucene/sandbox/codecs/lucene90/randomaccess/Lucene90RandomAccessDictionaryPostingsFormat.java
Outdated
Show resolved
Hide resolved
Thanks for the tips! Yes, almost there. I'm working on the real compact bitpacker and unpacker. I still need to implement the PostingFormat afterwards. Do you think I need to implement a new Codec? |
For those classes * TermType * TermsIndexBuilder
Just realized that we have lucene99 Codec out! I'll update the code to reflect that as this posting format aims to work with the latest Codec. |
Also with a concrete implementation based on fixed-size byte[]
Try it with TermsIndexPrimitive and verify basic functionality
+1, nice!
OK this makes sense, and it is a (sad) measure of how slow the emulated (on top of
This is odd. Though, the
Hmm maybe pulsing? Are we still inlining single-occurrence terms directly into the terms dict with your new terms dict? |
Profiling show lots of allocation to build a name for such slice
…c/transitions FST nodes have differetn variant. For non-variable length encoded node we can more efficiently lookup for a target label. Similarly, for FSAs the TransitionAccessor allows access to a list of [min, max] ranges in order, on which we can perform binary-search to advance to applicable transitions for a given target
Since the first working version, I iterated with a list of profiling-guided allocation optimizations, as they stood out quite obviously from the merged JFR reports (thanks to luceneutil !). Some of them comes from my code that implements the term dictionary data lookup, and a few of them are at more general Lucene level. I want to highlight the general issue I see from this work and maybe we can have separate issues to improve them! Here is the first heap profile comparison (search-only, no indexing).
FST doesn't play nicely with primitive types (I know, this is more or less a java issue)
For this work, I forked the FST class and manually templated it with long just to see how much difference it makes. Here is a diff in heap profile and bench results before and after.
Bench diff
|
Non-trivial amount of allocations for? .... building IndexInput slice descriptions !?
AFAIK, the description is only used for debugging and tracking purposes. I didn't expect it'd cause that much of allocation. So I made a change to pass null when building the description so those allocations are gone.
|
Here is the even more interesting stuff. After all those allocation optimizations. I also implemented the on-paper more "efficient" algorithm to intersect FST and FSA for Terms.intersect(), which leverages the sorted nature of the FST arcs and FSA transitions from a given state (so at least we can binary search to advance with some skipping). FST in some cases have direct addressing which is exploited, too. As a side note -- it also uncovered a bug for the NFA impl which is tracked here #12906. But that change is not moving the needle at all for
I tried to modify the bench task file and only run My versionmainlineSo we can see that the most time is spent in actually reading out the FST arcs and FSA transitions... My intuitive explanation for why this is slower than the blocktree is that it has worse locality in its data access pattern. (@mikemccand maybe you can shed some light) Here are some relevant factors:
Just out of curiosity I altered my code to load the FST on-heap to compare with the default off-heap option. It did not help much with The PKLookup task is a great proxy to FST performance, as both versions of the code visits the exact same number of Arcs.
|
Theres are very interesting results @Tony-X! I'll try to give deeper response soon, but one idea that jumped out about Maybe block tree is able to use this opto more effectively than the "all terms in FST" approach? But I think you could implement such an opto too, maybe: just find the one node (or, maybe more than one) whose suffix is the common suffix, and fast-check somehow? |
The Especially interesting is the off -> on heap gains for that task. We are somehow paying a high price for going through Lucene's IO APIs instead of |
Thanks @mikemccand for taking a look! I see the getCommonSuffixBytesRef method from I wonder if it is really applicable to the FST... i.e. for the FST is it guaranteed that there exists one and only one state where all valid outputs path that share the same suffix go through? Or put it in other words, how many sub-graphs of the FST are there that represents the same suffix? |
private final FSTCompiler<Long> fstCompiler; | ||
|
||
TermsIndexBuilder() throws IOException { | ||
fstCompiler = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just FYI, you can plug the indexOutput
directly to the FSTCompiler (passing from RandomAccessTermsDictWriter), and make the FST writing entirely off-heap (apart from the heap, it also eliminates the time taken to write to the on-heap DataOutput). I have some example PRs at:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that the FST metadata and data cannot be written to the same file, but it seems you already separated metaOutput
and indexOutput
so that should be fine.
This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the dev@lucene.apache.org list. Thank you for your contribution! |
Description
Related issue #12513
Opening this PR early to avoid massive diffs in one-shot
TODO: