-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SortedSet DV Multi Range query #13974
base: main
Are you sure you want to change the base?
Conversation
fails 2nd pass on -Dtests.seed=C73CAF65600D946E
This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the dev@lucene.apache.org list. Thank you for your contribution! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @mkhludnev ! I haven't fully combed through this change but did an initial pass and wanted to share some thoughts since I meant to look at this a while ago and it slipped through the cracks. I appreciate you taking this on!
lucene/sandbox/src/java/org/apache/lucene/sandbox/search/SortedSetMultiRangeQuery.java
Outdated
Show resolved
Hide resolved
lucene/sandbox/src/java/org/apache/lucene/sandbox/search/SortedSetMultiRangeQuery.java
Outdated
Show resolved
Hide resolved
lucene/sandbox/src/java/org/apache/lucene/sandbox/search/SortedSetMultiRangeQuery.java
Outdated
Show resolved
Hide resolved
lucene/sandbox/src/java/org/apache/lucene/sandbox/search/SortedSetMultiRangeQuery.java
Outdated
Show resolved
Hide resolved
lucene/sandbox/src/java/org/apache/lucene/sandbox/search/SortedSetMultiRangeQuery.java
Outdated
Show resolved
Hide resolved
lucene/sandbox/src/java/org/apache/lucene/sandbox/search/SortedSetMultiRangeQuery.java
Outdated
Show resolved
Hide resolved
lucene/sandbox/src/java/org/apache/lucene/sandbox/search/SortedSetMultiRangeQuery.java
Outdated
Show resolved
Hide resolved
lucene/sandbox/src/java/org/apache/lucene/sandbox/search/SortedSetMultiRangeQuery.java
Outdated
Show resolved
Hide resolved
lucene/sandbox/src/java/org/apache/lucene/sandbox/search/SortedSetMultiRangeQuery.java
Outdated
Show resolved
Hide resolved
if (ord >= finalMinOrd | ||
&& ((finalMatchesAbove < values.getValueCount() | ||
&& ord >= finalMatchesAbove) | ||
|| finalMatchingOrdsShifted.get(ord - finalMinOrd))) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This approach of densely memorizing all ordinals that fall within the ranges has me a bit worried. This could potentially be a quite large bitset. I wonder if we could instead just keep track of all the ranges actually represented in the query and then just check against them for each doc ord? Like what we do in SortedSetDocValuesRangeQuery
but keeping tracking of multiple min/max tuples to check against instead of only one? Is there a reason you didn't go with that approach and feel the dense bitset is necessary here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was my first though too. However, then I've found DocValuesRewriteMethod where there's the set over ordinals. So, if it's feasible for set of point values, why not do the same for set of ranges?
Perhaps, it needs to set too many 1s for ranges which fully cover all ords? I dont know.
If we think about Numeric or Binary DVs, there are no ords, and we have to search in the tree of ranges. Also, for me it's purposed for queries with more that 1000 ranges, because smaller sets can be handled via plain BooleanQuery.SHOULD.
Anyway, I'm really open for suggestions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right. As you point out, there are definitely other situations where we use this bitset approach. My only thought here is that we should be able to use less memory by keeping track of the range min/max values and not needing the dense representation (this isn't really as feasible with DocValuesRewriteMethod
since it's encoding a set of unique terms that aren't particularly likely to be contiguous ranges). If we track ranges, we can take advantage of them being sorted and the doc-level ordinals also being sorted. A binary search approach might be a good fit, and we can keep pushing up the lower bound of the search based on the last range that matched (since doc-level ordinals are sorted). I suspect this would still be less efficient than the dense bitset approach, but I bet it wouldn't be that big of a difference.
Without a way to benchmark and test though, it's just a bit of a guess. I'm OK with keeping it the way you have it for now if that's your preference. We can always benchmark alternatives in the future and evolve this if we want. Should we do that for now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mostly agree. Thanks for sharing your considerations. Let me clarify your points:
isn't really as feasible with DocValuesRewriteMethod
Why this "set" query can't use a tree of unique values with log(n) access complexity? These queries might be quite close to each other. In the edge case ranges are quite narrow having 1 or 2 ordinals, thus SsDvMultyRange would be the same as TermsInSet. If ranges are wide, it definitely spends more time on setting 1s in the bitset, but maybe it's not lethal due to SIMD intrinsic (who knows) with a gain on search time per doc check.
I agree it's a field for benchmarking and picking optimizations dynamically (later). Also, something like KD-tree might be applicable here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can take advantage of them being sorted and the doc-level ordinals also being sorted.
Frankly speaking, I suppose singular values are more common than long sets of values, so, I don't think it get much gain for many users.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your attention @gsmiller .
You are right.
resolveOverlaps
literally merges overlapping ranges. Why it's necessary? I think just for some pathological queries. Imagine, many distinct ranges covered by the huge one, it may drop covered ranges from the future processing whether it be bitset or a tree set approach. I'm open for any stronger arguments.
Then, createOrdRanges turns bytearrays to ordinals (btw, this step won't be available for NumericDVs and other w/o ords), at here some ranges might become adjacent, not really overlapping, but eligible for merging into bigger ones. I checked the code and comments there, there's no occurrence of overlap
only adjacent
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think just for some pathological queries. Imagine, many distinct ranges covered by the huge one, it may drop covered ranges from the future processing [...]
It makes sense to me that we may want to do some optimizations for pathological cases with many ranges that overlap. I do wonder a bit on the importance of this. My reasoning here is that the cost only hits when creating the bitset, which is an upfront cost and not a per-doc cost. I'm not opposed to pursuing this, but I'm trying to find a way to balance some of the added code complexity. In this case, it looks like you're relying on a couple helper classes (OrdRange
and Edge
) along with a reasonably complex algorithm for finding all the overlaps (which to be perfectly honest, I haven't dug into, only glanced at).
[...] whether it be bitset or a tree set approach
I'm not sure I understand this point completely. I think what you're saying is that you might want to follow up this PR with another approach that uses a tree set approach instead of a bitset for matching (I believe you had an iteration of this PR that introduced the idea). I wonder if a common solution for merging overlapping ranges might become more important later if we really do have multiple implementations? Maybe this is a candidate for a spin-off issue that gets pursued as a later PR? (This goes back to my motivation of trying to reduce complexity of this initial change until it becomes more obvious we need the added complexity).
[...] this step won't be available for NumericDVs and other w/o ords
Ah yes, that's fair. I think this is a continued discussion of the last point. Maybe we can pursue some merging when we do have multiple implementations that could obviously benefit from something like this? I can an optimization becoming more important in these other situations since it will become a per-doc cost as opposed to an up-front setup cost. Again though, I wonder if this could be tackled as a spin-off issue once we have more implementations?
createOrdRanges [...] at here some ranges might become adjacent, not really overlapping, but eligible for merging into bigger ones. I checked the code and comments there, there's no occurrence of overlap only adjacent.
Are you suggesting that createOrdRanges
as you currently have it implemented would not merge ranges that overlap? It looks to me like it would, but I might be missing something. If the ranges are sorted by their min
, then I believe this method will merge away all the overlaps (as well as adjacent) wouldn't it? What do you think? Maybe I'm missing some cases or not properly wrapping my head around what you've got implemented here?
Apologies if it feels like I'm pushing back on this too much. I'm not against pursuing these optimizations by any means, I'm just seeing if there's a way to simplify this PR so we can get it merged and follow up with spin-off issues. Please let me know what you think (and thanks for your patience!).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok. let's drop resolveOverlaps
, createOrdRanges
. Let's just loop ranges and call seekCeil()
twice per range. Does it make sense to find min and max ords attempting to reduce bitset allocation upfront?
If it does should it turn byte array ranges into ord ranges and store it for the second pass filling bitset?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your thoughts @mkhludnev. I'd suggest one-of-two options. I'd be happy to move forward with either:
- Keep it as simple as possible and drop all range merging, as you outline above (e.g., drop
resolveOverlaps
andcreateOrdRanges
). Then consider adding these ideas back in as spin-off SIMs after the initial implementation is merged. - Drop
resolveOverlaps
but keep the logic increateOrdRanges
as you currently have it. I like how you have this implemented currently. Seems like a pretty clean way of reducing unnecessary calls toseekCeil
and bitset sets by recognizing adjacent or overlapping ranges without adding much code complexity.
Apologies on my above communication. It may not have been as specific as it could have been. The complexity that I'd like to avoid (at least in the initial version of this) is resolveOverlaps
. I don't mind createOrdRanges
since it is immediately applicable in this implementation and isn't particularly complex to reason about. The resolveOverlaps
method seems more future-looking to other implementations, etc., so the value vs. complexity trade-off isn't as clear to me right now.
What do you think? (I'm happy to go with whatever you think will help get this merged and reduce headaches for you since we're discussed plenty at this point and I don't want to slow you down further )
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gsmiller thanks for reviewing. Looking into! |
lucene/sandbox/src/java/org/apache/lucene/sandbox/search/DocValuesMultiRangeQuery.java
Outdated
Show resolved
Hide resolved
introduce ordinal ranges renaming
open items from my POV:
|
Description
Here's a sketch of DV analog of MultiRangeQuery. This draft uses bitset for expanding multiple ranges.
It passes random test, but internals are rough yet.
What do you think about this idea overall?
@atris, may I request your comments as an author of MultiRangeQuery?