SortedSet DV Multi Range query #13974

mkhludnev · 2024-11-03T19:54:54Z

Description

Here's a sketch of DV analog of MultiRangeQuery. This draft uses bitset for expanding multiple ranges.
It passes random test, but internals are rough yet.
What do you think about this idea overall?

@atris, may I request your comments as an author of MultiRangeQuery?

fails 2nd pass on -Dtests.seed=C73CAF65600D946E

github-actions · 2024-12-07T00:24:18Z

This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the dev@lucene.apache.org list. Thank you for your contribution!

gsmiller

Thanks @mkhludnev ! I haven't fully combed through this change but did an initial pass and wanted to share some thoughts since I meant to look at this a while ago and it slipped through the cracks. I appreciate you taking this on!

lucene/sandbox/src/java/org/apache/lucene/sandbox/search/SortedSetMultiRangeQuery.java

gsmiller · 2024-12-19T01:26:09Z

lucene/sandbox/src/java/org/apache/lucene/sandbox/search/SortedSetMultiRangeQuery.java

+                      if (ord >= finalMinOrd
+                          && ((finalMatchesAbove < values.getValueCount()
+                                  && ord >= finalMatchesAbove)
+                              || finalMatchingOrdsShifted.get(ord - finalMinOrd))) {


This approach of densely memorizing all ordinals that fall within the ranges has me a bit worried. This could potentially be a quite large bitset. I wonder if we could instead just keep track of all the ranges actually represented in the query and then just check against them for each doc ord? Like what we do in SortedSetDocValuesRangeQuery but keeping tracking of multiple min/max tuples to check against instead of only one? Is there a reason you didn't go with that approach and feel the dense bitset is necessary here?

This was my first though too. However, then I've found DocValuesRewriteMethod where there's the set over ordinals. So, if it's feasible for set of point values, why not do the same for set of ranges?
Perhaps, it needs to set too many 1s for ranges which fully cover all ords? I dont know.
If we think about Numeric or Binary DVs, there are no ords, and we have to search in the tree of ranges. Also, for me it's purposed for queries with more that 1000 ranges, because smaller sets can be handled via plain BooleanQuery.SHOULD.
Anyway, I'm really open for suggestions.

Right. As you point out, there are definitely other situations where we use this bitset approach. My only thought here is that we should be able to use less memory by keeping track of the range min/max values and not needing the dense representation (this isn't really as feasible with DocValuesRewriteMethod since it's encoding a set of unique terms that aren't particularly likely to be contiguous ranges). If we track ranges, we can take advantage of them being sorted and the doc-level ordinals also being sorted. A binary search approach might be a good fit, and we can keep pushing up the lower bound of the search based on the last range that matched (since doc-level ordinals are sorted). I suspect this would still be less efficient than the dense bitset approach, but I bet it wouldn't be that big of a difference.

Without a way to benchmark and test though, it's just a bit of a guess. I'm OK with keeping it the way you have it for now if that's your preference. We can always benchmark alternatives in the future and evolve this if we want. Should we do that for now?

I mostly agree. Thanks for sharing your considerations. Let me clarify your points:

isn't really as feasible with DocValuesRewriteMethod

Why this "set" query can't use a tree of unique values with log(n) access complexity? These queries might be quite close to each other. In the edge case ranges are quite narrow having 1 or 2 ordinals, thus SsDvMultyRange would be the same as TermsInSet. If ranges are wide, it definitely spends more time on setting 1s in the bitset, but maybe it's not lethal due to SIMD intrinsic (who knows) with a gain on search time per doc check.
I agree it's a field for benchmarking and picking optimizations dynamically (later). Also, something like KD-tree might be applicable here.

we can take advantage of them being sorted and the doc-level ordinals also being sorted.

Frankly speaking, I suppose singular values are more common than long sets of values, so, I don't think it get much gain for many users.

Thanks for your attention @gsmiller .
You are right.
resolveOverlaps literally merges overlapping ranges. Why it's necessary? I think just for some pathological queries. Imagine, many distinct ranges covered by the huge one, it may drop covered ranges from the future processing whether it be bitset or a tree set approach. I'm open for any stronger arguments.
Then, createOrdRanges turns bytearrays to ordinals (btw, this step won't be available for NumericDVs and other w/o ords), at here some ranges might become adjacent, not really overlapping, but eligible for merging into bigger ones. I checked the code and comments there, there's no occurrence of overlap only adjacent.

I think just for some pathological queries. Imagine, many distinct ranges covered by the huge one, it may drop covered ranges from the future processing [...]

It makes sense to me that we may want to do some optimizations for pathological cases with many ranges that overlap. I do wonder a bit on the importance of this. My reasoning here is that the cost only hits when creating the bitset, which is an upfront cost and not a per-doc cost. I'm not opposed to pursuing this, but I'm trying to find a way to balance some of the added code complexity. In this case, it looks like you're relying on a couple helper classes (OrdRange and Edge) along with a reasonably complex algorithm for finding all the overlaps (which to be perfectly honest, I haven't dug into, only glanced at).

[...] whether it be bitset or a tree set approach

I'm not sure I understand this point completely. I think what you're saying is that you might want to follow up this PR with another approach that uses a tree set approach instead of a bitset for matching (I believe you had an iteration of this PR that introduced the idea). I wonder if a common solution for merging overlapping ranges might become more important later if we really do have multiple implementations? Maybe this is a candidate for a spin-off issue that gets pursued as a later PR? (This goes back to my motivation of trying to reduce complexity of this initial change until it becomes more obvious we need the added complexity).

[...] this step won't be available for NumericDVs and other w/o ords

Ah yes, that's fair. I think this is a continued discussion of the last point. Maybe we can pursue some merging when we do have multiple implementations that could obviously benefit from something like this? I can an optimization becoming more important in these other situations since it will become a per-doc cost as opposed to an up-front setup cost. Again though, I wonder if this could be tackled as a spin-off issue once we have more implementations?

createOrdRanges [...] at here some ranges might become adjacent, not really overlapping, but eligible for merging into bigger ones. I checked the code and comments there, there's no occurrence of overlap only adjacent.

Are you suggesting that createOrdRanges as you currently have it implemented would not merge ranges that overlap? It looks to me like it would, but I might be missing something. If the ranges are sorted by their min, then I believe this method will merge away all the overlaps (as well as adjacent) wouldn't it? What do you think? Maybe I'm missing some cases or not properly wrapping my head around what you've got implemented here?

Apologies if it feels like I'm pushing back on this too much. I'm not against pursuing these optimizations by any means, I'm just seeing if there's a way to simplify this PR so we can get it merged and follow up with spin-off issues. Please let me know what you think (and thanks for your patience!).

ok. let's drop resolveOverlaps, createOrdRanges. Let's just loop ranges and call seekCeil() twice per range. Does it make sense to find min and max ords attempting to reduce bitset allocation upfront?
If it does should it turn byte array ranges into ord ranges and store it for the second pass filling bitset?

Thanks for your thoughts @mkhludnev. I'd suggest one-of-two options. I'd be happy to move forward with either:

Keep it as simple as possible and drop all range merging, as you outline above (e.g., drop resolveOverlaps and createOrdRanges). Then consider adding these ideas back in as spin-off SIMs after the initial implementation is merged.

Drop resolveOverlaps but keep the logic in createOrdRanges as you currently have it. I like how you have this implemented currently. Seems like a pretty clean way of reducing unnecessary calls to seekCeil and bitset sets by recognizing adjacent or overlapping ranges without adding much code complexity.

Apologies on my above communication. It may not have been as specific as it could have been. The complexity that I'd like to avoid (at least in the initial version of this) is resolveOverlaps. I don't mind createOrdRanges since it is immediately applicable in this implementation and isn't particularly complex to reason about. The resolveOverlaps method seems more future-looking to other implementations, etc., so the value vs. complexity trade-off isn't as clear to me right now.

What do you think? (I'm happy to go with whatever you think will help get this merged and reduce headaches for you since we're discussed plenty at this point and I don't want to slow you down further )

2. Drop resolveOverlaps but keep the logic in createOrdRanges as you currently have i

done. 1751e70

Thanks for looking into @gsmiller !

mkhludnev · 2024-12-19T13:21:48Z

@gsmiller thanks for reviewing. Looking into!

- copied range into

lucene/sandbox/src/java/org/apache/lucene/sandbox/search/DocValuesMultiRangeQuery.java

introduce ordinal ranges renaming

mkhludnev · 2025-01-10T07:51:00Z

open items from my POV:

improve testing via copying TestDocValuesQueries
may DVSkipper limit and reduce the given ranges before matching to limit bitset?
if ranges reaches ordinals boundaries: [0 TO a],[b TO c],...[y TO dv.getValueCount()-1] we can reduce bitset to [a+1 TO y-1], and check boundary ranges explicitly.

mikhail-khludnev and others added 7 commits October 30, 2024 00:36

first failed test. just an api

7fb1b0f

silly bug fxd

0957905

add some pivot randomnes

af41025

horrible impl.

b74dd36

fails 2nd pass on -Dtests.seed=C73CAF65600D946E

it works

e310500

some cleanup

79971e5

some cleanup

cbc588a

mkhludnev mentioned this pull request Nov 6, 2024

[Feature Request] relax max Clauses Count limitation of termS query over IP field opensearch-project/OpenSearch#16200

Closed

mkhludnev mentioned this pull request Nov 16, 2024

Support more than 1024 IP/masks with indexed field opensearch-project/OpenSearch#16391

Merged

3 tasks

mkhludnev added 2 commits November 21, 2024 23:28

sweep

f0b2700

javadoc

3ad7140

mkhludnev marked this pull request as ready for review November 22, 2024 06:35

mkhludnev changed the title ~~DRAFT: SortedSet DV Multi Range query~~ SortedSet DV Multi Range query Nov 22, 2024

github-actions bot added the Stale label Dec 7, 2024

gsmiller reviewed Dec 19, 2024

View reviewed changes

github-actions bot removed the Stale label Dec 20, 2024

mkhludnev added 6 commits December 23, 2024 10:11

in the middle of PR feedback

eb2990a

review in progress

8b2655d

- renamed Builder

4fe35c0

- copied range into

added duel with PointsMultiRange

416f085

tidy

9fcd549

expose only builder with java function interface.

db161d4

mkhludnev commented Dec 27, 2024

View reviewed changes

lucene/sandbox/src/java/org/apache/lucene/sandbox/search/DocValuesMultiRangeQuery.java Outdated Show resolved Hide resolved

mkhludnev added 5 commits January 1, 2025 19:08

remove ranges overlaps with sweepline alg.

25d4420

introduce ordinal ranges renaming

check dvSkipper via index sorter

308dd0d

linter

adaeca2

introduce ordRande tree lookup algm

a4640ea

sweep

eea9e81

mkhludnev added 4 commits January 2, 2025 00:27

import

a9eacea

tudy

d848f77

bike shedding

e63d9ae

remove range tree impl

71bf057

mkhludnev added 2 commits February 1, 2025 16:37

remove sweep line range merge

1751e70

tidy

c56caeb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SortedSet DV Multi Range query #13974

SortedSet DV Multi Range query #13974

mkhludnev commented Nov 3, 2024

github-actions bot commented Dec 7, 2024

gsmiller left a comment

gsmiller Dec 19, 2024

mkhludnev Dec 19, 2024

gsmiller Dec 19, 2024

mkhludnev Dec 24, 2024

mkhludnev Dec 24, 2024

mkhludnev Jan 22, 2025

gsmiller Jan 25, 2025

mkhludnev Jan 25, 2025

gsmiller Jan 27, 2025

mkhludnev Feb 1, 2025

mkhludnev commented Dec 19, 2024

mkhludnev commented Jan 10, 2025

SortedSet DV Multi Range query #13974

Are you sure you want to change the base?

SortedSet DV Multi Range query #13974

Conversation

mkhludnev commented Nov 3, 2024

Description

github-actions bot commented Dec 7, 2024

gsmiller left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mkhludnev commented Dec 19, 2024

mkhludnev commented Jan 10, 2025