-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Made DocIdsWriter use DISI when reading documents with an IntersectVisitor #13149
Conversation
0319617
to
dacb2a5
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like a great optimization! I just had one small concern about advance
.
|
||
@Override | ||
public int advance(int target) throws IOException { | ||
while (index < count && scratch[index] < target) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we even need to implement this method? Can we throw UOE
instead? It's a bit scary that it's O(N), though, in practice, scratch
is a smallish buffer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we don't need to - but I don't see why this would be scarier than advancing using the nextDoc
method? But I might be missing something - if you'd like that change, I'll make it. It might be that a DISI is not the right interface for visiting docids within the BKD tree - but I'm guessing changing that would be larger and more controversial.
For large segments the buffer size would almost always be 512 - but there would be a lot of buffers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried changing the advance method to just throw an throw new UnsupportedOperationException()
- and tests do pass - I can commit that if you prefer that. I guess we should add some kind of comment of why it is unsupported though?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be that a DISI is not the right interface for visiting docids within the BKD tree - but I'm guessing changing that would be larger and more controversial.
Yeah indeed DISI may not be right, but let's not try to fix that here/now? Perhaps there are use cases that might want to call .advance
on the DISI? Maybe some sort of curious combined filtering / points query?
OK you convinced me!: let's leave the method as is. You're right that if such a use case existed today, it is suffering through the linear scan of .visit
calls already, so this is no worse.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah indeed DISI may not be right, but let's not try to fix that here/now?
Agreed, but I'm tempted to experiment with it in another PR :)
visitor.visit(scratch[i]); | ||
} | ||
scratchDocIdSetIterator.reset(count); | ||
visitor.visit(scratchDocIdSetIterator); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@antonha Maybe add a small comment here explaining that we found using a DISI was much faster than visting docs individually?
People usually attach the PR id to the comment in case someone wants to dig deeper. Many a times, I have learnt about why something was implemented a certain way through these comments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a comment higher up in the file, next to the ScratchDocIdSetIterator. Would moving that down here and referring to this PR make things better?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That comment is good enough. I had not noticed it I guess..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@antonha could you add a CHANGES.txt
entry explaining the optimization? Thanks!
Thanks for the approval! I do however think that it would be good to prove this performance improvement in luceneutil before merging, to make the benchmark more easily reproducible than the benchmarking I have done. I have started looking into it, but will have more time for it next week. Does that sound reasonable? The only downside I see is a later merge of this PR.
Yes, will do once I have run some luceneutil benchmarks (if we want to wait for that) |
Confusingly named, the The geospatial benchmarks (a different benchmark than the normal/nightly Still, given that you saw gains in the "OpenSearch vs Elasticsearch" benchmarks, even if the results are flat with the existing luceneutil benchmarks, I think we should just merge the change. The night after we can watch Lucene's nightly benchmarks and see if the nightly box measured anything. (Hmm, curiously/separately, it looks like something caused a jump in some geo tasks' performance e.g. distance filter ... I'll try to find the cause and add an annotation!). |
I spent some time "proving" this is luceneutil - luceneutil/pull/257 adds a reproduction - if run with The reason that the larger segment is needed is that Lucene otherwise chooses to store document ids in the BKD leaves as int24, meaning that the optimization in this PR does nothing. With the luceneutil changes, I get the following output when comparing this PR to master:
The interesting parts here is in the bottom two lines - IntNRQ and LongNRQ becomes much faster. it might be that I messed up the "minimal" part of reproduction, maybe all that was needed is the single-segment and Regardless, it looks promising - a 94% to 162% increase in QPS for range queries with this PR in the slightly modified benchmark. |
Would we be subject to the same issue if/when 3+ different implementations of |
Yes. Your question makes me think that maybe we should not add a new |
Relatedly indeed, I was wondering if the API should expose an IntsRef or something like that, so that there is a single virtual call per block of doc IDs anyway (IntsRef cannot be extended, it's a single impl). |
@jpountz I had a quick look at the code, and it seems to me like there are, with this PR, only two implementations used for the DISI used for the Do we believe that we can merge this PR and then continue with changing the BKD visit API in a later change, or should we try to change the abstraction in this PR? |
++ on progress over perfection That said, I wonder if this change is legal: |
You are correct - I had (falsely) assumed that the document ids in the DocIdsWriter were written in order - thus the PR as of now ( The good news is that this should not have affected the benchmarks - neither the I will try adding a |
…sitor. Instead of calling IntersectVisitor.visit for each doc in the readDelta16 and readInts32 methods, create a DocIdSetIterator and call IntersectVisitor.visit(DocIdSetIterator) instead. This seems to make Lucene faster at sorting and range querying tasks - the hypothesis being that it is due to fewer virtual calls.
b5fbbf1
to
60495d4
Compare
I've changed the readInts16/34 methods to now use the One thing that I would love input on is whether or not to manually inline the for (int i = ref.offset; i < ref.offset + ref.length; i++) {
result.clear(ref.ints[i]);
cost[0]--;
} While we could do for (int i = ref.offset; i < ref.offset + ref.length; i++) {
visit(ref.ints[i]);
} The latter would, most likely, be inlined by the JVM. The first is manually inlined, so we don't need to trust the JVM on it. Thoughts? It is important to note that just having the default method in |
I benchmarked the current commit (
(the baseline being master) |
@@ -185,6 +186,13 @@ public void visit(DocIdSetIterator iterator) throws IOException { | |||
adder.add(iterator); | |||
} | |||
|
|||
@Override | |||
public void visit(IntsRef ref) { | |||
for (int i = ref.offset; i < ref.offset + ref.length; i++) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we delegate this to BulkAdder
and take advantage of System#arraycopy
in BufferAdder
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That would probably be better. I don't think we should expect huge performance increases from it (since it would only affect narrow range queries). But it would make a large difference if a third implementation of the adder is ever used.
I have limited time this week to look at it - I will try to find some time for it though!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I unfortunately did not find time enough for this this week - and since this PR seems to be wrapping up I will leave it out. It looks like a simple change but I got stuck implementing tests (which I feel would be very needed). I will leave this for future improvement!
This is what I meant indeed. Before merging I'd be curious to better understand why the JVM doesn't optimize this better. Presumably, it should be able to resolve the virtual call once for the entire for loop rather than doing it again on every iteration? I wonder if there is actually a performance bug, or if we are just insufficiently warming up the JVM and this for loop never gets compiled through C2? |
Valid questions. I ran the same luceneutils benchmark with I also tried reproducing this behavior in a shorter java program for demonstration purposes - with virtual calls similar to the import java.util.Random;
public class C2Inlining {
static int ITERATIONS = 50_000;
static int NUM_VALUES = 100_000;
public static void main(String[] args) {
//Generate ints
Random r = new Random();
int[] arr = new int[NUM_VALUES];
for (int i = 0; i < NUM_VALUES; i++) {
arr[i] = r.nextInt();
}
IntProcessor[] intProcessors = {
new IntProcessor1(),
new IntProcessor2(),
//Comment this last one out to trigger bimorphic behaviour
new IntProcessor3()
};
processOneByOne(intProcessors, arr);
processBatch(intProcessors, arr);
}
private static void processOneByOne(IntProcessor[] intProcessors, int[] arr) {
long start = System.nanoTime();
for (int i = 0; i < ITERATIONS; i++) {
for (IntProcessor intProcessor : intProcessors) {
for (int value : arr) {
intProcessor.process(value);
}
}
}
long end = System.nanoTime();
long took = end - start;
System.out.printf("One-by-one: Time per iteration: %.3f ms%n", (((double) took) / ITERATIONS) / 1_000_000d);
}
private static void processBatch(IntProcessor[] intProcessors, int[] arr) {
long start = System.nanoTime();
for (int i = 0; i < ITERATIONS; i++) {
for (IntProcessor intProcessor : intProcessors) {
intProcessor.process(arr);
}
}
long end = System.nanoTime();
long took = end - start;
System.out.printf("Batch: Time per iteration: %.3f ms%n", (((double) took) / ITERATIONS) / 1_000_000d);
}
interface IntProcessor {
void process(int i);
void process(int[] arr);
}
static class IntProcessor1 implements IntProcessor {
static int value;
@Override
public void process(int i) {
value = i;
}
@Override
public void process(int[] arr) {
for (int i = 0; i < arr.length; i++) {
value = arr[i];
}
}
}
static class IntProcessor2 implements IntProcessor {
static int value;
@Override
public void process(int i) {
value = i;
}
@Override
public void process(int[] arr) {
for (int i = 0; i < arr.length; i++) {
value = arr[i];
}
}
}
static class IntProcessor3 implements IntProcessor {
static int value;
@Override
public void process(int i) {
value = i;
}
@Override
public void process(int[] arr) {
for (int i = 0; i < arr.length; i++) {
value = arr[i];
}
}
}
} I ran this in this form, and with one implementation commented out (to trigger the bimorphic inlining behavior). The timing results are quite extreme (running with jdk21 and
I.e. with three implementations, batching is ~60 times faster than single implementations. With two, it is ~16 times faster. This is the same pattern that we see with the lucene code - although the difference is much more extreme (I'm guessing due to the implementation). I do think that we can draw the conclusion that helping the JVM with batch versions of virtual calls can help performance significantly. One might argue that the JVM should be able to figure this out on it's own. But until it does, let's maybe help it a bit? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for double checking that the problem still occurs with significant warmup. The change looks reasonable, I'd like to have someone else take a look though. @mikemccand Would you be able to do that since you already started looking?
Ideally we'd add a query that adds another implementation of an IntersectVisitor
to the nightly benchmarks before merging that PR so that we can see the performance bump?
public void visit(IntsRef ref) throws IOException { | ||
for (int i = ref.offset; i < ref.offset + ref.length; i++) { | ||
result.clear(ref.ints[i]); | ||
cost[0]--; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you could move this outside of the loop and decrement by ref.length
in one go?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, done :)
Yes - this would be ideal imo. Some caveats that I found while testing:
|
+1 -- PNP!
The nightly benchy uses But, the nightly benchy does not Actually, I don't think we should block this change on fixing nightly benchy to reveal its gains? Your above |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great -- I love the state this iterated to. Thank you for the deep analysis, and persisting through the feedback @antonha. This is a very exciting optimization!
013a520
to
cc39cc1
Compare
Thank you all for helping to get this PR into a better state. The only thing that still irked me is that I couldn't get the tests to execute all newly added methods. I made a few changes in the last commit to:
I would love it if you, @mikemccand or @jpountz, could give a thumbs up to these last changes? After this, what's the protocol? Do I merge the PR myself or should a committer push the button? |
How long does the test take to run? It'd be nice to exercise the optimized path in "ordinary" (non-nightly) test runs too ...
I'll have a look!
One of we committers must merge it! It sounds like we are super close ... I'll try to review today and maybe merge. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a tiny javadoc wording improvement. Thanks @antonha!
@@ -298,6 +299,17 @@ default void visit(DocIdSetIterator iterator) throws IOException { | |||
} | |||
} | |||
|
|||
/** | |||
* Similar to {@link IntersectVisitor#visit(int)}, but a bulk visit and implements may have |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
implements
-> implementations
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
whoops, fixed!
I timed this a bit ... looks like it's ~.25 - .5 seconds. I think that's OK to run always. |
Sounds great - the javadoc fix is done. Thanks a lot for having a look and timing the test! |
Thanks @antonha -- sorry, maybe also add a |
My bad - I should have added one the first time you asked 🙈. I've Added one now, let me know if you feel like it lacks something! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @antonha!
…sitor (#13149) * Made DocIdsWriter use DISI when reading documents with an IntersectVisitor. Instead of calling IntersectVisitor.visit for each doc in the readDelta16 and readInts32 methods, create a DocIdSetIterator and call IntersectVisitor.visit(DocIdSetIterator) instead. This seems to make Lucene faster at sorting and range querying tasks - the hypothesis being that it is due to fewer virtual calls. * Spotless * Changed bulk iteration to use IntsRef instead. * Clearer comment. * Decrease cost outside loop * Added test to make inverse intsref be used. * Wording improvement in javadoc. * Added CHANGES.txt entry.
OK I merged a backported to 9.11.0 -- I think that's safe: we added a new default method to |
Instead of calling
IntersectVisitor.visit
for each doc in thereadDelta16
andreadInts32
methods, create aDocIdSetIterator
and callIntersectVisitor.visit(DocIdSetIterator)
instead.This seems to make Lucene faster at some sorting and range querying tasks - I saw 35-45% reduction in execution time. In learnt this through running this benchmark setup by Elastic: https://github.com/elastic/elasticsearch-opensearch-benchmark.
The hypothesis is that this is due to fewer virtual calls being made - once per BKD leaf, instead of once per document. Note that this is only measurable if the readInts methods have been called with at least 3 implementation of the IntersectVisitor interface - otherwise the JIT inlining takes away the virtual call. In real life Lucene deployments, I would judge that it is very likely that at least 3 implementations are used. For more details on method etc, there are details in this blog post: https://blunders.io/posts/es-benchmark-4-inlining
I tried benchmarking this with luceneutil, but did not see any significant change with the default benchmark - I suspect that I'm using the wrong luceneutil tasks to see any major difference. Which luceneutils benchmarks should I be using for these changes?