Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Made DocIdsWriter use DISI when reading documents with an IntersectVisitor #13149

Merged
merged 8 commits into from
Apr 2, 2024

Conversation

antonha
Copy link
Contributor

@antonha antonha commented Mar 1, 2024

Instead of calling IntersectVisitor.visit for each doc in the readDelta16 and readInts32 methods, create a DocIdSetIterator and call IntersectVisitor.visit(DocIdSetIterator) instead.

This seems to make Lucene faster at some sorting and range querying tasks - I saw 35-45% reduction in execution time. In learnt this through running this benchmark setup by Elastic: https://github.com/elastic/elasticsearch-opensearch-benchmark.

The hypothesis is that this is due to fewer virtual calls being made - once per BKD leaf, instead of once per document. Note that this is only measurable if the readInts methods have been called with at least 3 implementation of the IntersectVisitor interface - otherwise the JIT inlining takes away the virtual call. In real life Lucene deployments, I would judge that it is very likely that at least 3 implementations are used. For more details on method etc, there are details in this blog post: https://blunders.io/posts/es-benchmark-4-inlining

I tried benchmarking this with luceneutil, but did not see any significant change with the default benchmark - I suspect that I'm using the wrong luceneutil tasks to see any major difference. Which luceneutils benchmarks should I be using for these changes?

Copy link
Member

@mikemccand mikemccand left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a great optimization! I just had one small concern about advance.


@Override
public int advance(int target) throws IOException {
while (index < count && scratch[index] < target) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we even need to implement this method? Can we throw UOE instead? It's a bit scary that it's O(N), though, in practice, scratch is a smallish buffer?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we don't need to - but I don't see why this would be scarier than advancing using the nextDoc method? But I might be missing something - if you'd like that change, I'll make it. It might be that a DISI is not the right interface for visiting docids within the BKD tree - but I'm guessing changing that would be larger and more controversial.

For large segments the buffer size would almost always be 512 - but there would be a lot of buffers.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried changing the advance method to just throw an throw new UnsupportedOperationException() - and tests do pass - I can commit that if you prefer that. I guess we should add some kind of comment of why it is unsupported though?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be that a DISI is not the right interface for visiting docids within the BKD tree - but I'm guessing changing that would be larger and more controversial.

Yeah indeed DISI may not be right, but let's not try to fix that here/now? Perhaps there are use cases that might want to call .advance on the DISI? Maybe some sort of curious combined filtering / points query?

OK you convinced me!: let's leave the method as is. You're right that if such a use case existed today, it is suffering through the linear scan of .visit calls already, so this is no worse.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah indeed DISI may not be right, but let's not try to fix that here/now?
Agreed, but I'm tempted to experiment with it in another PR :)

visitor.visit(scratch[i]);
}
scratchDocIdSetIterator.reset(count);
visitor.visit(scratchDocIdSetIterator);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@antonha Maybe add a small comment here explaining that we found using a DISI was much faster than visting docs individually?

People usually attach the PR id to the comment in case someone wants to dig deeper. Many a times, I have learnt about why something was implemented a certain way through these comments.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a comment higher up in the file, next to the ScratchDocIdSetIterator. Would moving that down here and referring to this PR make things better?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That comment is good enough. I had not noticed it I guess..

Copy link
Member

@mikemccand mikemccand left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@antonha could you add a CHANGES.txt entry explaining the optimization? Thanks!

@antonha
Copy link
Contributor Author

antonha commented Mar 9, 2024

Thanks for the approval!

I do however think that it would be good to prove this performance improvement in luceneutil before merging, to make the benchmark more easily reproducible than the benchmarking I have done. I have started looking into it, but will have more time for it next week. Does that sound reasonable? The only downside I see is a later merge of this PR.

@antonha could you add a CHANGES.txt entry explaining the optimization? Thanks!

Yes, will do once I have run some luceneutil benchmarks (if we want to wait for that)

@mikemccand
Copy link
Member

I tried benchmarking this with luceneutil, but did not see any significant change with the default benchmark - I suspect that I'm using the wrong luceneutil tasks to see any major difference. Which luceneutils benchmarks should I be using for these changes?

Confusingly named, the IntsNRQ task is the only task using points, I think. It runs simple 1D range queries.

The geospatial benchmarks (a different benchmark than the normal/nightly luceneutil) use multi-dimensional points.

Still, given that you saw gains in the "OpenSearch vs Elasticsearch" benchmarks, even if the results are flat with the existing luceneutil benchmarks, I think we should just merge the change. The night after we can watch Lucene's nightly benchmarks and see if the nightly box measured anything.

(Hmm, curiously/separately, it looks like something caused a jump in some geo tasks' performance e.g. distance filter ... I'll try to find the cause and add an annotation!).

@antonha
Copy link
Contributor Author

antonha commented Mar 13, 2024

I spent some time "proving" this is luceneutil - luceneutil/pull/257 adds a reproduction - if run with wikimediumall, optimize = True for indexing and commitPoint = 'single'.

The reason that the larger segment is needed is that Lucene otherwise chooses to store document ids in the BKD leaves as int24, meaning that the optimization in this PR does nothing. With the luceneutil changes, I get the following output when comparing this PR to master:

                           TaskQPS baseline      StdDevQPS my_modified_version      StdDev                Pct diff p-value
                    OrHighNotLow      527.76      (7.9%)      516.38      (8.0%)   -2.2% ( -16% -   14%) 0.392
                    OrHighNotMed      523.37      (7.3%)      512.63      (7.5%)   -2.1% ( -15% -   13%) 0.384
                   OrHighNotHigh      418.72      (6.4%)      411.39      (6.3%)   -1.8% ( -13% -   11%) 0.384
           BrowseMonthTaxoFacets       11.50     (22.5%)       11.32     (26.2%)   -1.6% ( -41% -   60%) 0.839
                   OrNotHighHigh      419.16      (5.4%)      412.88      (5.4%)   -1.5% ( -11% -    9%) 0.381
                     AndHighHigh       43.95      (5.4%)       43.42      (3.8%)   -1.2% (  -9% -    8%) 0.413
            HighIntervalsOrdered        4.60      (5.6%)        4.55      (6.0%)   -1.1% ( -12% -   11%) 0.561
               HighTermMonthSort    13015.73      (3.3%)    12889.39      (4.0%)   -1.0% (  -7% -    6%) 0.401
                          Fuzzy1      285.37      (3.3%)      282.87      (2.2%)   -0.9% (  -6% -    4%) 0.320
        AndHighHighDayTaxoFacets        9.00      (4.6%)        8.92      (4.8%)   -0.8% (  -9% -    8%) 0.577
                    HighSpanNear        4.59      (2.8%)        4.55      (2.8%)   -0.8% (  -6% -    4%) 0.356
                    OrNotHighLow     1091.32      (2.8%)     1082.56      (3.3%)   -0.8% (  -6% -    5%) 0.402
                     LowSpanNear       12.72      (2.4%)       12.64      (2.5%)   -0.7% (  -5% -    4%) 0.377
            HighTermTitleBDVSort        5.69      (2.4%)        5.65      (2.5%)   -0.7% (  -5% -    4%) 0.387
                          Fuzzy2      165.14      (2.4%)      164.10      (1.8%)   -0.6% (  -4% -    3%) 0.343
                     MedSpanNear       12.61      (1.9%)       12.53      (2.2%)   -0.6% (  -4% -    3%) 0.327
                        HighTerm      578.67      (6.5%)      575.09      (5.1%)   -0.6% ( -11% -   11%) 0.739
            MedTermDayTaxoFacets       19.91      (5.1%)       19.79      (4.3%)   -0.6% (  -9% -    9%) 0.683
          OrHighMedDayTaxoFacets        3.43     (12.0%)        3.41     (11.4%)   -0.5% ( -21% -   25%) 0.885
                         MedTerm      748.03      (5.6%)      744.26      (4.5%)   -0.5% ( -10% -   10%) 0.755
         AndHighMedDayTaxoFacets       32.72      (1.5%)       32.57      (1.6%)   -0.5% (  -3% -    2%) 0.335
                         LowTerm      526.39      (2.7%)      524.40      (2.7%)   -0.4% (  -5% -    5%) 0.653
                HighSloppyPhrase        9.56      (2.3%)        9.53      (2.9%)   -0.3% (  -5% -    5%) 0.689
                      HighPhrase       33.59      (3.8%)       33.49      (3.8%)   -0.3% (  -7% -    7%) 0.803
             MedIntervalsOrdered       14.95      (4.2%)       14.92      (4.4%)   -0.2% (  -8% -    8%) 0.876
       BrowseDayOfYearSSDVFacets        6.37      (1.4%)        6.36      (1.3%)   -0.2% (  -2% -    2%) 0.665
                 MedSloppyPhrase       15.68      (1.6%)       15.65      (2.8%)   -0.2% (  -4% -    4%) 0.799
                      AndHighMed      109.07      (3.9%)      108.94      (3.2%)   -0.1% (  -6% -    7%) 0.915
                 LowSloppyPhrase       12.64      (1.5%)       12.63      (2.2%)   -0.1% (  -3% -    3%) 0.855
                    OrNotHighMed      377.59      (3.0%)      377.64      (3.3%)    0.0% (  -6% -    6%) 0.989
           BrowseMonthSSDVFacets        6.65      (0.4%)        6.65      (0.4%)    0.0% (   0% -    0%) 0.829
                       OrHighLow      541.18      (3.3%)      541.47      (3.3%)    0.1% (  -6% -    6%) 0.960
             LowIntervalsOrdered       12.42      (4.0%)       12.43      (3.9%)    0.1% (  -7% -    8%) 0.948
     BrowseRandomLabelSSDVFacets        5.35      (1.3%)        5.36      (1.1%)    0.1% (  -2% -    2%) 0.809
                       MedPhrase       47.26      (2.3%)       47.31      (2.4%)    0.1% (  -4% -    4%) 0.889
                         Prefix3      246.21      (4.9%)      246.48      (7.3%)    0.1% ( -11% -   12%) 0.956
                       LowPhrase       34.24      (2.5%)       34.30      (2.4%)    0.2% (  -4% -    5%) 0.824
               HighTermTitleSort       98.67      (2.9%)       98.84      (4.6%)    0.2% (  -7% -    7%) 0.884
                      OrHighHigh       42.99      (8.8%)       43.10      (8.6%)    0.3% ( -15% -   19%) 0.924
            BrowseDateSSDVFacets        2.24     (13.7%)        2.25     (13.2%)    0.6% ( -23% -   31%) 0.895
                         Respell      255.75      (1.7%)      257.22      (1.7%)    0.6% (  -2% -    4%) 0.281
       BrowseDayOfYearTaxoFacets        7.39      (1.9%)        7.44      (4.1%)    0.7% (  -5% -    6%) 0.489
                      AndHighLow      961.38      (3.8%)      970.95      (4.1%)    1.0% (  -6% -    9%) 0.427
            BrowseDateTaxoFacets        7.08      (1.3%)        7.16      (4.6%)    1.1% (  -4% -    7%) 0.291
                      TermDTSort       77.89      (7.4%)       78.84      (4.8%)    1.2% ( -10% -   14%) 0.535
                        PKLookup      250.25      (3.3%)      253.54      (2.7%)    1.3% (  -4% -    7%) 0.169
           HighTermDayOfYearSort      107.64      (2.6%)      109.12      (2.1%)    1.4% (  -3% -    6%) 0.064
                       OrHighMed      109.62      (6.2%)      111.24      (6.1%)    1.5% ( -10% -   14%) 0.447
     BrowseRandomLabelTaxoFacets        6.17      (5.0%)        6.27      (3.9%)    1.6% (  -6% -   11%) 0.253
                        Wildcard      183.93      (3.7%)      187.03      (4.4%)    1.7% (  -6% -   10%) 0.193
                         LongNRQ       42.83      (6.1%)       83.47     (34.5%)   94.9% (  51% -  144%) 0.000
                          IntNRQ       20.29      (3.4%)       53.24     (34.2%)  162.4% ( 120% -  207%) 0.000

The interesting parts here is in the bottom two lines - IntNRQ and LongNRQ becomes much faster. it might be that I messed up the "minimal" part of reproduction, maybe all that was needed is the single-segment and taskCountPerCat increase.

Regardless, it looks promising - a 94% to 162% increase in QPS for range queries with this PR in the slightly modified benchmark.

@jpountz
Copy link
Contributor

jpountz commented Mar 13, 2024

Would we be subject to the same issue if/when 3+ different implementations of DocIdSetIterator get used in IntersectVisitor#visit?

@antonha
Copy link
Contributor Author

antonha commented Mar 13, 2024

Would we be subject to the same issue if/when 3+ different implementations of DocIdSetIterator get used in IntersectVisitor#visit?

Yes.

Your question makes me think that maybe we should not add a new DocIdSetIterator for this PR - maybe that was your thought as well?

@jpountz
Copy link
Contributor

jpountz commented Mar 13, 2024

Relatedly indeed, I was wondering if the API should expose an IntsRef or something like that, so that there is a single virtual call per block of doc IDs anyway (IntsRef cannot be extended, it's a single impl).

@antonha
Copy link
Contributor Author

antonha commented Mar 13, 2024

@jpountz I had a quick look at the code, and it seems to me like there are, with this PR, only two implementations used for the DISI used for the IntersectVisitor#visit method - which is good in the sense that it would probably be fine to merge it now, but bad in the sense that it would make adding a third hurt performance.

Do we believe that we can merge this PR and then continue with changing the BKD visit API in a later change, or should we try to change the abstraction in this PR?

@jpountz
Copy link
Contributor

jpountz commented Mar 14, 2024

++ on progress over perfection

That said, I wonder if this change is legal: DocIdSetIterator must return doc IDs in order, but it looks like it wouldn't always be the case with your change?

@antonha
Copy link
Contributor Author

antonha commented Mar 15, 2024

That said, I wonder if this change is legal: DocIdSetIterator must return doc IDs in order, but it looks like it wouldn't always be the case with your change?

You are correct - I had (falsely) assumed that the document ids in the DocIdsWriter were written in order - thus the PR as of now (dacb2a5d3ad707a555ed1fa2925a082d81aa4443) is not a legal change.

The good news is that this should not have affected the benchmarks - neither the PointRangeQuery nor the NumericLeafComparator seems to rely on the DISI being ordered. They just iterate until DocIdSetIterator.NO_MORE_DOCS.

I will try adding a visit() method taking an IntsRef (I believe that is what you meant @jpountz?).

antonha added 3 commits March 15, 2024 11:30
…sitor.

Instead of calling IntersectVisitor.visit for each doc in the
readDelta16 and readInts32 methods, create a DocIdSetIterator
and call IntersectVisitor.visit(DocIdSetIterator) instead.

This seems to make Lucene faster at sorting and range querying
tasks - the hypothesis being that it is due to fewer virtual calls.
@antonha antonha force-pushed the docidswriter_disi_visit branch from b5fbbf1 to 60495d4 Compare March 15, 2024 10:31
@antonha
Copy link
Contributor Author

antonha commented Mar 15, 2024

I've changed the readInts16/34 methods to now use the IntsRef instead - I kind of like how it turned out, but the downside is that we will have to add many implementations to see the benefits. I've added the 2 in the PointRangeQuery - if we want to continue down this path, I can add more.

One thing that I would love input on is whether or not to manually inline the visit(docid) version or not. E.g. I now do:

for (int i = ref.offset; i < ref.offset + ref.length; i++) {
  result.clear(ref.ints[i]);
  cost[0]--;
 }

While we could do

for (int i = ref.offset; i < ref.offset + ref.length; i++) {
  visit(ref.ints[i]);
 }

The latter would, most likely, be inlined by the JVM. The first is manually inlined, so we don't need to trust the JVM on it. Thoughts?

It is important to note that just having the default method in IntersectVisitor would not be good, since that would do virtual calls to visit(int). Copy-pasting that method body into each implementation would however, most likely, inline better.

@antonha
Copy link
Contributor Author

antonha commented Mar 15, 2024

I benchmarked the current commit (e826696fd235cb1ef215745944cac6e34faa0d20) using the same modified lucenebench as above, it seems to like this change even better:

TaskQPS baseline      StdDevQPS my_modified_version      StdDev                Pct diff p-value
LongNRQ       41.11      (6.6%)      107.13     (13.4%)  160.6% ( 131% -  193%) 0.000
 IntNRQ       23.23      (7.0%)       85.58     (19.0%)  268.4% ( 226% -  316%) 0.000

(the baseline being master)

@@ -185,6 +186,13 @@ public void visit(DocIdSetIterator iterator) throws IOException {
adder.add(iterator);
}

@Override
public void visit(IntsRef ref) {
for (int i = ref.offset; i < ref.offset + ref.length; i++) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we delegate this to BulkAdder and take advantage of System#arraycopy in BufferAdder ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would probably be better. I don't think we should expect huge performance increases from it (since it would only affect narrow range queries). But it would make a large difference if a third implementation of the adder is ever used.

I have limited time this week to look at it - I will try to find some time for it though!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I unfortunately did not find time enough for this this week - and since this PR seems to be wrapping up I will leave it out. It looks like a simple change but I got stuck implementing tests (which I feel would be very needed). I will leave this for future improvement!

@jpountz
Copy link
Contributor

jpountz commented Mar 19, 2024

I will try adding a visit() method taking an IntsRef (I believe that is what you meant @jpountz?).

This is what I meant indeed.

Before merging I'd be curious to better understand why the JVM doesn't optimize this better. Presumably, it should be able to resolve the virtual call once for the entire for loop rather than doing it again on every iteration? I wonder if there is actually a performance bug, or if we are just insufficiently warming up the JVM and this for loop never gets compiled through C2?

@antonha
Copy link
Contributor Author

antonha commented Mar 24, 2024

Before merging I'd be curious to better understand why the JVM doesn't optimize this better. Presumably, it should be able to resolve the virtual call once for the entire for loop rather than doing it again on every iteration? I wonder if there is actually a performance bug, or if we are just insufficiently warming up the JVM and this for loop never gets compiled through C2?

Valid questions.

I ran the same luceneutils benchmark with -XX:+LogCompilation and -XX:PrintInlining and I can see c2/level4 compilations being done, including BKDPointTree::addAll and DocIdsWriter::readInts32. The Elasticsearch benchmarks mentioned in my post (where we saw similar results) were also run long enough that I would be very surprised if c2 had not kicked in.

I also tried reproducing this behavior in a shorter java program for demonstration purposes - with virtual calls similar to the IntersectVisitor. The program has 3 implementations of an interface processing int values, with both single value and batch implementations:

import java.util.Random;

public class C2Inlining {
  static int ITERATIONS = 50_000;
  static int NUM_VALUES = 100_000;
  public static void main(String[] args) {
    //Generate ints
    Random r = new Random();
    int[] arr = new int[NUM_VALUES];
    for (int i = 0; i < NUM_VALUES; i++) {
      arr[i] = r.nextInt();
    }

    IntProcessor[] intProcessors = {
        new IntProcessor1(),
        new IntProcessor2(),
        //Comment this last one out to trigger bimorphic behaviour
        new IntProcessor3()
    };
    processOneByOne(intProcessors, arr);
    processBatch(intProcessors, arr);
  }

  private static void processOneByOne(IntProcessor[] intProcessors, int[] arr) {
    long start = System.nanoTime();
    for (int i = 0; i < ITERATIONS; i++) {
      for (IntProcessor intProcessor : intProcessors) {
        for (int value : arr) {
          intProcessor.process(value);
        }
      }
    }
    long end = System.nanoTime();
    long took = end - start;
    System.out.printf("One-by-one: Time per iteration: %.3f ms%n", (((double) took) / ITERATIONS) / 1_000_000d);
  }

  private static void processBatch(IntProcessor[] intProcessors, int[] arr) {
    long start = System.nanoTime();
    for (int i = 0; i < ITERATIONS; i++) {
      for (IntProcessor intProcessor : intProcessors) {
        intProcessor.process(arr);
      }
    }
    long end = System.nanoTime();
    long took = end - start;
    System.out.printf("Batch: Time per iteration: %.3f ms%n", (((double) took) / ITERATIONS) / 1_000_000d);
  }

  interface IntProcessor {
    void process(int i);
    void process(int[] arr);
  }

  static class IntProcessor1 implements IntProcessor {

    static int value;

    @Override
    public void process(int i) {
      value = i;
    }

    @Override
    public void process(int[] arr) {
      for (int i = 0; i < arr.length; i++) {
        value = arr[i];
      }
    }
  }

  static class IntProcessor2 implements IntProcessor {
    static int value;

    @Override
    public void process(int i) {
      value = i;
    }

    @Override
    public void process(int[] arr) {
      for (int i = 0; i < arr.length; i++) {
        value = arr[i];
      }
    }
  }

  static class IntProcessor3 implements IntProcessor {
    static int value;

    @Override
    public void process(int i) {
      value = i;
    }

    @Override
    public void process(int[] arr) {
      for (int i = 0; i < arr.length; i++) {
        value = arr[i];
      }
    }
  }
}

I ran this in this form, and with one implementation commented out (to trigger the bimorphic inlining behavior). The timing results are quite extreme (running with jdk21 and -server)

Variant One-by-One Batch
Three implementations 1.101 ms/iter 0.018 ms/iter
Two implementations 0.198 ms/iter 0.012 ms/iter

I.e. with three implementations, batching is ~60 times faster than single implementations. With two, it is ~16 times faster.
Both versions saw c2 compilations.

This is the same pattern that we see with the lucene code - although the difference is much more extreme (I'm guessing due to the implementation). I do think that we can draw the conclusion that helping the JVM with batch versions of virtual calls can help performance significantly.

One might argue that the JVM should be able to figure this out on it's own. But until it does, let's maybe help it a bit?

Copy link
Contributor

@jpountz jpountz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for double checking that the problem still occurs with significant warmup. The change looks reasonable, I'd like to have someone else take a look though. @mikemccand Would you be able to do that since you already started looking?

Ideally we'd add a query that adds another implementation of an IntersectVisitor to the nightly benchmarks before merging that PR so that we can see the performance bump?

public void visit(IntsRef ref) throws IOException {
for (int i = ref.offset; i < ref.offset + ref.length; i++) {
result.clear(ref.ints[i]);
cost[0]--;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you could move this outside of the loop and decrement by ref.length in one go?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, done :)

@antonha
Copy link
Contributor Author

antonha commented Mar 26, 2024

Ideally we'd add a query that adds another implementation of an IntersectVisitor to the nightly benchmarks before merging that PR so that we can see the performance bump?

Yes - this would be ideal imo. Some caveats that I found while testing:

  • PointRangeQuerys are executed with different IntersectVisitors depending on expected number of matches, which means that several query intervals need to be used in each JVM. In luceneutil, I believe that this corresponds to setting taskCountPerCat high.
  • To see significant change, the number of documents in the segments needs to either fit into the 16bit or 32bit implementation of bkd.DocIdsWriter - this was not the case with the simple localrun.py script out of the box.
  • It is possible that we see a performance decrease when we add more IntersectVisitor variants, since the benchmark becomes more realistically slow with the additional implementations.

@mikemccand
Copy link
Member

Do we believe that we can merge this PR and then continue with changing the BKD visit API in a later change, or should we try to change the abstraction in this PR?

+1 -- PNP!

  • In luceneutil, I believe that this corresponds to setting taskCountPerCat high.

The nightly benchy uses taskCountPerCat=5 -- is that enough to reveal the impact of this optimization?

But, the nightly benchy does not forceMerge down to one segment, so, we may not have the right number of docs to use 16 or 32 bit int encoding in the BKD file.

Actually, I don't think we should block this change on fixing nightly benchy to reveal its gains? Your above luceneutil repro is sufficient? I'm wary of allowing benchmarking to unduly hold up otherwise good Lucene changes in general... it's a helpful tool, not a crutch ;)

Copy link
Member

@mikemccand mikemccand left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great -- I love the state this iterated to. Thank you for the deep analysis, and persisting through the feedback @antonha. This is a very exciting optimization!

@antonha antonha force-pushed the docidswriter_disi_visit branch from 013a520 to cc39cc1 Compare March 30, 2024 22:40
@antonha
Copy link
Contributor Author

antonha commented Mar 30, 2024

Thank you all for helping to get this PR into a better state. The only thing that still irked me is that I couldn't get the tests to execute all newly added methods. I made a few changes in the last commit to:

  • Add a new test case which calls doTestRandomLongs with 20 000 - without this I couldn't get the IntsRef to trigger often enough. This should maybe be a @Nightly?
  • Added a higher chance of having 0 missing values in doTestRandomLongs - otherwise the inverse intersect visitor would only have a chance of happening every 100 runs, which I thought a bit low.

I would love it if you, @mikemccand or @jpountz, could give a thumbs up to these last changes? After this, what's the protocol? Do I merge the PR myself or should a committer push the button?

@mikemccand
Copy link
Member

  • Add a new test case which calls doTestRandomLongs with 20 000 - without this I couldn't get the IntsRef to trigger often enough. This should maybe be a @Nightly?

How long does the test take to run? It'd be nice to exercise the optimized path in "ordinary" (non-nightly) test runs too ...

I would love it if you, @mikemccand or @jpountz, could give a thumbs up to these last changes?

I'll have a look!

Do I merge the PR myself or should a committer push the button?

One of we committers must merge it! It sounds like we are super close ... I'll try to review today and maybe merge.

Copy link
Member

@mikemccand mikemccand left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a tiny javadoc wording improvement. Thanks @antonha!

@@ -298,6 +299,17 @@ default void visit(DocIdSetIterator iterator) throws IOException {
}
}

/**
* Similar to {@link IntersectVisitor#visit(int)}, but a bulk visit and implements may have
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

implements -> implementations?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

whoops, fixed!

@mikemccand
Copy link
Member

  • Add a new test case which calls doTestRandomLongs with 20 000 - without this I couldn't get the IntsRef to trigger often enough. This should maybe be a @Nightly?

How long does the test take to run? It'd be nice to exercise the optimized path in "ordinary" (non-nightly) test runs too ...

I timed this a bit ... looks like it's ~.25 - .5 seconds. I think that's OK to run always.

@antonha
Copy link
Contributor Author

antonha commented Apr 1, 2024

One of we committers must merge it! It sounds like we are super close ... I'll try to review today and maybe merge.

Sounds great - the javadoc fix is done. Thanks a lot for having a look and timing the test!

@mikemccand
Copy link
Member

Thanks @antonha -- sorry, maybe also add a CHANGES.txt entry? This is an exciting opto!

@antonha
Copy link
Contributor Author

antonha commented Apr 1, 2024

Thanks @antonha -- sorry, maybe also add a CHANGES.txt entry? This is an exciting opto!

My bad - I should have added one the first time you asked 🙈. I've Added one now, let me know if you feel like it lacks something!

Copy link
Member

@mikemccand mikemccand left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @antonha!

@mikemccand mikemccand merged commit 4eeb3a7 into apache:main Apr 2, 2024
3 checks passed
mikemccand pushed a commit that referenced this pull request Apr 2, 2024
…sitor (#13149)

* Made DocIdsWriter use DISI when reading documents with an IntersectVisitor.

Instead of calling IntersectVisitor.visit for each doc in the
readDelta16 and readInts32 methods, create a DocIdSetIterator
and call IntersectVisitor.visit(DocIdSetIterator) instead.

This seems to make Lucene faster at sorting and range querying
tasks - the hypothesis being that it is due to fewer virtual calls.

* Spotless

* Changed bulk iteration to use IntsRef instead.

* Clearer comment.

* Decrease cost outside loop

* Added test to make inverse intsref be used.

* Wording improvement in javadoc.

* Added CHANGES.txt entry.
mikemccand added a commit that referenced this pull request Apr 2, 2024
@mikemccand
Copy link
Member

OK I merged a backported to 9.11.0 -- I think that's safe: we added a new default method to IntersectVisitor.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants