LUCENE-10002: Deprecate IndexSearch#search(Query, Collector) in favor of IndexSearcher#search(Query, CollectorManager) - TopFieldCollectorManager & TopScoreDocCollectorManager #240

zacharymorn · 2021-08-12T05:30:11Z

Description / Solution

Deprecate IndexSearch#search(Query, Collector) in favor of IndexSearcher#search(Query, CollectorManager), with the following changes:

Refactor out TopScoreDocCollectorManager, TopFieldCollectorManager, ~~TotalHitCountCollectorManager and FixedBitSetCollector~~
Refactor some tests to use the above collector manager
~~3. Refactor DrillSideways to use extracted out collector managers~~

#11041

Tests

Passed updated tests with ./gradlew clean; ./gradlew check -Ptests.nightly=true (currently there are nocommit messages that are failing precommit though)

Checklist

Please review the following and check all that apply:

I have reviewed the guidelines for How to Contribute and my code conforms to the standards described there to the best of my ability.
I have created a Jira issue and added the issue ID to my pull request title.
I have given Lucene maintainers access to contribute to my PR branch. (optional but recommended)
I have developed this patch against the main branch.
I have run ./gradlew check.
I have added tests for my changes.

… of IndexSearcher#search(Query, CollectorManager) Refactor out TopScoreDocCollectorManager, TopFieldCollectorManager, TotalHitCountCollectorManager and FixedBitSetCollector Refactor some tests to use the above collector manager

…anagers

jpountz · 2021-08-13T06:33:17Z

lucene/core/src/java/org/apache/lucene/search/TopFieldCollectorManager.java

+    /*
+    nocommit
+    Should the following two be passed in instead? Possible custom initialization based on executor status and slices?
+    On the other hand, in a single-threaded environment, shared HitsThresholdChecker and MaxScoreAccumulator should be fast without lock contention anyway?


Can you check if luceneutil benchmarks become slower with this change? Hopefully they're not slower and we can keep this logic simple.

Thanks for the great suggestion! I just ran luceneutil, and it turns out that the changes do get slower due to this thread-safe HitsThresholdChecker (confirmed by playing with the atomic long inside a bit). Here is the full benchmark result:

TaskQPS baseline StdDevQPS my_modified_version StdDev Pct diff p-value HighTermTitleBDVSort 52.86 (14.0%) 42.72 (6.6%) -19.2% ( -34% - 1%) 0.000 HighTermMonthSort 31.66 (10.7%) 25.90 (6.3%) -18.2% ( -31% - -1%) 0.000 HighTermDayOfYearSort 24.00 (15.7%) 20.02 (8.5%) -16.6% ( -35% - 9%) 0.000 TermDTSort 20.22 (9.0%) 17.13 (6.6%) -15.3% ( -28% - 0%) 0.000 MedTerm 1270.02 (3.9%) 1208.29 (5.1%) -4.9% ( -13% - 4%) 0.001 HighTerm 941.14 (4.0%) 900.05 (5.9%) -4.4% ( -13% - 5%) 0.006 OrHighNotMed 531.63 (5.6%) 514.68 (6.2%) -3.2% ( -14% - 9%) 0.089 LowTerm 1112.50 (5.1%) 1081.48 (6.2%) -2.8% ( -13% - 8%) 0.118 MedSpanNear 12.50 (2.2%) 12.18 (2.2%) -2.5% ( -6% - 1%) 0.000 HighSpanNear 12.82 (1.7%) 12.52 (1.7%) -2.4% ( -5% - 1%) 0.000 OrHighNotLow 562.50 (6.9%) 550.15 (5.6%) -2.2% ( -13% - 11%) 0.269 OrHighNotHigh 577.10 (5.6%) 565.51 (6.7%) -2.0% ( -13% - 10%) 0.303 OrNotHighHigh 633.91 (6.7%) 621.85 (6.0%) -1.9% ( -13% - 11%) 0.345 HighIntervalsOrdered 6.94 (3.1%) 6.81 (3.2%) -1.9% ( -7% - 4%) 0.061 PKLookup 174.06 (4.2%) 171.17 (4.5%) -1.7% ( -9% - 7%) 0.230 OrHighLow 283.56 (5.3%) 279.05 (5.8%) -1.6% ( -12% - 10%) 0.365 MedIntervalsOrdered 19.00 (3.7%) 18.74 (3.6%) -1.4% ( -8% - 6%) 0.237 Fuzzy1 59.38 (11.6%) 58.65 (12.0%) -1.2% ( -22% - 25%) 0.740 HighSloppyPhrase 12.68 (2.7%) 12.53 (2.1%) -1.1% ( -5% - 3%) 0.135 Fuzzy2 63.95 (8.6%) 63.40 (9.0%) -0.9% ( -17% - 18%) 0.759 MedSloppyPhrase 17.53 (2.7%) 17.38 (2.2%) -0.8% ( -5% - 4%) 0.279 LowSloppyPhrase 23.90 (3.0%) 23.70 (2.8%) -0.8% ( -6% - 5%) 0.354 Respell 39.23 (3.1%) 38.90 (3.1%) -0.8% ( -6% - 5%) 0.403 AndHighMed 61.17 (3.8%) 60.72 (3.8%) -0.7% ( -8% - 7%) 0.542 OrHighMed 48.81 (3.0%) 48.46 (3.0%) -0.7% ( -6% - 5%) 0.444 LowIntervalsOrdered 14.64 (2.3%) 14.55 (2.3%) -0.6% ( -5% - 4%) 0.380 LowSpanNear 11.65 (1.6%) 11.57 (1.6%) -0.6% ( -3% - 2%) 0.223 AndHighHigh 16.57 (3.6%) 16.46 (4.0%) -0.6% ( -7% - 7%) 0.608 IntNRQ 17.25 (28.0%) 17.17 (27.3%) -0.5% ( -43% - 76%) 0.959 OrHighHigh 9.28 (2.5%) 9.24 (2.5%) -0.4% ( -5% - 4%) 0.635 MedPhrase 72.99 (3.3%) 72.75 (3.7%) -0.3% ( -7% - 6%) 0.763 LowPhrase 78.02 (3.5%) 77.77 (4.0%) -0.3% ( -7% - 7%) 0.786 BrowseMonthSSDVFacets 3.19 (1.8%) 3.19 (1.7%) -0.2% ( -3% - 3%) 0.653 BrowseDayOfYearSSDVFacets 3.01 (1.5%) 3.01 (1.2%) -0.0% ( -2% - 2%) 0.991 Wildcard 27.98 (2.8%) 27.98 (2.7%) 0.0% ( -5% - 5%) 0.981 AndHighLow 454.42 (5.4%) 455.10 (4.6%) 0.1% ( -9% - 10%) 0.925 HighPhrase 289.77 (5.1%) 290.54 (4.0%) 0.3% ( -8% - 9%) 0.854 OrNotHighLow 521.43 (6.5%) 523.38 (5.7%) 0.4% ( -11% - 13%) 0.846 Prefix3 39.82 (4.8%) 40.02 (3.1%) 0.5% ( -7% - 8%) 0.707 OrNotHighMed 560.49 (5.8%) 563.51 (5.5%) 0.5% ( -10% - 12%) 0.762 BrowseMonthTaxoFacets 0.99 (4.8%) 1.00 (4.6%) 0.6% ( -8% - 10%) 0.702 BrowseDateTaxoFacets 0.94 (4.6%) 0.94 (4.6%) 0.6% ( -8% - 10%) 0.683 BrowseDayOfYearTaxoFacets 0.94 (4.5%) 0.94 (4.5%) 0.7% ( -7% - 10%) 0.638

To fix the performance regression, but still try to keep the logic simple then, I'm considering adding an additional flag supportsConcurrency to the collector managers' constructor like such:

public TopScoreDocCollectorManager(int numHits, ScoreDoc after, int totalHitsThreshold, boolean supportsConcurrency) { ... this.hitsThresholdChecker = supportsConcurrency ? HitsThresholdChecker.createShared(Math.max(totalHitsThreshold, numHits)) : HitsThresholdChecker.create(Math.max(totalHitsThreshold, numHits)); this.minScoreAcc = supportsConcurrency ? new MaxScoreAccumulator() : null; } public TopFieldCollectorManager(Sort sort, int numHits, FieldDoc after, int totalHitsThreshold, boolean supportsConcurrency) { ... this.hitsThresholdChecker = supportsConcurrency ? HitsThresholdChecker.createShared(Math.max(totalHitsThreshold, numHits)) : HitsThresholdChecker.create(Math.max(totalHitsThreshold, numHits)); this.minScoreAcc = supportsConcurrency ? new MaxScoreAccumulator() : null; }

What do you think about this approach?

Hi @jpountz, for now I've implemented the above approach in 4af7740, and the lucenenutil benchmark shows there's no obvious impact to performance now:

TaskQPS baseline StdDevQPS my_modified_version StdDev Pct diff p-value MedSpanNear 6.59 (5.6%) 6.30 (10.4%) -4.5% ( -19% - 12%) 0.091 BrowseDayOfYearTaxoFacets 0.45 (6.5%) 0.43 (10.0%) -4.3% ( -19% - 13%) 0.106 OrNotHighLow 321.23 (8.0%) 308.14 (7.5%) -4.1% ( -18% - 12%) 0.097 HighSpanNear 2.96 (6.4%) 2.84 (9.5%) -3.9% ( -18% - 12%) 0.129 BrowseDateTaxoFacets 0.45 (6.9%) 0.43 (9.6%) -3.9% ( -19% - 13%) 0.142 HighIntervalsOrdered 4.12 (4.7%) 3.97 (8.7%) -3.6% ( -16% - 10%) 0.102 BrowseMonthSSDVFacets 1.68 (6.6%) 1.63 (11.7%) -3.5% ( -20% - 15%) 0.239 HighSloppyPhrase 6.41 (8.6%) 6.18 (10.0%) -3.5% ( -20% - 16%) 0.233 IntNRQ 14.11 (6.2%) 13.62 (11.2%) -3.5% ( -19% - 14%) 0.225 Wildcard 14.90 (7.5%) 14.39 (8.2%) -3.4% ( -17% - 13%) 0.170 LowSpanNear 4.53 (5.7%) 4.38 (8.1%) -3.3% ( -16% - 11%) 0.135 OrHighLow 164.11 (6.2%) 158.84 (8.5%) -3.2% ( -16% - 12%) 0.174 BrowseMonthTaxoFacets 0.47 (6.1%) 0.45 (9.2%) -3.2% ( -17% - 12%) 0.193 LowIntervalsOrdered 3.02 (5.5%) 2.93 (9.1%) -3.2% ( -16% - 12%) 0.185 LowSloppyPhrase 8.80 (8.7%) 8.53 (9.7%) -3.0% ( -19% - 16%) 0.298 MedIntervalsOrdered 6.46 (5.8%) 6.26 (10.1%) -3.0% ( -17% - 13%) 0.248 AndHighMed 41.11 (6.6%) 39.91 (8.0%) -2.9% ( -16% - 12%) 0.205 Fuzzy1 30.54 (13.2%) 29.75 (14.1%) -2.6% ( -26% - 28%) 0.550 BrowseDayOfYearSSDVFacets 1.46 (6.1%) 1.43 (10.6%) -2.6% ( -18% - 15%) 0.350 HighPhrase 104.16 (5.5%) 101.53 (8.9%) -2.5% ( -16% - 12%) 0.284 OrHighNotLow 394.47 (6.8%) 385.27 (8.3%) -2.3% ( -16% - 13%) 0.329 Fuzzy2 25.00 (8.9%) 24.42 (9.8%) -2.3% ( -19% - 18%) 0.440 MedSloppyPhrase 9.24 (8.3%) 9.03 (9.5%) -2.3% ( -18% - 16%) 0.421 OrHighHigh 4.58 (5.9%) 4.47 (7.7%) -2.2% ( -14% - 12%) 0.304 OrNotHighMed 326.95 (6.6%) 319.95 (7.1%) -2.1% ( -14% - 12%) 0.322 LowPhrase 128.75 (5.9%) 126.06 (8.9%) -2.1% ( -15% - 13%) 0.382 MedPhrase 34.71 (5.7%) 34.01 (8.0%) -2.0% ( -14% - 12%) 0.354 OrNotHighHigh 303.59 (6.7%) 297.55 (8.0%) -2.0% ( -15% - 13%) 0.394 OrHighNotMed 304.72 (5.9%) 299.56 (7.5%) -1.7% ( -14% - 12%) 0.429 Prefix3 62.22 (5.1%) 61.22 (7.5%) -1.6% ( -13% - 11%) 0.431 OrHighMed 32.41 (6.8%) 32.02 (6.8%) -1.2% ( -13% - 13%) 0.579 HighTermTitleBDVSort 24.77 (20.2%) 24.53 (18.1%) -1.0% ( -32% - 46%) 0.871 Respell 20.64 (5.3%) 20.47 (7.1%) -0.8% ( -12% - 12%) 0.682 AndHighHigh 23.51 (6.0%) 23.34 (6.7%) -0.7% ( -12% - 12%) 0.727 LowTerm 724.06 (7.1%) 719.45 (5.6%) -0.6% ( -12% - 12%) 0.751 AndHighLow 272.87 (5.9%) 271.24 (7.3%) -0.6% ( -13% - 13%) 0.777 PKLookup 90.10 (5.5%) 89.65 (8.8%) -0.5% ( -14% - 14%) 0.831 HighTerm 876.85 (7.3%) 875.67 (7.8%) -0.1% ( -14% - 16%) 0.955 OrHighNotHigh 291.95 (8.8%) 291.77 (8.8%) -0.1% ( -16% - 19%) 0.982 HighTermDayOfYearSort 37.60 (16.6%) 37.65 (17.9%) 0.1% ( -29% - 41%) 0.981 HighTermMonthSort 18.43 (18.6%) 18.50 (20.3%) 0.4% ( -32% - 48%) 0.954 TermDTSort 41.45 (15.4%) 41.99 (15.1%) 1.3% ( -25% - 37%) 0.786 MedTerm 700.62 (6.8%) 711.10 (6.0%) 1.5% ( -10% - 15%) 0.461

Please let me know how this looks to you.

…hread-safe internal states are needed, and remove static methods in favor of constructors

…rillSideways#searchSequentially(query, collectorManager) to by pass caching This is needed to pass test case from ./gradlew test --tests TestDrillSideways.testRandom -Dtests.seed=69A3EF02D8E3465E -Dtests.nightly=true

gsmiller

Left a few comments. Looks good overall. Lots of little changes! Those test cases... :)

gsmiller · 2021-08-19T13:15:45Z

lucene/benchmark/src/java/org/apache/lucene/benchmark/byTask/tasks/ReadTask.java

@@ -180,6 +185,7 @@ protected int withTopDocs(IndexSearcher searcher, Query q, TopDocs hits) throws
    return res;
  }

+  @Deprecated


I wonder if anyone out there has sub-classed this to provide their own Collector through overriding this method? It's possible right? Would we want to support this with a CollectorManager now instead? You might consider creating a new protected method--createCollectorManager--that sub-classes could override if they want, then add javadoc here with a @lucene.deprecated tag pointing users to the new method.

Also, if this change is going onto main, I might kill this method entirely and then add it back in as deprecated if you choose to backport to 8x.

I wonder if anyone out there has sub-classed this to provide their own Collector through overriding this method? It's possible right?

Hmm you are right. I reverted some changes in this class to still use the ReadTask#createCollector method, in case there's overriding from users.

Would we want to support this with a CollectorManager now instead? You might consider creating a new protected method--createCollectorManager--that sub-classes could override if they want, then add javadoc here with a @lucene.deprecated tag pointing users to the new method.

I think this createCollector method was originally added (11 years ago) to support benchmarking, and not sure how widely this is used actually? So hopefully marking it as deprecated for release 9.0 and removing it entirely once we release 10.0 should be good enough?

I was more thinking along the lines of keeping the change you had but providing a protected method that would allow sub-classes to provide their own CollectorManager if they wanted to do so (which would give users a migration path if they were previously providing their own Collector). So in 9.0, I think you'd move completely to the CollectorManager approach like you had. What do you think of that approach?

I see where you are coming from. However, upon looking at this method more closely, I'm afraid this method is effectively not useful, since the result of using this collector was commented out :D :

lucene/lucene/benchmark/src/java/org/apache/lucene/benchmark/byTask/tasks/ReadTask.java

Lines 119 to 121 in 3b3f960

Collector collector = createCollector();

searcher.search(q, collector);

// hits = collector.topDocs();

So I'm now leaning more towards just removing this method altogether if no users ever noticed / complaint about this. What do you think @gsmiller @jpountz ?

I suspect it would still be useful to assess the performance of a collector but you probably have a point that if performance is the only thing that the benchmark module reports, then there's a chance that it's not actually used much.

I did a quick GitHub search to see how much code I could find that extends this code:

"extends ReadTask" createCollector: no hits

"extends SearchTask" createCollector -SearchWithCollectorTask: no hits (I had to exclude SearchWithCollectorTask or all Lucene forks would be returned).

Given that the benchmark module is quite expert, I'd be ok with breaking hard here.

Sounds good! I've removed this method, and also fixed the commented out code.

gsmiller · 2021-08-19T13:17:36Z

lucene/core/src/java/org/apache/lucene/search/IndexSearcher.java

@@ -527,7 +503,10 @@ public TopDocs search(Query query, int n) throws IOException {
   *
   * @throws TooManyClauses If a query would exceed {@link IndexSearcher#getMaxClauseCount()}
   *     clauses.
+   * @deprecated This method is being deprecated in favor of {@link IndexSearcher#search(Query,


I think we like the @lucene.deprecated tag?

Is your reasoning behind keeping this in there for now that all of our uses of this (internal to Lucene) haven't yet migrated as part of this change? Would the plan me to migrate all usages off of this internally and then actually remove it on main/9.0, or are you thinking of keeping it around until 10.0? I think our backwards compatibility policy is such that we could just directly remove this on main/9.0, but then leave it like you have it (marked deprecated) if you choose to backport this to 8x. Since this method is so fundamental though, I could easily see an argument to keep it around for an extra major release to give users more time to migrate. Then again, the migration path seems pretty straight-forward. What do you think?

I think we like the @lucene.deprecated tag?

Do you mean @lucene.deprecated should be created and used instead of using @deprecated? I can't seems to find that tag is being used in lucene actually...I think @lucene.deprecated most likely won't be recognized by IDE or build tool to signal / warn the use of deprecated methods though.

Is your reasoning behind keeping this in there for now that all of our uses of this (internal to Lucene) haven't yet migrated as part of this change? Would the plan me to migrate all usages off of this internally and then actually remove it on main/9.0, or are you thinking of keeping it around until 10.0? I think our backwards compatibility policy is such that we could just directly remove this on main/9.0, but then leave it like you have it (marked deprecated) if you choose to backport this to 8x. Since this method is so fundamental though, I could easily see an argument to keep it around for an extra major release to give users more time to migrate. Then again, the migration path seems pretty straight-forward. What do you think?

As explained in https://issues.apache.org/jira/browse/LUCENE-10002?focusedCommentId=17397827&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17397827, I use deprecation instead of direct removal in this PR exactly for the reasons you mentioned:

Not all internal uses have been migrated in this PR (which may require another few thousand lines of changes).

I personally feel that we should give users more time for migration and testing out the changes, so 10.0 release might be a good timing for complete removal.

I would be interested in seeing how folks feel about the timing of removal, but after #1 is completed, complete removal of IndexSearcher#search(Query, Collector) in lucene codebase should require only straightforward deletion changes.

I think we should try to keep IndexSearcher#search(Query, Collector) and the TopXXXCollector#create factory methods during the 9.x series, as I expect it to be quite commonly used. However in my opinion it would be fine to cut over expert classes like those in the benchmark module without providing such long bw compatibility.

+1 makes sense to keep it in 9.0. As for my comment about @lucene.deprecated, I was getting this mixed up with something else. Apologies for any confusion!

As for my comment about @lucene.deprecated, I was getting this mixed up with something else. Apologies for any confusion!

Though, I think there are both @deprecated (javadoc tag) and @Deprecated (Java class annotation) -- we do both of them typically (looks like you did here @zacharymorn, great!).

lucene/core/src/java/org/apache/lucene/search/IndexSearcher.java

lucene/core/src/java/org/apache/lucene/search/TopFieldCollector.java

lucene/test-framework/src/java/org/apache/lucene/util/FixedBitSetCollector.java

lucene/core/src/java/org/apache/lucene/search/TopScoreDocCollectorManager.java

lucene/facet/src/java/org/apache/lucene/facet/DrillSideways.java

zacharymorn · 2021-08-20T07:19:46Z

Left a few comments. Looks good overall. Lots of little changes! Those test cases... :)

Thanks @gsmiller for the review! Yes there are indeed many small changes, and more will come :D

jpountz · 2021-08-24T15:47:04Z

lucene/core/src/java/org/apache/lucene/search/IndexSearcher.java

@@ -527,7 +503,10 @@ public TopDocs search(Query query, int n) throws IOException {
   *
   * @throws TooManyClauses If a query would exceed {@link IndexSearcher#getMaxClauseCount()}
   *     clauses.
+   * @deprecated This method is being deprecated in favor of {@link IndexSearcher#search(Query,


I think we should try to keep IndexSearcher#search(Query, Collector) and the TopXXXCollector#create factory methods during the 9.x series, as I expect it to be quite commonly used. However in my opinion it would be fine to cut over expert classes like those in the benchmark module without providing such long bw compatibility.

jpountz · 2021-08-24T15:49:23Z

lucene/core/src/java/org/apache/lucene/search/TopFieldCollectorManager.java

+    this.hitsThresholdChecker =
+        supportsConcurrency
+            ? HitsThresholdChecker.createShared(Math.max(totalHitsThreshold, numHits))
+            : HitsThresholdChecker.create(Math.max(totalHitsThreshold, numHits));


let's also store supportsConcurrency in a member of this class so that we can check that users never create more than one collector when it is false?

Ah I like this idea better than the approach described in #240 (comment), since it doesn't require an additional API to be added to CollectorManager for thread-safety! I've implemented this in the latest commit.

…or) to DrillSideways#searchSequentially(query, collectorManager) to by pass caching" This reverts commit 86bac52.

zacharymorn · 2021-09-10T08:29:54Z

Hi @jpountz @gsmiller , just want to check back on this PR to see if you have any further feedback?

jpountz · 2021-09-15T16:35:14Z

Sorry @zacharymorn I have not forgotten about it but the change is very large and it's hard to find enough adjacent time to review it. I'll do my best to find time in the coming week.

zacharymorn · 2021-09-16T06:09:45Z

Sorry @zacharymorn I have not forgotten about it but the change is very large and it's hard to find enough adjacent time to review it. I'll do my best to find time in the coming week.

No worry @jpountz , this PR is indeed large and time consuming! Please take your time and I appreciate your help on the review!

jpountz

I left some comments but the approach that you took so far looks good to me. I see that we still need to migrate some collectors like LargeNumHitsTopDocsCollector or the internal JoinUtil collectors to a CollectorManager?

jpountz · 2021-09-16T06:54:33Z

lucene/benchmark/src/java/org/apache/lucene/benchmark/byTask/tasks/ReadTask.java

@@ -180,6 +185,7 @@ protected int withTopDocs(IndexSearcher searcher, Query q, TopDocs hits) throws
    return res;
  }

+  @Deprecated


I suspect it would still be useful to assess the performance of a collector but you probably have a point that if performance is the only thing that the benchmark module reports, then there's a chance that it's not actually used much.

I did a quick GitHub search to see how much code I could find that extends this code:

"extends ReadTask" createCollector: no hits

"extends SearchTask" createCollector -SearchWithCollectorTask: no hits (I had to exclude SearchWithCollectorTask or all Lucene forks would be returned).

Given that the benchmark module is quite expert, I'd be ok with breaking hard here.

jpountz · 2021-09-16T06:55:10Z

...ne/classification/src/java/org/apache/lucene/classification/CachingNaiveBayesClassifier.java


-        int ret = totalHitCountCollector.getTotalHits();
+        int ret = indexSearcher.search(booleanQuery.build(), new TotalHitCountCollectorManager());


We could make it even better by using IndexSearcher#count.

jpountz · 2021-09-16T07:04:19Z

lucene/classification/src/java/org/apache/lucene/classification/SimpleNaiveBayesClassifier.java

@@ -179,8 +178,7 @@ protected int countDocsWithClass() throws IOException {
      if (query != null) {
        q.add(query, BooleanClause.Occur.MUST);
      }
-      indexSearcher.search(q.build(), classQueryCountCollector);
-      docCount = classQueryCountCollector.getTotalHits();
+      docCount = indexSearcher.search(q.build(), new TotalHitCountCollectorManager());


let's simplify by using IndexSearcher#count?

Done. Also updated the implementation of IndexSearcher#count to use TotalHitCountCollectorManager.

jpountz · 2021-09-16T07:04:26Z

lucene/classification/src/java/org/apache/lucene/classification/SimpleNaiveBayesClassifier.java

-    TotalHitCountCollector totalHitCountCollector = new TotalHitCountCollector();
-    indexSearcher.search(booleanQuery.build(), totalHitCountCollector);
-    return totalHitCountCollector.getTotalHits();
+    return indexSearcher.search(booleanQuery.build(), new TotalHitCountCollectorManager());


likewise here

jpountz · 2021-09-16T07:08:45Z

lucene/core/src/java/org/apache/lucene/search/TopFieldCollector.java

-   * Same as above with additional parameters to allow passing in the threshold checker and the max
-   * score accumulator.
-   */
-  static TopFieldCollector create(


Indeed we don't support access to pkg-private code.

jpountz · 2021-09-16T07:10:45Z

lucene/core/src/java/org/apache/lucene/search/TopFieldCollectorManager.java

+
+    if (sort.fields.length == 0) {
+      throw new IllegalArgumentException("Sort must contain at least one field");
+    }


Can we move as much validation as possible to the constructor, so that we would detect inconsistencies earlier: when users create the collector manager rather than when Lucene starts collecting hits?

Ahh good catch! Updated.

jpountz · 2021-09-16T07:11:15Z

lucene/core/src/java/org/apache/lucene/search/TopFieldCollectorManager.java

+    return TopDocs.merge(sort, 0, numHits, topDocs);
+  }
+
+  public List<TopFieldCollector> getCollectors() {


Can we make it pkg-private?

So this method was added to support early termination detection in tests like the following:

collectorManager.getCollectors().stream().anyMatch(TopFieldCollector::isEarlyTerminated)

I think Lucene users may have the need for this check as well?

jpountz · 2021-09-16T07:20:31Z

lucene/sandbox/src/test/org/apache/lucene/sandbox/search/TestLargeNumHitsTopDocsCollector.java


    searcher.search(testQuery, largeCollector);
-    searcher.search(testQuery, regularCollector);
+    TopDocs topDocs = searcher.search(testQuery, regularCollector);


keep it called secondTopDocs, I think it helped readability

jpountz · 2021-09-16T07:37:35Z

@zacharymorn @gsmiller If we try to do everything in a single PR, I worry that this will become very hard to review. I wonder if we should split by replacing the deprecation warnings of IndexSearcher#search(Query,Collector) and TopXXXCollector#create by some words about how the CollectorManager variants are recommended as they make it possible to search using multiple threads. Then when this PR is merged we can work more iteratively on replacing usage of Collector with CollectorManager, and we could add the deprecation warnings in the end once we have migrated all our code?

zacharymorn · 2021-09-17T04:39:27Z

I left some comments but the approach that you took so far looks good to me. I see that we still need to migrate some collectors like LargeNumHitsTopDocsCollector or the internal JoinUtil collectors to a CollectorManager?

Thanks @jpountz for the review! Indeed there are more that need to be migrated as well. In fact, I have been working on the 2nd PR to migrate more "easy" ones, and after that, we still have the following:

Explanation related

ExplanationAsserter
MatchesAsserter

Facet related

FacetsCollector
RandomSamplingFacetsCollector

Grouping related

GroupFacetCollector
TopGroupsCollector
BlockGroupingCollector
FirstPassGroupingCollector
AllGroupHeadsCollector
AllGroupsCollector

Global ordinal related

GlobalOrdinalsCollector
GlobalOrdinalsWithScoreCollector

Anonymous classes

Anonymous Collector used in JoinUtil
Anonymous Collector used in TestJoinUtil
Anonymous Collector used in QueryUtils

Profiling related

ProfilerCollector
MemoryAccountingBitsetCollector
MonitorQueryCollector

Counting related

DocValuesStatsCollector
MyHitCollector
MatchCollector

Result diversification related

DistinctValuesCollector
DiversifiedTopDocsCollector

Other

CachingCollector
TerminateAfterCollector
TimeLimitingCollector
LargeNumHitsTopDocsCollector

As each of these categories / collectors will likely require more thoughts / time to understand the current logic, and come up with CollectorManager implementation that's also thread-safe, I'm thinking to create follow-up Jira sub-tickets for each of these categories to track them. What do you think?

I wonder if we should split by replacing the deprecation warnings of IndexSearcher#search(Query,Collector) and TopXXXCollector#create by some words about how the CollectorManager variants are recommended as they make it possible to search using multiple threads. Then when this PR is merged we can work more iteratively on replacing usage of Collector with CollectorManager, and we could add the deprecation warnings in the end once we have migrated all our code?

Yeah I can definitely follow this strategy as well. The only concern I may have is that, as can be seen above, it may still take some time to migrate all existing collectors to collectorManagers. So using suggestion comment for now and adding deprecation warning at the end may likely see a few more collectors without collectorManagers added accidentally (i.e. folks may not notice the comment when they just code against the IndexSearch#search(Query, Collector) API)?

javanna · 2022-01-10T10:24:58Z

hi @zacharymorn I looked at this issue a long while ago, before you started working on it, and I am now catching up. I see you made great progress on it! I also see that reviewing it as a single change is a bit of a challenge due to the size of the PR. Could I help out splitting this in smaller PRs so that we make progress towards getting the change in, step by step?

zacharymorn · 2023-11-09T03:59:40Z

We are still a ways away (from seeing Lucene fully utilize available hardware concurrency available at search time to reduce query latencies)

For example: query concurrency is still tied to segment geometry, which is insane. An "optimized" (forceMerge to 1 segment) index loses all of its concurrency! Hideously, at my team (Amazon product search) we had to create a neutered merge policy that intentionally avoids "accidentally" making a lopsided index (where too many docs are in a single segment) because it harms our long pole query latencies (we run all queries concurrently, except close to red-line), due to this silly Lucene limitation.

We have a longstanding issue open for this: #9721, and it ought not be so difficult to fix because Lucene already has strong support for one thread to search a "slice" of a big segment, i.e. we can make the thread work units slices of segments instead of whole segments or collections of segments (what IndexSearcher now calls slices, confusingly).

Thanks @mikemccand for sharing the additional insight and context! Looks like there's already a lot of good discussions in that ticket as well. Will definitely take a look at that and see what I can contribute, once most of the work for deprecating IndexSearch#search(Query, Collector) has been completed.

mikemccand

Looks great, thanks @zacharymorn! I just think we should revert the lucene/benchmark change that breaks testing of a custom collector for this first step ...

zacharymorn · 2023-11-14T07:11:46Z

Looks great, thanks @zacharymorn! I just think we should revert the lucene/benchmark change that breaks testing of a custom collector for this first step ...

Thanks @mikemccand for the review feedback and approval!

mikemccand

Great, thank you @zacharymorn! Awesome to see Lucene bit by bit progressing towards fully utilizing concurrent hardware by default. It's about time we pay down this longstanding tech debt.

zacharymorn · 2023-11-15T08:01:20Z

Hi @mikemccand @jpountz @javanna @gsmiller , I have updated this PR to pick up the latest from main, as well as revert some changes to save them for follow-up PRs that address other collectors. This PR is now focused on the following:

Refactor out TopScoreDocCollectorManager, TopFieldCollectorManager

Refactor some tests to use the above collector manager

Could you please take a look, and let me know if you have any feedback?

Hi @jpountz @javanna @gsmiller, just want to have a quick follow-up to see if you may have any further feedback on this PR? I plan to merge it in the next few days if there's no major concern.

javanna · 2023-11-16T10:08:04Z

Thanks for reviving this PR @zacharymorn ! the changes look good to me, having top score doc and top field collector managers sounds like a natural next step, and removes code duplication. I wish that we could avoid having the supportConcurrency flag in them, but that's ok for now, and I don't have a good alternative in mind to avoid unnecessary overhead when there's a single slice.

One thing that I wonder is whether we are ok already deprecating search(Query, Collector) given that we have a lot of usages still within Lucene. I was thinking that we may want to replace them all before deprecating, and then remove the method in main, but this is where I got distracted from this project last year and ended up dropping it after all. Should we consider getting your changes in, minus the deprecation, perhaps, and split the work on removing usages of the method?

zacharymorn · 2023-11-17T00:42:27Z

Thanks @javanna for the feedback!

One thing that I wonder is whether we are ok already deprecating search(Query, Collector) given that we have a lot of usages still within Lucene. I was thinking that we may want to replace them all before deprecating, and then remove the method in main, but this is where I got distracted from this project last year and ended up dropping it after all. Should we consider getting your changes in, minus the deprecation, perhaps, and split the work on removing usages of the method?

My current preference is actually to get the deprecation annotation in early, so that while we take on the remaining work to migrate other collector usage (which could take some time due to the amount of it), other contributors / application developers could start to take note of those deprecation messages, and stay away from using search(Query, Collector) in favor of search(Query, CollectorManager) as early as possible. Otherwise, there may be more and more of those usage being added into lucene and we will always be playing catch-up to migrate them until we deprecate those methods. What do you think?

javanna · 2023-11-21T12:38:54Z

That is fine with me @zacharymorn . Indeed I have observed as well that there will be new usages introduced while we work on removing current usages, and deprecating early can help with that.

…r this test class

javanna · 2023-11-30T19:29:32Z

heya @zacharymorn given that this is a deprecation, I guess you meant on backporting to branch_9x as well?

zacharymorn · 2023-11-30T22:09:29Z

Hi @javanna , I was actually thinking to have it for 10.0.0 (added an entry into that section in CHANGES.txt), as deprecating IndexSearcher#search(Query, Collector) has a rather large impact to lucene applications (not that it requires immediate changes, but the method itself is very frequently used and hence requires some attentions). But I can change it to branch_9x instead if that's preferred?

javanna · 2023-12-04T19:02:36Z

I think we would do the removal in 10, as that's the type of change that we would make for a major release. Or can decide to delay the removal if need be, but delaying the deprecation kind of defeats the purpose of deprecating?

zacharymorn · 2023-12-04T20:14:52Z

I think we would do the removal in 10, as that's the type of change that we would make for a major release. Or can decide to delay the removal if need be, but delaying the deprecation kind of defeats the purpose of deprecating?

Do you know what's the timeline we have for releasing version 10? Based on my quick look over the last few days, most likely it will take a good few months for us to completely remove that usage in lucene codebase (I'm also a bit occupied at the moment), hence we may need to delay the removal if version 10 is planned to be released within like a year.

With the above said, like I shared earlier, I feel getting the deprecation warning message into our codebase as early as possible is critical for lucene contributors to start moving away from using that method, so that we won't be playing catch-up with ourselves and will be able to remove that method one day. Checking the deprecation warning into main already achieved this purpose. It will be great for lucene application developer to see this as early as possible as well, but its not as urgent (and it will take them longer time to migrate anyway), hence I was having it for version 10 given its potential impact. But I could change it to be released for 9x if that's still preferred.

javanna · 2023-12-06T20:17:02Z

I think that it would be even better to get the deprecation out with 9x, regardless of whether we will effectively remove with 10. At least we let users know that they should use a different method. I don't see advantages around delaying a deprecation somehow. Thoughts @jpountz ?

mikemccand · 2023-12-07T11:14:31Z

+1 to backport the deprecation message/tags to 9.x, even if we cannot fully remove our own internal usage of the deprecated APIs before releasing 10.0. The two can / should be decoupled since it's so much work?

Let's also open a dedicated spinoff issue to "remove all deprecated IS.search usage / IS.search methods in 10.0", and mark it as Blocker for now? If 10.0 wants to get out and we haven't done this blocker we can revisit how/when to do it, whether to downgrade, etc.?

… of IndexSearcher#search(Query, CollectorManager) - TopFieldCollectorManager & TopScoreDocCollectorManager (apache#240)

zacharymorn · 2023-12-07T22:05:47Z

Thanks both @javanna @mikemccand . I have adjusted the change entry and opened a backport PR, as well as a new spinoff issue for 10.0.0 #12892.

… of IndexSearcher#search(Query, CollectorManager) - TopFieldCollectorManager & TopScoreDocCollectorManager (#240) (#12891)

…ethods (#create, #createSharedManager) (#13500) These methods were deprecated in #240 which is part of Lucene 10.0. Addresses #13499

zacharymorn added 2 commits August 11, 2021 21:18

LUCENE-10002: Refactor DrillSideways to use extracted out collector m…

776759d

…anagers

zacharymorn requested review from dweiss, jpountz, mikemccand and gsmiller August 12, 2021 05:30

LUCENE-10002: Refactor some more tests to use collector manager

0c2c12b

jpountz reviewed Aug 13, 2021

View reviewed changes

zacharymorn added 2 commits August 18, 2021 19:09

Add flag to collector managers' constructors to allow specifying if t…

4af7740

…hread-safe internal states are needed, and remove static methods in favor of constructors

zacharymorn requested a review from jpountz August 19, 2021 05:28

gsmiller reviewed Aug 19, 2021

View reviewed changes

Update per feedback

3a3a86a

zacharymorn requested a review from gsmiller August 24, 2021 00:55

jpountz reviewed Aug 24, 2021

View reviewed changes

Use supportsConcurrency to detect potentially incorrect usage

d21e5f9

zacharymorn requested a review from jpountz August 25, 2021 03:43

zacharymorn added 2 commits August 25, 2021 19:33

Revert "Add the same hack used in DrillSideways#search(query, collect…

be6977d

…or) to DrillSideways#searchSequentially(query, collectorManager) to by pass caching" This reverts commit 86bac52.

Merge branch 'main' into LUCENE-10002-IndexSearcherCollectorAPIRemoval

2df81c5

jpountz reviewed Sep 16, 2021

View reviewed changes

address comment

81163de

Merge branch 'main' into LUCENE-10002-IndexSearcherCollectorAPIRemoval

aa114d7

bryanlb mentioned this pull request Jun 2, 2022

Lucene query improvements slackhq/astra#311

Merged

mikemccand approved these changes Nov 13, 2023

View reviewed changes

Revert some changes

0b4f9cd

mikemccand approved these changes Nov 14, 2023

View reviewed changes

zacharymorn added 2 commits November 27, 2023 11:53

Merge branch 'main' into LUCENE-10002-IndexSearcherCollectorAPIRemoval

d588a82

Add change entry

8f2797f

zacharymorn merged commit 38ca8d3 into apache:main Nov 28, 2023

mikemccand added a commit that referenced this pull request Nov 29, 2023

#240: disable threads in the random IndexSearcher from newSearcher fo…

8703b54

…r this test class

zacharymorn mentioned this pull request Dec 7, 2023

[Backporting] LUCENE-10002: Deprecate IndexSearch#search(Query, Collector) in favor of IndexSearcher#search(Query, CollectorManager) - TopFieldCollectorManager & TopScoreDocCollectorManager #12891

Merged

zacharymorn added a commit that referenced this pull request Dec 9, 2023

LUCENE-10002: Deprecate IndexSearch#search(Query, Collector) in favor…

59c0f82

… of IndexSearcher#search(Query, CollectorManager) - TopFieldCollectorManager & TopScoreDocCollectorManager (#240) (#12891)

javanna modified the milestones: 10.0.0, 9.10.0 Dec 11, 2023

asfimport mentioned this pull request Aug 24, 2022

Remove IndexSearcher#search(Query,Collector) in favor of IndexSearcher#search(Query,CollectorManager) [LUCENE-10002] #11041

Closed

This was referenced Jun 18, 2024

Remove the @Deprecated methods from TopScoreDocCollector and TopFieldCollector #13499

Closed

Removing usage of TopScoreDocCollector + TopFieldCollector deprecated methods (#create, #createSharedManager) #13500

Merged

javanna pushed a commit that referenced this pull request Jul 29, 2024

Remove usage of TopScoreDocCollector + TopFieldCollector deprecated m…

8a7d484

…ethods (#create, #createSharedManager) (#13500) These methods were deprecated in #240 which is part of Lucene 10.0. Addresses #13499

	Collector collector = createCollector();
	searcher.search(q, collector);
	// hits = collector.topDocs();


		int ret = totalHitCountCollector.getTotalHits();
		int ret = indexSearcher.search(booleanQuery.build(), new TotalHitCountCollectorManager());

LUCENE-10002: Deprecate IndexSearch#search(Query, Collector) in favor of IndexSearcher#search(Query, CollectorManager) - TopFieldCollectorManager & TopScoreDocCollectorManager #240

LUCENE-10002: Deprecate IndexSearch#search(Query, Collector) in favor of IndexSearcher#search(Query, CollectorManager) - TopFieldCollectorManager & TopScoreDocCollectorManager #240

Conversation

zacharymorn commented Aug 12, 2021 • edited Loading

Description / Solution

Tests

Checklist

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gsmiller left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zacharymorn Aug 20, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zacharymorn commented Aug 20, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zacharymorn commented Sep 10, 2021

jpountz commented Sep 15, 2021

zacharymorn commented Sep 16, 2021

jpountz left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jpountz commented Sep 16, 2021

zacharymorn commented Sep 17, 2021

Explanation related

Facet related

Grouping related

Global ordinal related

Anonymous classes

Profiling related

Counting related

Result diversification related

Other

javanna commented Jan 10, 2022

zacharymorn commented Nov 9, 2023

mikemccand left a comment

Choose a reason for hiding this comment

zacharymorn commented Nov 14, 2023

mikemccand left a comment

Choose a reason for hiding this comment

zacharymorn commented Nov 15, 2023

javanna commented Nov 16, 2023

zacharymorn commented Nov 17, 2023

javanna commented Nov 21, 2023

javanna commented Nov 30, 2023

zacharymorn commented Nov 30, 2023

javanna commented Dec 4, 2023

zacharymorn commented Dec 4, 2023

javanna commented Dec 6, 2023

mikemccand commented Dec 7, 2023

zacharymorn commented Dec 7, 2023

zacharymorn commented Aug 12, 2021 •

edited

Loading

zacharymorn Aug 20, 2021 •

edited

Loading