-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LUCENE-10002: Deprecate IndexSearch#search(Query, Collector) in favor of IndexSearcher#search(Query, CollectorManager) - TopFieldCollectorManager & TopScoreDocCollectorManager #240
LUCENE-10002: Deprecate IndexSearch#search(Query, Collector) in favor of IndexSearcher#search(Query, CollectorManager) - TopFieldCollectorManager & TopScoreDocCollectorManager #240
Conversation
… of IndexSearcher#search(Query, CollectorManager) Refactor out TopScoreDocCollectorManager, TopFieldCollectorManager, TotalHitCountCollectorManager and FixedBitSetCollector Refactor some tests to use the above collector manager
/* | ||
nocommit | ||
Should the following two be passed in instead? Possible custom initialization based on executor status and slices? | ||
On the other hand, in a single-threaded environment, shared HitsThresholdChecker and MaxScoreAccumulator should be fast without lock contention anyway? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you check if luceneutil benchmarks become slower with this change? Hopefully they're not slower and we can keep this logic simple.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the great suggestion! I just ran luceneutil, and it turns out that the changes do get slower due to this thread-safe HitsThresholdChecker
(confirmed by playing with the atomic long inside a bit). Here is the full benchmark result:
TaskQPS baseline StdDevQPS my_modified_version StdDev Pct diff p-value
HighTermTitleBDVSort 52.86 (14.0%) 42.72 (6.6%) -19.2% ( -34% - 1%) 0.000
HighTermMonthSort 31.66 (10.7%) 25.90 (6.3%) -18.2% ( -31% - -1%) 0.000
HighTermDayOfYearSort 24.00 (15.7%) 20.02 (8.5%) -16.6% ( -35% - 9%) 0.000
TermDTSort 20.22 (9.0%) 17.13 (6.6%) -15.3% ( -28% - 0%) 0.000
MedTerm 1270.02 (3.9%) 1208.29 (5.1%) -4.9% ( -13% - 4%) 0.001
HighTerm 941.14 (4.0%) 900.05 (5.9%) -4.4% ( -13% - 5%) 0.006
OrHighNotMed 531.63 (5.6%) 514.68 (6.2%) -3.2% ( -14% - 9%) 0.089
LowTerm 1112.50 (5.1%) 1081.48 (6.2%) -2.8% ( -13% - 8%) 0.118
MedSpanNear 12.50 (2.2%) 12.18 (2.2%) -2.5% ( -6% - 1%) 0.000
HighSpanNear 12.82 (1.7%) 12.52 (1.7%) -2.4% ( -5% - 1%) 0.000
OrHighNotLow 562.50 (6.9%) 550.15 (5.6%) -2.2% ( -13% - 11%) 0.269
OrHighNotHigh 577.10 (5.6%) 565.51 (6.7%) -2.0% ( -13% - 10%) 0.303
OrNotHighHigh 633.91 (6.7%) 621.85 (6.0%) -1.9% ( -13% - 11%) 0.345
HighIntervalsOrdered 6.94 (3.1%) 6.81 (3.2%) -1.9% ( -7% - 4%) 0.061
PKLookup 174.06 (4.2%) 171.17 (4.5%) -1.7% ( -9% - 7%) 0.230
OrHighLow 283.56 (5.3%) 279.05 (5.8%) -1.6% ( -12% - 10%) 0.365
MedIntervalsOrdered 19.00 (3.7%) 18.74 (3.6%) -1.4% ( -8% - 6%) 0.237
Fuzzy1 59.38 (11.6%) 58.65 (12.0%) -1.2% ( -22% - 25%) 0.740
HighSloppyPhrase 12.68 (2.7%) 12.53 (2.1%) -1.1% ( -5% - 3%) 0.135
Fuzzy2 63.95 (8.6%) 63.40 (9.0%) -0.9% ( -17% - 18%) 0.759
MedSloppyPhrase 17.53 (2.7%) 17.38 (2.2%) -0.8% ( -5% - 4%) 0.279
LowSloppyPhrase 23.90 (3.0%) 23.70 (2.8%) -0.8% ( -6% - 5%) 0.354
Respell 39.23 (3.1%) 38.90 (3.1%) -0.8% ( -6% - 5%) 0.403
AndHighMed 61.17 (3.8%) 60.72 (3.8%) -0.7% ( -8% - 7%) 0.542
OrHighMed 48.81 (3.0%) 48.46 (3.0%) -0.7% ( -6% - 5%) 0.444
LowIntervalsOrdered 14.64 (2.3%) 14.55 (2.3%) -0.6% ( -5% - 4%) 0.380
LowSpanNear 11.65 (1.6%) 11.57 (1.6%) -0.6% ( -3% - 2%) 0.223
AndHighHigh 16.57 (3.6%) 16.46 (4.0%) -0.6% ( -7% - 7%) 0.608
IntNRQ 17.25 (28.0%) 17.17 (27.3%) -0.5% ( -43% - 76%) 0.959
OrHighHigh 9.28 (2.5%) 9.24 (2.5%) -0.4% ( -5% - 4%) 0.635
MedPhrase 72.99 (3.3%) 72.75 (3.7%) -0.3% ( -7% - 6%) 0.763
LowPhrase 78.02 (3.5%) 77.77 (4.0%) -0.3% ( -7% - 7%) 0.786
BrowseMonthSSDVFacets 3.19 (1.8%) 3.19 (1.7%) -0.2% ( -3% - 3%) 0.653
BrowseDayOfYearSSDVFacets 3.01 (1.5%) 3.01 (1.2%) -0.0% ( -2% - 2%) 0.991
Wildcard 27.98 (2.8%) 27.98 (2.7%) 0.0% ( -5% - 5%) 0.981
AndHighLow 454.42 (5.4%) 455.10 (4.6%) 0.1% ( -9% - 10%) 0.925
HighPhrase 289.77 (5.1%) 290.54 (4.0%) 0.3% ( -8% - 9%) 0.854
OrNotHighLow 521.43 (6.5%) 523.38 (5.7%) 0.4% ( -11% - 13%) 0.846
Prefix3 39.82 (4.8%) 40.02 (3.1%) 0.5% ( -7% - 8%) 0.707
OrNotHighMed 560.49 (5.8%) 563.51 (5.5%) 0.5% ( -10% - 12%) 0.762
BrowseMonthTaxoFacets 0.99 (4.8%) 1.00 (4.6%) 0.6% ( -8% - 10%) 0.702
BrowseDateTaxoFacets 0.94 (4.6%) 0.94 (4.6%) 0.6% ( -8% - 10%) 0.683
BrowseDayOfYearTaxoFacets 0.94 (4.5%) 0.94 (4.5%) 0.7% ( -7% - 10%) 0.638
To fix the performance regression, but still try to keep the logic simple then, I'm considering adding an additional flag supportsConcurrency
to the collector managers' constructor like such:
public TopScoreDocCollectorManager(int numHits, ScoreDoc after, int totalHitsThreshold, boolean supportsConcurrency) {
...
this.hitsThresholdChecker = supportsConcurrency ?
HitsThresholdChecker.createShared(Math.max(totalHitsThreshold, numHits)) :
HitsThresholdChecker.create(Math.max(totalHitsThreshold, numHits));
this.minScoreAcc = supportsConcurrency ? new MaxScoreAccumulator() : null;
}
public TopFieldCollectorManager(Sort sort, int numHits, FieldDoc after, int totalHitsThreshold, boolean supportsConcurrency) {
...
this.hitsThresholdChecker = supportsConcurrency ?
HitsThresholdChecker.createShared(Math.max(totalHitsThreshold, numHits)) :
HitsThresholdChecker.create(Math.max(totalHitsThreshold, numHits));
this.minScoreAcc = supportsConcurrency ? new MaxScoreAccumulator() : null;
}
What do you think about this approach?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @jpountz, for now I've implemented the above approach in 4af7740, and the lucenenutil benchmark shows there's no obvious impact to performance now:
TaskQPS baseline StdDevQPS my_modified_version StdDev Pct diff p-value
MedSpanNear 6.59 (5.6%) 6.30 (10.4%) -4.5% ( -19% - 12%) 0.091
BrowseDayOfYearTaxoFacets 0.45 (6.5%) 0.43 (10.0%) -4.3% ( -19% - 13%) 0.106
OrNotHighLow 321.23 (8.0%) 308.14 (7.5%) -4.1% ( -18% - 12%) 0.097
HighSpanNear 2.96 (6.4%) 2.84 (9.5%) -3.9% ( -18% - 12%) 0.129
BrowseDateTaxoFacets 0.45 (6.9%) 0.43 (9.6%) -3.9% ( -19% - 13%) 0.142
HighIntervalsOrdered 4.12 (4.7%) 3.97 (8.7%) -3.6% ( -16% - 10%) 0.102
BrowseMonthSSDVFacets 1.68 (6.6%) 1.63 (11.7%) -3.5% ( -20% - 15%) 0.239
HighSloppyPhrase 6.41 (8.6%) 6.18 (10.0%) -3.5% ( -20% - 16%) 0.233
IntNRQ 14.11 (6.2%) 13.62 (11.2%) -3.5% ( -19% - 14%) 0.225
Wildcard 14.90 (7.5%) 14.39 (8.2%) -3.4% ( -17% - 13%) 0.170
LowSpanNear 4.53 (5.7%) 4.38 (8.1%) -3.3% ( -16% - 11%) 0.135
OrHighLow 164.11 (6.2%) 158.84 (8.5%) -3.2% ( -16% - 12%) 0.174
BrowseMonthTaxoFacets 0.47 (6.1%) 0.45 (9.2%) -3.2% ( -17% - 12%) 0.193
LowIntervalsOrdered 3.02 (5.5%) 2.93 (9.1%) -3.2% ( -16% - 12%) 0.185
LowSloppyPhrase 8.80 (8.7%) 8.53 (9.7%) -3.0% ( -19% - 16%) 0.298
MedIntervalsOrdered 6.46 (5.8%) 6.26 (10.1%) -3.0% ( -17% - 13%) 0.248
AndHighMed 41.11 (6.6%) 39.91 (8.0%) -2.9% ( -16% - 12%) 0.205
Fuzzy1 30.54 (13.2%) 29.75 (14.1%) -2.6% ( -26% - 28%) 0.550
BrowseDayOfYearSSDVFacets 1.46 (6.1%) 1.43 (10.6%) -2.6% ( -18% - 15%) 0.350
HighPhrase 104.16 (5.5%) 101.53 (8.9%) -2.5% ( -16% - 12%) 0.284
OrHighNotLow 394.47 (6.8%) 385.27 (8.3%) -2.3% ( -16% - 13%) 0.329
Fuzzy2 25.00 (8.9%) 24.42 (9.8%) -2.3% ( -19% - 18%) 0.440
MedSloppyPhrase 9.24 (8.3%) 9.03 (9.5%) -2.3% ( -18% - 16%) 0.421
OrHighHigh 4.58 (5.9%) 4.47 (7.7%) -2.2% ( -14% - 12%) 0.304
OrNotHighMed 326.95 (6.6%) 319.95 (7.1%) -2.1% ( -14% - 12%) 0.322
LowPhrase 128.75 (5.9%) 126.06 (8.9%) -2.1% ( -15% - 13%) 0.382
MedPhrase 34.71 (5.7%) 34.01 (8.0%) -2.0% ( -14% - 12%) 0.354
OrNotHighHigh 303.59 (6.7%) 297.55 (8.0%) -2.0% ( -15% - 13%) 0.394
OrHighNotMed 304.72 (5.9%) 299.56 (7.5%) -1.7% ( -14% - 12%) 0.429
Prefix3 62.22 (5.1%) 61.22 (7.5%) -1.6% ( -13% - 11%) 0.431
OrHighMed 32.41 (6.8%) 32.02 (6.8%) -1.2% ( -13% - 13%) 0.579
HighTermTitleBDVSort 24.77 (20.2%) 24.53 (18.1%) -1.0% ( -32% - 46%) 0.871
Respell 20.64 (5.3%) 20.47 (7.1%) -0.8% ( -12% - 12%) 0.682
AndHighHigh 23.51 (6.0%) 23.34 (6.7%) -0.7% ( -12% - 12%) 0.727
LowTerm 724.06 (7.1%) 719.45 (5.6%) -0.6% ( -12% - 12%) 0.751
AndHighLow 272.87 (5.9%) 271.24 (7.3%) -0.6% ( -13% - 13%) 0.777
PKLookup 90.10 (5.5%) 89.65 (8.8%) -0.5% ( -14% - 14%) 0.831
HighTerm 876.85 (7.3%) 875.67 (7.8%) -0.1% ( -14% - 16%) 0.955
OrHighNotHigh 291.95 (8.8%) 291.77 (8.8%) -0.1% ( -16% - 19%) 0.982
HighTermDayOfYearSort 37.60 (16.6%) 37.65 (17.9%) 0.1% ( -29% - 41%) 0.981
HighTermMonthSort 18.43 (18.6%) 18.50 (20.3%) 0.4% ( -32% - 48%) 0.954
TermDTSort 41.45 (15.4%) 41.99 (15.1%) 1.3% ( -25% - 37%) 0.786
MedTerm 700.62 (6.8%) 711.10 (6.0%) 1.5% ( -10% - 15%) 0.461
Please let me know how this looks to you.
…hread-safe internal states are needed, and remove static methods in favor of constructors
…rillSideways#searchSequentially(query, collectorManager) to by pass caching This is needed to pass test case from ./gradlew test --tests TestDrillSideways.testRandom -Dtests.seed=69A3EF02D8E3465E -Dtests.nightly=true
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left a few comments. Looks good overall. Lots of little changes! Those test cases... :)
@@ -180,6 +185,7 @@ protected int withTopDocs(IndexSearcher searcher, Query q, TopDocs hits) throws | |||
return res; | |||
} | |||
|
|||
@Deprecated |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if anyone out there has sub-classed this to provide their own Collector
through overriding this method? It's possible right? Would we want to support this with a CollectorManager
now instead? You might consider creating a new protected method--createCollectorManager
--that sub-classes could override if they want, then add javadoc here with a @lucene.deprecated
tag pointing users to the new method.
Also, if this change is going onto main
, I might kill this method entirely and then add it back in as deprecated if you choose to backport to 8x.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if anyone out there has sub-classed this to provide their own Collector through overriding this method? It's possible right?
Hmm you are right. I reverted some changes in this class to still use the ReadTask#createCollector
method, in case there's overriding from users.
Would we want to support this with a CollectorManager now instead? You might consider creating a new protected method--createCollectorManager--that sub-classes could override if they want, then add javadoc here with a @lucene.deprecated tag pointing users to the new method.
I think this createCollector
method was originally added (11 years ago) to support benchmarking, and not sure how widely this is used actually? So hopefully marking it as deprecated for release 9.0 and removing it entirely once we release 10.0 should be good enough?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was more thinking along the lines of keeping the change you had but providing a protected method that would allow sub-classes to provide their own CollectorManager
if they wanted to do so (which would give users a migration path if they were previously providing their own Collector
). So in 9.0, I think you'd move completely to the CollectorManager
approach like you had. What do you think of that approach?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see where you are coming from. However, upon looking at this method more closely, I'm afraid this method is effectively not useful, since the result of using this collector was commented out :D :
lucene/lucene/benchmark/src/java/org/apache/lucene/benchmark/byTask/tasks/ReadTask.java
Lines 119 to 121 in 3b3f960
Collector collector = createCollector(); | |
searcher.search(q, collector); | |
// hits = collector.topDocs(); |
So I'm now leaning more towards just removing this method altogether if no users ever noticed / complaint about this. What do you think @gsmiller @jpountz ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suspect it would still be useful to assess the performance of a collector but you probably have a point that if performance is the only thing that the benchmark module reports, then there's a chance that it's not actually used much.
I did a quick GitHub search to see how much code I could find that extends this code:
"extends ReadTask" createCollector
: no hits"extends SearchTask" createCollector -SearchWithCollectorTask
: no hits (I had to excludeSearchWithCollectorTask
or all Lucene forks would be returned).
Given that the benchmark module is quite expert, I'd be ok with breaking hard here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good! I've removed this method, and also fixed the commented out code.
@@ -527,7 +503,10 @@ public TopDocs search(Query query, int n) throws IOException { | |||
* | |||
* @throws TooManyClauses If a query would exceed {@link IndexSearcher#getMaxClauseCount()} | |||
* clauses. | |||
* @deprecated This method is being deprecated in favor of {@link IndexSearcher#search(Query, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we like the @lucene.deprecated
tag?
Is your reasoning behind keeping this in there for now that all of our uses of this (internal to Lucene) haven't yet migrated as part of this change? Would the plan me to migrate all usages off of this internally and then actually remove it on main
/9.0, or are you thinking of keeping it around until 10.0? I think our backwards compatibility policy is such that we could just directly remove this on main
/9.0, but then leave it like you have it (marked deprecated) if you choose to backport this to 8x. Since this method is so fundamental though, I could easily see an argument to keep it around for an extra major release to give users more time to migrate. Then again, the migration path seems pretty straight-forward. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we like the @lucene.deprecated tag?
Do you mean @lucene.deprecated
should be created and used instead of using @deprecated
? I can't seems to find that tag is being used in lucene actually...I think @lucene.deprecated
most likely won't be recognized by IDE or build tool to signal / warn the use of deprecated methods though.
Is your reasoning behind keeping this in there for now that all of our uses of this (internal to Lucene) haven't yet migrated as part of this change? Would the plan me to migrate all usages off of this internally and then actually remove it on main/9.0, or are you thinking of keeping it around until 10.0? I think our backwards compatibility policy is such that we could just directly remove this on main/9.0, but then leave it like you have it (marked deprecated) if you choose to backport this to 8x. Since this method is so fundamental though, I could easily see an argument to keep it around for an extra major release to give users more time to migrate. Then again, the migration path seems pretty straight-forward. What do you think?
As explained in https://issues.apache.org/jira/browse/LUCENE-10002?focusedCommentId=17397827&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17397827, I use deprecation instead of direct removal in this PR exactly for the reasons you mentioned:
- Not all internal uses have been migrated in this PR (which may require another few thousand lines of changes).
- I personally feel that we should give users more time for migration and testing out the changes, so 10.0 release might be a good timing for complete removal.
I would be interested in seeing how folks feel about the timing of removal, but after #1 is completed, complete removal of IndexSearcher#search(Query, Collector) in lucene codebase should require only straightforward deletion changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should try to keep IndexSearcher#search(Query, Collector)
and the TopXXXCollector#create
factory methods during the 9.x series, as I expect it to be quite commonly used. However in my opinion it would be fine to cut over expert classes like those in the benchmark module without providing such long bw compatibility.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 makes sense to keep it in 9.0. As for my comment about @lucene.deprecated
, I was getting this mixed up with something else. Apologies for any confusion!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As for my comment about
@lucene.deprecated
, I was getting this mixed up with something else. Apologies for any confusion!
Though, I think there are both @deprecated
(javadoc tag) and @Deprecated
(Java class annotation) -- we do both of them typically (looks like you did here @zacharymorn, great!).
lucene/core/src/java/org/apache/lucene/search/IndexSearcher.java
Outdated
Show resolved
Hide resolved
lucene/test-framework/src/java/org/apache/lucene/util/FixedBitSetCollector.java
Outdated
Show resolved
Hide resolved
lucene/test-framework/src/java/org/apache/lucene/util/FixedBitSetCollector.java
Outdated
Show resolved
Hide resolved
lucene/test-framework/src/java/org/apache/lucene/util/FixedBitSetCollector.java
Outdated
Show resolved
Hide resolved
lucene/core/src/java/org/apache/lucene/search/TopScoreDocCollectorManager.java
Show resolved
Hide resolved
lucene/facet/src/java/org/apache/lucene/facet/DrillSideways.java
Outdated
Show resolved
Hide resolved
Thanks @gsmiller for the review! Yes there are indeed many small changes, and more will come :D |
@@ -527,7 +503,10 @@ public TopDocs search(Query query, int n) throws IOException { | |||
* | |||
* @throws TooManyClauses If a query would exceed {@link IndexSearcher#getMaxClauseCount()} | |||
* clauses. | |||
* @deprecated This method is being deprecated in favor of {@link IndexSearcher#search(Query, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should try to keep IndexSearcher#search(Query, Collector)
and the TopXXXCollector#create
factory methods during the 9.x series, as I expect it to be quite commonly used. However in my opinion it would be fine to cut over expert classes like those in the benchmark module without providing such long bw compatibility.
this.hitsThresholdChecker = | ||
supportsConcurrency | ||
? HitsThresholdChecker.createShared(Math.max(totalHitsThreshold, numHits)) | ||
: HitsThresholdChecker.create(Math.max(totalHitsThreshold, numHits)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's also store supportsConcurrency
in a member of this class so that we can check that users never create more than one collector when it is false?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah I like this idea better than the approach described in #240 (comment), since it doesn't require an additional API to be added to CollectorManager
for thread-safety! I've implemented this in the latest commit.
…or) to DrillSideways#searchSequentially(query, collectorManager) to by pass caching" This reverts commit 86bac52.
Sorry @zacharymorn I have not forgotten about it but the change is very large and it's hard to find enough adjacent time to review it. I'll do my best to find time in the coming week. |
No worry @jpountz , this PR is indeed large and time consuming! Please take your time and I appreciate your help on the review! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left some comments but the approach that you took so far looks good to me. I see that we still need to migrate some collectors like LargeNumHitsTopDocsCollector
or the internal JoinUtil
collectors to a CollectorManager
?
@@ -180,6 +185,7 @@ protected int withTopDocs(IndexSearcher searcher, Query q, TopDocs hits) throws | |||
return res; | |||
} | |||
|
|||
@Deprecated |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suspect it would still be useful to assess the performance of a collector but you probably have a point that if performance is the only thing that the benchmark module reports, then there's a chance that it's not actually used much.
I did a quick GitHub search to see how much code I could find that extends this code:
"extends ReadTask" createCollector
: no hits"extends SearchTask" createCollector -SearchWithCollectorTask
: no hits (I had to excludeSearchWithCollectorTask
or all Lucene forks would be returned).
Given that the benchmark module is quite expert, I'd be ok with breaking hard here.
|
||
int ret = totalHitCountCollector.getTotalHits(); | ||
int ret = indexSearcher.search(booleanQuery.build(), new TotalHitCountCollectorManager()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could make it even better by using IndexSearcher#count
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
@@ -179,8 +178,7 @@ protected int countDocsWithClass() throws IOException { | |||
if (query != null) { | |||
q.add(query, BooleanClause.Occur.MUST); | |||
} | |||
indexSearcher.search(q.build(), classQueryCountCollector); | |||
docCount = classQueryCountCollector.getTotalHits(); | |||
docCount = indexSearcher.search(q.build(), new TotalHitCountCollectorManager()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's simplify by using IndexSearcher#count
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. Also updated the implementation of IndexSearcher#count
to use TotalHitCountCollectorManager
.
TotalHitCountCollector totalHitCountCollector = new TotalHitCountCollector(); | ||
indexSearcher.search(booleanQuery.build(), totalHitCountCollector); | ||
return totalHitCountCollector.getTotalHits(); | ||
return indexSearcher.search(booleanQuery.build(), new TotalHitCountCollectorManager()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
likewise here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
* Same as above with additional parameters to allow passing in the threshold checker and the max | ||
* score accumulator. | ||
*/ | ||
static TopFieldCollector create( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed we don't support access to pkg-private code.
|
||
if (sort.fields.length == 0) { | ||
throw new IllegalArgumentException("Sort must contain at least one field"); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we move as much validation as possible to the constructor, so that we would detect inconsistencies earlier: when users create the collector manager rather than when Lucene starts collecting hits?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahh good catch! Updated.
return TopDocs.merge(sort, 0, numHits, topDocs); | ||
} | ||
|
||
public List<TopFieldCollector> getCollectors() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we make it pkg-private?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this method was added to support early termination detection in tests like the following:
collectorManager.getCollectors().stream().anyMatch(TopFieldCollector::isEarlyTerminated)
I think Lucene users may have the need for this check as well?
|
||
searcher.search(testQuery, largeCollector); | ||
searcher.search(testQuery, regularCollector); | ||
TopDocs topDocs = searcher.search(testQuery, regularCollector); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
keep it called secondTopDocs
, I think it helped readability
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated.
@zacharymorn @gsmiller If we try to do everything in a single PR, I worry that this will become very hard to review. I wonder if we should split by replacing the deprecation warnings of |
Thanks @jpountz for the review! Indeed there are more that need to be migrated as well. In fact, I have been working on the 2nd PR to migrate more "easy" ones, and after that, we still have the following: Explanation related
Facet related
Grouping related
Global ordinal related
Anonymous classesAnonymous Profiling related
Counting related
Result diversification related
Other
As each of these categories / collectors will likely require more thoughts / time to understand the current logic, and come up with
Yeah I can definitely follow this strategy as well. The only concern I may have is that, as can be seen above, it may still take some time to migrate all existing collectors to collectorManagers. So using suggestion comment for now and adding deprecation warning at the end may likely see a few more collectors without collectorManagers added accidentally (i.e. folks may not notice the comment when they just code against the |
hi @zacharymorn I looked at this issue a long while ago, before you started working on it, and I am now catching up. I see you made great progress on it! I also see that reviewing it as a single change is a bit of a challenge due to the size of the PR. Could I help out splitting this in smaller PRs so that we make progress towards getting the change in, step by step? |
Thanks @mikemccand for sharing the additional insight and context! Looks like there's already a lot of good discussions in that ticket as well. Will definitely take a look at that and see what I can contribute, once most of the work for deprecating |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great, thanks @zacharymorn! I just think we should revert the lucene/benchmark
change that breaks testing of a custom collector for this first step ...
Thanks @mikemccand for the review feedback and approval! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great, thank you @zacharymorn! Awesome to see Lucene bit by bit progressing towards fully utilizing concurrent hardware by default. It's about time we pay down this longstanding tech debt.
Hi @jpountz @javanna @gsmiller, just want to have a quick follow-up to see if you may have any further feedback on this PR? I plan to merge it in the next few days if there's no major concern. |
Thanks for reviving this PR @zacharymorn ! the changes look good to me, having top score doc and top field collector managers sounds like a natural next step, and removes code duplication. I wish that we could avoid having the supportConcurrency flag in them, but that's ok for now, and I don't have a good alternative in mind to avoid unnecessary overhead when there's a single slice. One thing that I wonder is whether we are ok already deprecating search(Query, Collector) given that we have a lot of usages still within Lucene. I was thinking that we may want to replace them all before deprecating, and then remove the method in main, but this is where I got distracted from this project last year and ended up dropping it after all. Should we consider getting your changes in, minus the deprecation, perhaps, and split the work on removing usages of the method? |
Thanks @javanna for the feedback!
My current preference is actually to get the deprecation annotation in early, so that while we take on the remaining work to migrate other collector usage (which could take some time due to the amount of it), other contributors / application developers could start to take note of those deprecation messages, and stay away from using |
That is fine with me @zacharymorn . Indeed I have observed as well that there will be new usages introduced while we work on removing current usages, and deprecating early can help with that. |
heya @zacharymorn given that this is a deprecation, I guess you meant on backporting to branch_9x as well? |
Hi @javanna , I was actually thinking to have it for |
I think we would do the removal in 10, as that's the type of change that we would make for a major release. Or can decide to delay the removal if need be, but delaying the deprecation kind of defeats the purpose of deprecating? |
Do you know what's the timeline we have for releasing version 10? Based on my quick look over the last few days, most likely it will take a good few months for us to completely remove that usage in lucene codebase (I'm also a bit occupied at the moment), hence we may need to delay the removal if version 10 is planned to be released within like a year. With the above said, like I shared earlier, I feel getting the deprecation warning message into our codebase as early as possible is critical for lucene contributors to start moving away from using that method, so that we won't be playing catch-up with ourselves and will be able to remove that method one day. Checking the deprecation warning into |
I think that it would be even better to get the deprecation out with 9x, regardless of whether we will effectively remove with 10. At least we let users know that they should use a different method. I don't see advantages around delaying a deprecation somehow. Thoughts @jpountz ? |
+1 to backport the deprecation message/tags to 9.x, even if we cannot fully remove our own internal usage of the deprecated APIs before releasing 10.0. The two can / should be decoupled since it's so much work? Let's also open a dedicated spinoff issue to "remove all deprecated IS.search usage / IS.search methods in 10.0", and mark it as Blocker for now? If 10.0 wants to get out and we haven't done this blocker we can revisit how/when to do it, whether to downgrade, etc.? |
… of IndexSearcher#search(Query, CollectorManager) - TopFieldCollectorManager & TopScoreDocCollectorManager (apache#240)
Thanks both @javanna @mikemccand . I have adjusted the change entry and opened a backport PR, as well as a new spinoff issue for |
Description / Solution
Deprecate IndexSearch#search(Query, Collector) in favor of IndexSearcher#search(Query, CollectorManager), with the following changes:
TotalHitCountCollectorManager and FixedBitSetCollector3. Refactor DrillSideways to use extracted out collector managers#11041
Tests
Passed updated tests with
./gradlew clean; ./gradlew check -Ptests.nightly=true
(currently there are nocommit messages that are failing precommit though)Checklist
Please review the following and check all that apply:
main
branch../gradlew check
.