-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make TermStates#build concurrent #12183
Conversation
55bafc1
to
2020435
Compare
the change to BlendedTermQuery makes the code nearly unreadable. It is important to be able to understand how it is literally "blending the stats". Can we please find a way to make this feature less invasive? |
lucene/core/src/java/org/apache/lucene/search/BlendedTermQuery.java
Outdated
Show resolved
Hide resolved
Personally I don't think we should do this. Maybe the API needs to be rethought. I see race conditions being added in this change, now this adds potential for all kinds of crazy bugs in lucene. I don't think any performance gain is worth that. |
lucene/core/src/java/org/apache/lucene/search/FieldExistsQuery.java
Outdated
Show resolved
Hide resolved
lucene/core/src/java/org/apache/lucene/search/FieldExistsQuery.java
Outdated
Show resolved
Hide resolved
Thanks for the quick feedback @rmuir @uschindler !
Sure, this was to get some initial feedback on whether this would make sense so haven't focussed much on those aspects. I'll try to make it more cleaner in the next revision and see how that turns out.
Agreed. We should definitely not have any race conditions here. I'll try to address these and other suggestions. Maybe we could then evaluate if this would be beneficial and worth the possible benefits of concurrency or not. |
When we changed
Indeed compound queries should not rewrite their sub queries via the executor, otherwise we could have threads in the executor joining other threads from the executor, which can lead to deadlocks. |
Yes, I'd drop
Yeah, thats answers why it was not working for compound ones. Thanks @jpountz ! |
If there is something that might benefit from being made concurrent, it may be |
@jpountz That sounds like a really good idea to me. Thanks for pointing! |
2020435
to
ac7fcca
Compare
So in the new revision I have changed the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the lag! I had another look and the general direction looks good to me, I left some suggestions. I'd be interested in getting feedback from others as this would be quite new usage of the IndexSearcher executor in the sense that the code that it called is not especially CPU-intensive but could do quite some random I/O, so we'd be mostly using the executor to improve I/O concurrency.
lucene/core/src/java/org/apache/lucene/document/FeatureField.java
Outdated
Show resolved
Hide resolved
I also like the direction this is going! This might be a nice latency reduction for primary key lookups on idle-ish hosts. It is certainly a new use of Unfortunately, Lucene's nightly benchmarks ( |
…nts (apache#12325)" This reverts commit 10bebde. Based on a recent discussion in apache#12183 (comment) we agreed it makes more sense to parallelize knn query vector rewrite across leaves rather than leaf slices.
…nts (#12325)" (#12385) This reverts commit 10bebde. Based on a recent discussion in #12183 (comment) we agreed it makes more sense to parallelize knn query vector rewrite across leaves rather than leaf slices.
…nts (#12325)" (#12385) This reverts commit 10bebde. Based on a recent discussion in #12183 (comment) we agreed it makes more sense to parallelize knn query vector rewrite across leaves rather than leaf slices.
ac7fcca
to
68dca5a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left some comments on exception handling but otherwise it looks good to me. Can you add a CHANGES entry under 9.8 as well?
@jpountz As I was running the tests(
To reproduce
|
lucene/core/src/java/org/apache/lucene/search/TaskExecutor.java
Outdated
Show resolved
Hide resolved
lucene/core/src/java/org/apache/lucene/search/TaskExecutor.java
Outdated
Show resolved
Hide resolved
Hey @shubhamvishu heads up: I merged #12569 to address the deadlock issue and opened #12574 to adjust TaskExecutor visibility outside of this PR. Hopefully you are next going to be able to merge your PR! |
Thanks a lot @javanna! I'll push the new revision post #12574 is merged. |
ef8efe1
to
408d0fb
Compare
I have rebased the PR based on the changes in #12574. Could someone please review the changes? Thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have reviewed the concurrent execution part, interaction with TaskExecutor and it looks good to me. I'd ask others to double check the parallelism introduced in TermStates because I am not super familiar with it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The change looks good to me, thanks for your tenacity @shubhamvishu and @javanna for the help!
return null; | ||
})) | ||
.toList(); | ||
List<TaskExecutor.Task<TermStateInfo>> taskList = new ArrayList<>(tasks); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need to clone the list?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To me it makes no sense to clone the list, as it is a new list already (created by stream.toList()
). In addition the original list is not used anymore.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes we don't, it was required in earlier revisions but became obsolute with the new rebased changes. I have removed this in the new revision.
return null; | ||
})) | ||
.toList(); | ||
List<TaskExecutor.Task<TermStateInfo>> taskList = new ArrayList<>(tasks); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To me it makes no sense to clone the list, as it is a new list already (created by stream.toList()
). In addition the original list is not used anymore.
@@ -211,4 +244,40 @@ public String toString() { | |||
|
|||
return sb.toString(); | |||
} | |||
|
|||
/** Wrapper over TermState, ordinal value, term doc frequency and total term frequency */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In main branch we could make this a record
(unfortunately not in Java 11).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, I didn't use record because I couldn't find any other usages of record
in the whole codebase(just 1-2) which convinced me to have a static class rather. Is there a minimum JDK that have to be supported for lucene(jdk 11 or 17?)? In build.gradle I see minJavaVersion = JavaVersion.VERSION_17
so maybe its fine/safe to use record
in the above change(I'm not much aware about it)? Let me know if I should instead change this to a record
or maybe keep it as is. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lucene main (Future Lucene 10.x) will use minimum Java 17. branch_9x (For Lucene 9.x) uses Java 11.
So record is fine in Lucene Main branch but requires additional work when backporting changes. But at some point we should use records for this type of classes. This would be a good start.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for explaining and reviewing @uschindler!
Lucene main (Future Lucene 10.x) will use minimum Java 17. branch_9x ....
Got it, yes that would be great! I would be happy to open a separate PR dealing with changing all such current classes in lucene into a record class which get us started atleast(I don't know if that should be rather incremental effort though).
408d0fb
to
4a23df8
Compare
Great to see this merged, thanks @shubhamvishu for all the work as well as patience as we were figuring out a way forward! |
Description
This PR now deals with making
TermStates#build
run concurrently using theIndexSearcher
's executor .Old idea/description :
This change tries to make some queries with heavy rewrites make use of concurrency using the api added in #11840. In #12160 it turns out have good gains in KnnVectorQuery. This is initial rough implementation to get some early thoughts/feedback on or ideas on if this could be useful and worth pursuing for some other queries as well which are doing some heavy lifting in rewrite. This change initially tries to achieve some benefit of concurrency for
FieldExistsQuery
,BlendedTermQuery
andCompletionQuery
.There are also others also I could find which are not covered in this but maybe could benefit with concurrency.Queries changed to have || rewrite in this PR:
Some possible candidate(s) with non-trivial rewrites ?
Benchmarks : yet to be run
PS : As some query rewrites calls rewrites on the child/sub queries. Can we achieve this in those queries as well? So I tried to make
DisjunctionMaxQuery#rewrite
concurrent but that seems to get stuck on my machine when runningTestDisjunctionMaxQuery
unit test. Maybe thats too much parallelism as the test is randomly generating a very large disjunctive query and making that rewrite concurrent is not correct or helpful ?