-
Notifications
You must be signed in to change notification settings - Fork 25k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use Search After job iterators #57875
Conversation
Pinging @elastic/ml-core (:ml) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some concerns around search after.
I like the design + abstractions.
The decision around using the provider when ID patterns are supplied vs *
is good.
.size(BATCH_SIZE) | ||
.query(getQuery()) | ||
.fetchSource(shouldFetchSource()) | ||
.trackTotalHits(true) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We only need total hits for the first query. Would be good to have this be false
once we have the total hits recorded.
*/ | ||
@Override | ||
public boolean hasNext() { | ||
return count != totalHits; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does search after protect us from new docs being added?
It seems to me that there is a possibility of count > totalHits
if new documents are added + the index is refreshed.
It might be good to have this as count <= totalHits
as adjust the initial values + initialization of values accordingly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is a very good point and with deletes count
may never == totalHits
.
I changed the logic so that hasNext will return false only if the last search returned less than the requested number for hits e.g. I asked for 10 results but only got 8 back so there are no more search results and that is the end of the iteration.
Search after is a better choice for the delete expired data iterators where processing takes a long time as unlike scroll a context does not have to be kept alive. Also changes the delete expired data endpoint to 404 if the job is unknown
The changes made in #57337 have a few shortcomings:
In the case of deleting expired data, if there are > 10,000 jobs then processing those will almost certainly time out the scroll context (5 minutes). The first part of this change is to add a class
SearchAfterDocumentsIterator
similar in function to theBatchedDocumentsIterator
but uses search after instead of scroll.The problem is that the check that a job is present can only take place for a single search response not for multiple responses, this is how
JobConfigProvider
works.Theoretically for the job id expression
bar-*,foo
if there are more than 10,000bar-*
jobs thenfoo
would come in the second page but theExpandedIdsMatcher
would throw becausefoo
is not in the results. This is a known limitation.In order to get the behaviour that a bad job Id should throw a
ResourceNotFoundException
I've used theJobConfigProvider
if the request uses a job Id that is not*
,_all
or null/empty. When all jobs are requested theSearchAfterJobsIterator
is used.