Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use Search After job iterators #57875

Merged
merged 7 commits into from
Jun 10, 2020
Merged

Conversation

davidkyle
Copy link
Member

The changes made in #57337 have a few shortcomings:

  1. Using scrolled searches is not safe if processing the search results takes a long time as the scroll context will expire.

In the case of deleting expired data, if there are > 10,000 jobs then processing those will almost certainly time out the scroll context (5 minutes). The first part of this change is to add a class SearchAfterDocumentsIterator similar in function to the BatchedDocumentsIterator but uses search after instead of scroll.

  1. A bad job Id in the delete expired data request will not return a 404 if the job does not exist. This is a leniency we always try to avoid.

The problem is that the check that a job is present can only take place for a single search response not for multiple responses, this is how JobConfigProvider works.

Theoretically for the job id expression bar-*,foo if there are more than 10,000 bar-* jobs then foo would come in the second page but the ExpandedIdsMatcher would throw because foo is not in the results. This is a known limitation.

In order to get the behaviour that a bad job Id should throw a ResourceNotFoundException I've used the JobConfigProvider if the request uses a job Id that is not *, _all or null/empty. When all jobs are requested the SearchAfterJobsIterator is used.

@elasticmachine
Copy link
Collaborator

Pinging @elastic/ml-core (:ml)

@benwtrent benwtrent self-requested a review June 9, 2020 13:28
Copy link
Member

@benwtrent benwtrent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some concerns around search after.

I like the design + abstractions.

The decision around using the provider when ID patterns are supplied vs * is good.

.size(BATCH_SIZE)
.query(getQuery())
.fetchSource(shouldFetchSource())
.trackTotalHits(true)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We only need total hits for the first query. Would be good to have this be false once we have the total hits recorded.

*/
@Override
public boolean hasNext() {
return count != totalHits;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does search after protect us from new docs being added?

It seems to me that there is a possibility of count > totalHits if new documents are added + the index is refreshed.

It might be good to have this as count <= totalHits as adjust the initial values + initialization of values accordingly.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is a very good point and with deletes count may never == totalHits.

I changed the logic so that hasNext will return false only if the last search returned less than the requested number for hits e.g. I asked for 10 results but only got 8 back so there are no more search results and that is the end of the iteration.

@davidkyle davidkyle merged commit 96a6de2 into elastic:master Jun 10, 2020
@davidkyle davidkyle deleted the search-after branch June 10, 2020 08:53
davidkyle added a commit to davidkyle/elasticsearch that referenced this pull request Jun 10, 2020
Search after is a better choice for the delete expired data iterators
where processing takes a long time as unlike scroll a context does not
have to be kept alive. Also changes the delete expired data endpoint to
404 if the job is unknown
davidkyle added a commit that referenced this pull request Jun 11, 2020
Search after is a better choice for the delete expired data iterators
where processing takes a long time as unlike scroll a context does not
have to be kept alive. Also changes the delete expired data endpoint to
404 if the job is unknown
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants