Use Search After job iterators #57875

davidkyle · 2020-06-09T12:33:09Z

The changes made in #57337 have a few shortcomings:

Using scrolled searches is not safe if processing the search results takes a long time as the scroll context will expire.

In the case of deleting expired data, if there are > 10,000 jobs then processing those will almost certainly time out the scroll context (5 minutes). The first part of this change is to add a class SearchAfterDocumentsIterator similar in function to the BatchedDocumentsIterator but uses search after instead of scroll.

A bad job Id in the delete expired data request will not return a 404 if the job does not exist. This is a leniency we always try to avoid.

The problem is that the check that a job is present can only take place for a single search response not for multiple responses, this is how JobConfigProvider works.

Theoretically for the job id expression bar-*,foo if there are more than 10,000 bar-* jobs then foo would come in the second page but the ExpandedIdsMatcher would throw because foo is not in the results. This is a known limitation.

In order to get the behaviour that a bad job Id should throw a ResourceNotFoundException I've used the JobConfigProvider if the request uses a job Id that is not *, _all or null/empty. When all jobs are requested the SearchAfterJobsIterator is used.

elasticmachine · 2020-06-09T12:33:11Z

Pinging @elastic/ml-core (:ml)

benwtrent

Some concerns around search after.

I like the design + abstractions.

The decision around using the provider when ID patterns are supplied vs * is good.

benwtrent · 2020-06-09T13:47:57Z

...src/main/java/org/elasticsearch/xpack/ml/utils/persistence/SearchAfterDocumentsIterator.java

+            .size(BATCH_SIZE)
+            .query(getQuery())
+            .fetchSource(shouldFetchSource())
+            .trackTotalHits(true)


We only need total hits for the first query. Would be good to have this be false once we have the total hits recorded.

benwtrent · 2020-06-09T13:51:03Z

...src/main/java/org/elasticsearch/xpack/ml/utils/persistence/SearchAfterDocumentsIterator.java

+     */
+    @Override
+    public boolean hasNext() {
+        return count != totalHits;


Does search after protect us from new docs being added?

It seems to me that there is a possibility of count > totalHits if new documents are added + the index is refreshed.

It might be good to have this as count <= totalHits as adjust the initial values + initialization of values accordingly.

That is a very good point and with deletes count may never == totalHits.

I changed the logic so that hasNext will return false only if the last search returned less than the requested number for hits e.g. I asked for 10 results but only got 8 back so there are no more search results and that is the end of the iteration.

Search after is a better choice for the delete expired data iterators where processing takes a long time as unlike scroll a context does not have to be kept alive. Also changes the delete expired data endpoint to 404 if the job is unknown

davidkyle added >refactoring :ml Machine learning v8.0.0 v7.9.0 labels Jun 9, 2020

benwtrent self-requested a review June 9, 2020 13:28

benwtrent reviewed Jun 9, 2020

View reviewed changes

davidkyle added 7 commits June 9, 2020 18:02

WIP

d16bc5c

Add SearchAfterDocumentsIterator

3cc6d16

Use search after iterator in expired data removers

c95ca06

Expand jobs with config porovider

e4b995c

Remove unused class

24fde63

Fix yml test match

22b320c

Use number of search results to determine hasNext

87d25f1

davidkyle force-pushed the search-after branch from 83313a8 to 87d25f1 Compare June 9, 2020 19:25

benwtrent approved these changes Jun 9, 2020

View reviewed changes

davidkyle merged commit 96a6de2 into elastic:master Jun 10, 2020

davidkyle deleted the search-after branch June 10, 2020 08:53

davidkyle mentioned this pull request Jun 10, 2020

[7.x] Use Search After job iterators (#57875) #57923

Merged

jakelandis added v8.0.0-alpha1 and removed v8.0.0 labels Jul 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use Search After job iterators #57875

Use Search After job iterators #57875

davidkyle commented Jun 9, 2020

elasticmachine commented Jun 9, 2020

benwtrent left a comment

benwtrent Jun 9, 2020

benwtrent Jun 9, 2020

davidkyle Jun 9, 2020

Use Search After job iterators #57875

Use Search After job iterators #57875

Conversation

davidkyle commented Jun 9, 2020

elasticmachine commented Jun 9, 2020

benwtrent left a comment

Choose a reason for hiding this comment

benwtrent Jun 9, 2020

Choose a reason for hiding this comment

benwtrent Jun 9, 2020

Choose a reason for hiding this comment

davidkyle Jun 9, 2020

Choose a reason for hiding this comment