Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Delete expired data by job #57337

Merged
merged 14 commits into from
Jun 5, 2020
Merged

Conversation

davidkyle
Copy link
Member

@davidkyle davidkyle commented May 29, 2020

Deleting expired data can take a long time leading to timeouts if there are many jobs. Often the problem is due to a few large jobs which prevent the regular maintenance of the remaining jobs. This change adds a job_id parameter to the delete expired data endpoint to help clean up those problematic jobs.

This change only affects model snapshots and results. Forecasts cannot be removed by job_id yet if desired that could be implemented.

TODO HLRC

@elasticmachine
Copy link
Collaborator

Pinging @elastic/ml-core (:ml)

if (restRequest.hasContent()) {
request = DeleteExpiredDataAction.Request.PARSER.apply(restRequest.contentParser(), null);
} else {
request = new DeleteExpiredDataAction.Request();
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

requests_per_second and timeout can now be query parameters

() -> deleteExpiredData(request, listener, isTimedOutSupplier)
);
jobConfigProvider.expandJobs(request.getJobId(), true, true, ActionListener.wrap(
jobBuilders -> {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the most controversial change I think. Previously each data remover would get all jobs using a BatchedJobsIterator which gets the jobs in batches of 10,000 using a scroll search so if there are more than 10,000 jobs a scroll search will return them all.

The config provider performs a normal search and cannot return more that 10,000 jobs. The 10,000 jobs is a known limit as GET jobs would never return more than that number. Using the config provider hugely simplifies the code but it is a change in behaviour not matter how unlikely it is that there are > 10,000 jobs

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO, if there is > 10,000, the cleanup is not likely to finish. This seems like a simple throttle we get for free.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing to think about is that the 10,001st job would NEVER have it's data cleaned up. I suppose that is OK, but this should be at least documented. I agree that having more than 10k jobs is rare.

FWIW, the code using the iterator could be just as simple as you only have to update the iterator to restrict its search. Then instead of passing in a list of jobs, you pass in the iterator.

() -> deleteExpiredData(request, listener, isTimedOutSupplier)
);
jobConfigProvider.expandJobs(request.getJobId(), true, true, ActionListener.wrap(
jobBuilders -> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing to think about is that the 10,001st job would NEVER have it's data cleaned up. I suppose that is OK, but this should be at least documented. I agree that having more than 10k jobs is rare.

FWIW, the code using the iterator could be just as simple as you only have to update the iterator to restrict its search. Then instead of passing in a list of jobs, you pass in the iterator.

@@ -22,7 +24,8 @@

@Override
public List<Route> routes() {
return Collections.emptyList();
return Collections.singletonList(
new Route(DELETE, MachineLearning.BASE_PATH + "_delete_expired_data/{" + Fields.JOB_ID.getPreferredName() + "}"));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to make sure that we take the empty route in replacedRoutes and put it here before it is removed.

@@ -34,3 +56,96 @@ setup:
body: >
{ "timeout": "10h", "requests_per_second": 100000.0 }
- match: { deleted: true}

---
"Test delete expired data with job id":
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if somebody calls _delete_expired_data with a job that does not exist?

@davidkyle
Copy link
Member Author

run elasticsearch-ci/packaging-sample-matrix-windows

@davidkyle davidkyle merged commit bbeda64 into elastic:master Jun 5, 2020
@davidkyle davidkyle deleted the expire-by-job-id branch June 5, 2020 12:32
davidkyle added a commit to davidkyle/elasticsearch that referenced this pull request Jun 8, 2020
Deleting expired data can take a long time leading to timeouts if there
are many jobs. Often the problem is due to a few large jobs which 
prevent the regular maintenance of the remaining jobs. This change adds
a job_id parameter to the delete expired data endpoint to help clean up
those problematic jobs.
davidkyle added a commit that referenced this pull request Jun 8, 2020
Deleting expired data can take a long time leading to timeouts if there
are many jobs. Often the problem is due to a few large jobs which 
prevent the regular maintenance of the remaining jobs. This change adds
a job_id parameter to the delete expired data endpoint to help clean up
those problematic jobs.
davidkyle added a commit that referenced this pull request Jun 8, 2020
For the ml delete expired data request changes in #57337
davidkyle added a commit that referenced this pull request Jun 11, 2020
High level rest client changes for #57337
davidkyle added a commit to davidkyle/elasticsearch that referenced this pull request Jun 12, 2020
davidkyle added a commit that referenced this pull request Jun 12, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants