-
Notifications
You must be signed in to change notification settings - Fork 25k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ML] Fixes for stop datafeed edge cases #49191
[ML] Fixes for stop datafeed edge cases #49191
Conversation
The following edge cases were fixed: 1. A request to force-stop a stopping datafeed is no longer ignored. Force-stop is an important recovery mechanism if normal stop doesn't work for some reason, and needs to operate on a datafeed in any state other than stopped. 2. If the node that a datafeed is running on is removed from the cluster during a normal stop then the stop request is retried (and will likely succeed on this retry by simply cancelling the persistent task for the affected datafeed). 3. If there are multiple simultaneous force-stop requests for the same datafeed we no longer fail the one that is processed second. The previous behaviour was wrong as stopping a stopped datafeed is not an error, so stopping a datafeed twice simultaneously should not be either. Fixes elastic#43670 Fixes elastic#48931
Pinging @elastic/ml-core (:ml) |
} else { | ||
listener.onFailure(e); | ||
} | ||
}); | ||
|
||
super.doExecute(task, request, finalListener); | ||
} | ||
|
||
private void forceStopDatafeed(final StopDatafeedAction.Request request, final ActionListener<StopDatafeedAction.Response> listener, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method seems like it no longer needs to distinguish started
and stopping
datafeeds. Should we pass it a single list of datafeeds and do the concatenation a level up?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, that's done in the second commit.
While I was doing this I noticed that although we have a starting
state we never use it. So I also added a TODO about using that for 8.0. (Looking back through the history, starting
was added in 5.5, but couldn't be used until 6.x because 5.4 wouldn't understand it. Now we're beyond that we can use it, but we should only do this in a major version just in case somebody is relying on stopped
, started
and stopping
being the only 3 states that exist.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure we need to use it. I recall adding it so we had it available without BWC in case we needed it. But surely, if it improves things we can definitely use it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think not using it creates a potential race condition where if you start and stop a datafeed in very quick succession then the stop will be ignored. It seems that none of our current tests do this.
Also added a TODO, as while doing this I noticed that the "starting" state is never used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
The following edge cases were fixed: 1. A request to force-stop a stopping datafeed is no longer ignored. Force-stop is an important recovery mechanism if normal stop doesn't work for some reason, and needs to operate on a datafeed in any state other than stopped. 2. If the node that a datafeed is running on is removed from the cluster during a normal stop then the stop request is retried (and will likely succeed on this retry by simply cancelling the persistent task for the affected datafeed). 3. If there are multiple simultaneous force-stop requests for the same datafeed we no longer fail the one that is processed second. The previous behaviour was wrong as stopping a stopped datafeed is not an error, so stopping a datafeed twice simultaneously should not be either. Backport of elastic#49191
The following edge cases were fixed: 1. A request to force-stop a stopping datafeed is no longer ignored. Force-stop is an important recovery mechanism if normal stop doesn't work for some reason, and needs to operate on a datafeed in any state other than stopped. 2. If the node that a datafeed is running on is removed from the cluster during a normal stop then the stop request is retried (and will likely succeed on this retry by simply cancelling the persistent task for the affected datafeed). 3. If there are multiple simultaneous force-stop requests for the same datafeed we no longer fail the one that is processed second. The previous behaviour was wrong as stopping a stopped datafeed is not an error, so stopping a datafeed twice simultaneously should not be either. Backport of #49191
If a datafeed is stopped normally and force stopped at the same time then it is possible that the force stop removes the persistent task while the normal stop is performing actions. Currently this causes the normal stop to error, but since stopping a stopped datafeed is not an error this doesn't make sense. Instead the force stop should just take precedence. This is a followup to elastic#49191 and should really have been included in the changes in that PR.
If a datafeed is stopped normally and force stopped at the same time then it is possible that the force stop removes the persistent task while the normal stop is performing actions. Currently this causes the normal stop to error, but since stopping a stopped datafeed is not an error this doesn't make sense. Instead the force stop should just take precedence. This is a followup to #49191 and should really have been included in the changes in that PR.
If a datafeed is stopped normally and force stopped at the same time then it is possible that the force stop removes the persistent task while the normal stop is performing actions. Currently this causes the normal stop to error, but since stopping a stopped datafeed is not an error this doesn't make sense. Instead the force stop should just take precedence. This is a followup to #49191 and should really have been included in the changes in that PR.
The following edge cases were fixed: 1. A request to force-stop a stopping datafeed is no longer ignored. Force-stop is an important recovery mechanism if normal stop doesn't work for some reason, and needs to operate on a datafeed in any state other than stopped. 2. If the node that a datafeed is running on is removed from the cluster during a normal stop then the stop request is retried (and will likely succeed on this retry by simply cancelling the persistent task for the affected datafeed). 3. If there are multiple simultaneous force-stop requests for the same datafeed we no longer fail the one that is processed second. The previous behaviour was wrong as stopping a stopped datafeed is not an error, so stopping a datafeed twice simultaneously should not be either. Backport of #49191
The following edge cases were fixed: 1. A request to force-stop a stopping datafeed is no longer ignored. Force-stop is an important recovery mechanism if normal stop doesn't work for some reason, and needs to operate on a datafeed in any state other than stopped. 2. If the node that a datafeed is running on is removed from the cluster during a normal stop then the stop request is retried (and will likely succeed on this retry by simply cancelling the persistent task for the affected datafeed). 3. If there are multiple simultaneous force-stop requests for the same datafeed we no longer fail the one that is processed second. The previous behaviour was wrong as stopping a stopped datafeed is not an error, so stopping a datafeed twice simultaneously should not be either. Backport of #49191
The following edge cases were fixed:
ignored. Force-stop is an important recovery mechanism
if normal stop doesn't work for some reason, and needs
to operate on a datafeed in any state other than stopped.
the cluster during a normal stop then the stop request is
retried (and will likely succeed on this retry by simply
cancelling the persistent task for the affected datafeed).
the same datafeed we no longer fail the one that is
processed second. The previous behaviour was wrong as
stopping a stopped datafeed is not an error, so stopping
a datafeed twice simultaneously should not be either.
Fixes #43670
Fixes #48931