[ML] Fixes for stop datafeed edge cases #49191

droberts195 · 2019-11-15T17:36:11Z

The following edge cases were fixed:

A request to force-stop a stopping datafeed is no longer
ignored. Force-stop is an important recovery mechanism
if normal stop doesn't work for some reason, and needs
to operate on a datafeed in any state other than stopped.
If the node that a datafeed is running on is removed from
the cluster during a normal stop then the stop request is
retried (and will likely succeed on this retry by simply
cancelling the persistent task for the affected datafeed).
If there are multiple simultaneous force-stop requests for
the same datafeed we no longer fail the one that is
processed second. The previous behaviour was wrong as
stopping a stopped datafeed is not an error, so stopping
a datafeed twice simultaneously should not be either.

The following edge cases were fixed: 1. A request to force-stop a stopping datafeed is no longer ignored. Force-stop is an important recovery mechanism if normal stop doesn't work for some reason, and needs to operate on a datafeed in any state other than stopped. 2. If the node that a datafeed is running on is removed from the cluster during a normal stop then the stop request is retried (and will likely succeed on this retry by simply cancelling the persistent task for the affected datafeed). 3. If there are multiple simultaneous force-stop requests for the same datafeed we no longer fail the one that is processed second. The previous behaviour was wrong as stopping a stopped datafeed is not an error, so stopping a datafeed twice simultaneously should not be either. Fixes elastic#43670 Fixes elastic#48931

elasticmachine · 2019-11-15T17:36:13Z

Pinging @elastic/ml-core (:ml)

dimitris-athanasiou · 2019-11-18T11:59:24Z

...k/plugin/ml/src/main/java/org/elasticsearch/xpack/ml/action/TransportStopDatafeedAction.java

+                    } else {
+                        listener.onFailure(e);
+                    }
+                });

        super.doExecute(task, request, finalListener);
    }

    private void forceStopDatafeed(final StopDatafeedAction.Request request, final ActionListener<StopDatafeedAction.Response> listener,


This method seems like it no longer needs to distinguish started and stopping datafeeds. Should we pass it a single list of datafeeds and do the concatenation a level up?

Sure, that's done in the second commit.

While I was doing this I noticed that although we have a starting state we never use it. So I also added a TODO about using that for 8.0. (Looking back through the history, starting was added in 5.5, but couldn't be used until 6.x because 5.4 wouldn't understand it. Now we're beyond that we can use it, but we should only do this in a major version just in case somebody is relying on stopped, started and stopping being the only 3 states that exist.

I'm not sure we need to use it. I recall adding it so we had it available without BWC in case we needed it. But surely, if it improves things we can definitely use it.

I think not using it creates a potential race condition where if you start and stop a datafeed in very quick succession then the stop will be ignored. It seems that none of our current tests do this.

Also added a TODO, as while doing this I noticed that the "starting" state is never used.

dimitris-athanasiou

LGTM

The following edge cases were fixed: 1. A request to force-stop a stopping datafeed is no longer ignored. Force-stop is an important recovery mechanism if normal stop doesn't work for some reason, and needs to operate on a datafeed in any state other than stopped. 2. If the node that a datafeed is running on is removed from the cluster during a normal stop then the stop request is retried (and will likely succeed on this retry by simply cancelling the persistent task for the affected datafeed). 3. If there are multiple simultaneous force-stop requests for the same datafeed we no longer fail the one that is processed second. The previous behaviour was wrong as stopping a stopped datafeed is not an error, so stopping a datafeed twice simultaneously should not be either. Backport of elastic#49191

The following edge cases were fixed: 1. A request to force-stop a stopping datafeed is no longer ignored. Force-stop is an important recovery mechanism if normal stop doesn't work for some reason, and needs to operate on a datafeed in any state other than stopped. 2. If the node that a datafeed is running on is removed from the cluster during a normal stop then the stop request is retried (and will likely succeed on this retry by simply cancelling the persistent task for the affected datafeed). 3. If there are multiple simultaneous force-stop requests for the same datafeed we no longer fail the one that is processed second. The previous behaviour was wrong as stopping a stopped datafeed is not an error, so stopping a datafeed twice simultaneously should not be either. Backport of #49191

If a datafeed is stopped normally and force stopped at the same time then it is possible that the force stop removes the persistent task while the normal stop is performing actions. Currently this causes the normal stop to error, but since stopping a stopped datafeed is not an error this doesn't make sense. Instead the force stop should just take precedence. This is a followup to elastic#49191 and should really have been included in the changes in that PR.

If a datafeed is stopped normally and force stopped at the same time then it is possible that the force stop removes the persistent task while the normal stop is performing actions. Currently this causes the normal stop to error, but since stopping a stopped datafeed is not an error this doesn't make sense. Instead the force stop should just take precedence. This is a followup to #49191 and should really have been included in the changes in that PR.

The following edge cases were fixed: 1. A request to force-stop a stopping datafeed is no longer ignored. Force-stop is an important recovery mechanism if normal stop doesn't work for some reason, and needs to operate on a datafeed in any state other than stopped. 2. If the node that a datafeed is running on is removed from the cluster during a normal stop then the stop request is retried (and will likely succeed on this retry by simply cancelling the persistent task for the affected datafeed). 3. If there are multiple simultaneous force-stop requests for the same datafeed we no longer fail the one that is processed second. The previous behaviour was wrong as stopping a stopped datafeed is not an error, so stopping a datafeed twice simultaneously should not be either. Backport of #49191

droberts195 added >bug :ml Machine learning v8.0.0 v7.6.0 v6.8.6 v7.5.1 labels Nov 15, 2019

droberts195 requested a review from dimitris-athanasiou November 15, 2019 17:36

dimitris-athanasiou reviewed Nov 18, 2019

View reviewed changes

Split started/stopping jobs earlier on

a451e61

Also added a TODO, as while doing this I noticed that the "starting" state is never used.

dimitris-athanasiou approved these changes Nov 19, 2019

View reviewed changes

droberts195 merged commit 8bbbe28 into elastic:master Nov 19, 2019

droberts195 deleted the fix_stop_datafeed_edge_cases branch November 19, 2019 09:26

droberts195 mentioned this pull request Nov 19, 2019

[7.x][ML] Fixes for stop datafeed edge cases #49284

Merged

This was referenced Nov 19, 2019

[7.5][ML] Fixes for stop datafeed edge cases #49286

Merged

[6.8][ML] Fixes for stop datafeed edge cases #49290

Merged

jimczi added v7.5.0 and removed v7.5.1 labels Nov 19, 2019

droberts195 added v7.5.1 and removed v7.5.0 labels Nov 20, 2019

droberts195 mentioned this pull request Nov 20, 2019

[ML] Fix simultaneous stop and force stop datafeed #49367

Merged

droberts195 added v7.5.0 and removed v7.5.1 labels Nov 26, 2019

This was referenced Feb 3, 2020

[meta] 7.6 release elastic/elasticsearch-net#4340

Closed

[meta] 7.6 release elastic/elasticsearch-net#4341

Closed

jakelandis added v8.0.0-alpha1 and removed v8.0.0 labels Jul 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ML] Fixes for stop datafeed edge cases #49191

[ML] Fixes for stop datafeed edge cases #49191

droberts195 commented Nov 15, 2019

elasticmachine commented Nov 15, 2019

dimitris-athanasiou Nov 18, 2019

droberts195 Nov 18, 2019

dimitris-athanasiou Nov 19, 2019

droberts195 Nov 19, 2019

dimitris-athanasiou left a comment

[ML] Fixes for stop datafeed edge cases #49191

[ML] Fixes for stop datafeed edge cases #49191

Conversation

droberts195 commented Nov 15, 2019

elasticmachine commented Nov 15, 2019

dimitris-athanasiou Nov 18, 2019

Choose a reason for hiding this comment

droberts195 Nov 18, 2019

Choose a reason for hiding this comment

dimitris-athanasiou Nov 19, 2019

Choose a reason for hiding this comment

droberts195 Nov 19, 2019

Choose a reason for hiding this comment

dimitris-athanasiou left a comment

Choose a reason for hiding this comment