-
Notifications
You must be signed in to change notification settings - Fork 25.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Failure in CoordinatorTests.testSingleNodeDiscoveryStabilisesEvenWhenDisrupted #89867
Comments
Pinging @elastic/es-distributed (Team:Distributed) |
Looks like the stabilisation timeout is just too short, because the following patch fixes this failure: diff --git a/server/src/test/java/org/elasticsearch/cluster/coordination/CoordinatorTests.java b/server/src/test/java/org/elasticsearch/cluster/coordination/CoordinatorTests.java
index ca7cb106be6..49cd4a3a15b 100644
--- a/server/src/test/java/org/elasticsearch/cluster/coordination/CoordinatorTests.java
+++ b/server/src/test/java/org/elasticsearch/cluster/coordination/CoordinatorTests.java
@@ -1934,7 +1934,7 @@ public class CoordinatorTests extends AbstractCoordinatorTestCase {
// detection and ongoing publications do not time out
cluster.runFor(ELECTION_INITIAL_TIMEOUT_SETTING.get(Settings.EMPTY).millis() + delayVariabilityMillis
// two round trips for pre-voting and voting
- + 4 * delayVariabilityMillis
+ + 4 * delayVariabilityMillis + 300000
// and then the election update
+ clusterStateUpdateDelay, "stabilising");
Ofc that's not the actual fix, we must work out for what this computation is not accounting. |
* main: (175 commits) Fix integration test on Windows (elastic#89894) Avoiding the use of dynamic map keys in the cluster_formation results of the stable master health indicator (elastic#89842) Mute org.elasticsearch.tracing.apm.ApmIT.testCapturesTracesForHttpTraffic (elastic#89891) Fix typos in audit event types (elastic#89886) Synthetic _source: support histogram field (elastic#89833) [TSDB] Rename rollup public API to downsample (elastic#89809) Format script values access (elastic#89780) [DOCS] Simplifies composite aggregation recommendation (elastic#89878) [DOCS] Update CCS compatibility matrix for 8.3 (elastic#88906) Fix memory leak when double invoking RestChannel.sendResponse (elastic#89873) [ML] Add processor autoscaling decider (elastic#89645) Update disk-usage.asciidoc (elastic#89709) (elastic#89874) Add allocation deciders in createComponents (elastic#89836) Mute flaky H3LatLonGeometryTest.testIndexPoints (elastic#89870) Fix typo in get-snapshot-status-api doc (elastic#89865) Picking master eligible node at random in the master stability health indicator (elastic#89841) Do not reuse the client after a disruption elastic#89815 (elastic#89866) [ML] Distribute trained model allocations across availability zones (elastic#89822) Increment clientCalledCount before onResponse (elastic#89858) AwaitsFix for elastic#89867 ...
* main: (175 commits) Fix integration test on Windows (elastic#89894) Avoiding the use of dynamic map keys in the cluster_formation results of the stable master health indicator (elastic#89842) Mute org.elasticsearch.tracing.apm.ApmIT.testCapturesTracesForHttpTraffic (elastic#89891) Fix typos in audit event types (elastic#89886) Synthetic _source: support histogram field (elastic#89833) [TSDB] Rename rollup public API to downsample (elastic#89809) Format script values access (elastic#89780) [DOCS] Simplifies composite aggregation recommendation (elastic#89878) [DOCS] Update CCS compatibility matrix for 8.3 (elastic#88906) Fix memory leak when double invoking RestChannel.sendResponse (elastic#89873) [ML] Add processor autoscaling decider (elastic#89645) Update disk-usage.asciidoc (elastic#89709) (elastic#89874) Add allocation deciders in createComponents (elastic#89836) Mute flaky H3LatLonGeometryTest.testIndexPoints (elastic#89870) Fix typo in get-snapshot-status-api doc (elastic#89865) Picking master eligible node at random in the master stability health indicator (elastic#89841) Do not reuse the client after a disruption elastic#89815 (elastic#89866) [ML] Distribute trained model allocations across availability zones (elastic#89822) Increment clientCalledCount before onResponse (elastic#89858) AwaitsFix for elastic#89867 ... # Conflicts: # x-pack/plugin/rollup/src/main/java/org/elasticsearch/xpack/downsample/RollupShardIndexer.java
* main: (283 commits) Fix integration test on Windows (elastic#89894) Avoiding the use of dynamic map keys in the cluster_formation results of the stable master health indicator (elastic#89842) Mute org.elasticsearch.tracing.apm.ApmIT.testCapturesTracesForHttpTraffic (elastic#89891) Fix typos in audit event types (elastic#89886) Synthetic _source: support histogram field (elastic#89833) [TSDB] Rename rollup public API to downsample (elastic#89809) Format script values access (elastic#89780) [DOCS] Simplifies composite aggregation recommendation (elastic#89878) [DOCS] Update CCS compatibility matrix for 8.3 (elastic#88906) Fix memory leak when double invoking RestChannel.sendResponse (elastic#89873) [ML] Add processor autoscaling decider (elastic#89645) Update disk-usage.asciidoc (elastic#89709) (elastic#89874) Add allocation deciders in createComponents (elastic#89836) Mute flaky H3LatLonGeometryTest.testIndexPoints (elastic#89870) Fix typo in get-snapshot-status-api doc (elastic#89865) Picking master eligible node at random in the master stability health indicator (elastic#89841) Do not reuse the client after a disruption elastic#89815 (elastic#89866) [ML] Distribute trained model allocations across availability zones (elastic#89822) Increment clientCalledCount before onResponse (elastic#89858) AwaitsFix for elastic#89867 ...
This is a hard. I started out to go back in history to figure out when this started appearing. It started appearing with https://github.com/elastic/elasticsearch/pull/86921/files , which removed the MockSinglePrioritizingExecutor in favor of some more java-native thread executor. With the old MockSinglePrioritizingExecutor, the specific test (and random seed) passes. However, in trying to understand the difference, I figured that MockSinglePrioritizingExecutor was flawed in multiple parts, e.g., it didn't emit a good string for the task to be executed, and also threw a KillWorkerError after executing a task, which created a different order of execution than the new java-native approach. I have the impression that during the KillWorkerError, the run() of the MockSinglePrioritizingExecutor was called again, and it might insert an additional "node join" tasks in the DeterministicTaskQueue, which might have been the reason why it could faster result in stabilization. In the end, due to all these differences, I ended up stopping the comparison with MockSinglePrioritizingExecutor. Will try to figure out with the code on main branch. |
@DaveCTurner I observe that simply adding another |
I'd like to understand why we need another delay here. It could well be because #86921 added another delay, but how exactly does that feed into the calculation? Does it mean that |
I see, I am not perfectly yet aware of all these delays and what they mean. I will try to figure it out from the logs of comparing with MockSinglePrioritizingExecutor. |
Recording some notes here for future discussion: here's an excerpt of a log showing a commit which took from
I don't think this was affected by #86921 because even the old |
* Add delay to DEFAULT_CLUSTER_STATE_UPDATE_DELAY Fixes #89867 * Convert to int * Update server/src/test/java/org/elasticsearch/cluster/coordination/CoordinatorTests.java Co-authored-by: David Turner <david.turner@elastic.co> Co-authored-by: David Turner <david.turner@elastic.co> Co-authored-by: Elastic Machine <elasticmachine@users.noreply.github.com>
I am afraid this happened again, see https://gradle-enterprise.elastic.co/s/cwlcjoocvaehg/tests/:server:test/org.elasticsearch.cluster.coordination.CoordinatorTests/testSingleNodeDiscoveryStabilisesEvenWhenDisrupted?top-execution=1. I am reopening this issue, if this is not the right thing to do, let me know and I will create a new one. |
Mary's issue is reproducible. Adding another
So, I believe now this is a different problem. I tried to compare the outputs between the failed one and the successful one (where I added Here is the failed one:
And here's the one from the successful one (which simply runs longer):
From my rather inexperienced understanding, and looking at the above comparisons, I believe that the specific seed makes the test cluster unstable for much longer than expected in the test's runFor() formula. I see it because we see I hope @DaveCTurner may be able to shed more light on how to continue on this one. |
This seems to be a genuine (but very low-impact) bug. In the failing run the delay variability is set to a massive However when using single-node discovery there should be no need to schedule multiple elections like this because there's only one node involved so the first election should succeed. Maybe something like this is enough? diff --git a/server/src/main/java/org/elasticsearch/cluster/coordination/Coordinator.java b/server/src/main/java/org/elasticsearch/cluster/coordination/Coordinator.java
index 33e3528246d..7d9b0b666e3 100644
--- a/server/src/main/java/org/elasticsearch/cluster/coordination/Coordinator.java
+++ b/server/src/main/java/org/elasticsearch/cluster/coordination/Coordinator.java
@@ -1628,6 +1628,10 @@ public class Coordinator extends AbstractLifecycleComponent implements ClusterSt
}
if (prevotingRound != null) {
+ if (singleNodeDiscovery) {
+ logger.debug("skip prevoting as single-node discovery is enabled and prevoting is already in progress");
+ return;
+ }
prevotingRound.close();
}
prevotingRound = preVoteCollector.start(lastAcceptedState, getDiscoveredNodes()); |
@DaveCTurner wow, nice, thanks for taking a look and understanding it so quickly! I just tried that and it works. Will open a PR. I'm a bit confused with the debug message |
I didn't put much thought into the message, feel free to adjust as you see fit :) |
I had one of those failures on my feature branch: https://gradle-enterprise.elastic.co/s/lcvafnq7rt4jm/tests/:server:test/org.elasticsearch.cluster.coordination.CoordinatorTests/testSingleNodeDiscoveryStabilisesEvenWhenDisrupted?top-execution=1 I believe I had this fix #91255 when the failure occured |
Thanks Ievgen, this looks to be a problem caused by #91255. |
This ticket is haunted. You fix one thing, another thing pops up :) |
Yes, the problem in this failure case is that an election happens within |
I cannot reproduce the new failure on main (I ran 10 iterations of it). @DaveCTurner , how did you understand all that? :) I cannot see any logs in the test failure. |
Is this something that can happen in reality and thus we introduced a new bug? |
The
Yes I think so. |
I think we should revert #91255 because the failure it introduces seems much more frequent than the one it fixes (a few hundred iterations vs a few thousand) |
BTW, I also reproduced it in main. Did not take many iterations until it popped up. I was thinking of how we go to solve both issues. I think either:
@DaveCTurner what do you think? Guessing from your last comment, would the first approach worth trying? |
I think the second approach would be more robust, but also much more effort because the plumbing for that failure handling just doesn't exist today. And this is a super-minor bug that almost certainly eventually goes away with enough retries so not really worth the effort. Therefore I favour the first. I've just pushed bdf3485 which reverts #91255. I'll also push a commit which mutes this test while we continue to work on it. |
I think the best setting to adjust would be |
I can test this.
You mean it's OK to set the delay variability to up to 5 mins instead of 10mins? |
No I mean it should be ok to relax the upper bound on |
@DaveCTurner , so I should change the max of the |
By the way, I notice that cluster.election.max_timeout also has 5 mins as maximum. Is that related to the election duration max value? If yes, should I change the max_timeout's max value to also be 10 mins? |
FYI, setting the |
OK, after many iterations (>1000) it looks good. I will open the PR. |
By increasing the voting duration in case of high delays, to avoid the possible endless repetition of voting rounds. Fixes elastic#89867
By increasing the voting duration in case of high delays, to avoid the possible endless repetition of voting rounds. Fixes #89867
CI Link
https://gradle-enterprise.elastic.co/s/yckwl5zjomfas
Repro line
Does it reproduce?
Yes
Applicable branches
main
Failure history
https://ci-stats.elastic.co/s/jenkins/goto/da1547f0-2ea6-11ed-8233-4bd40c8511bb - 3 similar-looking failures over the last 90 days
Failure excerpt
The text was updated successfully, but these errors were encountered: