Failure in CoordinatorTests.testSingleNodeDiscoveryStabilisesEvenWhenDisrupted #89867

DaveCTurner · 2022-09-07T12:20:15Z

CI Link

https://gradle-enterprise.elastic.co/s/yckwl5zjomfas

Repro line

./gradlew ':server:test' --tests "org.elasticsearch.cluster.coordination.CoordinatorTests.testSingleNodeDiscoveryStabilisesEvenWhenDisrupted" -Dtests.seed=949AB28501F8ED37:BBC20DD1A5D4EF81

Does it reproduce?

Yes

Applicable branches

main

Failure history

https://ci-stats.elastic.co/s/jenkins/goto/da1547f0-2ea6-11ed-8233-4bd40c8511bb - 3 similar-looking failures over the last 90 days

Failure excerpt

    java.lang.AssertionError: leaders
    Expected: not an empty collection
         but: was <[]>
        at __randomizedtesting.SeedInfo.seed([949AB28501F8ED37:BBC20DD1A5D4EF81]:0)
        at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:18)
        at org.junit.Assert.assertThat(Assert.java:956)
        at org.elasticsearch.cluster.coordination.AbstractCoordinatorTestCase$Cluster.getAnyLeader(AbstractCoordinatorTestCase.java:791)
        at org.elasticsearch.cluster.coordination.CoordinatorTests.testSingleNodeDiscoveryStabilisesEvenWhenDisrupted(CoordinatorTests.java:1939)


Suite: Test class org.elasticsearch.cluster.coordination.CoordinatorTests
  2> REPRODUCE WITH: ./gradlew ':server:test' --tests "org.elasticsearch.cluster.coordination.CoordinatorTests.testSingleNodeDiscoveryStabilisesEvenWhenDisrupted" -Dtests.seed=949AB28501F8ED37 -Dtests.locale=en-AU -Dtests.timezone=Europe/Vienna -Druntime.java=18
  2> java.lang.AssertionError: leaders
    Expected: not an empty collection
         but: was <[]>
        at __randomizedtesting.SeedInfo.seed([949AB28501F8ED37:BBC20DD1A5D4EF81]:0)
        at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:18)
        at org.junit.Assert.assertThat(Assert.java:956)
        at org.elasticsearch.cluster.coordination.AbstractCoordinatorTestCase$Cluster.getAnyLeader(AbstractCoordinatorTestCase.java:791)
        at org.elasticsearch.cluster.coordination.CoordinatorTests.testSingleNodeDiscoveryStabilisesEvenWhenDisrupted(CoordinatorTests.java:1939)
  2> NOTE: leaving temporary files on disk at: /home/davidturner/src/elasticsearch/server/build/testrun/test/temp/org.elasticsearch.cluster.coordination.CoordinatorTests_949AB28501F8ED37-002
  2> NOTE: test params are: codec=Asserting(Lucene94): {}, docValues:{}, maxPointsInLeafNode=1598, maxMBSortInHeap=6.485414969465074, sim=Asserting(RandomSimilarity(queryNorm=true): {}), locale=en-AU, timezone=Europe/Vienna
  2> NOTE: Linux 5.4.0-125-generic amd64/Oracle Corporation 18.0.2.1 (64-bit)/cpus=8,threads=1,free=465272544,total=536870912
  2> NOTE: All tests run in this JVM: [CoordinatorTests]

Tests with failures:
 - org.elasticsearch.cluster.coordination.CoordinatorTests.testSingleNodeDiscoveryStabilisesEvenWhenDisrupted

The text was updated successfully, but these errors were encountered:

elasticsearchmachine · 2022-09-07T12:20:38Z

Pinging @elastic/es-distributed (Team:Distributed)

DaveCTurner · 2022-09-07T12:42:08Z

Looks like the stabilisation timeout is just too short, because the following patch fixes this failure:

diff --git a/server/src/test/java/org/elasticsearch/cluster/coordination/CoordinatorTests.java b/server/src/test/java/org/elasticsearch/cluster/coordination/CoordinatorTests.java
index ca7cb106be6..49cd4a3a15b 100644
--- a/server/src/test/java/org/elasticsearch/cluster/coordination/CoordinatorTests.java
+++ b/server/src/test/java/org/elasticsearch/cluster/coordination/CoordinatorTests.java
@@ -1934,7 +1934,7 @@ public class CoordinatorTests extends AbstractCoordinatorTestCase {
             // detection and ongoing publications do not time out
             cluster.runFor(ELECTION_INITIAL_TIMEOUT_SETTING.get(Settings.EMPTY).millis() + delayVariabilityMillis
             // two round trips for pre-voting and voting
-                + 4 * delayVariabilityMillis
+                + 4 * delayVariabilityMillis + 300000
                 // and then the election update
                 + clusterStateUpdateDelay, "stabilising");

Ofc that's not the actual fix, we must work out for what this computation is not accounting.

* main: (175 commits) Fix integration test on Windows (elastic#89894) Avoiding the use of dynamic map keys in the cluster_formation results of the stable master health indicator (elastic#89842) Mute org.elasticsearch.tracing.apm.ApmIT.testCapturesTracesForHttpTraffic (elastic#89891) Fix typos in audit event types (elastic#89886) Synthetic _source: support histogram field (elastic#89833) [TSDB] Rename rollup public API to downsample (elastic#89809) Format script values access (elastic#89780) [DOCS] Simplifies composite aggregation recommendation (elastic#89878) [DOCS] Update CCS compatibility matrix for 8.3 (elastic#88906) Fix memory leak when double invoking RestChannel.sendResponse (elastic#89873) [ML] Add processor autoscaling decider (elastic#89645) Update disk-usage.asciidoc (elastic#89709) (elastic#89874) Add allocation deciders in createComponents (elastic#89836) Mute flaky H3LatLonGeometryTest.testIndexPoints (elastic#89870) Fix typo in get-snapshot-status-api doc (elastic#89865) Picking master eligible node at random in the master stability health indicator (elastic#89841) Do not reuse the client after a disruption elastic#89815 (elastic#89866) [ML] Distribute trained model allocations across availability zones (elastic#89822) Increment clientCalledCount before onResponse (elastic#89858) AwaitsFix for elastic#89867 ...

* main: (175 commits) Fix integration test on Windows (elastic#89894) Avoiding the use of dynamic map keys in the cluster_formation results of the stable master health indicator (elastic#89842) Mute org.elasticsearch.tracing.apm.ApmIT.testCapturesTracesForHttpTraffic (elastic#89891) Fix typos in audit event types (elastic#89886) Synthetic _source: support histogram field (elastic#89833) [TSDB] Rename rollup public API to downsample (elastic#89809) Format script values access (elastic#89780) [DOCS] Simplifies composite aggregation recommendation (elastic#89878) [DOCS] Update CCS compatibility matrix for 8.3 (elastic#88906) Fix memory leak when double invoking RestChannel.sendResponse (elastic#89873) [ML] Add processor autoscaling decider (elastic#89645) Update disk-usage.asciidoc (elastic#89709) (elastic#89874) Add allocation deciders in createComponents (elastic#89836) Mute flaky H3LatLonGeometryTest.testIndexPoints (elastic#89870) Fix typo in get-snapshot-status-api doc (elastic#89865) Picking master eligible node at random in the master stability health indicator (elastic#89841) Do not reuse the client after a disruption elastic#89815 (elastic#89866) [ML] Distribute trained model allocations across availability zones (elastic#89822) Increment clientCalledCount before onResponse (elastic#89858) AwaitsFix for elastic#89867 ... # Conflicts: # x-pack/plugin/rollup/src/main/java/org/elasticsearch/xpack/downsample/RollupShardIndexer.java

* main: (283 commits) Fix integration test on Windows (elastic#89894) Avoiding the use of dynamic map keys in the cluster_formation results of the stable master health indicator (elastic#89842) Mute org.elasticsearch.tracing.apm.ApmIT.testCapturesTracesForHttpTraffic (elastic#89891) Fix typos in audit event types (elastic#89886) Synthetic _source: support histogram field (elastic#89833) [TSDB] Rename rollup public API to downsample (elastic#89809) Format script values access (elastic#89780) [DOCS] Simplifies composite aggregation recommendation (elastic#89878) [DOCS] Update CCS compatibility matrix for 8.3 (elastic#88906) Fix memory leak when double invoking RestChannel.sendResponse (elastic#89873) [ML] Add processor autoscaling decider (elastic#89645) Update disk-usage.asciidoc (elastic#89709) (elastic#89874) Add allocation deciders in createComponents (elastic#89836) Mute flaky H3LatLonGeometryTest.testIndexPoints (elastic#89870) Fix typo in get-snapshot-status-api doc (elastic#89865) Picking master eligible node at random in the master stability health indicator (elastic#89841) Do not reuse the client after a disruption elastic#89815 (elastic#89866) [ML] Distribute trained model allocations across availability zones (elastic#89822) Increment clientCalledCount before onResponse (elastic#89858) AwaitsFix for elastic#89867 ...

kingherc · 2022-09-20T10:50:12Z

This is a hard. I started out to go back in history to figure out when this started appearing. It started appearing with https://github.com/elastic/elasticsearch/pull/86921/files , which removed the MockSinglePrioritizingExecutor in favor of some more java-native thread executor. With the old MockSinglePrioritizingExecutor, the specific test (and random seed) passes. However, in trying to understand the difference, I figured that MockSinglePrioritizingExecutor was flawed in multiple parts, e.g., it didn't emit a good string for the task to be executed, and also threw a KillWorkerError after executing a task, which created a different order of execution than the new java-native approach. I have the impression that during the KillWorkerError, the run() of the MockSinglePrioritizingExecutor was called again, and it might insert an additional "node join" tasks in the DeterministicTaskQueue, which might have been the reason why it could faster result in stabilization. In the end, due to all these differences, I ended up stopping the comparison with MockSinglePrioritizingExecutor. Will try to figure out with the code on main branch.

kingherc · 2022-09-20T10:57:26Z

@DaveCTurner I observe that simply adding another delayVariabilityMillis to the expression you mentioned passes the test. Would this be sufficient, or would we need to further understand the calculation?

DaveCTurner · 2022-09-20T11:02:52Z

I'd like to understand why we need another delay here. It could well be because #86921 added another delay, but how exactly does that feed into the calculation? Does it mean that clusterStateUpdateDelay should now be 8 * delayVariabilityMillis? If so, should we also update the definition of DEFAULT_CLUSTER_STATE_UPDATE_DELAY?

kingherc · 2022-09-20T11:16:52Z

I see, I am not perfectly yet aware of all these delays and what they mean. I will try to figure it out from the logs of comparing with MockSinglePrioritizingExecutor.

DaveCTurner · 2022-09-21T09:26:37Z

Does it mean that clusterStateUpdateDelay should now be 8 * delayVariabilityMillis?

Recording some notes here for future discussion: here's an excerpt of a log showing a commit which took from 5934060ms until 5934860ms, a total of 800ms.

[2022-09-20T08:19:59,812][INFO ][o.e.c.c.CoordinatorTests ] [testCanUpdateClusterStateAfterStabilisation] --> submitting value [2328060914443321389] to [{node0}{hy7wWQAAQACJS0Cz_____w}{0.0.0.0}{0.0.0.0:21}{cdfhilmrstw}]
{nodeId={node0}{hy7wWQAAQACJS0Cz_____w}}] scheduleNow: delaying [100ms], scheduling DeferredTask[executionTimeMillis=5934160, task={node0}{hy7wWQAAQACJS0Cz_____w}: master service scheduling next task]
[2022-09-20T08:19:59,812][INFO ][o.e.c.c.CoordinatorTests ] [testCanUpdateClusterStateAfterStabilisation] --> runFor(700ms) running until [5934760ms]: stabilising

1: master service task

] advanceTime: from [5934060ms] to [5934160ms]
] advanceTime: no longer deferred: DeferredTask[executionTimeMillis=5934160, task={node0}{hy7wWQAAQACJS0Cz_____w}: master service scheduling next task]
] running task 0 of 1: {node0}{hy7wWQAAQACJS0Cz_____w}: master service scheduling next task
{nodeId={node0}{hy7wWQAAQACJS0Cz_____w}}] scheduleNow: delaying [100ms], scheduling DeferredTask[executionTimeMillis=5934260, task={node0}{hy7wWQAAQACJS0Cz_____w}: publish change of cluster state from version [3] in term [1] to version [4] in term [1]]

2: state publisher task (see FakeThreadPoolMasterService#publish)

] advanceTime: from [5934160ms] to [5934260ms]
] advanceTime: no longer deferred: DeferredTask[executionTimeMillis=5934260, task={node0}{hy7wWQAAQACJS0Cz_____w}: publish change of cluster state from version [3] in term [1] to version [4] in term [1]]
] running task 0 of 1: {node0}{hy7wWQAAQACJS0Cz_____w}: publish change of cluster state from version [3] in term [1] to version [4] in term [1]
{nodeId={node0}{hy7wWQAAQACJS0Cz_____w}}] scheduleAt: adding DeferredTask[executionTimeMillis=5964360, task={node0}{hy7wWQAAQACJS0Cz_____w}: scheduled timeout for Publication{term=1, version=4}] with extra delay of [100ms]
{nodeId={node0}{hy7wWQAAQACJS0Cz_____w}}] scheduleAt: adding DeferredTask[executionTimeMillis=5944360, task={node0}{hy7wWQAAQACJS0Cz_____w}: scheduled timeout for reporting on Publication{term=1, version=4}] with extra delay of [100ms]
{nodeId={node0}{hy7wWQAAQACJS0Cz_____w}}] scheduleNow: delaying [100ms], scheduling DeferredTask[executionTimeMillis=5934360, task={node2}{MXR7YQAAQACXXRe5_____w}: [485][internal:cluster/coordination/publish_state] from {node0}{hy7wWQAAQACJS0Cz_____w}{0.0.0.0}{0.0.0.0:21}{cdfhilmrstw} to {node2}{MXR7YQAAQACXXRe5_____w}{0.0.0.0}{0.0.0.0:18}{cdfhilmrstw}]
{nodeId={node0}{hy7wWQAAQACJS0Cz_____w}}] scheduleNow: delaying [100ms], scheduling DeferredTask[executionTimeMillis=5934360, task={node1}{MhCa3f__T_-UXusCAAAAAA}: [486][internal:cluster/coordination/publish_state] from {node0}{hy7wWQAAQACJS0Cz_____w}{0.0.0.0}{0.0.0.0:21}{cdfhilmrstw} to {node1}{MhCa3f__T_-UXusCAAAAAA}{0.0.0.0}{0.0.0.0:20}{cdfhilmrstw}]
{nodeId={node0}{hy7wWQAAQACJS0Cz_____w}}] scheduleNow: delaying [100ms], scheduling DeferredTask[executionTimeMillis=5934360, task={node0}{hy7wWQAAQACJS0Cz_____w}: ActionRunnable#wrap[handling cluster state version [4] locally on [{node0}{hy7wWQAAQACJS0Cz_____w}{0.0.0.0}{0.0.0.0:21}{cdfhilmrstw}]]]

3: deliver publish requests

] advanceTime: from [5934260ms] to [5934274ms]
] advanceTime: from [5934274ms] to [5934300ms]
] advanceTime: from [5934300ms] to [5934302ms]
] advanceTime: from [5934302ms] to [5934360ms]
] advanceTime: no longer deferred: DeferredTask[executionTimeMillis=5934360, task={node2}{MXR7YQAAQACXXRe5_____w}: [485][internal:cluster/coordination/publish_state] from {node0}{hy7wWQAAQACJS0Cz_____w}{0.0.0.0}{0.0.0.0:21}{cdfhilmrstw} to {node2}{MXR7YQAAQACXXRe5_____w}{0.0.0.0}{0.0.0.0:18}{cdfhilmrstw}]
] advanceTime: no longer deferred: DeferredTask[executionTimeMillis=5934360, task={node1}{MhCa3f__T_-UXusCAAAAAA}: [486][internal:cluster/coordination/publish_state] from {node0}{hy7wWQAAQACJS0Cz_____w}{0.0.0.0}{0.0.0.0:21}{cdfhilmrstw} to {node1}{MhCa3f__T_-UXusCAAAAAA}{0.0.0.0}{0.0.0.0:20}{cdfhilmrstw}]
] advanceTime: no longer deferred: DeferredTask[executionTimeMillis=5934360, task={node0}{hy7wWQAAQACJS0Cz_____w}: ActionRunnable#wrap[handling cluster state version [4] locally on [{node0}{hy7wWQAAQACJS0Cz_____w}{0.0.0.0}{0.0.0.0:21}{cdfhilmrstw}]]]
] running task 1 of 3: {node1}{MhCa3f__T_-UXusCAAAAAA}: [486][internal:cluster/coordination/publish_state] from {node0}{hy7wWQAAQACJS0Cz_____w}{0.0.0.0}{0.0.0.0:21}{cdfhilmrstw} to {node1}{MhCa3f__T_-UXusCAAAAAA}{0.0.0.0}{0.0.0.0:20}{cdfhilmrstw}
{nodeId={node1}{MhCa3f__T_-UXusCAAAAAA}}] scheduleNow: delaying [100ms], scheduling DeferredTask[executionTimeMillis=5934460, task={node0}{hy7wWQAAQACJS0Cz_____w}: response to [486][internal:cluster/coordination/publish_state] from {node0}{hy7wWQAAQACJS0Cz_____w}{0.0.0.0}{0.0.0.0:21}{cdfhilmrstw} to {node1}{MhCa3f__T_-UXusCAAAAAA}{0.0.0.0}{0.0.0.0:20}{cdfhilmrstw}]
] running task 0 of 2: {node2}{MXR7YQAAQACXXRe5_____w}: [485][internal:cluster/coordination/publish_state] from {node0}{hy7wWQAAQACJS0Cz_____w}{0.0.0.0}{0.0.0.0:21}{cdfhilmrstw} to {node2}{MXR7YQAAQACXXRe5_____w}{0.0.0.0}{0.0.0.0:18}{cdfhilmrstw}
{nodeId={node2}{MXR7YQAAQACXXRe5_____w}}] scheduleNow: delaying [100ms], scheduling DeferredTask[executionTimeMillis=5934460, task={node0}{hy7wWQAAQACJS0Cz_____w}: response to [485][internal:cluster/coordination/publish_state] from {node0}{hy7wWQAAQACJS0Cz_____w}{0.0.0.0}{0.0.0.0:21}{cdfhilmrstw} to {node2}{MXR7YQAAQACXXRe5_____w}{0.0.0.0}{0.0.0.0:18}{cdfhilmrstw}]
] running task 0 of 1: {node0}{hy7wWQAAQACJS0Cz_____w}: ActionRunnable#wrap[handling cluster state version [4] locally on [{node0}{hy7wWQAAQACJS0Cz_____w}{0.0.0.0}{0.0.0.0:21}{cdfhilmrstw}]]

4: receive publish responses

] advanceTime: from [5934360ms] to [5934374ms]
] advanceTime: from [5934374ms] to [5934400ms]
] advanceTime: from [5934400ms] to [5934460ms]
] advanceTime: no longer deferred: DeferredTask[executionTimeMillis=5934460, task={node0}{hy7wWQAAQACJS0Cz_____w}: response to [486][internal:cluster/coordination/publish_state] from {node0}{hy7wWQAAQACJS0Cz_____w}{0.0.0.0}{0.0.0.0:21}{cdfhilmrstw} to {node1}{MhCa3f__T_-UXusCAAAAAA}{0.0.0.0}{0.0.0.0:20}{cdfhilmrstw}]
] advanceTime: no longer deferred: DeferredTask[executionTimeMillis=5934460, task={node0}{hy7wWQAAQACJS0Cz_____w}: response to [485][internal:cluster/coordination/publish_state] from {node0}{hy7wWQAAQACJS0Cz_____w}{0.0.0.0}{0.0.0.0:21}{cdfhilmrstw} to {node2}{MXR7YQAAQACXXRe5_____w}{0.0.0.0}{0.0.0.0:18}{cdfhilmrstw}]
] running task 0 of 2: {node0}{hy7wWQAAQACJS0Cz_____w}: response to [486][internal:cluster/coordination/publish_state] from {node0}{hy7wWQAAQACJS0Cz_____w}{0.0.0.0}{0.0.0.0:21}{cdfhilmrstw} to {node1}{MhCa3f__T_-UXusCAAAAAA}{0.0.0.0}{0.0.0.0:20}{cdfhilmrstw}
{nodeId={node0}{hy7wWQAAQACJS0Cz_____w}}] scheduleNow: delaying [100ms], scheduling DeferredTask[executionTimeMillis=5934560, task={node1}{MhCa3f__T_-UXusCAAAAAA}: [488][internal:cluster/coordination/commit_state] from {node0}{hy7wWQAAQACJS0Cz_____w}{0.0.0.0}{0.0.0.0:21}{cdfhilmrstw} to {node1}{MhCa3f__T_-UXusCAAAAAA}{0.0.0.0}{0.0.0.0:20}{cdfhilmrstw}]
{nodeId={node0}{hy7wWQAAQACJS0Cz_____w}}] scheduleNow: delaying [100ms], scheduling DeferredTask[executionTimeMillis=5934560, task={node0}{hy7wWQAAQACJS0Cz_____w}: processing of [489][internal:cluster/coordination/commit_state]: ApplyCommitRequest{term=1, version=4, sourceNode={node0}{hy7wWQAAQACJS0Cz_____w}{0.0.0.0}{0.0.0.0:21}{cdfhilmrstw}}]
] running task 0 of 1: {node0}{hy7wWQAAQACJS0Cz_____w}: response to [485][internal:cluster/coordination/publish_state] from {node0}{hy7wWQAAQACJS0Cz_____w}{0.0.0.0}{0.0.0.0:21}{cdfhilmrstw} to {node2}{MXR7YQAAQACXXRe5_____w}{0.0.0.0}{0.0.0.0:18}{cdfhilmrstw}
{nodeId={node0}{hy7wWQAAQACJS0Cz_____w}}] scheduleNow: delaying [100ms], scheduling DeferredTask[executionTimeMillis=5934560, task={node2}{MXR7YQAAQACXXRe5_____w}: [490][internal:cluster/coordination/commit_state] from {node0}{hy7wWQAAQACJS0Cz_____w}{0.0.0.0}{0.0.0.0:21}{cdfhilmrstw} to {node2}{MXR7YQAAQACXXRe5_____w}{0.0.0.0}{0.0.0.0:18}{cdfhilmrstw}]

5: send commit requests

] advanceTime: from [5934460ms] to [5934474ms]
] advanceTime: from [5934474ms] to [5934500ms]
] advanceTime: from [5934500ms] to [5934560ms]
] advanceTime: no longer deferred: DeferredTask[executionTimeMillis=5934560, task={node1}{MhCa3f__T_-UXusCAAAAAA}: [488][internal:cluster/coordination/commit_state] from {node0}{hy7wWQAAQACJS0Cz_____w}{0.0.0.0}{0.0.0.0:21}{cdfhilmrstw} to {node1}{MhCa3f__T_-UXusCAAAAAA}{0.0.0.0}{0.0.0.0:20}{cdfhilmrstw}]
] advanceTime: no longer deferred: DeferredTask[executionTimeMillis=5934560, task={node0}{hy7wWQAAQACJS0Cz_____w}: processing of [489][internal:cluster/coordination/commit_state]: ApplyCommitRequest{term=1, version=4, sourceNode={node0}{hy7wWQAAQACJS0Cz_____w}{0.0.0.0}{0.0.0.0:21}{cdfhilmrstw}}]
] advanceTime: no longer deferred: DeferredTask[executionTimeMillis=5934560, task={node2}{MXR7YQAAQACXXRe5_____w}: [490][internal:cluster/coordination/commit_state] from {node0}{hy7wWQAAQACJS0Cz_____w}{0.0.0.0}{0.0.0.0:21}{cdfhilmrstw} to {node2}{MXR7YQAAQACXXRe5_____w}{0.0.0.0}{0.0.0.0:18}{cdfhilmrstw}]
] running task 0 of 3: {node1}{MhCa3f__T_-UXusCAAAAAA}: [488][internal:cluster/coordination/commit_state] from {node0}{hy7wWQAAQACJS0Cz_____w}{0.0.0.0}{0.0.0.0:21}{cdfhilmrstw} to {node1}{MhCa3f__T_-UXusCAAAAAA}{0.0.0.0}{0.0.0.0:20}{cdfhilmrstw}
{nodeId={node1}{MhCa3f__T_-UXusCAAAAAA}}] scheduleNow: delaying [100ms], scheduling DeferredTask[executionTimeMillis=5934660, task=DisruptableClusterApplierService[[ApplyCommitRequest{term=1, version=4, sourceNode={node0}{hy7wWQAAQACJS0Cz_____w}{0.0.0.0}{0.0.0.0:21}{cdfhilmrstw}}]]]
] running task 0 of 2: {node0}{hy7wWQAAQACJS0Cz_____w}: processing of [489][internal:cluster/coordination/commit_state]: ApplyCommitRequest{term=1, version=4, sourceNode={node0}{hy7wWQAAQACJS0Cz_____w}{0.0.0.0}{0.0.0.0:21}{cdfhilmrstw}}
{nodeId={node0}{hy7wWQAAQACJS0Cz_____w}}] scheduleNow: delaying [100ms], scheduling DeferredTask[executionTimeMillis=5934660, task={node0}{hy7wWQAAQACJS0Cz_____w}: delivery of response to [489][internal:cluster/coordination/commit_state]: Empty{}]
] running task 0 of 1: {node2}{MXR7YQAAQACXXRe5_____w}: [490][internal:cluster/coordination/commit_state] from {node0}{hy7wWQAAQACJS0Cz_____w}{0.0.0.0}{0.0.0.0:21}{cdfhilmrstw} to {node2}{MXR7YQAAQACXXRe5_____w}{0.0.0.0}{0.0.0.0:18}{cdfhilmrstw}
{nodeId={node2}{MXR7YQAAQACXXRe5_____w}}] scheduleNow: delaying [100ms], scheduling DeferredTask[executionTimeMillis=5934660, task=DisruptableClusterApplierService[[ApplyCommitRequest{term=1, version=4, sourceNode={node0}{hy7wWQAAQACJS0Cz_____w}{0.0.0.0}{0.0.0.0:21}{cdfhilmrstw}}]]]

6: apply new state on remote nodes

] advanceTime: from [5934560ms] to [5934660ms]
] advanceTime: no longer deferred: DeferredTask[executionTimeMillis=5934660, task=DisruptableClusterApplierService[[ApplyCommitRequest{term=1, version=4, sourceNode={node0}{hy7wWQAAQACJS0Cz_____w}{0.0.0.0}{0.0.0.0:21}{cdfhilmrstw}}]]]
] advanceTime: no longer deferred: DeferredTask[executionTimeMillis=5934660, task={node0}{hy7wWQAAQACJS0Cz_____w}: delivery of response to [489][internal:cluster/coordination/commit_state]: Empty{}]
] advanceTime: no longer deferred: DeferredTask[executionTimeMillis=5934660, task=DisruptableClusterApplierService[[ApplyCommitRequest{term=1, version=4, sourceNode={node0}{hy7wWQAAQACJS0Cz_____w}{0.0.0.0}{0.0.0.0:21}{cdfhilmrstw}}]]]
] running task 0 of 3: DisruptableClusterApplierService[[ApplyCommitRequest{term=1, version=4, sourceNode={node0}{hy7wWQAAQACJS0Cz_____w}{0.0.0.0}{0.0.0.0:21}{cdfhilmrstw}}]]
{nodeId={node1}{MhCa3f__T_-UXusCAAAAAA}}] scheduleNow: delaying [100ms], scheduling DeferredTask[executionTimeMillis=5934760, task={node0}{hy7wWQAAQACJS0Cz_____w}: response to [488][internal:cluster/coordination/commit_state] from {node0}{hy7wWQAAQACJS0Cz_____w}{0.0.0.0}{0.0.0.0:21}{cdfhilmrstw} to {node1}{MhCa3f__T_-UXusCAAAAAA}{0.0.0.0}{0.0.0.0:20}{cdfhilmrstw}]
] running task 1 of 2: DisruptableClusterApplierService[[ApplyCommitRequest{term=1, version=4, sourceNode={node0}{hy7wWQAAQACJS0Cz_____w}{0.0.0.0}{0.0.0.0:21}{cdfhilmrstw}}]]
{nodeId={node2}{MXR7YQAAQACXXRe5_____w}}] scheduleNow: delaying [100ms], scheduling DeferredTask[executionTimeMillis=5934760, task={node0}{hy7wWQAAQACJS0Cz_____w}: response to [490][internal:cluster/coordination/commit_state] from {node0}{hy7wWQAAQACJS0Cz_____w}{0.0.0.0}{0.0.0.0:21}{cdfhilmrstw} to {node2}{MXR7YQAAQACXXRe5_____w}{0.0.0.0}{0.0.0.0:18}{cdfhilmrstw}]
] running task 0 of 1: {node0}{hy7wWQAAQACJS0Cz_____w}: delivery of response to [489][internal:cluster/coordination/commit_state]: Empty{}

7: receive commit responses

] advanceTime: from [5934660ms] to [5934683ms]
] advanceTime: from [5934683ms] to [5934760ms]
] advanceTime: no longer deferred: DeferredTask[executionTimeMillis=5934760, task={node0}{hy7wWQAAQACJS0Cz_____w}: response to [488][internal:cluster/coordination/commit_state] from {node0}{hy7wWQAAQACJS0Cz_____w}{0.0.0.0}{0.0.0.0:21}{cdfhilmrstw} to {node1}{MhCa3f__T_-UXusCAAAAAA}{0.0.0.0}{0.0.0.0:20}{cdfhilmrstw}]
] advanceTime: no longer deferred: DeferredTask[executionTimeMillis=5934760, task={node0}{hy7wWQAAQACJS0Cz_____w}: response to [490][internal:cluster/coordination/commit_state] from {node0}{hy7wWQAAQACJS0Cz_____w}{0.0.0.0}{0.0.0.0:21}{cdfhilmrstw} to {node2}{MXR7YQAAQACXXRe5_____w}{0.0.0.0}{0.0.0.0:18}{cdfhilmrstw}]
--> runFor(700ms) completed run until [5934760ms]: stabilising
--> runFor(700ms) running until [5935460ms]: draining task queue
] running task 0 of 2: {node0}{hy7wWQAAQACJS0Cz_____w}: response to [488][internal:cluster/coordination/commit_state] from {node0}{hy7wWQAAQACJS0Cz_____w}{0.0.0.0}{0.0.0.0:21}{cdfhilmrstw} to {node1}{MhCa3f__T_-UXusCAAAAAA}{0.0.0.0}{0.0.0.0:20}{cdfhilmrstw}
] running task 0 of 1: {node0}{hy7wWQAAQACJS0Cz_____w}: response to [490][internal:cluster/coordination/commit_state] from {node0}{hy7wWQAAQACJS0Cz_____w}{0.0.0.0}{0.0.0.0:21}{cdfhilmrstw} to {node2}{MXR7YQAAQACXXRe5_____w}{0.0.0.0}{0.0.0.0:18}{cdfhilmrstw}
{nodeId={node0}{hy7wWQAAQACJS0Cz_____w}}] scheduleNow: delaying [100ms], scheduling DeferredTask[executionTimeMillis=5934860, task=DisruptableClusterApplierService[[Publication{term=1, version=4}]]]

8: apply new state on master node

] advanceTime: from [5934760ms] to [5934766ms]
] advanceTime: from [5934766ms] to [5934783ms]
] advanceTime: from [5934783ms] to [5934860ms]
] advanceTime: no longer deferred: DeferredTask[executionTimeMillis=5934860, task=DisruptableClusterApplierService[[Publication{term=1, version=4}]]]
] running task 0 of 1: DisruptableClusterApplierService[[Publication{term=1, version=4}]]

I don't think this was affected by #86921 because even the old MockSinglePrioritizingExecutor uses scheduleNow to queue up the task, so that PR didn't change the number of scheduleNow calls. Instead it looks like we added an extra scheduleNow call in #61455.

Fixes elastic#89867

* Add delay to DEFAULT_CLUSTER_STATE_UPDATE_DELAY Fixes #89867 * Convert to int * Update server/src/test/java/org/elasticsearch/cluster/coordination/CoordinatorTests.java Co-authored-by: David Turner <david.turner@elastic.co> Co-authored-by: David Turner <david.turner@elastic.co> Co-authored-by: Elastic Machine <elasticmachine@users.noreply.github.com>

gmarouli · 2022-10-28T14:18:00Z

I am afraid this happened again, see https://gradle-enterprise.elastic.co/s/cwlcjoocvaehg/tests/:server:test/org.elasticsearch.cluster.coordination.CoordinatorTests/testSingleNodeDiscoveryStabilisesEvenWhenDisrupted?top-execution=1. I am reopening this issue, if this is not the right thing to do, let me know and I will create a new one.

kingherc · 2022-10-31T17:34:57Z

Mary's issue is reproducible. Adding another delayVariabilityMillis in the runFor expression makes it pass. So, I delved once more into @DaveCTurner 's approach of running the testCanUpdateClusterStateAfterStabilisation test with fixed delays of 100ms in org.elasticsearch.common.util.concurrent.DeterministicTaskQueue#scheduleNow and scheduleAt functions. And I believe we still have 8 delays/stages in main for the DEFAULT_CLUSTER_STATE_UPDATE_DELAY. Here is an excerpt:

[2022-10-31T16:47:55,184][INFO ][o.e.c.c.CoordinatorTests ] [testCanUpdateClusterStateAfterStabilisation] --&gt; runFor(800ms) running until [2067930ms]: stabilising
[2022-10-31T16:47:55,184][TRACE][o.e.c.u.c.DeterministicTaskQueue] [testCanUpdateClusterStateAfterStabilisation] running task 0 of 2: {node1}{lSqmpP__T_-qutLN_____w}: timeout handler for [190][internal:coordination/fault_detection/leader_check]
[2022-10-31T16:47:55,185][TRACE][o.e.c.u.c.DeterministicTaskQueue] [testCanUpdateClusterStateAfterStabilisation] running task 0 of 1: {node0}{m4mHn___T_-dxn9_AAAAAA}: timeout handler for [204][internal:coordination/fault_detection/follower_check]

1: Master service task

[2022-10-31T16:47:55,185][TRACE][o.e.c.u.c.DeterministicTaskQueue] [testCanUpdateClusterStateAfterStabilisation] advanceTime: from [2067130ms] to [2067230ms]
[2022-10-31T16:47:55,185][TRACE][o.e.c.u.c.DeterministicTaskQueue] [testCanUpdateClusterStateAfterStabilisation] advanceTime: no longer deferred: DeferredTask[executionTimeMillis=2067230, task={node0}{m4mHn___T_-dxn9_AAAAAA}: master service scheduling next task]
{m4mHn___T_-dxn9_AAAAAA}}] register 8 [master] [publish_cluster_state_update] [publication of cluster state [12]]
[2022-10-31T16:47:55,187][TRACE][o.e.c.s.MasterService    ] [testCanUpdateClusterStateAfterStabilisation{nodeId={node0}{m4mHn___T_-dxn9_AAAAAA}}] cluster state updated, source [new value [0=3964185402078690287]]
DeferredTask[executionTimeMillis=2067330, task={node0}{m4mHn___T_-dxn9_AAAAAA}: publish change of cluster state from version [11] in term [8] to version [12] in term [8]]

2: state publisher task

[2022-10-31T16:47:55,188][TRACE][o.e.c.u.c.DeterministicTaskQueue] [testCanUpdateClusterStateAfterStabilisation] advanceTime: from [2067230ms] to [2067330ms]
[2022-10-31T16:47:55,188][TRACE][o.e.c.u.c.DeterministicTaskQueue] [testCanUpdateClusterStateAfterStabilisation] advanceTime: no longer deferred: DeferredTask[executionTimeMillis=2067330, task={node0}{m4mHn___T_-dxn9_AAAAAA}: publish change of cluster state from version [11] in term [8] to version [12] in term [8]]
[2022-10-31T16:47:55,206][TRACE][o.e.c.c.LeaderChecker    ] [testCanUpdateClusterStateAfterStabilisation{nodeId={node0}{m4mHn___T_-dxn9_AAAAAA}}] setCurrentNodes: [{node1}{lSqmpP__T_-qutLN_____w}{0.0.0.0}{0.0.0.0:3}{IScdfhilmrstw}, {node0}{m4mHn___T_-dxn9_AAAAAA}{0.0.0.0}{0.0.0.0:6}{IScdfhilmrstw}]
[2022-10-31T16:47:55,207][TRACE][o.e.c.c.C.CoordinatorPublication] [testCanUpdateClusterStateAfterStabilisation{nodeId={node0}{m4mHn___T_-dxn9_AAAAAA}}] publishing PublishRequest{term=8, version=12, state=cluster uuid: tYpV1P__T_-u0UMBAAAAAA [committed: true]
[2022-10-31T16:47:55,208][TRACE][o.e.c.u.c.DeterministicTaskQueue] [testCanUpdateClusterStateAfterStabilisation{nodeId={node0}{m4mHn___T_-dxn9_AAAAAA}}] scheduleNow: delaying [100ms], scheduling DeferredTask[executionTimeMillis=2067430, task={node0}{m4mHn___T_-dxn9_AAAAAA}: ActionRunnable#wrap[handling cluster state version [12] locally on [{node0}{m4mHn___T_-dxn9_AAAAAA}{0.0.0.0}{0.0.0.0:6}{IScdfhilmrstw}]]]

3: deliver publish requests

[2022-10-31T16:47:55,209][TRACE][o.e.c.u.c.DeterministicTaskQueue] [testCanUpdateClusterStateAfterStabilisation] advanceTime: from [2067330ms] to [2067430ms]
[2022-10-31T16:47:55,209][TRACE][o.e.c.u.c.DeterministicTaskQueue] [testCanUpdateClusterStateAfterStabilisation] advanceTime: no longer deferred: DeferredTask[executionTimeMillis=2067430, task={node1}{lSqmpP__T_-qutLN_____w}: [212][internal:cluster/coordination/publish_state] from {node0}{m4mHn___T_-dxn9_AAAAAA}{0.0.0.0}{0.0.0.0:6}{IScdfhilmrstw} to {node1}{lSqmpP__T_-qutLN_____w}{0.0.0.0}{0.0.0.0:3}{IScdfhilmrstw}]
[2022-10-31T16:47:55,209][TRACE][o.e.c.u.c.DeterministicTaskQueue] [testCanUpdateClusterStateAfterStabilisation] advanceTime: no longer deferred: DeferredTask[executionTimeMillis=2067430, task={node0}{m4mHn___T_-dxn9_AAAAAA}: ActionRunnable#wrap[handling cluster state version [12] locally on [{node0}{m4mHn___T_-dxn9_AAAAAA}{0.0.0.0}{0.0.0.0:6}{IScdfhilmrstw}]]]
[2022-10-31T16:47:55,219][TRACE][o.e.c.u.c.DeterministicTaskQueue] [testCanUpdateClusterStateAfterStabilisation{nodeId={node0}{m4mHn___T_-dxn9_AAAAAA}}] scheduleNow: delaying [100ms], scheduling DeferredTask[executionTimeMillis=2067530, task={node0}{m4mHn___T_-dxn9_AAAAAA}: processing of [214][internal:cluster/coordination/commit_state]: ApplyCommitRequest{term=8, version=12, sourceNode={node0}{m4mHn___T_-dxn9_AAAAAA}{0.0.0.0}{0.0.0.0:6}{IScdfhilmrstw}}]

4: receive publish responses

[2022-10-31T16:47:55,220][TRACE][o.e.c.u.c.DeterministicTaskQueue] [testCanUpdateClusterStateAfterStabilisation] advanceTime: from [2067430ms] to [2067530ms]
[2022-10-31T16:47:55,220][TRACE][o.e.c.u.c.DeterministicTaskQueue] [testCanUpdateClusterStateAfterStabilisation] advanceTime: no longer deferred: DeferredTask[executionTimeMillis=2067530, task={node0}{m4mHn___T_-dxn9_AAAAAA}: response to [212][internal:cluster/coordination/publish_state] from {node0}{m4mHn___T_-dxn9_AAAAAA}{0.0.0.0}{0.0.0.0:6}{IScdfhilmrstw} to {node1}{lSqmpP__T_-qutLN_____w}{0.0.0.0}{0.0.0.0:3}{IScdfhilmrstw}]
[2022-10-31T16:47:55,223][TRACE][o.e.t.TaskManager        ] [testCanUpdateClusterStateAfterStabilisation{nodeId={node0}{m4mHn___T_-dxn9_AAAAAA}}] unregister task for id: 213
[2022-10-31T16:47:55,223][TRACE][o.e.c.u.c.DeterministicTaskQueue] [testCanUpdateClusterStateAfterStabilisation{nodeId={node0}{m4mHn___T_-dxn9_AAAAAA}}] scheduleNow: delaying [100ms], scheduling DeferredTask[executionTimeMillis=2067630, task={node1}{lSqmpP__T_-qutLN_____w}: response to [198][internal:coordination/fault_detection/leader_check] from {node1}{lSqmpP__T_-qutLN_____w}{0.0.0.0}{0.0.0.0:3}{IScdfhilmrstw} to {node0}{m4mHn___T_-dxn9_AAAAAA}{0.0.0.0}{0.0.0.0:6}{IScdfhilmrstw}]

5: send commit requests

[2022-10-31T16:47:55,223][TRACE][o.e.c.u.c.DeterministicTaskQueue] [testCanUpdateClusterStateAfterStabilisation] advanceTime: from [2067530ms] to [2067630ms]
[2022-10-31T16:47:55,223][TRACE][o.e.c.u.c.DeterministicTaskQueue] [testCanUpdateClusterStateAfterStabilisation] advanceTime: no longer deferred: DeferredTask[executionTimeMillis=2067630, task={node0}{m4mHn___T_-dxn9_AAAAAA}: delivery of response to [214][internal:cluster/coordination/commit_state]: Empty{}]
[2022-10-31T16:47:55,223][TRACE][o.e.c.u.c.DeterministicTaskQueue] [testCanUpdateClusterStateAfterStabilisation] advanceTime: no longer deferred: DeferredTask[executionTimeMillis=2067630, task={node0}{m4mHn___T_-dxn9_AAAAAA}: response to [213][internal:coordination/fault_detection/follower_check] from {node0}{m4mHn___T_-dxn9_AAAAAA}{0.0.0.0}{0.0.0.0:6}{IScdfhilmrstw} to {node1}{lSqmpP__T_-qutLN_____w}{0.0.0.0}{0.0.0.0:3}{IScdfhilmrstw}]
[2022-10-31T16:47:55,224][TRACE][o.e.c.u.c.DeterministicTaskQueue] [testCanUpdateClusterStateAfterStabilisation] advanceTime: no longer deferred: DeferredTask[executionTimeMillis=2067630, task={node1}{lSqmpP__T_-qutLN_____w}: [215][internal:cluster/coordination/commit_state] from {node0}{m4mHn___T_-dxn9_AAAAAA}{0.0.0.0}{0.0.0.0:6}{IScdfhilmrstw} to {node1}{lSqmpP__T_-qutLN_____w}{0.0.0.0}{0.0.0.0:3}{IScdfhilmrstw}]
[2022-10-31T16:47:55,225][TRACE][o.e.c.u.c.DeterministicTaskQueue] [testCanUpdateClusterStateAfterStabilisation{nodeId={node0}{m4mHn___T_-dxn9_AAAAAA}}] scheduleAt: adding DeferredTask[executionTimeMillis=2068730, task={node0}{m4mHn___T_-dxn9_AAAAAA}: FollowerChecker{discoveryNode={node1}{lSqmpP__T_-qutLN_____w}{0.0.0.0}{0.0.0.0:3}{IScdfhilmrstw}, failureCountSinceLastSuccess=0, timeoutCountSinceLastSuccess=0, [cluster.fault_detection.follower_check.retry_count]=3}::handleWakeUp] with extra delay of [100ms]

6: apply new state on remote nodes

[2022-10-31T16:47:55,226][TRACE][o.e.c.u.c.DeterministicTaskQueue] [testCanUpdateClusterStateAfterStabilisation] advanceTime: from [2067630ms] to [2067730ms]
[2022-10-31T16:47:55,226][TRACE][o.e.c.u.c.DeterministicTaskQueue] [testCanUpdateClusterStateAfterStabilisation] advanceTime: no longer deferred: DeferredTask[executionTimeMillis=2067730, task=DisruptableClusterApplierService[[ApplyCommitRequest{term=8, version=12, sourceNode={node0}{m4mHn___T_-dxn9_AAAAAA}{0.0.0.0}{0.0.0.0:6}{IScdfhilmrstw}}]]]
[2022-10-31T16:47:55,226][TRACE][o.e.c.u.c.DeterministicTaskQueue] [testCanUpdateClusterStateAfterStabilisation] running task 0 of 1: DisruptableClusterApplierService[[ApplyCommitRequest{term=8, version=12, sourceNode={node0}{m4mHn___T_-dxn9_AAAAAA}{0.0.0.0}{0.0.0.0:6}{IScdfhilmrstw}}]]
[2022-10-31T16:47:55,228][TRACE][o.e.c.u.c.DeterministicTaskQueue] [testCanUpdateClusterStateAfterStabilisation{nodeId={node1}{lSqmpP__T_-qutLN_____w}}] scheduleNow: delaying [100ms], scheduling DeferredTask[executionTimeMillis=2067830, task={node0}{m4mHn___T_-dxn9_AAAAAA}: response to [215][internal:cluster/coordination/commit_state] from {node0}{m4mHn___T_-dxn9_AAAAAA}{0.0.0.0}{0.0.0.0:6}{IScdfhilmrstw} to {node1}{lSqmpP__T_-qutLN_____w}{0.0.0.0}{0.0.0.0:3}{IScdfhilmrstw}]

7: receive commit responses

[2022-10-31T16:47:55,228][TRACE][o.e.c.u.c.DeterministicTaskQueue] [testCanUpdateClusterStateAfterStabilisation] advanceTime: from [2067730ms] to [2067830ms]
[2022-10-31T16:47:55,228][TRACE][o.e.c.u.c.DeterministicTaskQueue] [testCanUpdateClusterStateAfterStabilisation] advanceTime: no longer deferred: DeferredTask[executionTimeMillis=2067830, task={node0}{m4mHn___T_-dxn9_AAAAAA}: response to [215][internal:cluster/coordination/commit_state] from {node0}{m4mHn___T_-dxn9_AAAAAA}{0.0.0.0}{0.0.0.0:6}{IScdfhilmrstw} to {node1}{lSqmpP__T_-qutLN_____w}{0.0.0.0}{0.0.0.0:3}{IScdfhilmrstw}]
[2022-10-31T16:47:55,228][TRACE][o.e.c.u.c.DeterministicTaskQueue] [testCanUpdateClusterStateAfterStabilisation] running task 0 of 1: {node0}{m4mHn___T_-dxn9_AAAAAA}: response to [215][internal:cluster/coordination/commit_state] from {node0}{m4mHn___T_-dxn9_AAAAAA}{0.0.0.0}{0.0.0.0:6}{IScdfhilmrstw} to {node1}{lSqmpP__T_-qutLN_____w}{0.0.0.0}{0.0.0.0:3}{IScdfhilmrstw}
[2022-10-31T16:47:55,229][TRACE][o.e.c.u.c.DeterministicTaskQueue] [testCanUpdateClusterStateAfterStabilisation{nodeId={node0}{m4mHn___T_-dxn9_AAAAAA}}] scheduleNow: delaying [100ms], scheduling DeferredTask[executionTimeMillis=2067930, task=DisruptableClusterApplierService[[Publication{term=8, version=12}]]]
[2022-10-31T16:47:55,229][TRACE][o.e.c.c.C.CoordinatorPublication] [testCanUpdateClusterStateAfterStabilisation{nodeId={node0}{m4mHn___T_-dxn9_AAAAAA}}] onPossibleCompletion: [Publication{term=8, version=12}] was successful

8: apply new state on master node

[2022-10-31T16:47:55,229][TRACE][o.e.c.u.c.DeterministicTaskQueue] [testCanUpdateClusterStateAfterStabilisation] advanceTime: from [2067830ms] to [2067930ms]
[2022-10-31T16:47:55,229][TRACE][o.e.c.u.c.DeterministicTaskQueue] [testCanUpdateClusterStateAfterStabilisation] advanceTime: no longer deferred: DeferredTask[executionTimeMillis=2067930, task=DisruptableClusterApplierService[[Publication{term=8, version=12}]]]
[2022-10-31T16:47:55,229][INFO ][o.e.c.c.CoordinatorTests ] [testCanUpdateClusterStateAfterStabilisation] --&gt; runFor(800ms) completed run until [2067930ms]: stabilising

So, I believe now this is a different problem. I tried to compare the outputs between the failed one and the successful one (where I added delayVariabilityMillis in the runFor expression).

Here is the failed one:

[2022-10-31T20:55:56,985][INFO ][o.e.c.c.CoordinatorTests ] [testSingleNodeDiscoveryStabilisesEvenWhenDisrupted] --&gt; runFor(3140601ms) running until [3140601ms]: stabilising
[2022-10-31T20:55:56,988][TRACE][o.e.c.u.c.DeterministicTaskQueue] [testSingleNodeDiscoveryStabilisesEvenWhenDisrupted] advanceTime: from [0ms] to [91ms]
...
[2022-10-31T20:56:01,076][TRACE][o.e.c.u.c.DeterministicTaskQueue] [testSingleNodeDiscoveryStabilisesEvenWhenDisrupted] advanceTime: from [3073086ms] to [3080013ms]
[2022-10-31T20:56:01,076][TRACE][o.e.c.u.c.DeterministicTaskQueue] [testSingleNodeDiscoveryStabilisesEvenWhenDisrupted] advanceTime: no longer deferred: DeferredTask[executionTimeMillis=3080013, task={node0}{t71ERgAAQACqOqV0AAAAAA}: emit warning if cluster not formed]
[2022-10-31T20:56:01,076][TRACE][o.e.c.u.c.DeterministicTaskQueue] [testSingleNodeDiscoveryStabilisesEvenWhenDisrupted] running task 0 of 1: {node0}{t71ERgAAQACqOqV0AAAAAA}: emit warning if cluster not formed
[2022-10-31T20:56:01,076][WARN ][o.e.c.c.JoinHelper       ] [testSingleNodeDiscoveryStabilisesEvenWhenDisrupted{nodeId={node0}{t71ERgAAQACqOqV0AAAAAA}}] last failed join attempt was 61ms ago, failed to join {node0}{t71ERgAAQACqOqV0AAAAAA}{0.0.0.0}{0.0.0.0:1}{IScdfhilmrstw} with JoinRequest{sourceNode={node0}{t71ERgAAQACqOqV0AAAAAA}{0.0.0.0}{0.0.0.0:1}{IScdfhilmrstw}, minimumTerm=13, optionalJoin=Optional[Join{term=14, lastAcceptedTerm=7, lastAcceptedVersion=1, sourceNode={node0}{t71ERgAAQACqOqV0AAAAAA}{0.0.0.0}{0.0.0.0:1}{IScdfhilmrstw}, targetNode={node0}{t71ERgAAQACqOqV0AAAAAA}{0.0.0.0}{0.0.0.0:1}{IScdfhilmrstw}}]}
org.elasticsearch.transport.RemoteTransportException: [][0.0.0.0:1][internal:cluster/coordination/join]
Caused by: org.elasticsearch.cluster.coordination.CoordinationStateRejectedException: incoming term 14 does not match current term 15
	at org.elasticsearch.cluster.coordination.CoordinationState.handleJoin(CoordinationState.java:218) ~[main/:?]
	at org.elasticsearch.cluster.coordination.Coordinator.handleJoin(Coordinator.java:1342) ~[main/:?]
	at java.util.Optional.ifPresent(Optional.java:178) ~[?:?]
	at org.elasticsearch.cluster.coordination.Coordinator.processJoinRequest(Coordinator.java:753) ~[main/:?]
	at org.elasticsearch.cluster.coordination.Coordinator$1.lambda$onResponse$1(Coordinator.java:612) ~[main/:?]
        ...
	at com.carrotsearch.randomizedtesting.ThreadLeakControl.lambda$forkTimeoutingTask$0(ThreadLeakControl.java:850) ~[randomizedtesting-runner-2.8.0.jar:?]
	at java.lang.Thread.run(Thread.java:1589) ~[?:?]
[2022-10-31T20:56:01,087][WARN ][o.e.c.c.ClusterFormationFailureHelper] [testSingleNodeDiscoveryStabilisesEvenWhenDisrupted{nodeId={node0}{t71ERgAAQACqOqV0AAAAAA}}] master not discovered or elected yet, an election requires a node with id [node0], have discovered possible quorum [{node0}{t71ERgAAQACqOqV0AAAAAA}{0.0.0.0}{0.0.0.0:1}{IScdfhilmrstw}]; discovery will continue using [] from hosts providers and [{node0}{t71ERgAAQACqOqV0AAAAAA}{0.0.0.0}{0.0.0.0:1}{IScdfhilmrstw}] from last-known cluster state; node term 15, last-accepted version 1 in term 7; joining [{node0}{t71ERgAAQACqOqV0AAAAAA}{0.0.0.0}{0.0.0.0:1}{IScdfhilmrstw}] in term [15] has status [waiting for response] after [3.4m/208138ms]
[2022-10-31T20:56:01,088][TRACE][o.e.c.u.c.DeterministicTaskQueue] [testSingleNodeDiscoveryStabilisesEvenWhenDisrupted{nodeId={node0}{t71ERgAAQACqOqV0AAAAAA}}] scheduleAt: adding DeferredTask[executionTimeMillis=3113554, task={node0}{t71ERgAAQACqOqV0AAAAAA}: emit warning if cluster not formed] with extra delay of [23541ms]
[2022-10-31T20:56:01,088][TRACE][o.e.c.u.c.DeterministicTaskQueue] [testSingleNodeDiscoveryStabilisesEvenWhenDisrupted] advanceTime: from [3080013ms] to [3105607ms]
[2022-10-31T20:56:01,088][TRACE][o.e.c.u.c.DeterministicTaskQueue] [testSingleNodeDiscoveryStabilisesEvenWhenDisrupted] advanceTime: no longer deferred: DeferredTask[executionTimeMillis=3105607, task={node0}{t71ERgAAQACqOqV0AAAAAA}: PeerFinder handling wakeup]
[2022-10-31T20:56:01,089][TRACE][o.e.c.u.c.DeterministicTaskQueue] [testSingleNodeDiscoveryStabilisesEvenWhenDisrupted] running task 0 of 1: {node0}{t71ERgAAQACqOqV0AAAAAA}: PeerFinder handling wakeup
[2022-10-31T20:56:01,091][TRACE][o.e.d.PeerFinder         ] [testSingleNodeDiscoveryStabilisesEvenWhenDisrupted{nodeId={node0}{t71ERgAAQACqOqV0AAAAAA}}] probing master nodes from cluster state: [{node0}{t71ERgAAQACqOqV0AAAAAA}{0.0.0.0}{0.0.0.0:1}{IScdfhilmrstw}]
[2022-10-31T20:56:01,092][TRACE][o.e.c.u.c.DeterministicTaskQueue] [testSingleNodeDiscoveryStabilisesEvenWhenDisrupted{nodeId={node0}{t71ERgAAQACqOqV0AAAAAA}}] scheduleAt: adding DeferredTask[executionTimeMillis=3321985, task={node0}{t71ERgAAQACqOqV0AAAAAA}: PeerFinder handling wakeup] with extra delay of [215378ms]
[2022-10-31T20:56:01,093][TRACE][o.e.c.u.c.DeterministicTaskQueue] [testSingleNodeDiscoveryStabilisesEvenWhenDisrupted] advanceTime: from [3105607ms] to [3113554ms]
[2022-10-31T20:56:01,093][TRACE][o.e.c.u.c.DeterministicTaskQueue] [testSingleNodeDiscoveryStabilisesEvenWhenDisrupted] advanceTime: no longer deferred: DeferredTask[executionTimeMillis=3113554, task={node0}{t71ERgAAQACqOqV0AAAAAA}: emit warning if cluster not formed]
[2022-10-31T20:56:01,093][TRACE][o.e.c.u.c.DeterministicTaskQueue] [testSingleNodeDiscoveryStabilisesEvenWhenDisrupted] running task 0 of 1: {node0}{t71ERgAAQACqOqV0AAAAAA}: emit warning if cluster not formed
[2022-10-31T20:56:01,095][WARN ][o.e.c.c.ClusterFormationFailureHelper] [testSingleNodeDiscoveryStabilisesEvenWhenDisrupted{nodeId={node0}{t71ERgAAQACqOqV0AAAAAA}}] master not discovered or elected yet, an election requires a node with id [node0], have discovered possible quorum [{node0}{t71ERgAAQACqOqV0AAAAAA}{0.0.0.0}{0.0.0.0:1}{IScdfhilmrstw}]; discovery will continue using [] from hosts providers and [{node0}{t71ERgAAQACqOqV0AAAAAA}{0.0.0.0}{0.0.0.0:1}{IScdfhilmrstw}] from last-known cluster state; node term 15, last-accepted version 1 in term 7; joining [{node0}{t71ERgAAQACqOqV0AAAAAA}{0.0.0.0}{0.0.0.0:1}{IScdfhilmrstw}] in term [15] has status [waiting for response] after [4m/241679ms]
[2022-10-31T20:56:01,095][TRACE][o.e.c.u.c.DeterministicTaskQueue] [testSingleNodeDiscoveryStabilisesEvenWhenDisrupted{nodeId={node0}{t71ERgAAQACqOqV0AAAAAA}}] scheduleAt: adding DeferredTask[executionTimeMillis=3269433, task={node0}{t71ERgAAQACqOqV0AAAAAA}: emit warning if cluster not formed] with extra delay of [145879ms]
[2022-10-31T20:56:01,095][TRACE][o.e.c.u.c.DeterministicTaskQueue] [testSingleNodeDiscoveryStabilisesEvenWhenDisrupted] advanceTime: from [3113554ms] to [3135001ms]
[2022-10-31T20:56:01,096][TRACE][o.e.c.u.c.DeterministicTaskQueue] [testSingleNodeDiscoveryStabilisesEvenWhenDisrupted] advanceTime: no longer deferred: DeferredTask[executionTimeMillis=3135001, task={node0}{t71ERgAAQACqOqV0AAAAAA}: timeout handler for [91][internal:cluster/formation/info]]
[2022-10-31T20:56:01,096][TRACE][o.e.c.u.c.DeterministicTaskQueue] [testSingleNodeDiscoveryStabilisesEvenWhenDisrupted] running task 0 of 1: {node0}{t71ERgAAQACqOqV0AAAAAA}: timeout handler for [91][internal:cluster/formation/info]
[2022-10-31T20:56:01,096][TRACE][o.e.c.u.c.DeterministicTaskQueue] [testSingleNodeDiscoveryStabilisesEvenWhenDisrupted] advanceTime: from [3135001ms] to [3150655ms]
[2022-10-31T20:56:01,097][TRACE][o.e.c.u.c.DeterministicTaskQueue] [testSingleNodeDiscoveryStabilisesEvenWhenDisrupted] advanceTime: no longer deferred: DeferredTask[executionTimeMillis=3150655, task={node0}{t71ERgAAQACqOqV0AAAAAA}: timeout handler for [92][internal:cluster/formation/info]]
[2022-10-31T20:56:01,097][INFO ][o.e.c.c.CoordinatorTests ] [testSingleNodeDiscoveryStabilisesEvenWhenDisrupted] --&gt; runFor(3140601ms) completed run until [3140601ms]: stabilising

And here's the one from the successful one (which simply runs longer):

[2022-10-31T21:16:40,865][INFO ][o.e.c.c.CoordinatorTests ] [testSingleNodeDiscoveryStabilisesEvenWhenDisrupted] --&gt; runFor(3382178ms) running until [3382178ms]: stabilising
[2022-10-31T21:16:40,866][TRACE][o.e.c.u.c.DeterministicTaskQueue] [testSingleNodeDiscoveryStabilisesEvenWhenDisrupted] advanceTime: from [0ms] to [91ms]
...
[2022-10-31T21:16:41,804][TRACE][o.e.c.u.c.DeterministicTaskQueue] [testSingleNodeDiscoveryStabilisesEvenWhenDisrupted] advanceTime: from [3073086ms] to [3080013ms]
[2022-10-31T21:16:41,804][TRACE][o.e.c.u.c.DeterministicTaskQueue] [testSingleNodeDiscoveryStabilisesEvenWhenDisrupted] advanceTime: no longer deferred: DeferredTask[executionTimeMillis=3080013, task={node0}{t71ERgAAQACqOqV0AAAAAA}: emit warning if cluster not formed]
[2022-10-31T21:16:41,804][TRACE][o.e.c.u.c.DeterministicTaskQueue] [testSingleNodeDiscoveryStabilisesEvenWhenDisrupted] running task 0 of 1: {node0}{t71ERgAAQACqOqV0AAAAAA}: emit warning if cluster not formed
[2022-10-31T21:16:41,804][WARN ][o.e.c.c.JoinHelper       ] [testSingleNodeDiscoveryStabilisesEvenWhenDisrupted{nodeId={node0}{t71ERgAAQACqOqV0AAAAAA}}] last failed join attempt was 13ms ago, failed to join {node0}{t71ERgAAQACqOqV0AAAAAA}{0.0.0.0}{0.0.0.0:1}{IScdfhilmrstw} with JoinRequest{sourceNode={node0}{t71ERgAAQACqOqV0AAAAAA}{0.0.0.0}{0.0.0.0:1}{IScdfhilmrstw}, minimumTerm=13, optionalJoin=Optional[Join{term=14, lastAcceptedTerm=7, lastAcceptedVersion=1, sourceNode={node0}{t71ERgAAQACqOqV0AAAAAA}{0.0.0.0}{0.0.0.0:1}{IScdfhilmrstw}, targetNode={node0}{t71ERgAAQACqOqV0AAAAAA}{0.0.0.0}{0.0.0.0:1}{IScdfhilmrstw}}]}
org.elasticsearch.transport.RemoteTransportException: [][0.0.0.0:1][internal:cluster/coordination/join]
Caused by: org.elasticsearch.cluster.coordination.CoordinationStateRejectedException: incoming term 14 does not match current term 15
	at org.elasticsearch.cluster.coordination.CoordinationState.handleJoin(CoordinationState.java:218) ~[main/:?]
	at org.elasticsearch.cluster.coordination.Coordinator.handleJoin(Coordinator.java:1342) ~[main/:?]
	at java.util.Optional.ifPresent(Optional.java:178) ~[?:?]
	at org.elasticsearch.cluster.coordination.Coordinator.processJoinRequest(Coordinator.java:753) ~[main/:?]
...
[2022-10-31T21:16:41,852][TRACE][o.e.c.u.c.DeterministicTaskQueue] [testSingleNodeDiscoveryStabilisesEvenWhenDisrupted] advanceTime: from [3306117ms] to [3306416ms]
[2022-10-31T21:16:41,852][TRACE][o.e.c.u.c.DeterministicTaskQueue] [testSingleNodeDiscoveryStabilisesEvenWhenDisrupted] advanceTime: no longer deferred: DeferredTask[executionTimeMillis=3306416, task={node0}{t71ERgAAQACqOqV0AAAAAA}: delivery of response to [95][internal:cluster/coordination/join/ping]: Empty{}]
[2022-10-31T21:16:41,852][TRACE][o.e.c.u.c.DeterministicTaskQueue] [testSingleNodeDiscoveryStabilisesEvenWhenDisrupted] running task 0 of 1: {node0}{t71ERgAAQACqOqV0AAAAAA}: delivery of response to [95][internal:cluster/coordination/join/ping]: Empty{}
[2022-10-31T21:16:41,852][DEBUG][o.e.c.c.CoordinationState] [testSingleNodeDiscoveryStabilisesEvenWhenDisrupted{nodeId={node0}{t71ERgAAQACqOqV0AAAAAA}}] handleJoin: ignored join due to term mismatch (expected: [18], actual: [15])
[2022-10-31T21:16:41,852][DEBUG][o.e.c.c.Coordinator      ] [testSingleNodeDiscoveryStabilisesEvenWhenDisrupted{nodeId={node0}{t71ERgAAQACqOqV0AAAAAA}}] failed to add Join{term=15, lastAcceptedTerm=7, lastAcceptedVersion=1, sourceNode={node0}{t71ERgAAQACqOqV0AAAAAA}{0.0.0.0}{0.0.0.0:1}{IScdfhilmrstw}, targetNode={node0}{t71ERgAAQACqOqV0AAAAAA}{0.0.0.0}{0.0.0.0:1}{IScdfhilmrstw}} - ignoring
org.elasticsearch.cluster.coordination.CoordinationStateRejectedException: incoming term 15 does not match current term 18
	at org.elasticsearch.cluster.coordination.CoordinationState.handleJoin(CoordinationState.java:218) ~[main/:?]
	at org.elasticsearch.cluster.coordination.Coordinator.handleJoinIgnoringExceptions(Coordinator.java:1352) ~[main/:?]
	at org.elasticsearch.cluster.coordination.Coordinator.handleJoin(Coordinator.java:1332) ~[main/:?]
	at java.util.Optional.ifPresent(Optional.java:178) ~[?:?]
        ...
	at com.carrotsearch.randomizedtesting.ThreadLeakControl.lambda$forkTimeoutingTask$0(ThreadLeakControl.java:850) ~[randomizedtesting-runner-2.8.0.jar:?]
	at java.lang.Thread.run(Thread.java:1589) ~[?:?]
[2022-10-31T21:16:41,855][TRACE][o.e.c.u.c.DeterministicTaskQueue] [testSingleNodeDiscoveryStabilisesEvenWhenDisrupted] advanceTime: from [3306416ms] to [3321985ms]
[2022-10-31T21:16:41,856][TRACE][o.e.c.u.c.DeterministicTaskQueue] [testSingleNodeDiscoveryStabilisesEvenWhenDisrupted] advanceTime: no longer deferred: DeferredTask[executionTimeMillis=3321985, task={node0}{t71ERgAAQACqOqV0AAAAAA}: PeerFinder handling wakeup]
[2022-10-31T21:16:41,856][TRACE][o.e.c.u.c.DeterministicTaskQueue] [testSingleNodeDiscoveryStabilisesEvenWhenDisrupted] running task 0 of 1: {node0}{t71ERgAAQACqOqV0AAAAAA}: PeerFinder handling wakeup
[2022-10-31T21:16:41,856][TRACE][o.e.d.PeerFinder         ] [testSingleNodeDiscoveryStabilisesEvenWhenDisrupted{nodeId={node0}{t71ERgAAQACqOqV0AAAAAA}}] not active
[2022-10-31T21:16:41,856][TRACE][o.e.c.u.c.DeterministicTaskQueue] [testSingleNodeDiscoveryStabilisesEvenWhenDisrupted] advanceTime: from [3321985ms] to [3325256ms]
[2022-10-31T21:16:41,856][TRACE][o.e.c.u.c.DeterministicTaskQueue] [testSingleNodeDiscoveryStabilisesEvenWhenDisrupted] advanceTime: no longer deferred: DeferredTask[executionTimeMillis=3325256, task={node0}{t71ERgAAQACqOqV0AAAAAA}: async initialization of indices lookup for cluster state 2]
[2022-10-31T21:16:41,856][TRACE][o.e.c.u.c.DeterministicTaskQueue] [testSingleNodeDiscoveryStabilisesEvenWhenDisrupted] running task 0 of 1: {node0}{t71ERgAAQACqOqV0AAAAAA}: async initialization of indices lookup for cluster state 2
[2022-10-31T21:16:41,857][TRACE][o.e.c.u.c.DeterministicTaskQueue] [testSingleNodeDiscoveryStabilisesEvenWhenDisrupted] advanceTime: from [3325256ms] to [3342304ms]
[2022-10-31T21:16:41,857][TRACE][o.e.c.u.c.DeterministicTaskQueue] [testSingleNodeDiscoveryStabilisesEvenWhenDisrupted] advanceTime: no longer deferred: DeferredTask[executionTimeMillis=3342304, task=DisruptableClusterApplierService[[joining {node0}{t71ERgAAQACqOqV0AAAAAA}{0.0.0.0}{0.0.0.0:1}{IScdfhilmrstw}]]]
[2022-10-31T21:16:41,857][TRACE][o.e.c.u.c.DeterministicTaskQueue] [testSingleNodeDiscoveryStabilisesEvenWhenDisrupted] running task 0 of 1: DisruptableClusterApplierService[[joining {node0}{t71ERgAAQACqOqV0AAAAAA}{0.0.0.0}{0.0.0.0:1}{IScdfhilmrstw}]]
[2022-10-31T21:16:41,857][DEBUG][o.e.c.s.ClusterApplierService] [testSingleNodeDiscoveryStabilisesEvenWhenDisrupted{nodeId={node0}{t71ERgAAQACqOqV0AAAAAA}}] processing [joining {node0}{t71ERgAAQACqOqV0AAAAAA}{0.0.0.0}{0.0.0.0:1}{IScdfhilmrstw}]: execute
[2022-10-31T21:16:41,857][DEBUG][o.e.c.s.ClusterApplierService] [testSingleNodeDiscoveryStabilisesEvenWhenDisrupted{nodeId={node0}{t71ERgAAQACqOqV0AAAAAA}}] processing [joining {node0}{t71ERgAAQACqOqV0AAAAAA}{0.0.0.0}{0.0.0.0:1}{IScdfhilmrstw}]: took [0s] no change in cluster state
[2022-10-31T21:16:41,857][TRACE][o.e.t.T.tracer           ] [testSingleNodeDiscoveryStabilisesEvenWhenDisrupted{nodeId={node0}{t71ERgAAQACqOqV0AAAAAA}}] [109][internal:cluster/coordination/join] sent to [{node0}{t71ERgAAQACqOqV0AAAAAA}{0.0.0.0}{0.0.0.0:1}{IScdfhilmrstw}] (timeout: [null])
[2022-10-31T21:16:41,857][TRACE][o.e.t.T.tracer           ] [testSingleNodeDiscoveryStabilisesEvenWhenDisrupted{nodeId={node0}{t71ERgAAQACqOqV0AAAAAA}}] [109][internal:cluster/coordination/join] received request
[2022-10-31T21:16:41,857][TRACE][o.e.c.u.c.DeterministicTaskQueue] [testSingleNodeDiscoveryStabilisesEvenWhenDisrupted{nodeId={node0}{t71ERgAAQACqOqV0AAAAAA}}] scheduleNow: delaying [205799ms], scheduling DeferredTask[executionTimeMillis=3548103, task={node0}{t71ERgAAQACqOqV0AAAAAA}: processing of [109][internal:cluster/coordination/join]: JoinRequest{sourceNode={node0}{t71ERgAAQACqOqV0AAAAAA}{0.0.0.0}{0.0.0.0:1}{IScdfhilmrstw}, minimumTerm=15, optionalJoin=Optional[Join{term=16, lastAcceptedTerm=7, lastAcceptedVersion=1, sourceNode={node0}{t71ERgAAQACqOqV0AAAAAA}{0.0.0.0}{0.0.0.0:1}{IScdfhilmrstw}, targetNode={node0}{t71ERgAAQACqOqV0AAAAAA}{0.0.0.0}{0.0.0.0:1}{IScdfhilmrstw}}]}]
[2022-10-31T21:16:41,858][TRACE][o.e.c.u.c.DeterministicTaskQueue] [testSingleNodeDiscoveryStabilisesEvenWhenDisrupted] advanceTime: from [3342304ms] to [3359474ms]
[2022-10-31T21:16:41,858][TRACE][o.e.c.u.c.DeterministicTaskQueue] [testSingleNodeDiscoveryStabilisesEvenWhenDisrupted] advanceTime: no longer deferred: DeferredTask[executionTimeMillis=3359474, task={node0}{t71ERgAAQACqOqV0AAAAAA}: org.elasticsearch.cluster.coordination.CoordinationDiagnosticsService$$Lambda$1189/0x000000080148e7a8@78688d8a]
[2022-10-31T21:16:41,858][TRACE][o.e.c.u.c.DeterministicTaskQueue] [testSingleNodeDiscoveryStabilisesEvenWhenDisrupted] running task 0 of 1: {node0}{t71ERgAAQACqOqV0AAAAAA}: org.elasticsearch.cluster.coordination.CoordinationDiagnosticsService$$Lambda$1189/0x000000080148e7a8@78688d8a
[2022-10-31T21:16:41,859][TRACE][o.e.c.c.CoordinationDiagnosticsService] [testSingleNodeDiscoveryStabilisesEvenWhenDisrupted{nodeId={node0}{t71ERgAAQACqOqV0AAAAAA}}] Opened connection to {node0}{t71ERgAAQACqOqV0AAAAAA}{0.0.0.0}{0.0.0.0:1}{IScdfhilmrstw}, making transport request
[2022-10-31T21:16:41,859][TRACE][o.e.c.u.c.DeterministicTaskQueue] [testSingleNodeDiscoveryStabilisesEvenWhenDisrupted{nodeId={node0}{t71ERgAAQACqOqV0AAAAAA}}] scheduleAt: adding DeferredTask[executionTimeMillis=3440500, task={node0}{t71ERgAAQACqOqV0AAAAAA}: timeout handler for [110][internal:cluster/formation/info]] with extra delay of [71026ms]
[2022-10-31T21:16:41,859][TRACE][o.e.t.T.tracer           ] [testSingleNodeDiscoveryStabilisesEvenWhenDisrupted{nodeId={node0}{t71ERgAAQACqOqV0AAAAAA}}] [110][internal:cluster/formation/info] sent to [{node0}{t71ERgAAQACqOqV0AAAAAA}{0.0.0.0}{0.0.0.0:1}{IScdfhilmrstw}] (timeout: [10s])
[2022-10-31T21:16:41,859][TRACE][o.e.t.T.tracer           ] [testSingleNodeDiscoveryStabilisesEvenWhenDisrupted{nodeId={node0}{t71ERgAAQACqOqV0AAAAAA}}] [110][internal:cluster/formation/info] received request
[2022-10-31T21:16:41,859][TRACE][o.e.t.TaskManager        ] [testSingleNodeDiscoveryStabilisesEvenWhenDisrupted{nodeId={node0}{t71ERgAAQACqOqV0AAAAAA}}] register 109 [direct] [internal:cluster/formation/info] []
[2022-10-31T21:16:41,859][TRACE][o.e.t.TaskManager        ] [testSingleNodeDiscoveryStabilisesEvenWhenDisrupted{nodeId={node0}{t71ERgAAQACqOqV0AAAAAA}}] unregister task for id: 109
[2022-10-31T21:16:41,859][TRACE][o.e.t.T.tracer           ] [testSingleNodeDiscoveryStabilisesEvenWhenDisrupted{nodeId={node0}{t71ERgAAQACqOqV0AAAAAA}}] [110][internal:cluster/formation/info] sent response
[2022-10-31T21:16:41,859][TRACE][o.e.t.T.tracer           ] [testSingleNodeDiscoveryStabilisesEvenWhenDisrupted{nodeId={node0}{t71ERgAAQACqOqV0AAAAAA}}] [110][internal:cluster/formation/info] received response from [{node0}{t71ERgAAQACqOqV0AAAAAA}{0.0.0.0}{0.0.0.0:1}{IScdfhilmrstw}]
[2022-10-31T21:16:41,860][TRACE][o.e.c.c.CoordinationDiagnosticsService] [testSingleNodeDiscoveryStabilisesEvenWhenDisrupted{nodeId={node0}{t71ERgAAQACqOqV0AAAAAA}}] Received remote response from {node0}{t71ERgAAQACqOqV0AAAAAA}{0.0.0.0}{0.0.0.0:1}{IScdfhilmrstw} in 44.3ms
[2022-10-31T21:16:41,860][TRACE][o.e.c.u.c.DeterministicTaskQueue] [testSingleNodeDiscoveryStabilisesEvenWhenDisrupted{nodeId={node0}{t71ERgAAQACqOqV0AAAAAA}}] scheduleAt: adding DeferredTask[executionTimeMillis=3587329, task={node0}{t71ERgAAQACqOqV0AAAAAA}: org.elasticsearch.cluster.coordination.CoordinationDiagnosticsService$$Lambda$1189/0x000000080148e7a8@72067920] with extra delay of [217855ms]
[2022-10-31T21:16:41,860][TRACE][o.e.c.u.c.DeterministicTaskQueue] [testSingleNodeDiscoveryStabilisesEvenWhenDisrupted] advanceTime: from [3359474ms] to [3402394ms]
[2022-10-31T21:16:41,860][TRACE][o.e.c.u.c.DeterministicTaskQueue] [testSingleNodeDiscoveryStabilisesEvenWhenDisrupted] advanceTime: no longer deferred: DeferredTask[executionTimeMillis=3402394, task={node0}{t71ERgAAQACqOqV0AAAAAA}: publish change of cluster state from version [1] in term [7] to version [2] in term [18]]
[2022-10-31T21:16:41,860][INFO ][o.e.c.c.CoordinatorTests ] [testSingleNodeDiscoveryStabilisesEvenWhenDisrupted] --&gt; runFor(3382178ms) completed run until [3382178ms]: stabilising

From my rather inexperienced understanding, and looking at the above comparisons, I believe that the specific seed makes the test cluster unstable for much longer than expected in the test's runFor() formula. I see it because we see Caused by: org.elasticsearch.cluster.coordination.CoordinationStateRejectedException: incoming term 14 does not match current term 15 as the last thing of the failed test, but we also see it in the middle of the successful one (at exactly the same spot of advanceTime: from [3073086ms] to [3080013ms], which hints that the tests have done the same steps so far) which goes on with more failing terms, until it finally succeeds.

I hope @DaveCTurner may be able to shed more light on how to continue on this one.

DaveCTurner · 2022-11-02T12:35:52Z

This seems to be a genuine (but very low-impact) bug. In the failing run the delay variability is set to a massive 241577ms (i.e. just over 4 minutes). The ElectionScheduler is triggering elections much more frequently than this so there's lots of concurrent elections in flight. Many of these elections pass the request_pre_vote stage and increment the term, which causes earlier elections to fail, but are then superseded by a later election before they can complete. It's something of a coincidence that one more delayVariabilityMillis is enough to complete an election. It could take many more rounds in theory.

However when using single-node discovery there should be no need to schedule multiple elections like this because there's only one node involved so the first election should succeed. Maybe something like this is enough?

diff --git a/server/src/main/java/org/elasticsearch/cluster/coordination/Coordinator.java b/server/src/main/java/org/elasticsearch/cluster/coordination/Coordinator.java
index 33e3528246d..7d9b0b666e3 100644
--- a/server/src/main/java/org/elasticsearch/cluster/coordination/Coordinator.java
+++ b/server/src/main/java/org/elasticsearch/cluster/coordination/Coordinator.java
@@ -1628,6 +1628,10 @@ public class Coordinator extends AbstractLifecycleComponent implements ClusterSt
                         }

                         if (prevotingRound != null) {
+                            if (singleNodeDiscovery) {
+                                logger.debug("skip prevoting as single-node discovery is enabled and prevoting is already in progress");
+                                return;
+                            }
                             prevotingRound.close();
                         }
                         prevotingRound = preVoteCollector.start(lastAcceptedState, getDiscoveredNodes());

kingherc · 2022-11-02T16:27:32Z

@DaveCTurner wow, nice, thanks for taking a look and understanding it so quickly! I just tried that and it works. Will open a PR. I'm a bit confused with the debug message skip prevoting as single-node discovery is enabled and prevoting is already in progress, because it mentions to skip something while it is already in progress. Will take it as is, but feel free to tell me if it could be better formulated.

DaveCTurner · 2022-11-02T16:36:48Z

I didn't put much thought into the message, feel free to adjust as you see fit :)

Fixes elastic#89867

Fixes #89867

idegtiarenko · 2022-11-04T13:35:17Z

I had one of those failures on my feature branch: https://gradle-enterprise.elastic.co/s/lcvafnq7rt4jm/tests/:server:test/org.elasticsearch.cluster.coordination.CoordinatorTests/testSingleNodeDiscoveryStabilisesEvenWhenDisrupted?top-execution=1

I believe I had this fix #91255 when the failure occured

DaveCTurner · 2022-11-04T13:55:19Z

Thanks Ievgen, this looks to be a problem caused by #91255.

kingherc · 2022-11-04T13:59:21Z

This ticket is haunted. You fix one thing, another thing pops up :)
@dct , any reason you believe the PR caused the new failure?
I will try to re-produce it now

DaveCTurner · 2022-11-04T14:07:41Z

Yes, the problem in this failure case is that an election happens within runRandomly but fails due to a (simulated) IO exception, but we don't have a way to remove prevotingRound when this happens so the change we made in #91255 just blocks all future election attempts.

kingherc · 2022-11-04T14:36:24Z

I cannot reproduce the new failure on main (I ran 10 iterations of it). @DaveCTurner , how did you understand all that? :) I cannot see any logs in the test failure.

kingherc · 2022-11-04T14:37:58Z

we don't have a way to remove prevotingRound when this happens

Is this something that can happen in reality and thus we introduced a new bug?

DaveCTurner · 2022-11-04T14:50:52Z

I cannot reproduce the new failure on main

The REPRODUCE WITH line works on the feature/desired-balance-allocator branch (but not on main).

Is this something that can happen in reality and thus we introduced a new bug?

Yes I think so.

DaveCTurner · 2022-11-04T15:26:00Z

I think we should revert #91255 because the failure it introduces seems much more frequent than the one it fixes (a few hundred iterations vs a few thousand)

kingherc · 2022-11-04T15:29:33Z

BTW, I also reproduced it in main. Did not take many iterations until it popped up.

I was thinking of how we go to solve both issues. I think either:

Reverting Skip prevoting if single-node discovery #91255 and maybe configure the elections to not happen so frequently (e.g., adjusting ELECTION_INITIAL_TIMEOUT_SETTING for this test to be equal to the delay variability).
Or keeping Skip prevoting if single-node discovery #91255 and making some special coding for single-node discovery to re-kick prevoting if there was a failure.

@DaveCTurner what do you think? Guessing from your last comment, would the first approach worth trying?

DaveCTurner · 2022-11-04T15:35:59Z

I think the second approach would be more robust, but also much more effort because the plumbing for that failure handling just doesn't exist today. And this is a super-minor bug that almost certainly eventually goes away with enough retries so not really worth the effort.

Therefore I favour the first. I've just pushed bdf3485 which reverts #91255. I'll also push a commit which mutes this test while we continue to work on it.

DaveCTurner · 2022-11-04T15:42:38Z

I think the best setting to adjust would be cluster.election.duration which represents a kind of election timeout after which we start to think about scheduling the next election. Today that cannot be set greater than 5 mins, but with the delay variability in this test going up to 10min sometimes we'd need a much longer max. I think that's ok.

kingherc · 2022-11-04T15:44:19Z

I can test this.

I think that's ok.

You mean it's OK to set the delay variability to up to 5 mins instead of 10mins?

DaveCTurner · 2022-11-04T15:47:55Z

No I mean it should be ok to relax the upper bound on cluster.election.duration. The election takes multiple steps so needs to be much longer than delayVariabilityMillis - I reckon something like delayVariabilityMillis + 4*delayVariabilityMillis + clusterStateUpdateDelay which adds up to 2h10min.

kingherc · 2022-11-04T15:49:35Z

@DaveCTurner , so I should change the max of the cluster.election.duration to 10mins instead of 5mins in production? Or is there a way I can change the max just for the test?

kingherc · 2022-11-04T15:55:24Z

By the way, I notice that cluster.election.max_timeout also has 5 mins as maximum. Is that related to the election duration max value? If yes, should I change the max_timeout's max value to also be 10 mins?

kingherc · 2022-11-04T15:56:48Z

FYI, setting the cluster.election.duration to delayVariabilityMillis if the latter is bigger than the default duration, worked for one example I had. I will re-try now to run a lot of iterations to make sure now it's stable finally.

kingherc · 2022-11-04T17:53:06Z

OK, after many iterations (>1000) it looks good. I will open the PR.

By increasing the voting duration in case of high delays, to avoid the possible endless repetition of voting rounds. Fixes elastic#89867

By increasing the voting duration in case of high delays, to avoid the possible endless repetition of voting rounds. Fixes #89867

DaveCTurner added >test-failure Triaged test failures from CI :Distributed Coordination/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. labels Sep 7, 2022

elasticsearchmachine added the Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. label Sep 7, 2022

DaveCTurner added a commit that referenced this issue Sep 7, 2022

AwaitsFix for #89867

80e666a

kingherc self-assigned this Sep 13, 2022

kingherc mentioned this issue Sep 20, 2022

[CI] CoordinatorTests testUnhealthyLeaderIsReplaced failing #90158

Closed

kingherc added a commit to kingherc/elasticsearch that referenced this issue Sep 22, 2022

Add delay to DEFAULT_CLUSTER_STATE_UPDATE_DELAY

dcbdbe2

Fixes elastic#89867

kingherc mentioned this issue Sep 22, 2022

Add delay to DEFAULT_CLUSTER_STATE_UPDATE_DELAY #90217

Merged

kingherc closed this as completed in #90217 Sep 25, 2022

gmarouli reopened this Oct 28, 2022

kingherc added a commit to kingherc/elasticsearch that referenced this issue Nov 2, 2022

Skip prevoting if single-node discovery

6db3bcc

Fixes elastic#89867

kingherc mentioned this issue Nov 2, 2022

Skip prevoting if single-node discovery #91255

Merged

kingherc closed this as completed in #91255 Nov 2, 2022

kingherc added a commit that referenced this issue Nov 2, 2022

Skip prevoting if single-node discovery (#91255)

611c9b8

Fixes #89867

DaveCTurner reopened this Nov 4, 2022

kingherc added a commit to kingherc/elasticsearch that referenced this issue Nov 4, 2022

Fix SingleNodeDiscoveryStabilisesEvenWhenDisrupted

9021b78

By increasing the voting duration in case of high delays, to avoid the possible endless repetition of voting rounds. Fixes elastic#89867

kingherc mentioned this issue Nov 4, 2022

Fix SingleNodeDiscoveryStabilisesEvenWhenDisrupted #91325

Merged

kingherc closed this as completed in #91325 Nov 7, 2022

kingherc added a commit that referenced this issue Nov 7, 2022

Fix SingleNodeDiscoveryStabilisesEvenWhenDisrupted (#91325)

d83d19a

By increasing the voting duration in case of high delays, to avoid the possible endless repetition of voting rounds. Fixes #89867

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Failure in CoordinatorTests.testSingleNodeDiscoveryStabilisesEvenWhenDisrupted #89867

Failure in CoordinatorTests.testSingleNodeDiscoveryStabilisesEvenWhenDisrupted #89867

DaveCTurner commented Sep 7, 2022 •

edited

Loading

elasticsearchmachine commented Sep 7, 2022

DaveCTurner commented Sep 7, 2022

kingherc commented Sep 20, 2022

kingherc commented Sep 20, 2022

DaveCTurner commented Sep 20, 2022

kingherc commented Sep 20, 2022

DaveCTurner commented Sep 21, 2022

gmarouli commented Oct 28, 2022

kingherc commented Oct 31, 2022 •

edited

Loading

DaveCTurner commented Nov 2, 2022 •

edited

Loading

kingherc commented Nov 2, 2022

DaveCTurner commented Nov 2, 2022

idegtiarenko commented Nov 4, 2022

DaveCTurner commented Nov 4, 2022

kingherc commented Nov 4, 2022

DaveCTurner commented Nov 4, 2022

kingherc commented Nov 4, 2022

kingherc commented Nov 4, 2022

DaveCTurner commented Nov 4, 2022

DaveCTurner commented Nov 4, 2022 •

edited

Loading

kingherc commented Nov 4, 2022 •

edited

Loading

DaveCTurner commented Nov 4, 2022

DaveCTurner commented Nov 4, 2022

kingherc commented Nov 4, 2022

DaveCTurner commented Nov 4, 2022

kingherc commented Nov 4, 2022

kingherc commented Nov 4, 2022

kingherc commented Nov 4, 2022

kingherc commented Nov 4, 2022

Failure in CoordinatorTests.testSingleNodeDiscoveryStabilisesEvenWhenDisrupted #89867

Failure in CoordinatorTests.testSingleNodeDiscoveryStabilisesEvenWhenDisrupted #89867

Comments

DaveCTurner commented Sep 7, 2022 • edited Loading

CI Link

Repro line

Does it reproduce?

Applicable branches

Failure history

Failure excerpt

elasticsearchmachine commented Sep 7, 2022

DaveCTurner commented Sep 7, 2022

kingherc commented Sep 20, 2022

kingherc commented Sep 20, 2022

DaveCTurner commented Sep 20, 2022

kingherc commented Sep 20, 2022

DaveCTurner commented Sep 21, 2022

gmarouli commented Oct 28, 2022

kingherc commented Oct 31, 2022 • edited Loading

DaveCTurner commented Nov 2, 2022 • edited Loading

kingherc commented Nov 2, 2022

DaveCTurner commented Nov 2, 2022

idegtiarenko commented Nov 4, 2022

DaveCTurner commented Nov 4, 2022

kingherc commented Nov 4, 2022

DaveCTurner commented Nov 4, 2022

kingherc commented Nov 4, 2022

kingherc commented Nov 4, 2022

DaveCTurner commented Nov 4, 2022

DaveCTurner commented Nov 4, 2022 • edited Loading

kingherc commented Nov 4, 2022 • edited Loading

DaveCTurner commented Nov 4, 2022

DaveCTurner commented Nov 4, 2022

kingherc commented Nov 4, 2022

DaveCTurner commented Nov 4, 2022

kingherc commented Nov 4, 2022

kingherc commented Nov 4, 2022

kingherc commented Nov 4, 2022

kingherc commented Nov 4, 2022

DaveCTurner commented Sep 7, 2022 •

edited

Loading

kingherc commented Oct 31, 2022 •

edited

Loading

DaveCTurner commented Nov 2, 2022 •

edited

Loading

DaveCTurner commented Nov 4, 2022 •

edited

Loading

kingherc commented Nov 4, 2022 •

edited

Loading