Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failure in CoordinatorTests.testSingleNodeDiscoveryStabilisesEvenWhenDisrupted #89867

Closed
DaveCTurner opened this issue Sep 7, 2022 · 29 comments · Fixed by #90217, #91255 or #91325
Closed
Assignees
Labels
:Distributed Coordination/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. >test-failure Triaged test failures from CI

Comments

@DaveCTurner
Copy link
Contributor

DaveCTurner commented Sep 7, 2022

CI Link

https://gradle-enterprise.elastic.co/s/yckwl5zjomfas

Repro line

./gradlew ':server:test' --tests "org.elasticsearch.cluster.coordination.CoordinatorTests.testSingleNodeDiscoveryStabilisesEvenWhenDisrupted" -Dtests.seed=949AB28501F8ED37:BBC20DD1A5D4EF81

Does it reproduce?

Yes

Applicable branches

main

Failure history

https://ci-stats.elastic.co/s/jenkins/goto/da1547f0-2ea6-11ed-8233-4bd40c8511bb - 3 similar-looking failures over the last 90 days

Failure excerpt

    java.lang.AssertionError: leaders
    Expected: not an empty collection
         but: was <[]>
        at __randomizedtesting.SeedInfo.seed([949AB28501F8ED37:BBC20DD1A5D4EF81]:0)
        at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:18)
        at org.junit.Assert.assertThat(Assert.java:956)
        at org.elasticsearch.cluster.coordination.AbstractCoordinatorTestCase$Cluster.getAnyLeader(AbstractCoordinatorTestCase.java:791)
        at org.elasticsearch.cluster.coordination.CoordinatorTests.testSingleNodeDiscoveryStabilisesEvenWhenDisrupted(CoordinatorTests.java:1939)


Suite: Test class org.elasticsearch.cluster.coordination.CoordinatorTests
  2> REPRODUCE WITH: ./gradlew ':server:test' --tests "org.elasticsearch.cluster.coordination.CoordinatorTests.testSingleNodeDiscoveryStabilisesEvenWhenDisrupted" -Dtests.seed=949AB28501F8ED37 -Dtests.locale=en-AU -Dtests.timezone=Europe/Vienna -Druntime.java=18
  2> java.lang.AssertionError: leaders
    Expected: not an empty collection
         but: was <[]>
        at __randomizedtesting.SeedInfo.seed([949AB28501F8ED37:BBC20DD1A5D4EF81]:0)
        at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:18)
        at org.junit.Assert.assertThat(Assert.java:956)
        at org.elasticsearch.cluster.coordination.AbstractCoordinatorTestCase$Cluster.getAnyLeader(AbstractCoordinatorTestCase.java:791)
        at org.elasticsearch.cluster.coordination.CoordinatorTests.testSingleNodeDiscoveryStabilisesEvenWhenDisrupted(CoordinatorTests.java:1939)
  2> NOTE: leaving temporary files on disk at: /home/davidturner/src/elasticsearch/server/build/testrun/test/temp/org.elasticsearch.cluster.coordination.CoordinatorTests_949AB28501F8ED37-002
  2> NOTE: test params are: codec=Asserting(Lucene94): {}, docValues:{}, maxPointsInLeafNode=1598, maxMBSortInHeap=6.485414969465074, sim=Asserting(RandomSimilarity(queryNorm=true): {}), locale=en-AU, timezone=Europe/Vienna
  2> NOTE: Linux 5.4.0-125-generic amd64/Oracle Corporation 18.0.2.1 (64-bit)/cpus=8,threads=1,free=465272544,total=536870912
  2> NOTE: All tests run in this JVM: [CoordinatorTests]

Tests with failures:
 - org.elasticsearch.cluster.coordination.CoordinatorTests.testSingleNodeDiscoveryStabilisesEvenWhenDisrupted
@DaveCTurner DaveCTurner added >test-failure Triaged test failures from CI :Distributed Coordination/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. labels Sep 7, 2022
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (Team:Distributed)

@elasticsearchmachine elasticsearchmachine added the Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. label Sep 7, 2022
DaveCTurner added a commit that referenced this issue Sep 7, 2022
@DaveCTurner
Copy link
Contributor Author

Looks like the stabilisation timeout is just too short, because the following patch fixes this failure:

diff --git a/server/src/test/java/org/elasticsearch/cluster/coordination/CoordinatorTests.java b/server/src/test/java/org/elasticsearch/cluster/coordination/CoordinatorTests.java
index ca7cb106be6..49cd4a3a15b 100644
--- a/server/src/test/java/org/elasticsearch/cluster/coordination/CoordinatorTests.java
+++ b/server/src/test/java/org/elasticsearch/cluster/coordination/CoordinatorTests.java
@@ -1934,7 +1934,7 @@ public class CoordinatorTests extends AbstractCoordinatorTestCase {
             // detection and ongoing publications do not time out
             cluster.runFor(ELECTION_INITIAL_TIMEOUT_SETTING.get(Settings.EMPTY).millis() + delayVariabilityMillis
             // two round trips for pre-voting and voting
-                + 4 * delayVariabilityMillis
+                + 4 * delayVariabilityMillis + 300000
                 // and then the election update
                 + clusterStateUpdateDelay, "stabilising");

Ofc that's not the actual fix, we must work out for what this computation is not accounting.

weizijun added a commit to weizijun/elasticsearch that referenced this issue Sep 8, 2022
* main: (175 commits)
  Fix integration test on Windows (elastic#89894)
  Avoiding the use of dynamic map keys in the cluster_formation results of the stable master health indicator (elastic#89842)
  Mute org.elasticsearch.tracing.apm.ApmIT.testCapturesTracesForHttpTraffic (elastic#89891)
  Fix typos in audit event types (elastic#89886)
  Synthetic _source: support histogram field (elastic#89833)
  [TSDB] Rename rollup public API to downsample  (elastic#89809)
  Format script values access (elastic#89780)
  [DOCS] Simplifies composite aggregation recommendation (elastic#89878)
  [DOCS] Update CCS compatibility matrix for 8.3 (elastic#88906)
  Fix memory leak when double invoking RestChannel.sendResponse (elastic#89873)
  [ML] Add processor autoscaling decider (elastic#89645)
  Update disk-usage.asciidoc (elastic#89709) (elastic#89874)
  Add allocation deciders in createComponents (elastic#89836)
  Mute flaky H3LatLonGeometryTest.testIndexPoints (elastic#89870)
  Fix typo in get-snapshot-status-api doc (elastic#89865)
  Picking master eligible node at random in the master stability health indicator (elastic#89841)
  Do not reuse the client after a disruption elastic#89815 (elastic#89866)
  [ML] Distribute trained model allocations across availability zones (elastic#89822)
  Increment clientCalledCount before onResponse (elastic#89858)
  AwaitsFix for elastic#89867
  ...
weizijun added a commit to weizijun/elasticsearch that referenced this issue Sep 8, 2022
* main: (175 commits)
  Fix integration test on Windows (elastic#89894)
  Avoiding the use of dynamic map keys in the cluster_formation results of the stable master health indicator (elastic#89842)
  Mute org.elasticsearch.tracing.apm.ApmIT.testCapturesTracesForHttpTraffic (elastic#89891)
  Fix typos in audit event types (elastic#89886)
  Synthetic _source: support histogram field (elastic#89833)
  [TSDB] Rename rollup public API to downsample  (elastic#89809)
  Format script values access (elastic#89780)
  [DOCS] Simplifies composite aggregation recommendation (elastic#89878)
  [DOCS] Update CCS compatibility matrix for 8.3 (elastic#88906)
  Fix memory leak when double invoking RestChannel.sendResponse (elastic#89873)
  [ML] Add processor autoscaling decider (elastic#89645)
  Update disk-usage.asciidoc (elastic#89709) (elastic#89874)
  Add allocation deciders in createComponents (elastic#89836)
  Mute flaky H3LatLonGeometryTest.testIndexPoints (elastic#89870)
  Fix typo in get-snapshot-status-api doc (elastic#89865)
  Picking master eligible node at random in the master stability health indicator (elastic#89841)
  Do not reuse the client after a disruption elastic#89815 (elastic#89866)
  [ML] Distribute trained model allocations across availability zones (elastic#89822)
  Increment clientCalledCount before onResponse (elastic#89858)
  AwaitsFix for elastic#89867
  ...

# Conflicts:
#	x-pack/plugin/rollup/src/main/java/org/elasticsearch/xpack/downsample/RollupShardIndexer.java
weizijun added a commit to weizijun/elasticsearch that referenced this issue Sep 8, 2022
* main: (283 commits)
  Fix integration test on Windows (elastic#89894)
  Avoiding the use of dynamic map keys in the cluster_formation results of the stable master health indicator (elastic#89842)
  Mute org.elasticsearch.tracing.apm.ApmIT.testCapturesTracesForHttpTraffic (elastic#89891)
  Fix typos in audit event types (elastic#89886)
  Synthetic _source: support histogram field (elastic#89833)
  [TSDB] Rename rollup public API to downsample  (elastic#89809)
  Format script values access (elastic#89780)
  [DOCS] Simplifies composite aggregation recommendation (elastic#89878)
  [DOCS] Update CCS compatibility matrix for 8.3 (elastic#88906)
  Fix memory leak when double invoking RestChannel.sendResponse (elastic#89873)
  [ML] Add processor autoscaling decider (elastic#89645)
  Update disk-usage.asciidoc (elastic#89709) (elastic#89874)
  Add allocation deciders in createComponents (elastic#89836)
  Mute flaky H3LatLonGeometryTest.testIndexPoints (elastic#89870)
  Fix typo in get-snapshot-status-api doc (elastic#89865)
  Picking master eligible node at random in the master stability health indicator (elastic#89841)
  Do not reuse the client after a disruption elastic#89815 (elastic#89866)
  [ML] Distribute trained model allocations across availability zones (elastic#89822)
  Increment clientCalledCount before onResponse (elastic#89858)
  AwaitsFix for elastic#89867
  ...
@kingherc kingherc self-assigned this Sep 13, 2022
@kingherc
Copy link
Contributor

This is a hard. I started out to go back in history to figure out when this started appearing. It started appearing with https://github.com/elastic/elasticsearch/pull/86921/files , which removed the MockSinglePrioritizingExecutor in favor of some more java-native thread executor. With the old MockSinglePrioritizingExecutor, the specific test (and random seed) passes. However, in trying to understand the difference, I figured that MockSinglePrioritizingExecutor was flawed in multiple parts, e.g., it didn't emit a good string for the task to be executed, and also threw a KillWorkerError after executing a task, which created a different order of execution than the new java-native approach. I have the impression that during the KillWorkerError, the run() of the MockSinglePrioritizingExecutor was called again, and it might insert an additional "node join" tasks in the DeterministicTaskQueue, which might have been the reason why it could faster result in stabilization. In the end, due to all these differences, I ended up stopping the comparison with MockSinglePrioritizingExecutor. Will try to figure out with the code on main branch.

@kingherc
Copy link
Contributor

@DaveCTurner I observe that simply adding another delayVariabilityMillis to the expression you mentioned passes the test. Would this be sufficient, or would we need to further understand the calculation?

@DaveCTurner
Copy link
Contributor Author

I'd like to understand why we need another delay here. It could well be because #86921 added another delay, but how exactly does that feed into the calculation? Does it mean that clusterStateUpdateDelay should now be 8 * delayVariabilityMillis? If so, should we also update the definition of DEFAULT_CLUSTER_STATE_UPDATE_DELAY?

@kingherc
Copy link
Contributor

I see, I am not perfectly yet aware of all these delays and what they mean. I will try to figure it out from the logs of comparing with MockSinglePrioritizingExecutor.

@DaveCTurner
Copy link
Contributor Author

Does it mean that clusterStateUpdateDelay should now be 8 * delayVariabilityMillis?

Recording some notes here for future discussion: here's an excerpt of a log showing a commit which took from 5934060ms until 5934860ms, a total of 800ms.

[2022-09-20T08:19:59,812][INFO ][o.e.c.c.CoordinatorTests ] [testCanUpdateClusterStateAfterStabilisation] --> submitting value [2328060914443321389] to [{node0}{hy7wWQAAQACJS0Cz_____w}{0.0.0.0}{0.0.0.0:21}{cdfhilmrstw}]
{nodeId={node0}{hy7wWQAAQACJS0Cz_____w}}] scheduleNow: delaying [100ms], scheduling DeferredTask[executionTimeMillis=5934160, task={node0}{hy7wWQAAQACJS0Cz_____w}: master service scheduling next task]
[2022-09-20T08:19:59,812][INFO ][o.e.c.c.CoordinatorTests ] [testCanUpdateClusterStateAfterStabilisation] --> runFor(700ms) running until [5934760ms]: stabilising

1: master service task

] advanceTime: from [5934060ms] to [5934160ms]
] advanceTime: no longer deferred: DeferredTask[executionTimeMillis=5934160, task={node0}{hy7wWQAAQACJS0Cz_____w}: master service scheduling next task]
] running task 0 of 1: {node0}{hy7wWQAAQACJS0Cz_____w}: master service scheduling next task
{nodeId={node0}{hy7wWQAAQACJS0Cz_____w}}] scheduleNow: delaying [100ms], scheduling DeferredTask[executionTimeMillis=5934260, task={node0}{hy7wWQAAQACJS0Cz_____w}: publish change of cluster state from version [3] in term [1] to version [4] in term [1]]

2: state publisher task (see FakeThreadPoolMasterService#publish)

] advanceTime: from [5934160ms] to [5934260ms]
] advanceTime: no longer deferred: DeferredTask[executionTimeMillis=5934260, task={node0}{hy7wWQAAQACJS0Cz_____w}: publish change of cluster state from version [3] in term [1] to version [4] in term [1]]
] running task 0 of 1: {node0}{hy7wWQAAQACJS0Cz_____w}: publish change of cluster state from version [3] in term [1] to version [4] in term [1]
{nodeId={node0}{hy7wWQAAQACJS0Cz_____w}}] scheduleAt: adding DeferredTask[executionTimeMillis=5964360, task={node0}{hy7wWQAAQACJS0Cz_____w}: scheduled timeout for Publication{term=1, version=4}] with extra delay of [100ms]
{nodeId={node0}{hy7wWQAAQACJS0Cz_____w}}] scheduleAt: adding DeferredTask[executionTimeMillis=5944360, task={node0}{hy7wWQAAQACJS0Cz_____w}: scheduled timeout for reporting on Publication{term=1, version=4}] with extra delay of [100ms]
{nodeId={node0}{hy7wWQAAQACJS0Cz_____w}}] scheduleNow: delaying [100ms], scheduling DeferredTask[executionTimeMillis=5934360, task={node2}{MXR7YQAAQACXXRe5_____w}: [485][internal:cluster/coordination/publish_state] from {node0}{hy7wWQAAQACJS0Cz_____w}{0.0.0.0}{0.0.0.0:21}{cdfhilmrstw} to {node2}{MXR7YQAAQACXXRe5_____w}{0.0.0.0}{0.0.0.0:18}{cdfhilmrstw}]
{nodeId={node0}{hy7wWQAAQACJS0Cz_____w}}] scheduleNow: delaying [100ms], scheduling DeferredTask[executionTimeMillis=5934360, task={node1}{MhCa3f__T_-UXusCAAAAAA}: [486][internal:cluster/coordination/publish_state] from {node0}{hy7wWQAAQACJS0Cz_____w}{0.0.0.0}{0.0.0.0:21}{cdfhilmrstw} to {node1}{MhCa3f__T_-UXusCAAAAAA}{0.0.0.0}{0.0.0.0:20}{cdfhilmrstw}]
{nodeId={node0}{hy7wWQAAQACJS0Cz_____w}}] scheduleNow: delaying [100ms], scheduling DeferredTask[executionTimeMillis=5934360, task={node0}{hy7wWQAAQACJS0Cz_____w}: ActionRunnable#wrap[handling cluster state version [4] locally on [{node0}{hy7wWQAAQACJS0Cz_____w}{0.0.0.0}{0.0.0.0:21}{cdfhilmrstw}]]]

3: deliver publish requests

] advanceTime: from [5934260ms] to [5934274ms]
] advanceTime: from [5934274ms] to [5934300ms]
] advanceTime: from [5934300ms] to [5934302ms]
] advanceTime: from [5934302ms] to [5934360ms]
] advanceTime: no longer deferred: DeferredTask[executionTimeMillis=5934360, task={node2}{MXR7YQAAQACXXRe5_____w}: [485][internal:cluster/coordination/publish_state] from {node0}{hy7wWQAAQACJS0Cz_____w}{0.0.0.0}{0.0.0.0:21}{cdfhilmrstw} to {node2}{MXR7YQAAQACXXRe5_____w}{0.0.0.0}{0.0.0.0:18}{cdfhilmrstw}]
] advanceTime: no longer deferred: DeferredTask[executionTimeMillis=5934360, task={node1}{MhCa3f__T_-UXusCAAAAAA}: [486][internal:cluster/coordination/publish_state] from {node0}{hy7wWQAAQACJS0Cz_____w}{0.0.0.0}{0.0.0.0:21}{cdfhilmrstw} to {node1}{MhCa3f__T_-UXusCAAAAAA}{0.0.0.0}{0.0.0.0:20}{cdfhilmrstw}]
] advanceTime: no longer deferred: DeferredTask[executionTimeMillis=5934360, task={node0}{hy7wWQAAQACJS0Cz_____w}: ActionRunnable#wrap[handling cluster state version [4] locally on [{node0}{hy7wWQAAQACJS0Cz_____w}{0.0.0.0}{0.0.0.0:21}{cdfhilmrstw}]]]
] running task 1 of 3: {node1}{MhCa3f__T_-UXusCAAAAAA}: [486][internal:cluster/coordination/publish_state] from {node0}{hy7wWQAAQACJS0Cz_____w}{0.0.0.0}{0.0.0.0:21}{cdfhilmrstw} to {node1}{MhCa3f__T_-UXusCAAAAAA}{0.0.0.0}{0.0.0.0:20}{cdfhilmrstw}
{nodeId={node1}{MhCa3f__T_-UXusCAAAAAA}}] scheduleNow: delaying [100ms], scheduling DeferredTask[executionTimeMillis=5934460, task={node0}{hy7wWQAAQACJS0Cz_____w}: response to [486][internal:cluster/coordination/publish_state] from {node0}{hy7wWQAAQACJS0Cz_____w}{0.0.0.0}{0.0.0.0:21}{cdfhilmrstw} to {node1}{MhCa3f__T_-UXusCAAAAAA}{0.0.0.0}{0.0.0.0:20}{cdfhilmrstw}]
] running task 0 of 2: {node2}{MXR7YQAAQACXXRe5_____w}: [485][internal:cluster/coordination/publish_state] from {node0}{hy7wWQAAQACJS0Cz_____w}{0.0.0.0}{0.0.0.0:21}{cdfhilmrstw} to {node2}{MXR7YQAAQACXXRe5_____w}{0.0.0.0}{0.0.0.0:18}{cdfhilmrstw}
{nodeId={node2}{MXR7YQAAQACXXRe5_____w}}] scheduleNow: delaying [100ms], scheduling DeferredTask[executionTimeMillis=5934460, task={node0}{hy7wWQAAQACJS0Cz_____w}: response to [485][internal:cluster/coordination/publish_state] from {node0}{hy7wWQAAQACJS0Cz_____w}{0.0.0.0}{0.0.0.0:21}{cdfhilmrstw} to {node2}{MXR7YQAAQACXXRe5_____w}{0.0.0.0}{0.0.0.0:18}{cdfhilmrstw}]
] running task 0 of 1: {node0}{hy7wWQAAQACJS0Cz_____w}: ActionRunnable#wrap[handling cluster state version [4] locally on [{node0}{hy7wWQAAQACJS0Cz_____w}{0.0.0.0}{0.0.0.0:21}{cdfhilmrstw}]]

4: receive publish responses

] advanceTime: from [5934360ms] to [5934374ms]
] advanceTime: from [5934374ms] to [5934400ms]
] advanceTime: from [5934400ms] to [5934460ms]
] advanceTime: no longer deferred: DeferredTask[executionTimeMillis=5934460, task={node0}{hy7wWQAAQACJS0Cz_____w}: response to [486][internal:cluster/coordination/publish_state] from {node0}{hy7wWQAAQACJS0Cz_____w}{0.0.0.0}{0.0.0.0:21}{cdfhilmrstw} to {node1}{MhCa3f__T_-UXusCAAAAAA}{0.0.0.0}{0.0.0.0:20}{cdfhilmrstw}]
] advanceTime: no longer deferred: DeferredTask[executionTimeMillis=5934460, task={node0}{hy7wWQAAQACJS0Cz_____w}: response to [485][internal:cluster/coordination/publish_state] from {node0}{hy7wWQAAQACJS0Cz_____w}{0.0.0.0}{0.0.0.0:21}{cdfhilmrstw} to {node2}{MXR7YQAAQACXXRe5_____w}{0.0.0.0}{0.0.0.0:18}{cdfhilmrstw}]
] running task 0 of 2: {node0}{hy7wWQAAQACJS0Cz_____w}: response to [486][internal:cluster/coordination/publish_state] from {node0}{hy7wWQAAQACJS0Cz_____w}{0.0.0.0}{0.0.0.0:21}{cdfhilmrstw} to {node1}{MhCa3f__T_-UXusCAAAAAA}{0.0.0.0}{0.0.0.0:20}{cdfhilmrstw}
{nodeId={node0}{hy7wWQAAQACJS0Cz_____w}}] scheduleNow: delaying [100ms], scheduling DeferredTask[executionTimeMillis=5934560, task={node1}{MhCa3f__T_-UXusCAAAAAA}: [488][internal:cluster/coordination/commit_state] from {node0}{hy7wWQAAQACJS0Cz_____w}{0.0.0.0}{0.0.0.0:21}{cdfhilmrstw} to {node1}{MhCa3f__T_-UXusCAAAAAA}{0.0.0.0}{0.0.0.0:20}{cdfhilmrstw}]
{nodeId={node0}{hy7wWQAAQACJS0Cz_____w}}] scheduleNow: delaying [100ms], scheduling DeferredTask[executionTimeMillis=5934560, task={node0}{hy7wWQAAQACJS0Cz_____w}: processing of [489][internal:cluster/coordination/commit_state]: ApplyCommitRequest{term=1, version=4, sourceNode={node0}{hy7wWQAAQACJS0Cz_____w}{0.0.0.0}{0.0.0.0:21}{cdfhilmrstw}}]
] running task 0 of 1: {node0}{hy7wWQAAQACJS0Cz_____w}: response to [485][internal:cluster/coordination/publish_state] from {node0}{hy7wWQAAQACJS0Cz_____w}{0.0.0.0}{0.0.0.0:21}{cdfhilmrstw} to {node2}{MXR7YQAAQACXXRe5_____w}{0.0.0.0}{0.0.0.0:18}{cdfhilmrstw}
{nodeId={node0}{hy7wWQAAQACJS0Cz_____w}}] scheduleNow: delaying [100ms], scheduling DeferredTask[executionTimeMillis=5934560, task={node2}{MXR7YQAAQACXXRe5_____w}: [490][internal:cluster/coordination/commit_state] from {node0}{hy7wWQAAQACJS0Cz_____w}{0.0.0.0}{0.0.0.0:21}{cdfhilmrstw} to {node2}{MXR7YQAAQACXXRe5_____w}{0.0.0.0}{0.0.0.0:18}{cdfhilmrstw}]

5: send commit requests

] advanceTime: from [5934460ms] to [5934474ms]
] advanceTime: from [5934474ms] to [5934500ms]
] advanceTime: from [5934500ms] to [5934560ms]
] advanceTime: no longer deferred: DeferredTask[executionTimeMillis=5934560, task={node1}{MhCa3f__T_-UXusCAAAAAA}: [488][internal:cluster/coordination/commit_state] from {node0}{hy7wWQAAQACJS0Cz_____w}{0.0.0.0}{0.0.0.0:21}{cdfhilmrstw} to {node1}{MhCa3f__T_-UXusCAAAAAA}{0.0.0.0}{0.0.0.0:20}{cdfhilmrstw}]
] advanceTime: no longer deferred: DeferredTask[executionTimeMillis=5934560, task={node0}{hy7wWQAAQACJS0Cz_____w}: processing of [489][internal:cluster/coordination/commit_state]: ApplyCommitRequest{term=1, version=4, sourceNode={node0}{hy7wWQAAQACJS0Cz_____w}{0.0.0.0}{0.0.0.0:21}{cdfhilmrstw}}]
] advanceTime: no longer deferred: DeferredTask[executionTimeMillis=5934560, task={node2}{MXR7YQAAQACXXRe5_____w}: [490][internal:cluster/coordination/commit_state] from {node0}{hy7wWQAAQACJS0Cz_____w}{0.0.0.0}{0.0.0.0:21}{cdfhilmrstw} to {node2}{MXR7YQAAQACXXRe5_____w}{0.0.0.0}{0.0.0.0:18}{cdfhilmrstw}]
] running task 0 of 3: {node1}{MhCa3f__T_-UXusCAAAAAA}: [488][internal:cluster/coordination/commit_state] from {node0}{hy7wWQAAQACJS0Cz_____w}{0.0.0.0}{0.0.0.0:21}{cdfhilmrstw} to {node1}{MhCa3f__T_-UXusCAAAAAA}{0.0.0.0}{0.0.0.0:20}{cdfhilmrstw}
{nodeId={node1}{MhCa3f__T_-UXusCAAAAAA}}] scheduleNow: delaying [100ms], scheduling DeferredTask[executionTimeMillis=5934660, task=DisruptableClusterApplierService[[ApplyCommitRequest{term=1, version=4, sourceNode={node0}{hy7wWQAAQACJS0Cz_____w}{0.0.0.0}{0.0.0.0:21}{cdfhilmrstw}}]]]
] running task 0 of 2: {node0}{hy7wWQAAQACJS0Cz_____w}: processing of [489][internal:cluster/coordination/commit_state]: ApplyCommitRequest{term=1, version=4, sourceNode={node0}{hy7wWQAAQACJS0Cz_____w}{0.0.0.0}{0.0.0.0:21}{cdfhilmrstw}}
{nodeId={node0}{hy7wWQAAQACJS0Cz_____w}}] scheduleNow: delaying [100ms], scheduling DeferredTask[executionTimeMillis=5934660, task={node0}{hy7wWQAAQACJS0Cz_____w}: delivery of response to [489][internal:cluster/coordination/commit_state]: Empty{}]
] running task 0 of 1: {node2}{MXR7YQAAQACXXRe5_____w}: [490][internal:cluster/coordination/commit_state] from {node0}{hy7wWQAAQACJS0Cz_____w}{0.0.0.0}{0.0.0.0:21}{cdfhilmrstw} to {node2}{MXR7YQAAQACXXRe5_____w}{0.0.0.0}{0.0.0.0:18}{cdfhilmrstw}
{nodeId={node2}{MXR7YQAAQACXXRe5_____w}}] scheduleNow: delaying [100ms], scheduling DeferredTask[executionTimeMillis=5934660, task=DisruptableClusterApplierService[[ApplyCommitRequest{term=1, version=4, sourceNode={node0}{hy7wWQAAQACJS0Cz_____w}{0.0.0.0}{0.0.0.0:21}{cdfhilmrstw}}]]]

6: apply new state on remote nodes

] advanceTime: from [5934560ms] to [5934660ms]
] advanceTime: no longer deferred: DeferredTask[executionTimeMillis=5934660, task=DisruptableClusterApplierService[[ApplyCommitRequest{term=1, version=4, sourceNode={node0}{hy7wWQAAQACJS0Cz_____w}{0.0.0.0}{0.0.0.0:21}{cdfhilmrstw}}]]]
] advanceTime: no longer deferred: DeferredTask[executionTimeMillis=5934660, task={node0}{hy7wWQAAQACJS0Cz_____w}: delivery of response to [489][internal:cluster/coordination/commit_state]: Empty{}]
] advanceTime: no longer deferred: DeferredTask[executionTimeMillis=5934660, task=DisruptableClusterApplierService[[ApplyCommitRequest{term=1, version=4, sourceNode={node0}{hy7wWQAAQACJS0Cz_____w}{0.0.0.0}{0.0.0.0:21}{cdfhilmrstw}}]]]
] running task 0 of 3: DisruptableClusterApplierService[[ApplyCommitRequest{term=1, version=4, sourceNode={node0}{hy7wWQAAQACJS0Cz_____w}{0.0.0.0}{0.0.0.0:21}{cdfhilmrstw}}]]
{nodeId={node1}{MhCa3f__T_-UXusCAAAAAA}}] scheduleNow: delaying [100ms], scheduling DeferredTask[executionTimeMillis=5934760, task={node0}{hy7wWQAAQACJS0Cz_____w}: response to [488][internal:cluster/coordination/commit_state] from {node0}{hy7wWQAAQACJS0Cz_____w}{0.0.0.0}{0.0.0.0:21}{cdfhilmrstw} to {node1}{MhCa3f__T_-UXusCAAAAAA}{0.0.0.0}{0.0.0.0:20}{cdfhilmrstw}]
] running task 1 of 2: DisruptableClusterApplierService[[ApplyCommitRequest{term=1, version=4, sourceNode={node0}{hy7wWQAAQACJS0Cz_____w}{0.0.0.0}{0.0.0.0:21}{cdfhilmrstw}}]]
{nodeId={node2}{MXR7YQAAQACXXRe5_____w}}] scheduleNow: delaying [100ms], scheduling DeferredTask[executionTimeMillis=5934760, task={node0}{hy7wWQAAQACJS0Cz_____w}: response to [490][internal:cluster/coordination/commit_state] from {node0}{hy7wWQAAQACJS0Cz_____w}{0.0.0.0}{0.0.0.0:21}{cdfhilmrstw} to {node2}{MXR7YQAAQACXXRe5_____w}{0.0.0.0}{0.0.0.0:18}{cdfhilmrstw}]
] running task 0 of 1: {node0}{hy7wWQAAQACJS0Cz_____w}: delivery of response to [489][internal:cluster/coordination/commit_state]: Empty{}

7: receive commit responses

] advanceTime: from [5934660ms] to [5934683ms]
] advanceTime: from [5934683ms] to [5934760ms]
] advanceTime: no longer deferred: DeferredTask[executionTimeMillis=5934760, task={node0}{hy7wWQAAQACJS0Cz_____w}: response to [488][internal:cluster/coordination/commit_state] from {node0}{hy7wWQAAQACJS0Cz_____w}{0.0.0.0}{0.0.0.0:21}{cdfhilmrstw} to {node1}{MhCa3f__T_-UXusCAAAAAA}{0.0.0.0}{0.0.0.0:20}{cdfhilmrstw}]
] advanceTime: no longer deferred: DeferredTask[executionTimeMillis=5934760, task={node0}{hy7wWQAAQACJS0Cz_____w}: response to [490][internal:cluster/coordination/commit_state] from {node0}{hy7wWQAAQACJS0Cz_____w}{0.0.0.0}{0.0.0.0:21}{cdfhilmrstw} to {node2}{MXR7YQAAQACXXRe5_____w}{0.0.0.0}{0.0.0.0:18}{cdfhilmrstw}]
--> runFor(700ms) completed run until [5934760ms]: stabilising
--> runFor(700ms) running until [5935460ms]: draining task queue
] running task 0 of 2: {node0}{hy7wWQAAQACJS0Cz_____w}: response to [488][internal:cluster/coordination/commit_state] from {node0}{hy7wWQAAQACJS0Cz_____w}{0.0.0.0}{0.0.0.0:21}{cdfhilmrstw} to {node1}{MhCa3f__T_-UXusCAAAAAA}{0.0.0.0}{0.0.0.0:20}{cdfhilmrstw}
] running task 0 of 1: {node0}{hy7wWQAAQACJS0Cz_____w}: response to [490][internal:cluster/coordination/commit_state] from {node0}{hy7wWQAAQACJS0Cz_____w}{0.0.0.0}{0.0.0.0:21}{cdfhilmrstw} to {node2}{MXR7YQAAQACXXRe5_____w}{0.0.0.0}{0.0.0.0:18}{cdfhilmrstw}
{nodeId={node0}{hy7wWQAAQACJS0Cz_____w}}] scheduleNow: delaying [100ms], scheduling DeferredTask[executionTimeMillis=5934860, task=DisruptableClusterApplierService[[Publication{term=1, version=4}]]]

8: apply new state on master node

] advanceTime: from [5934760ms] to [5934766ms]
] advanceTime: from [5934766ms] to [5934783ms]
] advanceTime: from [5934783ms] to [5934860ms]
] advanceTime: no longer deferred: DeferredTask[executionTimeMillis=5934860, task=DisruptableClusterApplierService[[Publication{term=1, version=4}]]]
] running task 0 of 1: DisruptableClusterApplierService[[Publication{term=1, version=4}]]

I don't think this was affected by #86921 because even the old MockSinglePrioritizingExecutor uses scheduleNow to queue up the task, so that PR didn't change the number of scheduleNow calls. Instead it looks like we added an extra scheduleNow call in #61455.

kingherc added a commit to kingherc/elasticsearch that referenced this issue Sep 22, 2022
kingherc added a commit that referenced this issue Sep 25, 2022
* Add delay to DEFAULT_CLUSTER_STATE_UPDATE_DELAY

Fixes #89867

* Convert to int

* Update server/src/test/java/org/elasticsearch/cluster/coordination/CoordinatorTests.java

Co-authored-by: David Turner <david.turner@elastic.co>

Co-authored-by: David Turner <david.turner@elastic.co>
Co-authored-by: Elastic Machine <elasticmachine@users.noreply.github.com>
@gmarouli
Copy link
Contributor

I am afraid this happened again, see https://gradle-enterprise.elastic.co/s/cwlcjoocvaehg/tests/:server:test/org.elasticsearch.cluster.coordination.CoordinatorTests/testSingleNodeDiscoveryStabilisesEvenWhenDisrupted?top-execution=1. I am reopening this issue, if this is not the right thing to do, let me know and I will create a new one.

@gmarouli gmarouli reopened this Oct 28, 2022
@kingherc
Copy link
Contributor

kingherc commented Oct 31, 2022

Mary's issue is reproducible. Adding another delayVariabilityMillis in the runFor expression makes it pass. So, I delved once more into @DaveCTurner 's approach of running the testCanUpdateClusterStateAfterStabilisation test with fixed delays of 100ms in org.elasticsearch.common.util.concurrent.DeterministicTaskQueue#scheduleNow and scheduleAt functions. And I believe we still have 8 delays/stages in main for the DEFAULT_CLUSTER_STATE_UPDATE_DELAY. Here is an excerpt:

[2022-10-31T16:47:55,184][INFO ][o.e.c.c.CoordinatorTests ] [testCanUpdateClusterStateAfterStabilisation] --&gt; runFor(800ms) running until [2067930ms]: stabilising
[2022-10-31T16:47:55,184][TRACE][o.e.c.u.c.DeterministicTaskQueue] [testCanUpdateClusterStateAfterStabilisation] running task 0 of 2: {node1}{lSqmpP__T_-qutLN_____w}: timeout handler for [190][internal:coordination/fault_detection/leader_check]
[2022-10-31T16:47:55,185][TRACE][o.e.c.u.c.DeterministicTaskQueue] [testCanUpdateClusterStateAfterStabilisation] running task 0 of 1: {node0}{m4mHn___T_-dxn9_AAAAAA}: timeout handler for [204][internal:coordination/fault_detection/follower_check]

1: Master service task

[2022-10-31T16:47:55,185][TRACE][o.e.c.u.c.DeterministicTaskQueue] [testCanUpdateClusterStateAfterStabilisation] advanceTime: from [2067130ms] to [2067230ms]
[2022-10-31T16:47:55,185][TRACE][o.e.c.u.c.DeterministicTaskQueue] [testCanUpdateClusterStateAfterStabilisation] advanceTime: no longer deferred: DeferredTask[executionTimeMillis=2067230, task={node0}{m4mHn___T_-dxn9_AAAAAA}: master service scheduling next task]
{m4mHn___T_-dxn9_AAAAAA}}] register 8 [master] [publish_cluster_state_update] [publication of cluster state [12]]
[2022-10-31T16:47:55,187][TRACE][o.e.c.s.MasterService    ] [testCanUpdateClusterStateAfterStabilisation{nodeId={node0}{m4mHn___T_-dxn9_AAAAAA}}] cluster state updated, source [new value [0=3964185402078690287]]
DeferredTask[executionTimeMillis=2067330, task={node0}{m4mHn___T_-dxn9_AAAAAA}: publish change of cluster state from version [11] in term [8] to version [12] in term [8]]

2: state publisher task

[2022-10-31T16:47:55,188][TRACE][o.e.c.u.c.DeterministicTaskQueue] [testCanUpdateClusterStateAfterStabilisation] advanceTime: from [2067230ms] to [2067330ms]
[2022-10-31T16:47:55,188][TRACE][o.e.c.u.c.DeterministicTaskQueue] [testCanUpdateClusterStateAfterStabilisation] advanceTime: no longer deferred: DeferredTask[executionTimeMillis=2067330, task={node0}{m4mHn___T_-dxn9_AAAAAA}: publish change of cluster state from version [11] in term [8] to version [12] in term [8]]
[2022-10-31T16:47:55,206][TRACE][o.e.c.c.LeaderChecker    ] [testCanUpdateClusterStateAfterStabilisation{nodeId={node0}{m4mHn___T_-dxn9_AAAAAA}}] setCurrentNodes: [{node1}{lSqmpP__T_-qutLN_____w}{0.0.0.0}{0.0.0.0:3}{IScdfhilmrstw}, {node0}{m4mHn___T_-dxn9_AAAAAA}{0.0.0.0}{0.0.0.0:6}{IScdfhilmrstw}]
[2022-10-31T16:47:55,207][TRACE][o.e.c.c.C.CoordinatorPublication] [testCanUpdateClusterStateAfterStabilisation{nodeId={node0}{m4mHn___T_-dxn9_AAAAAA}}] publishing PublishRequest{term=8, version=12, state=cluster uuid: tYpV1P__T_-u0UMBAAAAAA [committed: true]
[2022-10-31T16:47:55,208][TRACE][o.e.c.u.c.DeterministicTaskQueue] [testCanUpdateClusterStateAfterStabilisation{nodeId={node0}{m4mHn___T_-dxn9_AAAAAA}}] scheduleNow: delaying [100ms], scheduling DeferredTask[executionTimeMillis=2067430, task={node0}{m4mHn___T_-dxn9_AAAAAA}: ActionRunnable#wrap[handling cluster state version [12] locally on [{node0}{m4mHn___T_-dxn9_AAAAAA}{0.0.0.0}{0.0.0.0:6}{IScdfhilmrstw}]]]

3: deliver publish requests

[2022-10-31T16:47:55,209][TRACE][o.e.c.u.c.DeterministicTaskQueue] [testCanUpdateClusterStateAfterStabilisation] advanceTime: from [2067330ms] to [2067430ms]
[2022-10-31T16:47:55,209][TRACE][o.e.c.u.c.DeterministicTaskQueue] [testCanUpdateClusterStateAfterStabilisation] advanceTime: no longer deferred: DeferredTask[executionTimeMillis=2067430, task={node1}{lSqmpP__T_-qutLN_____w}: [212][internal:cluster/coordination/publish_state] from {node0}{m4mHn___T_-dxn9_AAAAAA}{0.0.0.0}{0.0.0.0:6}{IScdfhilmrstw} to {node1}{lSqmpP__T_-qutLN_____w}{0.0.0.0}{0.0.0.0:3}{IScdfhilmrstw}]
[2022-10-31T16:47:55,209][TRACE][o.e.c.u.c.DeterministicTaskQueue] [testCanUpdateClusterStateAfterStabilisation] advanceTime: no longer deferred: DeferredTask[executionTimeMillis=2067430, task={node0}{m4mHn___T_-dxn9_AAAAAA}: ActionRunnable#wrap[handling cluster state version [12] locally on [{node0}{m4mHn___T_-dxn9_AAAAAA}{0.0.0.0}{0.0.0.0:6}{IScdfhilmrstw}]]]
[2022-10-31T16:47:55,219][TRACE][o.e.c.u.c.DeterministicTaskQueue] [testCanUpdateClusterStateAfterStabilisation{nodeId={node0}{m4mHn___T_-dxn9_AAAAAA}}] scheduleNow: delaying [100ms], scheduling DeferredTask[executionTimeMillis=2067530, task={node0}{m4mHn___T_-dxn9_AAAAAA}: processing of [214][internal:cluster/coordination/commit_state]: ApplyCommitRequest{term=8, version=12, sourceNode={node0}{m4mHn___T_-dxn9_AAAAAA}{0.0.0.0}{0.0.0.0:6}{IScdfhilmrstw}}]

4: receive publish responses

[2022-10-31T16:47:55,220][TRACE][o.e.c.u.c.DeterministicTaskQueue] [testCanUpdateClusterStateAfterStabilisation] advanceTime: from [2067430ms] to [2067530ms]
[2022-10-31T16:47:55,220][TRACE][o.e.c.u.c.DeterministicTaskQueue] [testCanUpdateClusterStateAfterStabilisation] advanceTime: no longer deferred: DeferredTask[executionTimeMillis=2067530, task={node0}{m4mHn___T_-dxn9_AAAAAA}: response to [212][internal:cluster/coordination/publish_state] from {node0}{m4mHn___T_-dxn9_AAAAAA}{0.0.0.0}{0.0.0.0:6}{IScdfhilmrstw} to {node1}{lSqmpP__T_-qutLN_____w}{0.0.0.0}{0.0.0.0:3}{IScdfhilmrstw}]
[2022-10-31T16:47:55,223][TRACE][o.e.t.TaskManager        ] [testCanUpdateClusterStateAfterStabilisation{nodeId={node0}{m4mHn___T_-dxn9_AAAAAA}}] unregister task for id: 213
[2022-10-31T16:47:55,223][TRACE][o.e.c.u.c.DeterministicTaskQueue] [testCanUpdateClusterStateAfterStabilisation{nodeId={node0}{m4mHn___T_-dxn9_AAAAAA}}] scheduleNow: delaying [100ms], scheduling DeferredTask[executionTimeMillis=2067630, task={node1}{lSqmpP__T_-qutLN_____w}: response to [198][internal:coordination/fault_detection/leader_check] from {node1}{lSqmpP__T_-qutLN_____w}{0.0.0.0}{0.0.0.0:3}{IScdfhilmrstw} to {node0}{m4mHn___T_-dxn9_AAAAAA}{0.0.0.0}{0.0.0.0:6}{IScdfhilmrstw}]

5: send commit requests

[2022-10-31T16:47:55,223][TRACE][o.e.c.u.c.DeterministicTaskQueue] [testCanUpdateClusterStateAfterStabilisation] advanceTime: from [2067530ms] to [2067630ms]
[2022-10-31T16:47:55,223][TRACE][o.e.c.u.c.DeterministicTaskQueue] [testCanUpdateClusterStateAfterStabilisation] advanceTime: no longer deferred: DeferredTask[executionTimeMillis=2067630, task={node0}{m4mHn___T_-dxn9_AAAAAA}: delivery of response to [214][internal:cluster/coordination/commit_state]: Empty{}]
[2022-10-31T16:47:55,223][TRACE][o.e.c.u.c.DeterministicTaskQueue] [testCanUpdateClusterStateAfterStabilisation] advanceTime: no longer deferred: DeferredTask[executionTimeMillis=2067630, task={node0}{m4mHn___T_-dxn9_AAAAAA}: response to [213][internal:coordination/fault_detection/follower_check] from {node0}{m4mHn___T_-dxn9_AAAAAA}{0.0.0.0}{0.0.0.0:6}{IScdfhilmrstw} to {node1}{lSqmpP__T_-qutLN_____w}{0.0.0.0}{0.0.0.0:3}{IScdfhilmrstw}]
[2022-10-31T16:47:55,224][TRACE][o.e.c.u.c.DeterministicTaskQueue] [testCanUpdateClusterStateAfterStabilisation] advanceTime: no longer deferred: DeferredTask[executionTimeMillis=2067630, task={node1}{lSqmpP__T_-qutLN_____w}: [215][internal:cluster/coordination/commit_state] from {node0}{m4mHn___T_-dxn9_AAAAAA}{0.0.0.0}{0.0.0.0:6}{IScdfhilmrstw} to {node1}{lSqmpP__T_-qutLN_____w}{0.0.0.0}{0.0.0.0:3}{IScdfhilmrstw}]
[2022-10-31T16:47:55,225][TRACE][o.e.c.u.c.DeterministicTaskQueue] [testCanUpdateClusterStateAfterStabilisation{nodeId={node0}{m4mHn___T_-dxn9_AAAAAA}}] scheduleAt: adding DeferredTask[executionTimeMillis=2068730, task={node0}{m4mHn___T_-dxn9_AAAAAA}: FollowerChecker{discoveryNode={node1}{lSqmpP__T_-qutLN_____w}{0.0.0.0}{0.0.0.0:3}{IScdfhilmrstw}, failureCountSinceLastSuccess=0, timeoutCountSinceLastSuccess=0, [cluster.fault_detection.follower_check.retry_count]=3}::handleWakeUp] with extra delay of [100ms]

6: apply new state on remote nodes

[2022-10-31T16:47:55,226][TRACE][o.e.c.u.c.DeterministicTaskQueue] [testCanUpdateClusterStateAfterStabilisation] advanceTime: from [2067630ms] to [2067730ms]
[2022-10-31T16:47:55,226][TRACE][o.e.c.u.c.DeterministicTaskQueue] [testCanUpdateClusterStateAfterStabilisation] advanceTime: no longer deferred: DeferredTask[executionTimeMillis=2067730, task=DisruptableClusterApplierService[[ApplyCommitRequest{term=8, version=12, sourceNode={node0}{m4mHn___T_-dxn9_AAAAAA}{0.0.0.0}{0.0.0.0:6}{IScdfhilmrstw}}]]]
[2022-10-31T16:47:55,226][TRACE][o.e.c.u.c.DeterministicTaskQueue] [testCanUpdateClusterStateAfterStabilisation] running task 0 of 1: DisruptableClusterApplierService[[ApplyCommitRequest{term=8, version=12, sourceNode={node0}{m4mHn___T_-dxn9_AAAAAA}{0.0.0.0}{0.0.0.0:6}{IScdfhilmrstw}}]]
[2022-10-31T16:47:55,228][TRACE][o.e.c.u.c.DeterministicTaskQueue] [testCanUpdateClusterStateAfterStabilisation{nodeId={node1}{lSqmpP__T_-qutLN_____w}}] scheduleNow: delaying [100ms], scheduling DeferredTask[executionTimeMillis=2067830, task={node0}{m4mHn___T_-dxn9_AAAAAA}: response to [215][internal:cluster/coordination/commit_state] from {node0}{m4mHn___T_-dxn9_AAAAAA}{0.0.0.0}{0.0.0.0:6}{IScdfhilmrstw} to {node1}{lSqmpP__T_-qutLN_____w}{0.0.0.0}{0.0.0.0:3}{IScdfhilmrstw}]

7: receive commit responses

[2022-10-31T16:47:55,228][TRACE][o.e.c.u.c.DeterministicTaskQueue] [testCanUpdateClusterStateAfterStabilisation] advanceTime: from [2067730ms] to [2067830ms]
[2022-10-31T16:47:55,228][TRACE][o.e.c.u.c.DeterministicTaskQueue] [testCanUpdateClusterStateAfterStabilisation] advanceTime: no longer deferred: DeferredTask[executionTimeMillis=2067830, task={node0}{m4mHn___T_-dxn9_AAAAAA}: response to [215][internal:cluster/coordination/commit_state] from {node0}{m4mHn___T_-dxn9_AAAAAA}{0.0.0.0}{0.0.0.0:6}{IScdfhilmrstw} to {node1}{lSqmpP__T_-qutLN_____w}{0.0.0.0}{0.0.0.0:3}{IScdfhilmrstw}]
[2022-10-31T16:47:55,228][TRACE][o.e.c.u.c.DeterministicTaskQueue] [testCanUpdateClusterStateAfterStabilisation] running task 0 of 1: {node0}{m4mHn___T_-dxn9_AAAAAA}: response to [215][internal:cluster/coordination/commit_state] from {node0}{m4mHn___T_-dxn9_AAAAAA}{0.0.0.0}{0.0.0.0:6}{IScdfhilmrstw} to {node1}{lSqmpP__T_-qutLN_____w}{0.0.0.0}{0.0.0.0:3}{IScdfhilmrstw}
[2022-10-31T16:47:55,229][TRACE][o.e.c.u.c.DeterministicTaskQueue] [testCanUpdateClusterStateAfterStabilisation{nodeId={node0}{m4mHn___T_-dxn9_AAAAAA}}] scheduleNow: delaying [100ms], scheduling DeferredTask[executionTimeMillis=2067930, task=DisruptableClusterApplierService[[Publication{term=8, version=12}]]]
[2022-10-31T16:47:55,229][TRACE][o.e.c.c.C.CoordinatorPublication] [testCanUpdateClusterStateAfterStabilisation{nodeId={node0}{m4mHn___T_-dxn9_AAAAAA}}] onPossibleCompletion: [Publication{term=8, version=12}] was successful

8: apply new state on master node

[2022-10-31T16:47:55,229][TRACE][o.e.c.u.c.DeterministicTaskQueue] [testCanUpdateClusterStateAfterStabilisation] advanceTime: from [2067830ms] to [2067930ms]
[2022-10-31T16:47:55,229][TRACE][o.e.c.u.c.DeterministicTaskQueue] [testCanUpdateClusterStateAfterStabilisation] advanceTime: no longer deferred: DeferredTask[executionTimeMillis=2067930, task=DisruptableClusterApplierService[[Publication{term=8, version=12}]]]
[2022-10-31T16:47:55,229][INFO ][o.e.c.c.CoordinatorTests ] [testCanUpdateClusterStateAfterStabilisation] --&gt; runFor(800ms) completed run until [2067930ms]: stabilising

So, I believe now this is a different problem. I tried to compare the outputs between the failed one and the successful one (where I added delayVariabilityMillis in the runFor expression).

Here is the failed one:

[2022-10-31T20:55:56,985][INFO ][o.e.c.c.CoordinatorTests ] [testSingleNodeDiscoveryStabilisesEvenWhenDisrupted] --&gt; runFor(3140601ms) running until [3140601ms]: stabilising
[2022-10-31T20:55:56,988][TRACE][o.e.c.u.c.DeterministicTaskQueue] [testSingleNodeDiscoveryStabilisesEvenWhenDisrupted] advanceTime: from [0ms] to [91ms]
...
[2022-10-31T20:56:01,076][TRACE][o.e.c.u.c.DeterministicTaskQueue] [testSingleNodeDiscoveryStabilisesEvenWhenDisrupted] advanceTime: from [3073086ms] to [3080013ms]
[2022-10-31T20:56:01,076][TRACE][o.e.c.u.c.DeterministicTaskQueue] [testSingleNodeDiscoveryStabilisesEvenWhenDisrupted] advanceTime: no longer deferred: DeferredTask[executionTimeMillis=3080013, task={node0}{t71ERgAAQACqOqV0AAAAAA}: emit warning if cluster not formed]
[2022-10-31T20:56:01,076][TRACE][o.e.c.u.c.DeterministicTaskQueue] [testSingleNodeDiscoveryStabilisesEvenWhenDisrupted] running task 0 of 1: {node0}{t71ERgAAQACqOqV0AAAAAA}: emit warning if cluster not formed
[2022-10-31T20:56:01,076][WARN ][o.e.c.c.JoinHelper       ] [testSingleNodeDiscoveryStabilisesEvenWhenDisrupted{nodeId={node0}{t71ERgAAQACqOqV0AAAAAA}}] last failed join attempt was 61ms ago, failed to join {node0}{t71ERgAAQACqOqV0AAAAAA}{0.0.0.0}{0.0.0.0:1}{IScdfhilmrstw} with JoinRequest{sourceNode={node0}{t71ERgAAQACqOqV0AAAAAA}{0.0.0.0}{0.0.0.0:1}{IScdfhilmrstw}, minimumTerm=13, optionalJoin=Optional[Join{term=14, lastAcceptedTerm=7, lastAcceptedVersion=1, sourceNode={node0}{t71ERgAAQACqOqV0AAAAAA}{0.0.0.0}{0.0.0.0:1}{IScdfhilmrstw}, targetNode={node0}{t71ERgAAQACqOqV0AAAAAA}{0.0.0.0}{0.0.0.0:1}{IScdfhilmrstw}}]}
org.elasticsearch.transport.RemoteTransportException: [][0.0.0.0:1][internal:cluster/coordination/join]
Caused by: org.elasticsearch.cluster.coordination.CoordinationStateRejectedException: incoming term 14 does not match current term 15
	at org.elasticsearch.cluster.coordination.CoordinationState.handleJoin(CoordinationState.java:218) ~[main/:?]
	at org.elasticsearch.cluster.coordination.Coordinator.handleJoin(Coordinator.java:1342) ~[main/:?]
	at java.util.Optional.ifPresent(Optional.java:178) ~[?:?]
	at org.elasticsearch.cluster.coordination.Coordinator.processJoinRequest(Coordinator.java:753) ~[main/:?]
	at org.elasticsearch.cluster.coordination.Coordinator$1.lambda$onResponse$1(Coordinator.java:612) ~[main/:?]
        ...
	at com.carrotsearch.randomizedtesting.ThreadLeakControl.lambda$forkTimeoutingTask$0(ThreadLeakControl.java:850) ~[randomizedtesting-runner-2.8.0.jar:?]
	at java.lang.Thread.run(Thread.java:1589) ~[?:?]
[2022-10-31T20:56:01,087][WARN ][o.e.c.c.ClusterFormationFailureHelper] [testSingleNodeDiscoveryStabilisesEvenWhenDisrupted{nodeId={node0}{t71ERgAAQACqOqV0AAAAAA}}] master not discovered or elected yet, an election requires a node with id [node0], have discovered possible quorum [{node0}{t71ERgAAQACqOqV0AAAAAA}{0.0.0.0}{0.0.0.0:1}{IScdfhilmrstw}]; discovery will continue using [] from hosts providers and [{node0}{t71ERgAAQACqOqV0AAAAAA}{0.0.0.0}{0.0.0.0:1}{IScdfhilmrstw}] from last-known cluster state; node term 15, last-accepted version 1 in term 7; joining [{node0}{t71ERgAAQACqOqV0AAAAAA}{0.0.0.0}{0.0.0.0:1}{IScdfhilmrstw}] in term [15] has status [waiting for response] after [3.4m/208138ms]
[2022-10-31T20:56:01,088][TRACE][o.e.c.u.c.DeterministicTaskQueue] [testSingleNodeDiscoveryStabilisesEvenWhenDisrupted{nodeId={node0}{t71ERgAAQACqOqV0AAAAAA}}] scheduleAt: adding DeferredTask[executionTimeMillis=3113554, task={node0}{t71ERgAAQACqOqV0AAAAAA}: emit warning if cluster not formed] with extra delay of [23541ms]
[2022-10-31T20:56:01,088][TRACE][o.e.c.u.c.DeterministicTaskQueue] [testSingleNodeDiscoveryStabilisesEvenWhenDisrupted] advanceTime: from [3080013ms] to [3105607ms]
[2022-10-31T20:56:01,088][TRACE][o.e.c.u.c.DeterministicTaskQueue] [testSingleNodeDiscoveryStabilisesEvenWhenDisrupted] advanceTime: no longer deferred: DeferredTask[executionTimeMillis=3105607, task={node0}{t71ERgAAQACqOqV0AAAAAA}: PeerFinder handling wakeup]
[2022-10-31T20:56:01,089][TRACE][o.e.c.u.c.DeterministicTaskQueue] [testSingleNodeDiscoveryStabilisesEvenWhenDisrupted] running task 0 of 1: {node0}{t71ERgAAQACqOqV0AAAAAA}: PeerFinder handling wakeup
[2022-10-31T20:56:01,091][TRACE][o.e.d.PeerFinder         ] [testSingleNodeDiscoveryStabilisesEvenWhenDisrupted{nodeId={node0}{t71ERgAAQACqOqV0AAAAAA}}] probing master nodes from cluster state: [{node0}{t71ERgAAQACqOqV0AAAAAA}{0.0.0.0}{0.0.0.0:1}{IScdfhilmrstw}]
[2022-10-31T20:56:01,092][TRACE][o.e.c.u.c.DeterministicTaskQueue] [testSingleNodeDiscoveryStabilisesEvenWhenDisrupted{nodeId={node0}{t71ERgAAQACqOqV0AAAAAA}}] scheduleAt: adding DeferredTask[executionTimeMillis=3321985, task={node0}{t71ERgAAQACqOqV0AAAAAA}: PeerFinder handling wakeup] with extra delay of [215378ms]
[2022-10-31T20:56:01,093][TRACE][o.e.c.u.c.DeterministicTaskQueue] [testSingleNodeDiscoveryStabilisesEvenWhenDisrupted] advanceTime: from [3105607ms] to [3113554ms]
[2022-10-31T20:56:01,093][TRACE][o.e.c.u.c.DeterministicTaskQueue] [testSingleNodeDiscoveryStabilisesEvenWhenDisrupted] advanceTime: no longer deferred: DeferredTask[executionTimeMillis=3113554, task={node0}{t71ERgAAQACqOqV0AAAAAA}: emit warning if cluster not formed]
[2022-10-31T20:56:01,093][TRACE][o.e.c.u.c.DeterministicTaskQueue] [testSingleNodeDiscoveryStabilisesEvenWhenDisrupted] running task 0 of 1: {node0}{t71ERgAAQACqOqV0AAAAAA}: emit warning if cluster not formed
[2022-10-31T20:56:01,095][WARN ][o.e.c.c.ClusterFormationFailureHelper] [testSingleNodeDiscoveryStabilisesEvenWhenDisrupted{nodeId={node0}{t71ERgAAQACqOqV0AAAAAA}}] master not discovered or elected yet, an election requires a node with id [node0], have discovered possible quorum [{node0}{t71ERgAAQACqOqV0AAAAAA}{0.0.0.0}{0.0.0.0:1}{IScdfhilmrstw}]; discovery will continue using [] from hosts providers and [{node0}{t71ERgAAQACqOqV0AAAAAA}{0.0.0.0}{0.0.0.0:1}{IScdfhilmrstw}] from last-known cluster state; node term 15, last-accepted version 1 in term 7; joining [{node0}{t71ERgAAQACqOqV0AAAAAA}{0.0.0.0}{0.0.0.0:1}{IScdfhilmrstw}] in term [15] has status [waiting for response] after [4m/241679ms]
[2022-10-31T20:56:01,095][TRACE][o.e.c.u.c.DeterministicTaskQueue] [testSingleNodeDiscoveryStabilisesEvenWhenDisrupted{nodeId={node0}{t71ERgAAQACqOqV0AAAAAA}}] scheduleAt: adding DeferredTask[executionTimeMillis=3269433, task={node0}{t71ERgAAQACqOqV0AAAAAA}: emit warning if cluster not formed] with extra delay of [145879ms]
[2022-10-31T20:56:01,095][TRACE][o.e.c.u.c.DeterministicTaskQueue] [testSingleNodeDiscoveryStabilisesEvenWhenDisrupted] advanceTime: from [3113554ms] to [3135001ms]
[2022-10-31T20:56:01,096][TRACE][o.e.c.u.c.DeterministicTaskQueue] [testSingleNodeDiscoveryStabilisesEvenWhenDisrupted] advanceTime: no longer deferred: DeferredTask[executionTimeMillis=3135001, task={node0}{t71ERgAAQACqOqV0AAAAAA}: timeout handler for [91][internal:cluster/formation/info]]
[2022-10-31T20:56:01,096][TRACE][o.e.c.u.c.DeterministicTaskQueue] [testSingleNodeDiscoveryStabilisesEvenWhenDisrupted] running task 0 of 1: {node0}{t71ERgAAQACqOqV0AAAAAA}: timeout handler for [91][internal:cluster/formation/info]
[2022-10-31T20:56:01,096][TRACE][o.e.c.u.c.DeterministicTaskQueue] [testSingleNodeDiscoveryStabilisesEvenWhenDisrupted] advanceTime: from [3135001ms] to [3150655ms]
[2022-10-31T20:56:01,097][TRACE][o.e.c.u.c.DeterministicTaskQueue] [testSingleNodeDiscoveryStabilisesEvenWhenDisrupted] advanceTime: no longer deferred: DeferredTask[executionTimeMillis=3150655, task={node0}{t71ERgAAQACqOqV0AAAAAA}: timeout handler for [92][internal:cluster/formation/info]]
[2022-10-31T20:56:01,097][INFO ][o.e.c.c.CoordinatorTests ] [testSingleNodeDiscoveryStabilisesEvenWhenDisrupted] --&gt; runFor(3140601ms) completed run until [3140601ms]: stabilising

And here's the one from the successful one (which simply runs longer):

[2022-10-31T21:16:40,865][INFO ][o.e.c.c.CoordinatorTests ] [testSingleNodeDiscoveryStabilisesEvenWhenDisrupted] --&gt; runFor(3382178ms) running until [3382178ms]: stabilising
[2022-10-31T21:16:40,866][TRACE][o.e.c.u.c.DeterministicTaskQueue] [testSingleNodeDiscoveryStabilisesEvenWhenDisrupted] advanceTime: from [0ms] to [91ms]
...
[2022-10-31T21:16:41,804][TRACE][o.e.c.u.c.DeterministicTaskQueue] [testSingleNodeDiscoveryStabilisesEvenWhenDisrupted] advanceTime: from [3073086ms] to [3080013ms]
[2022-10-31T21:16:41,804][TRACE][o.e.c.u.c.DeterministicTaskQueue] [testSingleNodeDiscoveryStabilisesEvenWhenDisrupted] advanceTime: no longer deferred: DeferredTask[executionTimeMillis=3080013, task={node0}{t71ERgAAQACqOqV0AAAAAA}: emit warning if cluster not formed]
[2022-10-31T21:16:41,804][TRACE][o.e.c.u.c.DeterministicTaskQueue] [testSingleNodeDiscoveryStabilisesEvenWhenDisrupted] running task 0 of 1: {node0}{t71ERgAAQACqOqV0AAAAAA}: emit warning if cluster not formed
[2022-10-31T21:16:41,804][WARN ][o.e.c.c.JoinHelper       ] [testSingleNodeDiscoveryStabilisesEvenWhenDisrupted{nodeId={node0}{t71ERgAAQACqOqV0AAAAAA}}] last failed join attempt was 13ms ago, failed to join {node0}{t71ERgAAQACqOqV0AAAAAA}{0.0.0.0}{0.0.0.0:1}{IScdfhilmrstw} with JoinRequest{sourceNode={node0}{t71ERgAAQACqOqV0AAAAAA}{0.0.0.0}{0.0.0.0:1}{IScdfhilmrstw}, minimumTerm=13, optionalJoin=Optional[Join{term=14, lastAcceptedTerm=7, lastAcceptedVersion=1, sourceNode={node0}{t71ERgAAQACqOqV0AAAAAA}{0.0.0.0}{0.0.0.0:1}{IScdfhilmrstw}, targetNode={node0}{t71ERgAAQACqOqV0AAAAAA}{0.0.0.0}{0.0.0.0:1}{IScdfhilmrstw}}]}
org.elasticsearch.transport.RemoteTransportException: [][0.0.0.0:1][internal:cluster/coordination/join]
Caused by: org.elasticsearch.cluster.coordination.CoordinationStateRejectedException: incoming term 14 does not match current term 15
	at org.elasticsearch.cluster.coordination.CoordinationState.handleJoin(CoordinationState.java:218) ~[main/:?]
	at org.elasticsearch.cluster.coordination.Coordinator.handleJoin(Coordinator.java:1342) ~[main/:?]
	at java.util.Optional.ifPresent(Optional.java:178) ~[?:?]
	at org.elasticsearch.cluster.coordination.Coordinator.processJoinRequest(Coordinator.java:753) ~[main/:?]
...
[2022-10-31T21:16:41,852][TRACE][o.e.c.u.c.DeterministicTaskQueue] [testSingleNodeDiscoveryStabilisesEvenWhenDisrupted] advanceTime: from [3306117ms] to [3306416ms]
[2022-10-31T21:16:41,852][TRACE][o.e.c.u.c.DeterministicTaskQueue] [testSingleNodeDiscoveryStabilisesEvenWhenDisrupted] advanceTime: no longer deferred: DeferredTask[executionTimeMillis=3306416, task={node0}{t71ERgAAQACqOqV0AAAAAA}: delivery of response to [95][internal:cluster/coordination/join/ping]: Empty{}]
[2022-10-31T21:16:41,852][TRACE][o.e.c.u.c.DeterministicTaskQueue] [testSingleNodeDiscoveryStabilisesEvenWhenDisrupted] running task 0 of 1: {node0}{t71ERgAAQACqOqV0AAAAAA}: delivery of response to [95][internal:cluster/coordination/join/ping]: Empty{}
[2022-10-31T21:16:41,852][DEBUG][o.e.c.c.CoordinationState] [testSingleNodeDiscoveryStabilisesEvenWhenDisrupted{nodeId={node0}{t71ERgAAQACqOqV0AAAAAA}}] handleJoin: ignored join due to term mismatch (expected: [18], actual: [15])
[2022-10-31T21:16:41,852][DEBUG][o.e.c.c.Coordinator      ] [testSingleNodeDiscoveryStabilisesEvenWhenDisrupted{nodeId={node0}{t71ERgAAQACqOqV0AAAAAA}}] failed to add Join{term=15, lastAcceptedTerm=7, lastAcceptedVersion=1, sourceNode={node0}{t71ERgAAQACqOqV0AAAAAA}{0.0.0.0}{0.0.0.0:1}{IScdfhilmrstw}, targetNode={node0}{t71ERgAAQACqOqV0AAAAAA}{0.0.0.0}{0.0.0.0:1}{IScdfhilmrstw}} - ignoring
org.elasticsearch.cluster.coordination.CoordinationStateRejectedException: incoming term 15 does not match current term 18
	at org.elasticsearch.cluster.coordination.CoordinationState.handleJoin(CoordinationState.java:218) ~[main/:?]
	at org.elasticsearch.cluster.coordination.Coordinator.handleJoinIgnoringExceptions(Coordinator.java:1352) ~[main/:?]
	at org.elasticsearch.cluster.coordination.Coordinator.handleJoin(Coordinator.java:1332) ~[main/:?]
	at java.util.Optional.ifPresent(Optional.java:178) ~[?:?]
        ...
	at com.carrotsearch.randomizedtesting.ThreadLeakControl.lambda$forkTimeoutingTask$0(ThreadLeakControl.java:850) ~[randomizedtesting-runner-2.8.0.jar:?]
	at java.lang.Thread.run(Thread.java:1589) ~[?:?]
[2022-10-31T21:16:41,855][TRACE][o.e.c.u.c.DeterministicTaskQueue] [testSingleNodeDiscoveryStabilisesEvenWhenDisrupted] advanceTime: from [3306416ms] to [3321985ms]
[2022-10-31T21:16:41,856][TRACE][o.e.c.u.c.DeterministicTaskQueue] [testSingleNodeDiscoveryStabilisesEvenWhenDisrupted] advanceTime: no longer deferred: DeferredTask[executionTimeMillis=3321985, task={node0}{t71ERgAAQACqOqV0AAAAAA}: PeerFinder handling wakeup]
[2022-10-31T21:16:41,856][TRACE][o.e.c.u.c.DeterministicTaskQueue] [testSingleNodeDiscoveryStabilisesEvenWhenDisrupted] running task 0 of 1: {node0}{t71ERgAAQACqOqV0AAAAAA}: PeerFinder handling wakeup
[2022-10-31T21:16:41,856][TRACE][o.e.d.PeerFinder         ] [testSingleNodeDiscoveryStabilisesEvenWhenDisrupted{nodeId={node0}{t71ERgAAQACqOqV0AAAAAA}}] not active
[2022-10-31T21:16:41,856][TRACE][o.e.c.u.c.DeterministicTaskQueue] [testSingleNodeDiscoveryStabilisesEvenWhenDisrupted] advanceTime: from [3321985ms] to [3325256ms]
[2022-10-31T21:16:41,856][TRACE][o.e.c.u.c.DeterministicTaskQueue] [testSingleNodeDiscoveryStabilisesEvenWhenDisrupted] advanceTime: no longer deferred: DeferredTask[executionTimeMillis=3325256, task={node0}{t71ERgAAQACqOqV0AAAAAA}: async initialization of indices lookup for cluster state 2]
[2022-10-31T21:16:41,856][TRACE][o.e.c.u.c.DeterministicTaskQueue] [testSingleNodeDiscoveryStabilisesEvenWhenDisrupted] running task 0 of 1: {node0}{t71ERgAAQACqOqV0AAAAAA}: async initialization of indices lookup for cluster state 2
[2022-10-31T21:16:41,857][TRACE][o.e.c.u.c.DeterministicTaskQueue] [testSingleNodeDiscoveryStabilisesEvenWhenDisrupted] advanceTime: from [3325256ms] to [3342304ms]
[2022-10-31T21:16:41,857][TRACE][o.e.c.u.c.DeterministicTaskQueue] [testSingleNodeDiscoveryStabilisesEvenWhenDisrupted] advanceTime: no longer deferred: DeferredTask[executionTimeMillis=3342304, task=DisruptableClusterApplierService[[joining {node0}{t71ERgAAQACqOqV0AAAAAA}{0.0.0.0}{0.0.0.0:1}{IScdfhilmrstw}]]]
[2022-10-31T21:16:41,857][TRACE][o.e.c.u.c.DeterministicTaskQueue] [testSingleNodeDiscoveryStabilisesEvenWhenDisrupted] running task 0 of 1: DisruptableClusterApplierService[[joining {node0}{t71ERgAAQACqOqV0AAAAAA}{0.0.0.0}{0.0.0.0:1}{IScdfhilmrstw}]]
[2022-10-31T21:16:41,857][DEBUG][o.e.c.s.ClusterApplierService] [testSingleNodeDiscoveryStabilisesEvenWhenDisrupted{nodeId={node0}{t71ERgAAQACqOqV0AAAAAA}}] processing [joining {node0}{t71ERgAAQACqOqV0AAAAAA}{0.0.0.0}{0.0.0.0:1}{IScdfhilmrstw}]: execute
[2022-10-31T21:16:41,857][DEBUG][o.e.c.s.ClusterApplierService] [testSingleNodeDiscoveryStabilisesEvenWhenDisrupted{nodeId={node0}{t71ERgAAQACqOqV0AAAAAA}}] processing [joining {node0}{t71ERgAAQACqOqV0AAAAAA}{0.0.0.0}{0.0.0.0:1}{IScdfhilmrstw}]: took [0s] no change in cluster state
[2022-10-31T21:16:41,857][TRACE][o.e.t.T.tracer           ] [testSingleNodeDiscoveryStabilisesEvenWhenDisrupted{nodeId={node0}{t71ERgAAQACqOqV0AAAAAA}}] [109][internal:cluster/coordination/join] sent to [{node0}{t71ERgAAQACqOqV0AAAAAA}{0.0.0.0}{0.0.0.0:1}{IScdfhilmrstw}] (timeout: [null])
[2022-10-31T21:16:41,857][TRACE][o.e.t.T.tracer           ] [testSingleNodeDiscoveryStabilisesEvenWhenDisrupted{nodeId={node0}{t71ERgAAQACqOqV0AAAAAA}}] [109][internal:cluster/coordination/join] received request
[2022-10-31T21:16:41,857][TRACE][o.e.c.u.c.DeterministicTaskQueue] [testSingleNodeDiscoveryStabilisesEvenWhenDisrupted{nodeId={node0}{t71ERgAAQACqOqV0AAAAAA}}] scheduleNow: delaying [205799ms], scheduling DeferredTask[executionTimeMillis=3548103, task={node0}{t71ERgAAQACqOqV0AAAAAA}: processing of [109][internal:cluster/coordination/join]: JoinRequest{sourceNode={node0}{t71ERgAAQACqOqV0AAAAAA}{0.0.0.0}{0.0.0.0:1}{IScdfhilmrstw}, minimumTerm=15, optionalJoin=Optional[Join{term=16, lastAcceptedTerm=7, lastAcceptedVersion=1, sourceNode={node0}{t71ERgAAQACqOqV0AAAAAA}{0.0.0.0}{0.0.0.0:1}{IScdfhilmrstw}, targetNode={node0}{t71ERgAAQACqOqV0AAAAAA}{0.0.0.0}{0.0.0.0:1}{IScdfhilmrstw}}]}]
[2022-10-31T21:16:41,858][TRACE][o.e.c.u.c.DeterministicTaskQueue] [testSingleNodeDiscoveryStabilisesEvenWhenDisrupted] advanceTime: from [3342304ms] to [3359474ms]
[2022-10-31T21:16:41,858][TRACE][o.e.c.u.c.DeterministicTaskQueue] [testSingleNodeDiscoveryStabilisesEvenWhenDisrupted] advanceTime: no longer deferred: DeferredTask[executionTimeMillis=3359474, task={node0}{t71ERgAAQACqOqV0AAAAAA}: org.elasticsearch.cluster.coordination.CoordinationDiagnosticsService$$Lambda$1189/0x000000080148e7a8@78688d8a]
[2022-10-31T21:16:41,858][TRACE][o.e.c.u.c.DeterministicTaskQueue] [testSingleNodeDiscoveryStabilisesEvenWhenDisrupted] running task 0 of 1: {node0}{t71ERgAAQACqOqV0AAAAAA}: org.elasticsearch.cluster.coordination.CoordinationDiagnosticsService$$Lambda$1189/0x000000080148e7a8@78688d8a
[2022-10-31T21:16:41,859][TRACE][o.e.c.c.CoordinationDiagnosticsService] [testSingleNodeDiscoveryStabilisesEvenWhenDisrupted{nodeId={node0}{t71ERgAAQACqOqV0AAAAAA}}] Opened connection to {node0}{t71ERgAAQACqOqV0AAAAAA}{0.0.0.0}{0.0.0.0:1}{IScdfhilmrstw}, making transport request
[2022-10-31T21:16:41,859][TRACE][o.e.c.u.c.DeterministicTaskQueue] [testSingleNodeDiscoveryStabilisesEvenWhenDisrupted{nodeId={node0}{t71ERgAAQACqOqV0AAAAAA}}] scheduleAt: adding DeferredTask[executionTimeMillis=3440500, task={node0}{t71ERgAAQACqOqV0AAAAAA}: timeout handler for [110][internal:cluster/formation/info]] with extra delay of [71026ms]
[2022-10-31T21:16:41,859][TRACE][o.e.t.T.tracer           ] [testSingleNodeDiscoveryStabilisesEvenWhenDisrupted{nodeId={node0}{t71ERgAAQACqOqV0AAAAAA}}] [110][internal:cluster/formation/info] sent to [{node0}{t71ERgAAQACqOqV0AAAAAA}{0.0.0.0}{0.0.0.0:1}{IScdfhilmrstw}] (timeout: [10s])
[2022-10-31T21:16:41,859][TRACE][o.e.t.T.tracer           ] [testSingleNodeDiscoveryStabilisesEvenWhenDisrupted{nodeId={node0}{t71ERgAAQACqOqV0AAAAAA}}] [110][internal:cluster/formation/info] received request
[2022-10-31T21:16:41,859][TRACE][o.e.t.TaskManager        ] [testSingleNodeDiscoveryStabilisesEvenWhenDisrupted{nodeId={node0}{t71ERgAAQACqOqV0AAAAAA}}] register 109 [direct] [internal:cluster/formation/info] []
[2022-10-31T21:16:41,859][TRACE][o.e.t.TaskManager        ] [testSingleNodeDiscoveryStabilisesEvenWhenDisrupted{nodeId={node0}{t71ERgAAQACqOqV0AAAAAA}}] unregister task for id: 109
[2022-10-31T21:16:41,859][TRACE][o.e.t.T.tracer           ] [testSingleNodeDiscoveryStabilisesEvenWhenDisrupted{nodeId={node0}{t71ERgAAQACqOqV0AAAAAA}}] [110][internal:cluster/formation/info] sent response
[2022-10-31T21:16:41,859][TRACE][o.e.t.T.tracer           ] [testSingleNodeDiscoveryStabilisesEvenWhenDisrupted{nodeId={node0}{t71ERgAAQACqOqV0AAAAAA}}] [110][internal:cluster/formation/info] received response from [{node0}{t71ERgAAQACqOqV0AAAAAA}{0.0.0.0}{0.0.0.0:1}{IScdfhilmrstw}]
[2022-10-31T21:16:41,860][TRACE][o.e.c.c.CoordinationDiagnosticsService] [testSingleNodeDiscoveryStabilisesEvenWhenDisrupted{nodeId={node0}{t71ERgAAQACqOqV0AAAAAA}}] Received remote response from {node0}{t71ERgAAQACqOqV0AAAAAA}{0.0.0.0}{0.0.0.0:1}{IScdfhilmrstw} in 44.3ms
[2022-10-31T21:16:41,860][TRACE][o.e.c.u.c.DeterministicTaskQueue] [testSingleNodeDiscoveryStabilisesEvenWhenDisrupted{nodeId={node0}{t71ERgAAQACqOqV0AAAAAA}}] scheduleAt: adding DeferredTask[executionTimeMillis=3587329, task={node0}{t71ERgAAQACqOqV0AAAAAA}: org.elasticsearch.cluster.coordination.CoordinationDiagnosticsService$$Lambda$1189/0x000000080148e7a8@72067920] with extra delay of [217855ms]
[2022-10-31T21:16:41,860][TRACE][o.e.c.u.c.DeterministicTaskQueue] [testSingleNodeDiscoveryStabilisesEvenWhenDisrupted] advanceTime: from [3359474ms] to [3402394ms]
[2022-10-31T21:16:41,860][TRACE][o.e.c.u.c.DeterministicTaskQueue] [testSingleNodeDiscoveryStabilisesEvenWhenDisrupted] advanceTime: no longer deferred: DeferredTask[executionTimeMillis=3402394, task={node0}{t71ERgAAQACqOqV0AAAAAA}: publish change of cluster state from version [1] in term [7] to version [2] in term [18]]
[2022-10-31T21:16:41,860][INFO ][o.e.c.c.CoordinatorTests ] [testSingleNodeDiscoveryStabilisesEvenWhenDisrupted] --&gt; runFor(3382178ms) completed run until [3382178ms]: stabilising

From my rather inexperienced understanding, and looking at the above comparisons, I believe that the specific seed makes the test cluster unstable for much longer than expected in the test's runFor() formula. I see it because we see Caused by: org.elasticsearch.cluster.coordination.CoordinationStateRejectedException: incoming term 14 does not match current term 15 as the last thing of the failed test, but we also see it in the middle of the successful one (at exactly the same spot of advanceTime: from [3073086ms] to [3080013ms], which hints that the tests have done the same steps so far) which goes on with more failing terms, until it finally succeeds.

I hope @DaveCTurner may be able to shed more light on how to continue on this one.

@DaveCTurner
Copy link
Contributor Author

DaveCTurner commented Nov 2, 2022

This seems to be a genuine (but very low-impact) bug. In the failing run the delay variability is set to a massive 241577ms (i.e. just over 4 minutes). The ElectionScheduler is triggering elections much more frequently than this so there's lots of concurrent elections in flight. Many of these elections pass the request_pre_vote stage and increment the term, which causes earlier elections to fail, but are then superseded by a later election before they can complete. It's something of a coincidence that one more delayVariabilityMillis is enough to complete an election. It could take many more rounds in theory.

However when using single-node discovery there should be no need to schedule multiple elections like this because there's only one node involved so the first election should succeed. Maybe something like this is enough?

diff --git a/server/src/main/java/org/elasticsearch/cluster/coordination/Coordinator.java b/server/src/main/java/org/elasticsearch/cluster/coordination/Coordinator.java
index 33e3528246d..7d9b0b666e3 100644
--- a/server/src/main/java/org/elasticsearch/cluster/coordination/Coordinator.java
+++ b/server/src/main/java/org/elasticsearch/cluster/coordination/Coordinator.java
@@ -1628,6 +1628,10 @@ public class Coordinator extends AbstractLifecycleComponent implements ClusterSt
                         }

                         if (prevotingRound != null) {
+                            if (singleNodeDiscovery) {
+                                logger.debug("skip prevoting as single-node discovery is enabled and prevoting is already in progress");
+                                return;
+                            }
                             prevotingRound.close();
                         }
                         prevotingRound = preVoteCollector.start(lastAcceptedState, getDiscoveredNodes());

@kingherc
Copy link
Contributor

kingherc commented Nov 2, 2022

@DaveCTurner wow, nice, thanks for taking a look and understanding it so quickly! I just tried that and it works. Will open a PR. I'm a bit confused with the debug message skip prevoting as single-node discovery is enabled and prevoting is already in progress, because it mentions to skip something while it is already in progress. Will take it as is, but feel free to tell me if it could be better formulated.

@DaveCTurner
Copy link
Contributor Author

I didn't put much thought into the message, feel free to adjust as you see fit :)

kingherc added a commit to kingherc/elasticsearch that referenced this issue Nov 2, 2022
@idegtiarenko
Copy link
Contributor

@DaveCTurner
Copy link
Contributor Author

Thanks Ievgen, this looks to be a problem caused by #91255.

@kingherc
Copy link
Contributor

kingherc commented Nov 4, 2022

This ticket is haunted. You fix one thing, another thing pops up :)
@dct , any reason you believe the PR caused the new failure?
I will try to re-produce it now

@DaveCTurner
Copy link
Contributor Author

Yes, the problem in this failure case is that an election happens within runRandomly but fails due to a (simulated) IO exception, but we don't have a way to remove prevotingRound when this happens so the change we made in #91255 just blocks all future election attempts.

@kingherc
Copy link
Contributor

kingherc commented Nov 4, 2022

I cannot reproduce the new failure on main (I ran 10 iterations of it). @DaveCTurner , how did you understand all that? :) I cannot see any logs in the test failure.

@kingherc
Copy link
Contributor

kingherc commented Nov 4, 2022

we don't have a way to remove prevotingRound when this happens

Is this something that can happen in reality and thus we introduced a new bug?

@DaveCTurner
Copy link
Contributor Author

I cannot reproduce the new failure on main

The REPRODUCE WITH line works on the feature/desired-balance-allocator branch (but not on main).

Is this something that can happen in reality and thus we introduced a new bug?

Yes I think so.

@DaveCTurner
Copy link
Contributor Author

DaveCTurner commented Nov 4, 2022

I think we should revert #91255 because the failure it introduces seems much more frequent than the one it fixes (a few hundred iterations vs a few thousand)

@kingherc
Copy link
Contributor

kingherc commented Nov 4, 2022

BTW, I also reproduced it in main. Did not take many iterations until it popped up.

I was thinking of how we go to solve both issues. I think either:

@DaveCTurner what do you think? Guessing from your last comment, would the first approach worth trying?

@DaveCTurner
Copy link
Contributor Author

I think the second approach would be more robust, but also much more effort because the plumbing for that failure handling just doesn't exist today. And this is a super-minor bug that almost certainly eventually goes away with enough retries so not really worth the effort.

Therefore I favour the first. I've just pushed bdf3485 which reverts #91255. I'll also push a commit which mutes this test while we continue to work on it.

@DaveCTurner DaveCTurner reopened this Nov 4, 2022
@DaveCTurner
Copy link
Contributor Author

I think the best setting to adjust would be cluster.election.duration which represents a kind of election timeout after which we start to think about scheduling the next election. Today that cannot be set greater than 5 mins, but with the delay variability in this test going up to 10min sometimes we'd need a much longer max. I think that's ok.

@kingherc
Copy link
Contributor

kingherc commented Nov 4, 2022

I can test this.

I think that's ok.

You mean it's OK to set the delay variability to up to 5 mins instead of 10mins?

@DaveCTurner
Copy link
Contributor Author

No I mean it should be ok to relax the upper bound on cluster.election.duration. The election takes multiple steps so needs to be much longer than delayVariabilityMillis - I reckon something like delayVariabilityMillis + 4*delayVariabilityMillis + clusterStateUpdateDelay which adds up to 2h10min.

@kingherc
Copy link
Contributor

kingherc commented Nov 4, 2022

@DaveCTurner , so I should change the max of the cluster.election.duration to 10mins instead of 5mins in production? Or is there a way I can change the max just for the test?

@kingherc
Copy link
Contributor

kingherc commented Nov 4, 2022

By the way, I notice that cluster.election.max_timeout also has 5 mins as maximum. Is that related to the election duration max value? If yes, should I change the max_timeout's max value to also be 10 mins?

@kingherc
Copy link
Contributor

kingherc commented Nov 4, 2022

FYI, setting the cluster.election.duration to delayVariabilityMillis if the latter is bigger than the default duration, worked for one example I had. I will re-try now to run a lot of iterations to make sure now it's stable finally.

@kingherc
Copy link
Contributor

kingherc commented Nov 4, 2022

OK, after many iterations (>1000) it looks good. I will open the PR.

kingherc added a commit to kingherc/elasticsearch that referenced this issue Nov 4, 2022
By increasing the voting duration in case of high
delays, to avoid the possible endless repetition
of voting rounds.

Fixes elastic#89867
kingherc added a commit that referenced this issue Nov 7, 2022
By increasing the voting duration in case of high
delays, to avoid the possible endless repetition
of voting rounds.

Fixes #89867
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed Coordination/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. >test-failure Triaged test failures from CI
Projects
None yet
5 participants