[CI] CoordinatorTests testUnhealthyLeaderIsReplaced failing #90158

original-brownbear · 2022-09-20T08:17:00Z

Build scan:
https://gradle-enterprise.elastic.co/s/j4xfvozocqmng/tests/:server:test/org.elasticsearch.cluster.coordination.CoordinatorTests/testUnhealthyLeaderIsReplaced

Reproduction line:
./gradlew ':server:test' --tests "org.elasticsearch.cluster.coordination.CoordinatorTests.testUnhealthyLeaderIsReplaced" -Dtests.seed=3E136AAE5154F966 -Dtests.locale=ro -Dtests.timezone=Asia/Aqtau -Druntime.java=17

Applicable branches:
main

Reproduces locally?:
Didn't try

Failure history:
https://gradle-enterprise.elastic.co/scans/tests?tests.container=org.elasticsearch.cluster.coordination.CoordinatorTests&tests.test=testUnhealthyLeaderIsReplaced

Failure excerpt:

java.lang.AssertionError: leaders
Expected: not an empty collection
     but: was <[]>

  at __randomizedtesting.SeedInfo.seed([3E136AAE5154F966:80597ABFDE528E40]:0)
  at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:18)
  at org.junit.Assert.assertThat(Assert.java:956)
  at org.elasticsearch.cluster.coordination.AbstractCoordinatorTestCase$Cluster.getAnyLeader(AbstractCoordinatorTestCase.java:791)
  at org.elasticsearch.cluster.coordination.AbstractCoordinatorTestCase$Cluster.stabilise(AbstractCoordinatorTestCase.java:563)
  at org.elasticsearch.cluster.coordination.AbstractCoordinatorTestCase$Cluster.stabilise(AbstractCoordinatorTestCase.java:548)
  at org.elasticsearch.cluster.coordination.CoordinatorTests.testUnhealthyLeaderIsReplaced(CoordinatorTests.java:640)
  at jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(NativeMethodAccessorImpl.java:-2)
  at jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77)
  at jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
  at java.lang.reflect.Method.invoke(Method.java:568)
  at com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1758)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$8.evaluate(RandomizedRunner.java:946)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$9.evaluate(RandomizedRunner.java:982)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$10.evaluate(RandomizedRunner.java:996)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at org.apache.lucene.tests.util.TestRuleSetupTeardownChained$1.evaluate(TestRuleSetupTeardownChained.java:44)
  at org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
  at org.apache.lucene.tests.util.TestRuleThreadAndTestName$1.evaluate(TestRuleThreadAndTestName.java:45)
  at org.apache.lucene.tests.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:60)
  at org.apache.lucene.tests.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:44)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:390)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl.forkTimeoutingTask(ThreadLeakControl.java:843)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl$3.evaluate(ThreadLeakControl.java:490)
  at com.carrotsearch.randomizedtesting.RandomizedRunner.runSingleTest(RandomizedRunner.java:955)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$5.evaluate(RandomizedRunner.java:840)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$6.evaluate(RandomizedRunner.java:891)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$7.evaluate(RandomizedRunner.java:902)
  at org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at org.apache.lucene.tests.util.TestRuleStoreClassName$1.evaluate(TestRuleStoreClassName.java:38)
  at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
  at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at org.apache.lucene.tests.util.TestRuleAssertionsRequired$1.evaluate(TestRuleAssertionsRequired.java:53)
  at org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
  at org.apache.lucene.tests.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:44)
  at org.apache.lucene.tests.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:60)
  at org.apache.lucene.tests.util.TestRuleIgnoreTestSuites$1.evaluate(TestRuleIgnoreTestSuites.java:47)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:390)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl.lambda$forkTimeoutingTask$0(ThreadLeakControl.java:850)
  at java.lang.Thread.run(Thread.java:833)

The text was updated successfully, but these errors were encountered:

elasticsearchmachine · 2022-09-20T08:17:23Z

Pinging @elastic/es-distributed (Team:Distributed)

kingherc · 2022-09-20T08:26:50Z

This reminds me of #89867 's error, but it's a different test.

Mute flaky test `CoordinatorTests#testUnhealthyLeaderIsReplaced` Relates to #90158

kingherc · 2022-09-25T10:21:01Z

This has been showing frequently in last 7 days. I believe the solution of #89867 which was just merged may solve this as well. I will monitor the related build failures in next 7 days to see if it was solved.

kingherc · 2022-09-25T10:21:40Z

Oh I may need to re-enable the test, will make a PR for it.

kingherc · 2022-09-26T08:02:36Z

I just tried to reproduce this, and it still fails in main. So the fix for #89867 does not seem related. I will handle this in due time.

kingherc · 2022-11-07T18:36:55Z

So, I've managed to figure out which PR introduced the issue by running many iterations of the test. Specifically it seems to be this line of this PR. If I change the line to this.empty = false;, then the test works (which means it passes many iterations >100 successfully, while previously it failed after 10 or so iterations). I've yet to understand if there's some bug in that PR, so far I believe no.

The main difference the PR seems to have introduced is that previously we were making a copy of the Metadata, while now we may be returning the same instance of Metadata, but I am not sure if that's an issue or not. @original-brownbear since it seems you authored that PR, might you understand why it makes the test fail?

original-brownbear · 2022-11-08T09:12:39Z

@kingherc hard to say how this got broken by my change without digging into it deeper. I can see a couple of spots where metadata instance equality is checked in AbstractCoordinatorTestCase. My guess would be that one of those made assumptions about the metadata instance changing that now don't hold true where we don't needlessly rebuild the instance when there's no changes. That's the direction I'd investigate here. Let me know if you want me to take a deeper look.

DaveCTurner · 2022-12-09T15:14:58Z

This looks to be a bad bug in the vicinity of adjusting the voting configuration, which is a pretty dangerous area to have bugs. See ~~#92258~~ #92259 for an explanation.

The cluster coordination consistency layer relies on a couple of fields within `Metadata` which record the last _committed_ values on each node. In contrast, the rest of the cluster state can only be changed at _accept_ time. In the past we would copy these fields over from the master on every publication, but since #90101 we don't copy anything at all if the `Metadata` is unchanged on the master. However, the master computes the diff against the last _committed_ state whereas the receiving nodes apply the diff to the last _accepted_ state, and this means if the master sends a no-op `Metadata` diff then the receiving node will revert its last-committed values to the ones included in the state it last accepted. With this commit we include the last-committed values alongside the cluster state diff so that they are always copied properly. Closes #90158

The cluster coordination consistency layer relies on a couple of fields within `Metadata` which record the last _committed_ values on each node. In contrast, the rest of the cluster state can only be changed at _accept_ time. In the past we would copy these fields over from the master on every publication, but since elastic#90101 we don't copy anything at all if the `Metadata` is unchanged on the master. However, the master computes the diff against the last _committed_ state whereas the receiving nodes apply the diff to the last _accepted_ state, and this means if the master sends a no-op `Metadata` diff then the receiving node will revert its last-committed values to the ones included in the state it last accepted. With this commit we include the last-committed values alongside the cluster state diff so that they are always copied properly. Closes elastic#90158 Backport of elastic#92259 to 8.6

The cluster coordination consistency layer relies on a couple of fields within `Metadata` which record the last _committed_ values on each node. In contrast, the rest of the cluster state can only be changed at _accept_ time. In the past we would copy these fields over from the master on every publication, but since #90101 we don't copy anything at all if the `Metadata` is unchanged on the master. However, the master computes the diff against the last _committed_ state whereas the receiving nodes apply the diff to the last _accepted_ state, and this means if the master sends a no-op `Metadata` diff then the receiving node will revert its last-committed values to the ones included in the state it last accepted. With this commit we include the last-committed values alongside the cluster state diff so that they are always copied properly. Closes #90158 Backport of #92259 to 8.6

original-brownbear added :Distributed Coordination/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. >test-failure Triaged test failures from CI labels Sep 20, 2022

elasticsearchmachine added the Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. label Sep 20, 2022

kingherc self-assigned this Sep 20, 2022

csoulios mentioned this issue Sep 21, 2022

[CI] Mute test CoordinatorTests#testUnhealthyLeaderIsReplaced #90180

Merged

elasticsearchmachine pushed a commit that referenced this issue Sep 21, 2022

Mute failing test (#90180)

c39a623

Mute flaky test `CoordinatorTests#testUnhealthyLeaderIsReplaced` Relates to #90158

DaveCTurner mentioned this issue Dec 11, 2022

Include last-committed data in publication #92259

Merged

elasticsearchmachine closed this as completed in #92259 Dec 12, 2022

DaveCTurner mentioned this issue Dec 12, 2022

Include last-committed data in publication #92277

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CI] CoordinatorTests testUnhealthyLeaderIsReplaced failing #90158

[CI] CoordinatorTests testUnhealthyLeaderIsReplaced failing #90158

original-brownbear commented Sep 20, 2022

elasticsearchmachine commented Sep 20, 2022

kingherc commented Sep 20, 2022

kingherc commented Sep 25, 2022

kingherc commented Sep 25, 2022

kingherc commented Sep 26, 2022

kingherc commented Nov 7, 2022

original-brownbear commented Nov 8, 2022

DaveCTurner commented Dec 9, 2022 •

edited

Loading

[CI] CoordinatorTests testUnhealthyLeaderIsReplaced failing #90158

[CI] CoordinatorTests testUnhealthyLeaderIsReplaced failing #90158

Comments

original-brownbear commented Sep 20, 2022

elasticsearchmachine commented Sep 20, 2022

kingherc commented Sep 20, 2022

kingherc commented Sep 25, 2022

kingherc commented Sep 25, 2022

kingherc commented Sep 26, 2022

kingherc commented Nov 7, 2022

original-brownbear commented Nov 8, 2022

DaveCTurner commented Dec 9, 2022 • edited Loading

DaveCTurner commented Dec 9, 2022 •

edited

Loading