Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CI] CoordinatorTests testUnhealthyLeaderIsReplaced failing #90158

Closed
original-brownbear opened this issue Sep 20, 2022 · 8 comments · Fixed by #92259
Closed

[CI] CoordinatorTests testUnhealthyLeaderIsReplaced failing #90158

original-brownbear opened this issue Sep 20, 2022 · 8 comments · Fixed by #92259
Assignees
Labels
:Distributed Coordination/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. >test-failure Triaged test failures from CI

Comments

@original-brownbear
Copy link
Member

Build scan:
https://gradle-enterprise.elastic.co/s/j4xfvozocqmng/tests/:server:test/org.elasticsearch.cluster.coordination.CoordinatorTests/testUnhealthyLeaderIsReplaced

Reproduction line:
./gradlew ':server:test' --tests "org.elasticsearch.cluster.coordination.CoordinatorTests.testUnhealthyLeaderIsReplaced" -Dtests.seed=3E136AAE5154F966 -Dtests.locale=ro -Dtests.timezone=Asia/Aqtau -Druntime.java=17

Applicable branches:
main

Reproduces locally?:
Didn't try

Failure history:
https://gradle-enterprise.elastic.co/scans/tests?tests.container=org.elasticsearch.cluster.coordination.CoordinatorTests&tests.test=testUnhealthyLeaderIsReplaced

Failure excerpt:

java.lang.AssertionError: leaders
Expected: not an empty collection
     but: was <[]>

  at __randomizedtesting.SeedInfo.seed([3E136AAE5154F966:80597ABFDE528E40]:0)
  at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:18)
  at org.junit.Assert.assertThat(Assert.java:956)
  at org.elasticsearch.cluster.coordination.AbstractCoordinatorTestCase$Cluster.getAnyLeader(AbstractCoordinatorTestCase.java:791)
  at org.elasticsearch.cluster.coordination.AbstractCoordinatorTestCase$Cluster.stabilise(AbstractCoordinatorTestCase.java:563)
  at org.elasticsearch.cluster.coordination.AbstractCoordinatorTestCase$Cluster.stabilise(AbstractCoordinatorTestCase.java:548)
  at org.elasticsearch.cluster.coordination.CoordinatorTests.testUnhealthyLeaderIsReplaced(CoordinatorTests.java:640)
  at jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(NativeMethodAccessorImpl.java:-2)
  at jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77)
  at jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
  at java.lang.reflect.Method.invoke(Method.java:568)
  at com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1758)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$8.evaluate(RandomizedRunner.java:946)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$9.evaluate(RandomizedRunner.java:982)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$10.evaluate(RandomizedRunner.java:996)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at org.apache.lucene.tests.util.TestRuleSetupTeardownChained$1.evaluate(TestRuleSetupTeardownChained.java:44)
  at org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
  at org.apache.lucene.tests.util.TestRuleThreadAndTestName$1.evaluate(TestRuleThreadAndTestName.java:45)
  at org.apache.lucene.tests.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:60)
  at org.apache.lucene.tests.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:44)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:390)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl.forkTimeoutingTask(ThreadLeakControl.java:843)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl$3.evaluate(ThreadLeakControl.java:490)
  at com.carrotsearch.randomizedtesting.RandomizedRunner.runSingleTest(RandomizedRunner.java:955)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$5.evaluate(RandomizedRunner.java:840)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$6.evaluate(RandomizedRunner.java:891)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$7.evaluate(RandomizedRunner.java:902)
  at org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at org.apache.lucene.tests.util.TestRuleStoreClassName$1.evaluate(TestRuleStoreClassName.java:38)
  at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
  at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at org.apache.lucene.tests.util.TestRuleAssertionsRequired$1.evaluate(TestRuleAssertionsRequired.java:53)
  at org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
  at org.apache.lucene.tests.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:44)
  at org.apache.lucene.tests.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:60)
  at org.apache.lucene.tests.util.TestRuleIgnoreTestSuites$1.evaluate(TestRuleIgnoreTestSuites.java:47)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:390)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl.lambda$forkTimeoutingTask$0(ThreadLeakControl.java:850)
  at java.lang.Thread.run(Thread.java:833)

@original-brownbear original-brownbear added :Distributed Coordination/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. >test-failure Triaged test failures from CI labels Sep 20, 2022
@elasticsearchmachine elasticsearchmachine added the Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. label Sep 20, 2022
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (Team:Distributed)

@kingherc
Copy link
Contributor

This reminds me of #89867 's error, but it's a different test.

@kingherc kingherc self-assigned this Sep 20, 2022
elasticsearchmachine pushed a commit that referenced this issue Sep 21, 2022
Mute flaky test `CoordinatorTests#testUnhealthyLeaderIsReplaced`

Relates to #90158
@kingherc
Copy link
Contributor

This has been showing frequently in last 7 days. I believe the solution of #89867 which was just merged may solve this as well. I will monitor the related build failures in next 7 days to see if it was solved.

@kingherc
Copy link
Contributor

Oh I may need to re-enable the test, will make a PR for it.

@kingherc
Copy link
Contributor

I just tried to reproduce this, and it still fails in main. So the fix for #89867 does not seem related. I will handle this in due time.

@kingherc
Copy link
Contributor

kingherc commented Nov 7, 2022

So, I've managed to figure out which PR introduced the issue by running many iterations of the test. Specifically it seems to be this line of this PR. If I change the line to this.empty = false;, then the test works (which means it passes many iterations >100 successfully, while previously it failed after 10 or so iterations). I've yet to understand if there's some bug in that PR, so far I believe no.

The main difference the PR seems to have introduced is that previously we were making a copy of the Metadata, while now we may be returning the same instance of Metadata, but I am not sure if that's an issue or not. @original-brownbear since it seems you authored that PR, might you understand why it makes the test fail?

@original-brownbear
Copy link
Member Author

@kingherc hard to say how this got broken by my change without digging into it deeper. I can see a couple of spots where metadata instance equality is checked in AbstractCoordinatorTestCase. My guess would be that one of those made assumptions about the metadata instance changing that now don't hold true where we don't needlessly rebuild the instance when there's no changes. That's the direction I'd investigate here. Let me know if you want me to take a deeper look.

@DaveCTurner
Copy link
Contributor

DaveCTurner commented Dec 9, 2022

This looks to be a bad bug in the vicinity of adjusting the voting configuration, which is a pretty dangerous area to have bugs. See #92258 #92259 for an explanation.

elasticsearchmachine pushed a commit that referenced this issue Dec 12, 2022
The cluster coordination consistency layer relies on a couple of fields
within `Metadata` which record the last _committed_ values on each node.
In contrast, the rest of the cluster state can only be changed at
_accept_ time.

In the past we would copy these fields over from the master on every
publication, but since #90101 we don't copy anything at all if the
`Metadata` is unchanged on the master. However, the master computes the
diff against the last _committed_ state whereas the receiving nodes
apply the diff to the last _accepted_ state, and this means if the
master sends a no-op `Metadata` diff then the receiving node will revert
its last-committed values to the ones included in the state it last
accepted.

With this commit we include the last-committed values alongside the
cluster state diff so that they are always copied properly.

Closes #90158
DaveCTurner added a commit to DaveCTurner/elasticsearch that referenced this issue Dec 12, 2022
The cluster coordination consistency layer relies on a couple of fields
within `Metadata` which record the last _committed_ values on each node.
In contrast, the rest of the cluster state can only be changed at
_accept_ time.

In the past we would copy these fields over from the master on every
publication, but since elastic#90101 we don't copy anything at all if the
`Metadata` is unchanged on the master. However, the master computes the
diff against the last _committed_ state whereas the receiving nodes
apply the diff to the last _accepted_ state, and this means if the
master sends a no-op `Metadata` diff then the receiving node will revert
its last-committed values to the ones included in the state it last
accepted.

With this commit we include the last-committed values alongside the
cluster state diff so that they are always copied properly.

Closes elastic#90158
Backport of elastic#92259 to 8.6
DaveCTurner added a commit that referenced this issue Dec 12, 2022
The cluster coordination consistency layer relies on a couple of fields
within `Metadata` which record the last _committed_ values on each node.
In contrast, the rest of the cluster state can only be changed at
_accept_ time.

In the past we would copy these fields over from the master on every
publication, but since #90101 we don't copy anything at all if the
`Metadata` is unchanged on the master. However, the master computes the
diff against the last _committed_ state whereas the receiving nodes
apply the diff to the last _accepted_ state, and this means if the
master sends a no-op `Metadata` diff then the receiving node will revert
its last-committed values to the ones included in the state it last
accepted.

With this commit we include the last-committed values alongside the
cluster state diff so that they are always copied properly.

Closes #90158
Backport of #92259 to 8.6
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed Coordination/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. >test-failure Triaged test failures from CI
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants