-
Notifications
You must be signed in to change notification settings - Fork 25.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CI] DedicatedClusterSnapshotRestoreIT testSnapshotWithStuckNode failed #39852
Comments
Pinging @elastic/es-distributed |
* The test failure in elastic#39852 is caused by a file in the initial repository when there should not be any * It seems that on a normal consistent file system no left-over file should exist ever here after the validation finishes and I can't reproduce or see any other path to a dangling file in the fresh respository => added a more verbose and strict assertion that will log what file is left over next time * Relates elastic#39852
It's a little tricky to tell where this is coming from without better logging/asserting => I opened #39893 for that. |
* The test failure in #39852 is caused by a file in the initial repository when there should not be any * It seems that on a normal consistent file system no left-over file should exist ever here after the validation finishes and I can't reproduce or see any other path to a dangling file in the fresh respository => added a more verbose and strict assertion that will log what file is left over next time * Relates #39852
* The test failure in elastic#39852 is caused by a file in the initial repository when there should not be any * It seems that on a normal consistent file system no left-over file should exist ever here after the validation finishes and I can't reproduce or see any other path to a dangling file in the fresh respository => added a more verbose and strict assertion that will log what file is left over next time * Relates elastic#39852
* The test failure in elastic#39852 is caused by a file in the initial repository when there should not be any * It seems that on a normal consistent file system no left-over file should exist ever here after the validation finishes and I can't reproduce or see any other path to a dangling file in the fresh respository => added a more verbose and strict assertion that will log what file is left over next time * Relates elastic#39852
* The test failure in #39852 is caused by a file in the initial repository when there should not be any * It seems that on a normal consistent file system no left-over file should exist ever here after the validation finishes and I can't reproduce or see any other path to a dangling file in the fresh respository => added a more verbose and strict assertion that will log what file is left over next time * Relates #39852
* The test failure in #39852 is caused by a file in the initial repository when there should not be any * It seems that on a normal consistent file system no left-over file should exist ever here after the validation finishes and I can't reproduce or see any other path to a dangling file in the fresh respository => added a more verbose and strict assertion that will log what file is left over next time * Relates #39852
This test failed again, https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+7.x+matrix-java-periodic/ES_BUILD_JAVA=openjdk12,ES_RUNTIME_JAVA=openjdk12,nodes=immutable&&linux&&docker/81/console. It does not reproduce locally for me. Reproduce line:
The failure with the additional output added by @original-brownbear is:
Output from testSnapshotWithStuckNode:
|
Thanks @jaymode ! looks like we're leaving a dangling data file behind here. That's kind of interesting:
shouldn't be there. Looking into it shortly. |
I tracked this down now. The problem is that we are deleting a snapshot while a data node is writing shard files (when it should be aborted already). |
It sounds @original-brownbear has enough logging to identify the issue, so I've muted the test in 1de2a25 while we're working on a fix. |
This failed in the 7.0 branch today: https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+7.0+multijob-unix-compatibility/os=centos-6/199/console |
Yea, this unfortunately is a pretty principle issue with the way the blob store repository works. A fix is incoming in #42189 but until that is in we should probably just mute this test wherever it comes up. |
* See comment in the test: The problem is that when the snapshot delete works out partially on master failover and the retry fails on `SnapshotMissingException` no repository cleanup is run => we still failed even with repo cleanup logic in the delete path now * Fixed the test by rerunning a create snapshot and delete loop to clean up the repo before verifying file counts * Closes elastic#39852
* Fix DedicatedClusterSnapshotRestoreIT testSnapshotWithStuckNode * See comment in the test: The problem is that when the snapshot delete works out partially on master failover and the retry fails on `SnapshotMissingException` no repository cleanup is run => we still failed even with repo cleanup logic in the delete path now * Fixed the test by rerunning a create snapshot and delete loop to clean up the repo before verifying file counts * Closes #39852
…tic#43537) * Fix DedicatedClusterSnapshotRestoreIT testSnapshotWithStuckNode * See comment in the test: The problem is that when the snapshot delete works out partially on master failover and the retry fails on `SnapshotMissingException` no repository cleanup is run => we still failed even with repo cleanup logic in the delete path now * Fixed the test by rerunning a create snapshot and delete loop to clean up the repo before verifying file counts * Closes elastic#39852
…) (#44082) * Fix DedicatedClusterSnapshotRestoreIT testSnapshotWithStuckNode * See comment in the test: The problem is that when the snapshot delete works out partially on master failover and the retry fails on `SnapshotMissingException` no repository cleanup is run => we still failed even with repo cleanup logic in the delete path now * Fixed the test by rerunning a create snapshot and delete loop to clean up the repo before verifying file counts * Closes #39852
org.elasticsearch.snapshots.DedicatedClusterSnapshotRestoreIT testSnapshotWithStuckNode
failed in https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+7.0+matrix-java-periodic/ES_BUILD_JAVA=openjdk12,ES_RUNTIME_JAVA=java11,nodes=immutable&&linux&&docker/60/consoleThe error was:
The REPRO command was:
(Note compiled with Java 12.)
The problem did not reproduce locally for me.
The text was updated successfully, but these errors were encountered: