-
Notifications
You must be signed in to change notification settings - Fork 9.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Corrupted WAL and snapshot restoring process #10219
Comments
Is this reproducible with current master? |
Yes, you could reproduce this with the latest commit 5837632 |
Appreciate if someone could confirm this and share ideas how it could be fixed. |
I am aware of this issue. Would you like to spend some time to get it fixed? |
Sure thing, but if you already have an idea to check or a direction to dig, I could try to do that. |
Added code that passed my local failpoints test. Could you please check if this approach makes sense. P.S. Unit tests are broken because of interface changes in |
Tests now pass. To summarise what were done:
|
@brk0v Thanks. I will give this a careful look over the next couple of weeks. |
cc @jpbetz |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions. |
etcdserver/*, wal/*: changes to snapshots and wal logic etcdserver/*: changes to snapshots and wal logic to fix #10219 etcdserver/*, wal/*: add Sync method etcdserver/*, wal/*: find valid snapshots by cross checking snap files and wal snap entries etcdserver/*, wal/*:Add comments, clean up error messages and tests etcdserver/*, wal/*: Remove orphaned .snap.db files during Release Signed-off-by: Gyuho Lee <leegyuho@amazon.com>
etcdserver/*, wal/*: changes to snapshots and wal logic etcdserver/*: changes to snapshots and wal logic to fix #10219 etcdserver/*, wal/*: add Sync method etcdserver/*, wal/*: find valid snapshots by cross checking snap files and wal snap entries etcdserver/*, wal/*:Add comments, clean up error messages and tests etcdserver/*, wal/*: Remove orphaned .snap.db files during Release Signed-off-by: Gyuho Lee <leegyuho@amazon.com>
etcdserver/*, wal/*: changes to snapshots and wal logic etcdserver/*: changes to snapshots and wal logic to fix #10219 etcdserver/*, wal/*: add Sync method etcdserver/*, wal/*: find valid snapshots by cross checking snap files and wal snap entries etcdserver/*, wal/*:Add comments, clean up error messages and tests etcdserver/*, wal/*: Remove orphaned .snap.db files during Release Signed-off-by: Gyuho Lee <leegyuho@amazon.com>
etcdserver/*, wal/*: changes to snapshots and wal logic etcdserver/*: changes to snapshots and wal logic to fix #10219 etcdserver/*, wal/*: add Sync method etcdserver/*, wal/*: find valid snapshots by cross checking snap files and wal snap entries etcdserver/*, wal/*:Add comments, clean up error messages and tests etcdserver/*, wal/*: Remove orphaned .snap.db files during Release Signed-off-by: Gyuho Lee <leegyuho@amazon.com>
etcdserver/*, wal/*: changes to snapshots and wal logic etcdserver/*: changes to snapshots and wal logic to fix #10219 etcdserver/*, wal/*: add Sync method etcdserver/*, wal/*: find valid snapshots by cross checking snap files and wal snap entries etcdserver/*, wal/*:Add comments, clean up error messages and tests etcdserver/*, wal/*: Remove orphaned .snap.db files during Release Signed-off-by: Gyuho Lee <leegyuho@amazon.com>
ref. #10219 Signed-off-by: Gyuho Lee <leegyuho@amazon.com>
ref. #10219 Signed-off-by: Gyuho Lee <leegyuho@amazon.com>
ref. #10219 Signed-off-by: Gyuho Lee <leegyuho@amazon.com>
Update raftexample to save the snapshot file and WAL snapshot entry before hardstate to ensure the snapshot exists during recovery. Otherwise if there is a failure after storing the hard state there may be reference to a non-existent snapshot. This PR introduces the fix from etcd-io#10219 to the raftexample.
Update raftexample to save the snapshot file and WAL snapshot entry before hardstate to ensure the snapshot exists during recovery. Otherwise if there is a failure after storing the hard state there may be reference to a non-existent snapshot. This PR introduces the fix from etcd-io#10219 to the raftexample.
Issue
Node that was offline more than
max(SnapshotCount, DefaultSnapshotCatchUpEntries)
corrupts its WAL log with badHardState.Commit
number if it's killed right afterHardState
was saved to non-volatile storage (failpoint: raftBeforeSaveSnap
).Specific
Version: master
Environment: any (tested on Linux, MacOS X)
Steps to reproduce
Procfile
with a failpointraftBeforeSaveSnap
for etcd2 node :From now WAL on the etcd2 node is corrupted. It was saved with a
HardState
entry that containsCommit
number from the snapshot, but snapshot was never saved to WAL and disk.WAL is corrupted.
Error:
The text was updated successfully, but these errors were encountered: