Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Corrupted WAL and snapshot restoring process #10219

Closed
brk0v opened this issue Oct 26, 2018 · 11 comments · Fixed by #11888
Closed

Corrupted WAL and snapshot restoring process #10219

brk0v opened this issue Oct 26, 2018 · 11 comments · Fixed by #11888
Assignees

Comments

@brk0v
Copy link

brk0v commented Oct 26, 2018

Issue

Node that was offline more than max(SnapshotCount, DefaultSnapshotCatchUpEntries) corrupts its WAL log with bad HardState.Commitnumber if it's killed right after HardState was saved to non-volatile storage (failpoint: raftBeforeSaveSnap).

Specific

Version: master
Environment: any (tested on Linux, MacOS X)

Steps to reproduce

  1. Patch snapshots defaults to trigger the issues easier:
+++ b/etcdserver/server.go
@@ -65,14 +65,14 @@ import (
 )

 const (
-       DefaultSnapshotCount = 100000
+       DefaultSnapshotCount = 10

        // DefaultSnapshotCatchUpEntries is the number of entries for a slow follower
        // to catch-up after compacting the raft storage entries.
        // We expect the follower has a millisecond level latency with the leader.
        // The max throughput is around 10K. Keep a 5K entries is enough for helping
        // follower to catch up.
-       DefaultSnapshotCatchUpEntries uint64 = 5000
+       DefaultSnapshotCatchUpEntries uint64 = 10
  1. Compile with failpoints to emulate power failure during the process of snapshot restoration:
FAILPOINTS="true" make build
  1. Create a Procfile with a failpoint raftBeforeSaveSnap for etcd2 node :
etcd1: bin/etcd --name infra1 --listen-client-urls http://127.0.0.1:2379 --advertise-client-urls http://127.0.0.1:2379 --listen-peer-urls http://127.0.0.1:12380 --initial-advertise-peer-urls http://127.0.0.1:12380 --initial-cluster-token etcd-cluster-1 --initial-cluster 'infra1=http://127.0.0.1:12380,infra2=http://127.0.0.1:22380,infra3=http://127.0.0.1:32380' --initial-cluster-state new --enable-pprof --logger=zap --log-outputs=stderr --debug 
etcd2: GOFAIL_HTTP="127.0.0.1:1111" GOFAIL_FAILPOINTS='go.etcd.io/etcd/etcdserver/raftBeforeSaveSnap=panic("raftBeforeSaveSnap")' bin/etcd --name infra2 --listen-client-urls http://127.0.0.1:22379 --advertise-client-urls http://127.0.0.1:22379 --listen-peer-urls http://127.0.0.1:22380 --initial-advertise-peer-urls http://127.0.0.1:22380 --initial-cluster-token etcd-cluster-1 --initial-cluster 'infra1=http://127.0.0.1:12380,infra2=http://127.0.0.1:22380,infra3=http://127.0.0.1:32380' --initial-cluster-state new --enable-pprof --logger=zap --log-outputs=stderr --debug 
etcd3: bin/etcd --name infra3 --listen-client-urls http://127.0.0.1:32379 --advertise-client-urls http://127.0.0.1:32379 --listen-peer-urls http://127.0.0.1:32380 --initial-advertise-peer-urls http://127.0.0.1:32380 --initial-cluster-token etcd-cluster-1 --initial-cluster 'infra1=http://127.0.0.1:12380,infra2=http://127.0.0.1:22380,infra3=http://127.0.0.1:32380' --initial-cluster-state new --enable-pprof --logger=zap --log-outputs=stderr --debug
  1. Start cluster
goreman start
  1. Start write loop:
for i in {0..100000}; do ./bin/etcdctl put key$i value$i; echo $i; done
  1. Stop etcd2 node for a "maintenance":
goreman run stop etcd2
  1. Start etcd2 node after 10 entries have been written to the master to trigger snapshot restore:
sleep 10 && goreman run start etcd2
  1. Should get a failoint panic, that emulates power issue during restoring from a snapshot.

From now WAL on the etcd2 node is corrupted. It was saved with a HardState entry that contains Commit number from the snapshot, but snapshot was never saved to WAL and disk.

WAL is corrupted.

  1. Run etcd2 node manually without fail point and got an error:
bin/etcd --name infra2 --listen-client-urls http://127.0.0.1:22379 --advertise-client-urls http://127.0.0.1:22379 --listen-peer-urls http://127.0.0.1:22380 --initial-advertise-peer-urls http://127.0.0.1:22380 --initial-cluster-token etcd-cluster-1 --initial-cluster 'infra1=http://127.0.0.1:12380,infra2=http://127.0.0.1:22380,infra3=http://127.0.0.1:32380' --initial-cluster-state new --enable-pprof --logger=zap --log-outputs=stderr --debug

Error:

panic: 91bc3c398fb3c146 state.commit 961 is out of range [572, 579]

goroutine 1 [running]:
go.etcd.io/etcd/vendor/go.uber.org/zap/zapcore.(*CheckedEntry).Write(0xc0001cc160, 0x0, 0x0, 0x0)
        /Users/brk0v/go/src/go.etcd.io/etcd/vendor/go.uber.org/zap/zapcore/entry.go:229 +0x515
go.etcd.io/etcd/vendor/go.uber.org/zap.(*SugaredLogger).log(0xc0001ce118, 0x4, 0x1d063df, 0x2b, 0xc000643f40, 0x4, 0x4, 0x0, 0x0, 0x0)
        /Users/brk0v/go/src/go.etcd.io/etcd/vendor/go.uber.org/zap/sugar.go:234 +0xf6
go.etcd.io/etcd/vendor/go.uber.org/zap.(*SugaredLogger).Panicf(0xc0001ce118, 0x1d063df, 0x2b, 0xc000643f40, 0x4, 0x4)
        /Users/brk0v/go/src/go.etcd.io/etcd/vendor/go.uber.org/zap/sugar.go:159 +0x79
go.etcd.io/etcd/pkg/logutil.(*zapRaftLogger).Panicf(0xc0000601f0, 0x1d063df, 0x2b, 0xc000643f40, 0x4, 0x4)
        /Users/brk0v/go/src/go.etcd.io/etcd/pkg/logutil/zap_raft.go:96 +0x61
go.etcd.io/etcd/raft.(*raft).loadState(0xc0000f8000, 0x2, 0x8211f1d0f64f3269, 0x3c1, 0x0, 0x0, 0x0)
        /Users/brk0v/go/src/go.etcd.io/etcd/raft/raft.go:1459 +0x1b1
go.etcd.io/etcd/raft.newRaft(0xc000450360, 0x4)
        /Users/brk0v/go/src/go.etcd.io/etcd/raft/raft.go:368 +0xd52
go.etcd.io/etcd/raft.RestartNode(0xc000450360, 0x1e98e20, 0xc0000601f0)
        /Users/brk0v/go/src/go.etcd.io/etcd/raft/node.go:232 +0x43
go.etcd.io/etcd/etcdserver.restartNode(0x7fff5fbff840, 0x6, 0x0, 0x0, 0x0, 0x0, 0xc0001aae80, 0x1, 0x1, 0xc0001ab180, ...)
        /Users/brk0v/go/src/go.etcd.io/etcd/etcdserver/raft.go:548 +0x7ad
go.etcd.io/etcd/etcdserver.NewServer(0x7fff5fbff840, 0x6, 0x0, 0x0, 0x0, 0x0, 0xc0001aae80, 0x1, 0x1, 0xc0001ab180, ...)
        /Users/brk0v/go/src/go.etcd.io/etcd/etcdserver/server.go:464 +0x3038
go.etcd.io/etcd/embed.StartEtcd(0xc0002b4000, 0xc0002b4500, 0x0, 0x0)
        /Users/brk0v/go/src/go.etcd.io/etcd/embed/etcd.go:203 +0x8ed
go.etcd.io/etcd/etcdmain.startEtcd(0xc0002b4000, 0x1cde6b6, 0x6, 0xc0001ab401, 0x2)
        /Users/brk0v/go/src/go.etcd.io/etcd/etcdmain/etcd.go:304 +0x40
go.etcd.io/etcd/etcdmain.startEtcdOrProxyV2()
        /Users/brk0v/go/src/go.etcd.io/etcd/etcdmain/etcd.go:146 +0x2f37
go.etcd.io/etcd/etcdmain.Main()
        /Users/brk0v/go/src/go.etcd.io/etcd/etcdmain/main.go:47 +0x37
main.main()
        /Users/brk0v/go/src/go.etcd.io/etcd/main.go:28 +0x20
@gyuho
Copy link
Contributor

gyuho commented Oct 26, 2018

Is this reproducible with current master?

@brk0v
Copy link
Author

brk0v commented Oct 26, 2018

Yes, you could reproduce this with the latest commit 5837632

@brk0v
Copy link
Author

brk0v commented Dec 21, 2018

Appreciate if someone could confirm this and share ideas how it could be fixed.
I tried rearrange data saving steps to the WAL with some code changes, but it seems like we need to do all saves in one transaction like dgraph does https://github.com/dgraph-io/dgraph/blob/master/raftwal/storage.go#L574

@xiang90
Copy link
Contributor

xiang90 commented Dec 21, 2018

@brk0v

I am aware of this issue. Would you like to spend some time to get it fixed?

@brk0v
Copy link
Author

brk0v commented Dec 24, 2018

Sure thing, but if you already have an idea to check or a direction to dig, I could try to do that.

@brk0v
Copy link
Author

brk0v commented Dec 27, 2018

Added code that passed my local failpoints test. Could you please check if this approach makes sense.
Thank you!

P.S. Unit tests are broken because of interface changes in etcdserver.Storage code.

@brk0v
Copy link
Author

brk0v commented Jan 9, 2019

Tests now pass.

To summarise what were done:

  • Added snapshotter.LoadIndex() to load arbitrary snapshot;
  • Added etcdserver.storage.checkWALSnap() to check that snapshot could be used to load raft state from the WAL;
  • Added etcdserver.storage.Release() and etcdserver.storage. Sync() to provide the safe order of saving operations;
  • Changed logic how raft.Ready handles not emptyrd.Snapshot:
    • save snapshot to the disk;
    • save hardstate;
    • fsync WAL to make hardstate persistent;
    • release all WAL entries up to snapshot index.

@xiang90
Copy link
Contributor

xiang90 commented Jan 9, 2019

@brk0v Thanks. I will give this a careful look over the next couple of weeks.

@jingyih
Copy link
Contributor

jingyih commented Jan 28, 2019

cc @jpbetz

@stale
Copy link

stale bot commented Apr 7, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label Apr 7, 2020
@stale stale bot closed this as completed Apr 28, 2020
@jpbetz jpbetz reopened this May 13, 2020
@stale stale bot removed the stale label May 13, 2020
@jpbetz
Copy link
Contributor

jpbetz commented May 14, 2020

I'm working on rebasing #10356 on master and making a couple adjustments to it (retaining the commits from @brk0v). I'll send out a PR shortly.

gyuho pushed a commit that referenced this issue May 15, 2020
etcdserver/*, wal/*: changes to snapshots and wal logic
etcdserver/*: changes to snapshots and wal logic to fix #10219
etcdserver/*, wal/*: add Sync method
etcdserver/*, wal/*: find valid snapshots by cross checking snap files and wal snap entries
etcdserver/*, wal/*:Add comments, clean up error messages and tests
etcdserver/*, wal/*: Remove orphaned .snap.db files during Release

Signed-off-by: Gyuho Lee <leegyuho@amazon.com>
gyuho pushed a commit that referenced this issue May 15, 2020
etcdserver/*, wal/*: changes to snapshots and wal logic
etcdserver/*: changes to snapshots and wal logic to fix #10219
etcdserver/*, wal/*: add Sync method
etcdserver/*, wal/*: find valid snapshots by cross checking snap files and wal snap entries
etcdserver/*, wal/*:Add comments, clean up error messages and tests
etcdserver/*, wal/*: Remove orphaned .snap.db files during Release

Signed-off-by: Gyuho Lee <leegyuho@amazon.com>
gyuho pushed a commit that referenced this issue May 15, 2020
etcdserver/*, wal/*: changes to snapshots and wal logic
etcdserver/*: changes to snapshots and wal logic to fix #10219
etcdserver/*, wal/*: add Sync method
etcdserver/*, wal/*: find valid snapshots by cross checking snap files and wal snap entries
etcdserver/*, wal/*:Add comments, clean up error messages and tests
etcdserver/*, wal/*: Remove orphaned .snap.db files during Release

Signed-off-by: Gyuho Lee <leegyuho@amazon.com>
gyuho pushed a commit that referenced this issue May 15, 2020
etcdserver/*, wal/*: changes to snapshots and wal logic
etcdserver/*: changes to snapshots and wal logic to fix #10219
etcdserver/*, wal/*: add Sync method
etcdserver/*, wal/*: find valid snapshots by cross checking snap files and wal snap entries
etcdserver/*, wal/*:Add comments, clean up error messages and tests
etcdserver/*, wal/*: Remove orphaned .snap.db files during Release

Signed-off-by: Gyuho Lee <leegyuho@amazon.com>
gyuho pushed a commit that referenced this issue May 15, 2020
etcdserver/*, wal/*: changes to snapshots and wal logic
etcdserver/*: changes to snapshots and wal logic to fix #10219
etcdserver/*, wal/*: add Sync method
etcdserver/*, wal/*: find valid snapshots by cross checking snap files and wal snap entries
etcdserver/*, wal/*:Add comments, clean up error messages and tests
etcdserver/*, wal/*: Remove orphaned .snap.db files during Release

Signed-off-by: Gyuho Lee <leegyuho@amazon.com>
gyuho added a commit that referenced this issue May 18, 2020
ref. #10219

Signed-off-by: Gyuho Lee <leegyuho@amazon.com>
gyuho added a commit that referenced this issue May 18, 2020
ref. #10219

Signed-off-by: Gyuho Lee <leegyuho@amazon.com>
gyuho added a commit that referenced this issue May 18, 2020
ref. #10219

Signed-off-by: Gyuho Lee <leegyuho@amazon.com>
wyrobnik added a commit to wyrobnik/etcd that referenced this issue Apr 11, 2022
Update raftexample to save the snapshot file and WAL snapshot entry
before hardstate to ensure the snapshot exists during recovery.
Otherwise if there is a failure after storing the hard state there may
be reference to a non-existent snapshot.
This PR introduces the fix from etcd-io#10219 to the raftexample.
wyrobnik added a commit to wyrobnik/etcd that referenced this issue Apr 11, 2022
Update raftexample to save the snapshot file and WAL snapshot entry
before hardstate to ensure the snapshot exists during recovery.
Otherwise if there is a failure after storing the hard state there may
be reference to a non-existent snapshot.
This PR introduces the fix from etcd-io#10219 to the raftexample.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging a pull request may close this issue.

5 participants