Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

etcd 3.5.3panic: tocommit(587192) is out of range [lastIndex(587189)]. Was the raft log corrupted, truncated, or lost? #15699

Closed
Tejaswini5327 opened this issue Apr 11, 2023 · 2 comments

Comments

@Tejaswini5327
Copy link

Tejaswini5327 commented Apr 11, 2023

What happened?

We have deployment of 3 pods , which is running from past 90 days, after 90 days the pod-0 suddenly started restarting. We applied work around like deleting the pod-0 member id, deletion of pod-0 ,then deleting the wal file of pod-0 and After this WA, the error in pod-0 log we found as below:

{"level":"info","ts":"2023-03-06T13:07:12.235-0500","caller":"rafthttp/stream.go:274","msg":"established TCP streaming connection with remote peer","stream-writer-type":"stream Message","local-member-id":"1c192aed65dd7938","remote-peer-id":"f92d0a91a53265ea"}
panic: tocommit(587192) is out of range [lastIndex(587189)]. Was the raft log corrupted, truncated, or lost?

goroutine 77 [running]:
go.uber.org/zap/zapcore.(*CheckedEntry).Write(0xc000884000, 0x0, 0x0, 0x0)
/go/pkg/mod/go.uber.org/zap@v1.17.0/zapcore/entry.go:234 +0x58d
go.uber.org/zap.(*SugaredLogger).log(0xc000b120b8, 0x9696db4a832504, 0x124ecb9, 0x5d, 0xc0025a4080, 0x2, 0x2, 0x0, 0x0, 0x0)
/go/pkg/mod/go.uber.org/zap@v1.17.0/sugar.go:227 +0x111
go.uber.org/zap.(*SugaredLogger).Panicf(...)
/go/pkg/mod/go.uber.org/zap@v1.17.0/sugar.go:159
go.etcd.io/etcd/server/v3/etcdserver.(*zapRaftLogger).Panicf(0xc00003c110, 0x124ecb9, 0x5d, 0xc0025a4080, 0x2, 0x2)
/go/src/go.etcd.io/etcd/release/etcd/server/etcdserver/zap_raft.go:101 +0x7d
go.etcd.io/etcd/raft/v3.(*raftLog).commitTo(0xc000802c40, 0x8f5b8)
/go/src/go.etcd.io/etcd/release/etcd/raft/log.go:237 +0x135
go.etcd.io/etcd/raft/v3.(*raft).handleHeartbeat(0xc0004be2c0, 0x8, 0x1c192aed65dd7938, 0x9f7835abc7c54201, 0xa, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
/go/src/go.etcd.io/etcd/release/etcd/raft/raft.go:1508 +0x54
go.etcd.io/etcd/raft/v3.stepFollower(0xc0004be2c0, 0x8, 0x1c192aed65dd7938, 0x9f7835abc7c54201, 0xa, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
/go/src/go.etcd.io/etcd/release/etcd/raft/raft.go:1434 +0x478
go.etcd.io/etcd/raft/v3.(*raft).Step(0xc0004be2c0, 0x8, 0x1c192aed65dd7938, 0x9f7835abc7c54201, 0xa, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
/go/src/go.etcd.io/etcd/release/etcd/raft/raft.go:975 +0xa55
go.etcd.io/etcd/raft/v3.(*node).run(0xc00081b080)
/go/src/go.etcd.io/etcd/release/etcd/raft/node.go:356 +0x798
created by go.etcd.io/etcd/raft/v3.RestartNode
/go/src/go.etcd.io/etcd/release/etcd/raft/node.go:244 +0x330 

What did you expect to happen?

Pod should run without restarting and without any panic: tocommit(587192) is out of range [lastIndex(587189)] error.

How can we reproduce it (as minimally and precisely as possible)?

  1. Deleted the member id of pod-0
  2. Delete the pod-0
    3.Delete the wal file of pod-0
    After the above , panic: tocommit(587192) is out of range [lastIndex(587189)] error came in pod-0 logs.

Anything else we need to know?

We have found similar issue #13509 , the WA/solution didnt work for us.

Etcd version (please run commands below)

bash-4.4$ etcd --version
etcd Version: 3.5.3
Git SHA: 0452fee
Go Version: go1.16.15
Go OS/Arch: linux/amd64
bash-4.4$ etcdctl version
etcdctl version: 3.5.3
API version: 3.5

Etcd configuration (command line flags or environment variables)

paste your configuration here

Etcd debug information (please run commands below, feel free to obfuscate the IP address or FQDN in the output)

$ etcdctl member list -w table
# paste output here

$ etcdctl --endpoints=<member list> endpoint status -w table
# paste output here

Relevant log output

{"level":"info","ts":"2023-03-06T13:07:12.235-0500","caller":"rafthttp/stream.go:274","msg":"established TCP streaming connection with remote peer","stream-writer-type":"stream Message","local-member-id":"1c192aed65dd7938","remote-peer-id":"f92d0a91a53265ea"}
panic: tocommit(587192) is out of range [lastIndex(587189)]. Was the raft log corrupted, truncated, or lost?

goroutine 77 [running]:
go.uber.org/zap/zapcore.(*CheckedEntry).Write(0xc000884000, 0x0, 0x0, 0x0)
/go/pkg/mod/go.uber.org/zap@v1.17.0/zapcore/entry.go:234 +0x58d
go.uber.org/zap.(*SugaredLogger).log(0xc000b120b8, 0x9696db4a832504, 0x124ecb9, 0x5d, 0xc0025a4080, 0x2, 0x2, 0x0, 0x0, 0x0)
/go/pkg/mod/go.uber.org/zap@v1.17.0/sugar.go:227 +0x111
go.uber.org/zap.(*SugaredLogger).Panicf(...)
/go/pkg/mod/go.uber.org/zap@v1.17.0/sugar.go:159
go.etcd.io/etcd/server/v3/etcdserver.(*zapRaftLogger).Panicf(0xc00003c110, 0x124ecb9, 0x5d, 0xc0025a4080, 0x2, 0x2)
/go/src/go.etcd.io/etcd/release/etcd/server/etcdserver/zap_raft.go:101 +0x7d
go.etcd.io/etcd/raft/v3.(*raftLog).commitTo(0xc000802c40, 0x8f5b8)
/go/src/go.etcd.io/etcd/release/etcd/raft/log.go:237 +0x135
go.etcd.io/etcd/raft/v3.(*raft).handleHeartbeat(0xc0004be2c0, 0x8, 0x1c192aed65dd7938, 0x9f7835abc7c54201, 0xa, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
/go/src/go.etcd.io/etcd/release/etcd/raft/raft.go:1508 +0x54
go.etcd.io/etcd/raft/v3.stepFollower(0xc0004be2c0, 0x8, 0x1c192aed65dd7938, 0x9f7835abc7c54201, 0xa, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
/go/src/go.etcd.io/etcd/release/etcd/raft/raft.go:1434 +0x478
go.etcd.io/etcd/raft/v3.(*raft).Step(0xc0004be2c0, 0x8, 0x1c192aed65dd7938, 0x9f7835abc7c54201, 0xa, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
/go/src/go.etcd.io/etcd/release/etcd/raft/raft.go:975 +0xa55
go.etcd.io/etcd/raft/v3.(*node).run(0xc00081b080)
/go/src/go.etcd.io/etcd/release/etcd/raft/node.go:356 +0x798
created by go.etcd.io/etcd/raft/v3.RestartNode
/go/src/go.etcd.io/etcd/release/etcd/raft/node.go:244 +0x330
@serathius
Copy link
Member

You deleted files from database directory and expect it to run? It doesn't work like that.

Using v3.5.3 etcd version is super duper not recommended. https://groups.google.com/g/etcd-dev/c/8S7u6NqW6C4/m/_uy9Dv7XBwAJ

@jmhbnz
Copy link
Member

jmhbnz commented Apr 22, 2023

Hey @Tejaswini5327 - Please refer to our etcd operations guide for guidance on disaster recovery for failing members: https://etcd.io/docs/v3.5/op-guide/recovery

Essentially you can use the etcdctl snapshot functionality from one of your two live members to then restore the third failing member with etcdutl snapshot restore.

I'm going to close this issue as I don't believe there is an etcd bug here, and the etcd operations guide I've linked should assist you to resolve the problematic member.

You're welcome to reply below with new information if you would like this to be re-opened. If you do run into any issues with the operations guide I linked, please let us know by raising an issue on the etcd-io/website repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

3 participants