Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tocommit(3730) is out of range [lastIndex(0)]. Was the raft log corrupted, truncated, or lost #16220

Closed
qixiaoyang0 opened this issue Jul 11, 2023 · 5 comments · Fixed by #17078

Comments

@qixiaoyang0
Copy link

qixiaoyang0 commented Jul 11, 2023

What would you like to be added?

The heartbeat sent by the leader node contains the committed log index. If the index is higher than the follower node, follower will panic.
ETCD should recheck the logs instead of exiting process.

Why is this needed?

We deployed a 3-node cluster and tested the cluster. One of the test cases is 65% of the network error packets between nodes. About 10 minutes later, a follower node restarted. The log prints the panic stack:

`msg:
tocommit(3730) is out of range [lastIndex(0)]. Was the raft log corrupted, truncated, or lost?
stacktrace:
vendor/go.etcd.io/etcd/server/v3/etcdserver.(*zapRaftLogger).Panicf
vendor/go.etcd.io/etcd/server/v3/etcdserver/zap_raft.go:101

vendor/go.etcd.io/etcd/raft/v3.(*raftLog).commitTo
vendor/go.etcd.io/etcd/raft/v3/log.go:237

vendor/go.etcd.io/etcd/raft/v3.(*raft).handleHeartbeat
vendor/go.etcd.io/etcd/raft/v3/raft.go:1509

vendor/go.etcd.io/etcd/raft/v3.stepFollower
vendor/go.etcd.io/etcd/raft/v3/raft.go:1435

vendor/go.etcd.io/etcd/raft/v3.(*raft).Step
vendor/go.etcd.io/etcd/raft/v3/raft.go:975

vendor/go.etcd.io/etcd/raft/v3.(*node).run
vendor/go.etcd.io/etcd/raft/v3/node.go:356`

We have found similar issue #13509, #15699

@chaochn47
Copy link
Member

chaochn47 commented Jul 11, 2023

Duplicate of etcd-io/raft#18. Please refer to the raft repo issue, Thanks!

Closing current issue.

@CabinfeverB
Copy link

Hi @qixiaoyang0, I would like to know if the follower who experienced the panic had restarted before the panic. I also encountered a similar problem, and from the logs, I can confirm that followers did not restart.
But I don't think it would have happened without the restart. PTAL @chaochn47

@qixiaoyang0
Copy link
Author

Hi @qixiaoyang0, I would like to know if the follower who experienced the panic had restarted before the panic. I also encountered a similar problem, and from the logs, I can confirm that followers did not restart. But I don't think it would have happened without the restart. PTAL @chaochn47

The follower exits the process due to panic. The process may exit too quickly to output panic messages in your system.

@CabinfeverB
Copy link

What I meant is, if it's just a network issue, it should not cause a panic. I want to confirm with you if there were any other indications before the panic occurred.

@qixiaoyang0
Copy link
Author

What I meant is, if it's just a network issue, it should not cause a panic. I want to confirm with you if there were any other indications before the panic occurred.

There are no other indicators in my test case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging a pull request may close this issue.

4 participants