[Third solution] Fix the potential data loss for clusters with only one member (raft layer change) #14407

ahrtr · 2022-08-31T01:27:24Z

Third solution to fix #14370

For a cluster with only one member, the raft always send identical
unstable entries and committed entries to etcdserver, and etcd
responds to the client once it finishes (actually partially) the
applying workflow.

When the client receives the response, it doesn't mean etcd has already
successfully saved the data, including BoltDB and WAL, because:

etcd commits the boltDB transaction periodically instead of on each request;
etcd saves WAL entries in parallel with applying the committed entries.
Accordingly, it may run into a situation of data loss when the etcd crashes
immediately after responding to the client and before the boltDB and WAL
successfully save the data to disk.
Note that this issue can only happen for clusters with only one member.

For clusters with multiple members, it isn't an issue, because etcd will
not commit & apply the data before it being replicated to majority members.
When the client receives the response, it means the data must have been applied.
It further means the data must have been committed.
Note: for clusters with multiple members, the raft will never send identical
unstable entries and committed entries to etcdserver.

Signed-off-by: Benjamin Wang wachao@vmware.com

Please read https://github.com/etcd-io/etcd/blob/main/CONTRIBUTING.md#contribution-flow.

For a cluster with only one member, the raft always send identical unstable entries and committed entries to etcdserver, and etcd responds to the client once it finishes (actually partially) the applying workflow. When the client receives the response, it doesn't mean etcd has already successfully saved the data, including BoltDB and WAL, because: 1. etcd commits the boltDB transaction periodically instead of on each request; 2. etcd saves WAL entries in parallel with applying the committed entries. Accordingly, it may run into a situation of data loss when the etcd crashes immediately after responding to the client and before the boltDB and WAL successfully save the data to disk. Note that this issue can only happen for clusters with only one member. For clusters with multiple members, it isn't an issue, because etcd will not commit & apply the data before it being replicated to majority members. When the client receives the response, it means the data must have been applied. It further means the data must have been committed. Note: for clusters with multiple members, the raft will never send identical unstable entries and committed entries to etcdserver. Signed-off-by: Benjamin Wang <wachao@vmware.com>

ahrtr · 2022-09-05T09:00:14Z

Closing this PR as we agreed that we will not proceed with this PR.

ahrtr changed the title ~~[Third solution] Fix the potential data loss for clusters with only one member~~ [Third solution] Fix the potential data loss for clusters with only one member (raft layer change) Aug 31, 2022

ahrtr force-pushed the one_member_data_loss_raft_protocol branch from 7a53bab to 97e54b2 Compare August 31, 2022 01:33

ahrtr force-pushed the one_member_data_loss_raft_protocol branch from 97e54b2 to 87f0a76 Compare August 31, 2022 01:39

This was referenced Aug 31, 2022

test pr ahrtr/gocontainer#2

Closed

Durability API guarantee broken in single node cluster #14370

Closed

ahrtr closed this Sep 5, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Third solution] Fix the potential data loss for clusters with only one member (raft layer change) #14407

[Third solution] Fix the potential data loss for clusters with only one member (raft layer change) #14407

ahrtr commented Aug 31, 2022 •

edited

Loading

ahrtr commented Sep 5, 2022

[Third solution] Fix the potential data loss for clusters with only one member (raft layer change) #14407

[Third solution] Fix the potential data loss for clusters with only one member (raft layer change) #14407

Conversation

ahrtr commented Aug 31, 2022 • edited Loading

ahrtr commented Sep 5, 2022

ahrtr commented Aug 31, 2022 •

edited

Loading