[Help Wanted] Possibility of data loss when server restarts immediately after key-value put with explained conditions #14364

hasethuraman · 2022-08-19T10:00:29Z

I observed the possibility of data loss and I would like the community to comment / correct me otherwise.

Before explaining that, I would like to explain the happy path when user does a PUT <key, value>. I have tried to only necessary steps to focus this issue. And considered a single etcd instance.

====================================================================================
----------api thread --------------

User calls etcdctl PUT k v
It lands in v3_server.go::put function with the message about k,v
Call delegates to series of function calls and enters v3_server.go::processInternalRaftRequestOnce
It registers for a signal with wait utility against this keyid
Call delegates further to series of function calls and enters raft/node.go::stepWithWaitOption(..message..)
It wraps this message in a msgResult channel and updates its result channel; then sends this message to propc channel.
After sending it waits on msgResult.channel
----------api thread waiting --------------
On seeing a message in propc channel, raft/node.go::run(), it wakes up and sequence of calls adds the message.Entries to raftLog
Notifies the msgResult.channel

----------api thread wakes--------------
10. Upon seeing the msgResult.channel, api thread wakes and returns down the stack back to v3_server.go::processInternalRaftRequestOnce and waits for signal that it registered at step#4
----------api thread waiting --------------

In next iteration of raft/node.go::run(), it gets the entry from raftLog and add it to readyc
etcdserver/raft.go::start wakes up on seeing this entry in readyc and adds this entry to applyc channel
and synchronously writes to wal log ---------------------> wal log
etcdserver/server.go wakes up on seeing entry in applyc channel (added in step accept machine list to join cluster #12)
From step#14, the call goes through series of calls and lands in server.go::applyEntryNormal
applyEntryNormal calls applyV3.apply which will eventually puts the KV to mvcc kvstore txn kvindex
applyEntryNormal now sends the signal for this key which is basically to wake up api thread that is waiting in 7

----------api thread wakes--------------
18. User thread here wakes and sends back acknowledgement
----------user sees ok--------------

Batcher flushes the entries added to kvstore txn kvindex to database file. (also this can happen before 18 based on its timer)
====================================================================================

Here if step #13 thread is pre-empted and rescheduled by the underlying operating system after completing step #18 and when there is a power failure at the end of step 18 where after user sees error, then the kv is neither written to wal nor to database file

I think this is not seen today because it is a small window where the server has to restart immediately after step 18 (and immediately after step 12 the underlying os must have pre-empted the etcdserver/raft.go::start and added to end of the runnable Q.). Given these multiple conditions, it appears that we dont see data loss.

But it appears from the code that it is possible. To simulate, added sleep after step 12 (also added exit) and 19. I was able to see ok but the data is not in both wal and db.

If I am not correct, my apology and also please correct my understanding.

ahrtr · 2022-08-19T11:56:22Z

Please provide the following info:

What's the etcd version?
Run etcdctl endpoint status -w json --cluster. Or provide the etcd configurations.
The detailed steps to reproduce the issue if possible.

hasethuraman · 2022-08-22T06:34:55Z

@ahrtr
(1)
It is applicable to both 3.4 and 3.5
(2)
Nothing specific to any runtime status but from code walk through

(3)
This is from my self code walkthrough to explain that there is a possibility of data loss with the explained conditions. I will send the code diff to simulate the data loss if that is the ask. Though I am missing something - so reached out here to confirm.

ahrtr · 2022-08-22T08:28:43Z

The workflow may not be correct. Refer to https://github.com/ahrtr/etcd-issues/blob/master/docs/cncf_storage_tag_etcd.md

Please open a new issue if you could reproduce the issue in your test environment.

hasethuraman · 2022-08-22T11:18:30Z

@ahrtr what I see from the code is different from that document. Can I provide the code diff and necessary screenshots? I hope you also went through the steps I gave in the bug description.

hasethuraman · 2022-08-22T15:21:49Z

@ahrtr Please find the code changes and steps to repro.. please kindly note that these changes are just sleep and exit to simulate the condition that I explained in the bug description

Do the code changes in raft.go

Do the code changes in tx.go

Rebuild etcd server

//1. Start etcd server with changes

//2. Add a key value. Allow etcdserver to acknowledge and exit immediately (with just sleep and exit to simulate the explanation)
$ touch /tmp/exitnow; ./bin/etcdctl put /k1 v1
OK

//3. Remove this control flag file and restart the etcd server
$ rm /tmp/exitnow

//4. Check if key present
$ ./bin/etcdctl get /k --prefix
$

// We can see no key-value

ahrtr · 2022-08-23T01:04:12Z

Note that it's expected behavior by design. If you really want high availability, then you need to setup a cluster with 3 members at least.

Due to the performance concern of bboltDB, etcd periodically commits the transaction instead of committing on each request. So in theory, it's possible that the bboltDB commit might actually fail for whatever system or hardware issue. But it isn't an issue if either of the following condition is true:

The local WAL entries are persisted;
There are other healthy members;

If the local WAL entries are successfully persisted, then etcd replays the WAL entries on startup. If there is other healthy members, then the leader will sync the missing data to other members, including the previous problematic one.

In your case, you intentionally created a situation in which both conditions are false. So eventually it caused data loss. Please note that it's beyond etcd's capacity to resolve such extreme catastrophic situation, and I believe it's also beyond the capacity of any single project. You need to think about/resolve this from a more high level system architecture, such as back & restore.

ahrtr closed this as completed Aug 22, 2022

ahrtr mentioned this issue Aug 23, 2022

Durability API guarantee broken in single node cluster #14370

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Help Wanted] Possibility of data loss when server restarts immediately after key-value put with explained conditions #14364

[Help Wanted] Possibility of data loss when server restarts immediately after key-value put with explained conditions #14364

hasethuraman commented Aug 19, 2022

ahrtr commented Aug 19, 2022 •

edited

Loading

hasethuraman commented Aug 22, 2022

ahrtr commented Aug 22, 2022

hasethuraman commented Aug 22, 2022

hasethuraman commented Aug 22, 2022

ahrtr commented Aug 23, 2022 •

edited

Loading

[Help Wanted] Possibility of data loss when server restarts immediately after key-value put with explained conditions #14364

[Help Wanted] Possibility of data loss when server restarts immediately after key-value put with explained conditions #14364

Comments

hasethuraman commented Aug 19, 2022

ahrtr commented Aug 19, 2022 • edited Loading

hasethuraman commented Aug 22, 2022

ahrtr commented Aug 22, 2022

hasethuraman commented Aug 22, 2022

hasethuraman commented Aug 22, 2022

ahrtr commented Aug 23, 2022 • edited Loading

ahrtr commented Aug 19, 2022 •

edited

Loading

ahrtr commented Aug 23, 2022 •

edited

Loading