-
Notifications
You must be signed in to change notification settings - Fork 9.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Periodic corruption check doesn’t do anything after compaction #14325
Comments
Are you looking into it @justinhellreich-stripe ? |
Hey @jbml -- i've done some digging and found that the leader is essentially just getting the latest revision, and then getting the hash key at that revision for each peer https://git.corp.stripe.com/stripe-internal/gocode/blob/fb2be906edbbfcf0f732b0d69431059d129a5ea1/vendor/go.etcd.io/etcd/server/v3/etcdserver/corrupt.go#L154 It's a surprise to me that after compaction, the latest revision is not available to be queried at. Does that seem like the expected behavior? Based on testing with just I might be a bit out of my depth, so if you are able to look into it or suggest a possible fix, that would be great! |
Thanks, I will take a look @justinhellreich-stripe |
When the leader requests the hash for a particular revision from its peers (as part of the corruption check), it will verify if the revision in question is not already compacted: etcd/server/storage/mvcc/kvstore.go Lines 180 to 182 in e61d431
I think the problem is that the check to verify if the revision is already compacted (the first "if" above), will return an error when the revision is the same as the compacted one ("<=" comparator). I think it should instead be a "<" comparator, which would be consistent with what a normal range request (ie. etcdctl get) would test, to verify if a key being requested has already been compacted, as below: etcd/server/storage/mvcc/kvstore_txn.go Lines 74 to 76 in fff5d00
I will do some tests to verify this. |
When a key-value store corruption check happens immediately after a compaction, the revision at which the key-value store hash is computed, is the compacted revision itself. In that case, the hash computation logic was incorrect because it returned an ErrCompacted error; this error should instead be returned when the revision at which the key-value store is hashed, is strictly lower than the compacted revision. Fixes etcd-io#14325 Signed-off-by: Jeremy Leach <44558776+jbml@users.noreply.github.com>
When a key-value store corruption check happens immediately after a compaction, the revision at which the key-value store hash is computed, is the compacted revision itself. In that case, the hash computation logic was incorrect because it returned an ErrCompacted error; this error should instead be returned when the revision at which the key-value store is hashed, is strictly lower than the compacted revision. Fixes etcd-io#14325 Signed-off-by: Jeremy Leach <44558776+jbml@users.noreply.github.com>
When a key-value store corruption check happens immediately after a compaction, the revision at which the key-value store hash is computed, is the compacted revision itself. In that case, the hash computation logic was incorrect because it returned an ErrCompacted error; this error should instead be returned when the revision at which the key-value store is hashed, is strictly lower than the compacted revision. Fixes etcd-io#14325 Signed-off-by: Jeremy Leach <44558776+jbml@users.noreply.github.com>
When a key-value store corruption check happens immediately after a compaction, the revision at which the key-value store hash is computed, is the compacted revision itself. In that case, the hash computation logic was incorrect because it returned an ErrCompacted error; this error should instead be returned when the revision at which the key-value store is hashed, is strictly lower than the compacted revision. Fixes etcd-io#14325 Signed-off-by: Jeremy Leach <44558776+jbml@users.noreply.github.com>
What happened?
After turning on
--experimental-corrupt-check-time
, I’ve noticed that any time there has been a compaction and no recent write to etcd, the corruption check fails to actually run.The output on the leader is
Looking at a follower, we see the requested revision has been compacted
I then wrote a key to etcd and observed the corruption check working correctly on the next run.
What did you expect to happen?
The corruption check should work regardless of whether or not a compaction has happened very recently.
How can we reproduce it (as minimally and precisely as possible)?
--experimental-corrupt-check-time
to something very short like 1mAnything else we need to know?
No response
Etcd version (please run commands below)
Etcd configuration (command line flags or environment variables)
Etcd debug information (please run commands blow, feel free to obfuscate the IP address or FQDN in the output)
Leaving these off. Happy to share more details if required. It is a healthy five node cluster, no learners, all running v3.5.3.
Relevant log output
No response
The text was updated successfully, but these errors were encountered: