Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

auth/store: save consistentIndex to fix a data corruption bug #11652

Merged
merged 3 commits into from
Mar 1, 2020

Conversation

tangcong
Copy link
Contributor

fix issue #11651

@tangcong tangcong requested review from mitake and jpbetz February 24, 2020 16:44
@codecov-io
Copy link

codecov-io commented Feb 24, 2020

Codecov Report

Merging #11652 into master will decrease coverage by 0.42%.
The diff coverage is 78.84%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #11652      +/-   ##
==========================================
- Coverage   66.66%   66.24%   -0.43%     
==========================================
  Files         401      402       +1     
  Lines       36629    36671      +42     
==========================================
- Hits        24420    24291     -129     
- Misses      10714    10862     +148     
- Partials     1495     1518      +23
Impacted Files Coverage Δ
etcdserver/server.go 75.79% <100%> (-0.16%) ⬇️
mvcc/watchable_store.go 86.47% <100%> (+2.24%) ⬆️
etcdserver/backend.go 56.25% <100%> (ø) ⬆️
auth/store.go 71.3% <68.75%> (-4.1%) ⬇️
auth/metrics.go 87.5% <87.5%> (ø)
pkg/transport/timeout_conn.go 60% <0%> (-40%) ⬇️
client/keys.go 60.3% <0%> (-28.15%) ⬇️
auth/simple_token.go 74.78% <0%> (-14.29%) ⬇️
clientv3/ordering/kv.go 72.05% <0%> (-13.24%) ⬇️
proxy/grpcproxy/register.go 72.5% <0%> (-10%) ⬇️
... and 32 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 085b19b...7de3be9. Read the comment docs.

@tangcong tangcong force-pushed the fix-auth-store-corruption-bug branch from d97c8b0 to 7de3be9 Compare February 25, 2020 03:52
@tangcong tangcong requested review from gyuho and jingyih and removed request for jpbetz February 25, 2020 03:59
Copy link
Contributor

@mitake mitake left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for your very detailed investigation and bugfix!! I think the PR is almost ready to be merged after fixing some style in the commit logs.

  • The log of the 1st commit should be: *: fix auth revision corruption bug
  • The log of the 3rd commit should be: auth: add new metric 'etcd_debugging_auth_revision'
  • The 4th commit should be merged to the 1st one

auth/store.go Outdated Show resolved Hide resolved
@mitake
Copy link
Contributor

mitake commented Feb 25, 2020

BTW I think lessor.Grant() (technically lease.persistTo()) would have the same problem of not updating the consistent_index. Replaying Grant multiple times wouldn't be harmful like the case of auth operations, but it might make the stats like leaseTotalTTLs and leaseGranted manipulated incorrectly. How do you think @jpbetz ?

^ on the second thought the counters aren't persisted so the only side effect caused by replaying would be granting again.

@tangcong
Copy link
Contributor Author

tangcong commented Feb 25, 2020

@mitake thanks. i will optimize pr based on your suggestions.
do you have better advice that how to call mvcc.saveIndex function in auth/store? it is a private method in mvcc/store. it does not seem very good.
it is possible to cause data inconsistency when we restart etcd. it is better if we can backport this to 3.2, 3.3 and 3.4, as well as changelog note.
i think it is necessary to improve code testability about auth revision. there is no e2e test about auth revision because client can not get auth revision. can we submit another pr to add auth revision field in StatusResponse?
how do you think? @mitake @jingyih @gyuho

@tangcong tangcong force-pushed the fix-auth-store-corruption-bug branch from 7de3be9 to fc44b73 Compare February 25, 2020 15:56
@tangcong
Copy link
Contributor Author

Thanks a lot for your very detailed investigation and bugfix!! I think the PR is almost ready to be merged after fixing some style in the commit logs.

  • The log of the 1st commit should be: *: fix auth revision corruption bug
  • The log of the 3rd commit should be: auth: add new metric 'etcd_debugging_auth_revision'
  • The 4th commit should be merged to the 1st one

done, PTAL.

@codecov-io
Copy link

codecov-io commented Feb 25, 2020

Codecov Report

Merging #11652 into master will decrease coverage by 0.66%.
The diff coverage is 75%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #11652      +/-   ##
==========================================
- Coverage   66.66%      66%   -0.67%     
==========================================
  Files         401      402       +1     
  Lines       36629    36672      +43     
==========================================
- Hits        24420    24204     -216     
- Misses      10714    10966     +252     
- Partials     1495     1502       +7
Impacted Files Coverage Δ
etcdserver/server.go 75.34% <100%> (-0.61%) ⬇️
mvcc/watchable_store.go 84.34% <100%> (+0.11%) ⬆️
etcdserver/backend.go 56.25% <100%> (ø) ⬆️
auth/store.go 55.05% <60.6%> (-20.35%) ⬇️
auth/metrics.go 87.5% <87.5%> (ø)
client/members.go 51.61% <0%> (-33.88%) ⬇️
pkg/transport/timeout_conn.go 80% <0%> (-20%) ⬇️
proxy/httpproxy/director.go 52.43% <0%> (-15.86%) ⬇️
proxy/grpcproxy/register.go 72.5% <0%> (-10%) ⬇️
... and 25 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 085b19b...f14d2a0. Read the comment docs.

Copy link
Contributor

@jingyih jingyih left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for digging into this issue and fixing it!

@@ -69,11 +70,11 @@ type watchableStore struct {
// cancel operations.
type cancelFunc func()

func New(lg *zap.Logger, b backend.Backend, le lease.Lessor, ig ConsistentIndexGetter, cfg StoreConfig) ConsistentWatchableKV {
return newWatchableStore(lg, b, le, ig, cfg)
func New(lg *zap.Logger, b backend.Backend, le lease.Lessor, as auth.AuthStore, ig ConsistentIndexGetter, cfg StoreConfig) ConsistentWatchableKV {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think logically we want to pass ConsistentIndexGetter to NewAuthStore(), so that the auth store can get consistent index from etcdserver, and writes it to backend meta bucket at the end of applying a raft entry (such as adding a user, deleting a user, etc). Not sure if this means a lot of extra complexity in terms of implementation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks. passing ConsistentIndexGetter to NewAuthStore will cause the saveIndex funtion to repeat.it looks a little ugly and violates the principle of function reuse. this is what we did in the first version.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in the second version, we define a public method SaveIndex in mvcc/store and define a interface ConsistentIndexSyncer which has a SaveIndex method that implemented by mvcc/store.SaveIndex. then, we want to pass ConsistentIndexSyncer interface to NewAuthStore().
What you see is the third version. After weighing the pros and cons, we choose the third version.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for trying multiple implementations, and I understand maybe the current implementation makes the most sense coding wise. However here is my push back to the current implementation: auth store's correctness should not depend on passing the store pointer to mvcc. Auth store should be relatively independent of mvcc, although they share the same backend.
cc @mitake

Copy link
Contributor Author

@tangcong tangcong Feb 27, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my opinion, consistentIndex is a global concept. It has nothing to do with mvcc. It is better to separate it from mvcc/store and encapsulate it in a separate package. When mvcc, lessor, auth and other modules need to prevent repeated execution of commands, call consistentIndex package Save methods to store. What do you think of this solution? but I'm not sure how much complexity this approach will cause. which solution do you prefer? it is a cricital bug for us,so we take a simple and safe approach to fix it.
thanks. @jingyih @mitake

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Totally agree with you! @tangcong

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. For short term we can get this merged and backported. Could you add a TODO for encapsulating consistentindex into a separate package?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done. @jingyih

@tangcong tangcong force-pushed the fix-auth-store-corruption-bug branch from fc44b73 to beb8d39 Compare February 26, 2020 02:34
@mitake
Copy link
Contributor

mitake commented Feb 26, 2020

@tangcong I think the ideal way of updating consistent index would be writing data through mvcc, not directly access to backend. e.g. storeTxnWrite.End() calls saveIndex() internally. But your change works for the issue so merging this first would be fine.

Copy link
Contributor

@mitake mitake left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! defer to @jingyih

Copy link
Contributor

@jingyih jingyih left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks again for fixing this! I added few comments. LGTM after fix.

auth/metrics.go Show resolved Hide resolved
auth/store.go Outdated Show resolved Hide resolved
auth/store.go Outdated Show resolved Hide resolved
auth/store.go Show resolved Hide resolved
@tangcong tangcong force-pushed the fix-auth-store-corruption-bug branch 2 times, most recently from 6ae860f to 2c99a83 Compare February 29, 2020 02:43
@tangcong
Copy link
Contributor Author

Thanks again for fixing this! I added few comments. LGTM after fix.

done. thanks.

tangcong added a commit to tangcong/etcd that referenced this pull request Apr 5, 2020
tangcong added a commit to tangcong/etcd that referenced this pull request Apr 5, 2020
tangcong added a commit to tangcong/etcd that referenced this pull request Apr 5, 2020
jingyih added a commit that referenced this pull request Apr 10, 2020
jingyih added a commit that referenced this pull request Apr 10, 2020
jingyih added a commit that referenced this pull request Apr 10, 2020
@tangcong tangcong deleted the fix-auth-store-corruption-bug branch February 26, 2021 18:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

5 participants