-
Notifications
You must be signed in to change notification settings - Fork 5.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
meta/autoid: make autoid client ResetConn operation concurrency-safe #50522
Conversation
Hi @tiancaiamao. Thanks for your PR. PRs from untrusted users cannot be marked as trusted with I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
func (d *ClientDiscover) ResetConn(reason error) { | ||
func (d *ClientDiscover) ResetConn(version uint64, reason error) { | ||
// Avoid repeated Reset operation | ||
if !atomic.CompareAndSwapUint64(&d.version, version, version+1) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Introduce a version
variable to avoid repeated ResetConn()
and this also avoid a stale ResetConn()
reset the newly created grpcConn mistakenly.
go func() { | ||
// Doen't close the conn immediately, in case the other sessions are still using it. | ||
time.Sleep(200 * time.Millisecond) | ||
err := grpcConn.Close() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Delay the Close() operation, so the other session that still reference this grpcConn do not get the "grpc: the client connection is closing" error.
Codecov Report
Additional details and impacted files@@ Coverage Diff @@
## master #50522 +/- ##
=================================================
- Coverage 71.9131% 55.7175% -16.1956%
=================================================
Files 1449 1564 +115
Lines 347105 588757 +241652
=================================================
+ Hits 249614 328041 +78427
- Misses 77205 237861 +160656
- Partials 20286 22855 +2569
Flags with carried forward coverage won't be shown. Click here to find out more.
|
/test mysql-test |
@tiancaiamao: Cannot trigger testing until a trusted user reviews the PR and leaves an In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/test mysql-test |
@tiancaiamao: Cannot trigger testing until a trusted user reviews the PR and leaves an In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: bb7133, xhebox, zimulala The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
@@ -56,6 +57,8 @@ type ClientDiscover struct { | |||
// See https://github.com/grpc/grpc-go/issues/5321 | |||
*grpc.ClientConn | |||
} | |||
// version is increased in every ResetConn() to make the operation safe. | |||
version uint64 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use atomic.Uint64
?
Signed-off-by: ti-chi-bot <ti-community-prow-bot@tidb.io>
In response to a cherrypick label: new pull request created to branch |
In response to a cherrypick label: new pull request created to branch |
Signed-off-by: ti-chi-bot <ti-community-prow-bot@tidb.io>
In response to a cherrypick label: new pull request created to branch |
In response to a cherrypick label: new pull request created to branch |
Signed-off-by: ti-chi-bot <ti-community-prow-bot@tidb.io>
What problem does this PR solve?
Issue Number: close #50519
Problem Summary:
What changed and how does it work?
The old logic is not concurrency-safe. Although some of the fields are protected by mutex, the logic is not correct.
Imagine that there are 7K concurrent Alloc() operation, and one of them meet rpc error.
Then one
ResetConn()
is called, which invokesgrpcConn.Close()
.Note, there are many ongoing calling of
Alloc()
and still using the conn, so close the grpcConn cause this error:This error in turn cause
ResetConn()
calling.Even though
GetClient()
get a new client, the new client might be reset again by theResetConn()
mistakenly!So the client can not recover from the error automatically some times (when the concurrent contention is high enough).
To summarize, there are too things we should avoid:
grpcConn.Close()
inResetConn()
when the grpcConn might still been using by some other sessions.ResetConn()
could be called multiple times and mistakenly reset the new connection.Check List
Tests
Rename this to cmd/autoid/main.go and go run it (modify tidb/pkg/autoid_service/autoid.go and mock 1/1000 percent error rate also).
main.txt
Use the test described in #50519, after the fix, the QPS become stable:
And the error message like this is gone:
Side effects
Documentation
Release note
Please refer to Release Notes Language Style Guide to write a quality release note.