-
Notifications
You must be signed in to change notification settings - Fork 9.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
concurrent.Session closed immediately after creation #13675
Comments
Thanks @liberize for the finding. It's a real issue, and I reproduced it in my local environment on both 3.5.2 and the main branch. It took me a couple of sleeping hours to figure out the root cause. The root cause is the leader hasn't finish applying the LeaseGrantRequest when it receives the following LeaseKeepAliveRequest request. The rough workflow is something like below (assuming there are 3 members in the cluster),
Since the 3 members are all running on one local server, so the communication between members is really fast. The leader finishes applying the LeaseGrantRequest just about 1~2 microseconds right after it receives the LeaseKeepAliveRequest. In a real production environment, it rarely happens. But it indeed is a real issue. I think it should be a generic issue. When etcdserver processes the raft message coming from raft, it applies the change locally in parallel with replicating to the followers. See L212-L224. So in theory, it's possible that the leader hasn't finish applying the request before the client receives a SUCCESS response and sends next request. Any thoughts on the solution, or do you think there is no need to fix this since it is unlikely reproduced in real production environment? cc @ptabor @serathius @spzala @xiang90 @gyuho |
fixed by #13690. |
Let's keep this issue open until the PR is reviewed & merged. |
Fix was merged, will be release in v3.5.3 |
What happened?
I have an etcd cluster of 3 nodes. each node started like this:
Then I use this code to create a lot of clients:
It panics every time with only 10-100 clients:
What did you expect to happen?
The code above runs without problem.
How can we reproduce it (as minimally and precisely as possible)?
Use the code pasted above.
Anything else we need to know?
It seems the server returned a keep alive response with TTL = 0. Then the channel returned by Client.KeepAlive was closed.
https://github.com/etcd-io/etcd/blob/main/client/v3/lease.go#L512
Wireshark proves that:
172.25.21.36 is leader node:
seems like a consistent issue?
Etcd version (please run commands below)
Etcd configuration (command line flags or environment variables)
node-0:
node-1:
node-2:
Etcd debug information (please run commands blow, feel free to obfuscate the IP address or FQDN in the output)
Relevant log output
No response
The text was updated successfully, but these errors were encountered: