-
Notifications
You must be signed in to change notification settings - Fork 9.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
*: LeaseTimeToLive returns error if leader changed #17642
Conversation
Skipping CI for Draft Pull Request. |
The old leader demotes lessor and all the leases' expire time will be updated. Instead of returning incorrect remaining TTL, we should return errors to force client retry. Signed-off-by: Wei Fu <fuweid89@gmail.com>
/lgtm |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
It's also a minor bug, which we should backport to 3.4 and 3.5.
Not sure about a solution here, what are we trying to fix? The flaking If it's the first problem, then we should just retry in the test. If it's the second, we need to make the request linearizable in similar way we did for Lease expiration #16822, we need to do a quorum read. I think the presented solution just gives a illusion of things of problem being solved. What are the chances that a leader is changed between checking |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need to discuss the issue more.
I would like to say it's data race issue. The // from test case log
2024-02-28T11:10:09.1624683Z logger.go:130: 2024-02-28T11:05:04.811Z INFO m1.raft 62d1ff821e702f1 became follower at term 3 {"member": "m1"}
2024-02-28T11:10:09.1626973Z logger.go:130: 2024-02-28T11:05:04.811Z INFO m1.raft raft.node: 62d1ff821e702f1 lost leader 62d1ff821e702f1 at term 3 {"member": "m1"}
2024-02-28T11:10:09.1628294Z lease_test.go:180:
2024-02-28T11:10:09.1629600Z Error Trace: /home/runner/actions-runner/_work/etcd/etcd/tests/common/lease_test.go:180
2024-02-28T11:10:09.1631871Z /home/runner/actions-runner/_work/etcd/etcd/tests/framework/testutils/execute.go:38
2024-02-28T11:10:09.1634051Z /home/runner/actions-runner/_work/_tool/go/1.21.6/arm64/src/runtime/asm_arm64.s:1197
2024-02-28T11:10:09.1635263Z Error: "2" is not greater than "9223372036"
2024-02-28T11:10:09.1636251Z Test: TestLeaseGrantKeepAliveOnce/PeerAutoTLS
Well, CI runner is resource-limited vm and I remember we hit it many times. It's small but it's data race issue actually.
It's hard to write regression test case for data race issue. The sleep is just used to create the race timing condition for test purpose. There is possible in production since there are two goroutines running background. The older leader's raft node processes message and demotes all the leases. For lease remaining TTL (Granted TTL is 10s), there are several possible responses:
It should not be 9223372036. I don't think we should ignore it. It's minor bug but it doesn't make sense to rerun the failure cases. |
So this is a test issue, we should retry. As you said leader can change at any moment, by adding second check you just increase the window where we would detect leader change, but leader can still change after What I'm trying to say is that, you can add as many check to |
If we choose to retry, what is condition for retry? I don't think it should be Based on current design, the lease remaining TTL will be reset after leader changed. Even if leader changed after
IMO, we just need to ensure that response is valid before return.
Would you mind sharing idea how to fix it? In this patch, I force server to return retryable error to client. Thanks |
|
Ok, so the issue you are fixing is the race between reading the least TTL and leader change causing reset of TTLs. Makes sense. Thanks @ahrtr |
The old leader demotes lessor and all the leases' expire time will be updated. Instead of returning incorrect remaining TTL, we should return errors to force client retry.
Please read https://github.com/etcd-io/etcd/blob/main/CONTRIBUTING.md#contribution-flow.
Fixes: #17506