-
Notifications
You must be signed in to change notification settings - Fork 9.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
tests: Ensure healthy cluster before and after robustness failpoint #15604
Conversation
e4286cf
to
5252324
Compare
5252324
to
bd63c0a
Compare
bd63c0a
to
3f249c9
Compare
3f249c9
to
f292ebd
Compare
90c74f8
to
e83703b
Compare
I don't understand why we need this (Thanks @jmhbnz 's effort anyway).
|
Please read #15595, we injected the failpoint on one member, but other members crashed. This is unexpected and should be detected by failpoint code as we cannot say that failpoint injection succeeded if cluster was unhealthy before or after. |
ping @ptabor |
Based on the discussion in #15595, it's because the proxy layer has issue. Shouldn't the proxy layer be fixed? No matter it's production or test environments, if a member crashes unexpectedly, it should be an critical or major issue, we should fix it. Adding more protection may not be good, because we may regard it as a flaky case and just retry, and accordingly hiding the real issue. |
No, the trigger was the proxy blackholing, but for robustness tests problem was that etcd followers crashed and the test didn't notice it. Because tests do not expect whole cluster to be down, they:
This is an unexpected error, thus tests should not retry it but exit immediately. And this is what @jmhbnz implemented. We mark the test as fail with Please ask about the code, instead of making an incorrect assumption. The design was also discussed #15596 (comment) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thank you.
I think it's better to have even redundant sources of signal and fail the tests early if anything is not going as expected.
defer clusterClient.Close() | ||
|
||
cli := healthpb.NewHealthClient(clusterClient.ActiveConnection()) | ||
resp, err := cli.Check(ctx, &healthpb.HealthCheckRequest{}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Potentially we should have a 'helper' retrier (3 times) in case of connection flakiness around such semi-grpc code.
We might monitor it for flakes... but intuitively there will be (even though it's a 'localhost' communication).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If there are flakes we are sure to discover them in nightly tests. We can consider it as a followup.
|
f3c61ab
to
3682955
Compare
…nts. Signed-off-by: James Blair <mail@jamesblair.net>
3682955
to
1227754
Compare
We need a way to verify if the cluster is healthy before and after injecting failpoints in robustness tests so we can surface these errors and ensure watch does not wait indefinitely causing the robustness suite to fail.
Fixes: #15596