You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This works great for many of the failpoints, but for failpoints that are in the startup phase of the etcd lifecycle, there is a problem: When etcd is restarted, any failpoints set on that etcd member via HTTP requests are cleared.
This means there is a very narrow window of time between when etcd starts up and when the first few startup failpoints are reached when the functional tester would have an opportunity to enable them via an HTTP request and have them be exercised.
I think the reason why we didn't catch #11888 with failure testing is because this is sufficiently unlikely.
Possible ways to fix this:
keep track of which failure points have been enabled on each etcd member, and when they restart, enable them via the GOFAIL_FAILPOINTS environment variable. (also, when we "recover" that failurepoint, remember to stop including the failure point in the environment variable)
Whenever starting an etcd, inject a failure point via environment variable via some probability function.
Opened #11913 to presist gofail failpoints across restarts since it seems like something we should be doing. I don't think it matters for anything urgent, since for #11888 the problem with testing was that there was not failpoint with "Snap" in the name between the operations that were causing a problem (which has since been fixed). But I figured I add this for completeness.
As part of #11888 one question we had was why wasn't this caught by our failpoint functional testing?
I looked over the code and I think there is a simple answer. The way we inject failpoint failures is:
See
failpointFailures
:etcd/functional/tester/case_failpoints.go
Line 94 in f1179fd
This works great for many of the failpoints, but for failpoints that are in the startup phase of the etcd lifecycle, there is a problem: When etcd is restarted, any failpoints set on that etcd member via HTTP requests are cleared.
This means there is a very narrow window of time between when etcd starts up and when the first few startup failpoints are reached when the functional tester would have an opportunity to enable them via an HTTP request and have them be exercised.
I think the reason why we didn't catch #11888 with failure testing is because this is sufficiently unlikely.
Possible ways to fix this:
GOFAIL_FAILPOINTS
environment variable. (also, when we "recover" that failurepoint, remember to stop including the failure point in the environment variable)I like #1 but am curious what others think.
@gyuho @jingyih @YoyinZyc @wenjiaswe
The text was updated successfully, but these errors were encountered: