failpoint functional testing not covering etcd startup well #11898

jpbetz · 2020-05-15T21:37:40Z

As part of #11888 one question we had was why wasn't this caught by our failpoint functional testing?

I looked over the code and I think there is a simple answer. The way we inject failpoint failures is:

Get a list of all failurepoints from etcd via a HTTP request
Create functional testing cases for them
Inject them into the cluster by doing an HTTP put to enable the failpoint

See failpointFailures:

etcd/functional/tester/case_failpoints.go

Line 94 in f1179fd

inject := makeInjectFailpoint(fp, fcmd)

This works great for many of the failpoints, but for failpoints that are in the startup phase of the etcd lifecycle, there is a problem: When etcd is restarted, any failpoints set on that etcd member via HTTP requests are cleared.

This means there is a very narrow window of time between when etcd starts up and when the first few startup failpoints are reached when the functional tester would have an opportunity to enable them via an HTTP request and have them be exercised.

I think the reason why we didn't catch #11888 with failure testing is because this is sufficiently unlikely.

Possible ways to fix this:

keep track of which failure points have been enabled on each etcd member, and when they restart, enable them via the GOFAIL_FAILPOINTS environment variable. (also, when we "recover" that failurepoint, remember to stop including the failure point in the environment variable)
Whenever starting an etcd, inject a failure point via environment variable via some probability function.

I like #1 but am curious what others think.

@gyuho @jingyih @YoyinZyc @wenjiaswe

The text was updated successfully, but these errors were encountered:

jpbetz · 2020-05-18T21:13:09Z

Humm, actually, it might be much simpler. case_failipoints.go only waits for a snapshot to occur after enabling failpoints with a name that contains "Snap" and we just added one for between snapshot file save and WAL entry writes (https://github.com/etcd-io/etcd/blob/master/etcdserver/raft.go#L237) in #11888.

jpbetz · 2020-05-18T21:23:42Z

Opened #11913 to presist gofail failpoints across restarts since it seems like something we should be doing. I don't think it matters for anything urgent, since for #11888 the problem with testing was that there was not failpoint with "Snap" in the name between the operations that were causing a problem (which has since been fixed). But I figured I add this for completeness.

jpbetz mentioned this issue May 18, 2020

Persist failpoints across member restart #11913

Merged

gyuho closed this as completed in #11913 Jun 4, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

failpoint functional testing not covering etcd startup well #11898

failpoint functional testing not covering etcd startup well #11898

jpbetz commented May 15, 2020 •

edited

Loading

jpbetz commented May 18, 2020

jpbetz commented May 18, 2020

failpoint functional testing not covering etcd startup well #11898

failpoint functional testing not covering etcd startup well #11898

Comments

jpbetz commented May 15, 2020 • edited Loading

jpbetz commented May 18, 2020

jpbetz commented May 18, 2020

jpbetz commented May 15, 2020 •

edited

Loading