-
Notifications
You must be signed in to change notification settings - Fork 725
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Restore e2e test failures on cluster health retrieval #1805
Conversation
We disabled error reporting (leading to e2e test failure) when we cannot retrieve the health. A "valid case" to not being able to retrieve the cluster health is when we're restarting/removing the master node of a v6 cluster. During leader election, the cluster is temporarily unavailable to such requests. This is way better with v7 clusters and zen2, for which unavailability is made a lot smaller. Let's restore that check, but make sure we ignore any errors resulting from a v6 cluster upgrade.
Thank you, forgot that one 😳 |
@@ -147,8 +154,13 @@ func (hc *ContinuousHealthCheck) Start() { | |||
defer cancel() | |||
health, err := hc.esClient.GetClusterHealth(ctx) | |||
if err != nil { | |||
// TODO: Temporarily account only red clusters, see https://github.com/elastic/cloud-on-k8s/issues/614 | |||
// hc.AppendErr(err) | |||
if IsMutationFromV6Cluster(hc.b) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think technically the same thing can happen on a 7.x cluster (as you mentioned in the issue description) especially under load. It is just that the likelihood of such an event is much much lower because master elections are typically sub-second in 7.x. iiuc
I wonder if we will be able to observe this short unavailability in the tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know. Maybe worth waiting to see if that ever happens?
I ran the test a few times and it has always failed for ES 7.x:
|
Thanks for testing @barkbay! I cannot reproduce the failure locally :( |
I guess we'll have to simply ignore errors then? Pretty hard to detect when the master was killed and ignore errors only at that moment from the e2e tests point of view. |
Can we not derive from the type of mutation whether or not a short downtime is to be expected? E.g. rolling master nodes? downtime OK, rolling data nodes: NOK |
Indeed that sounds feasible in theory. cloud-on-k8s/test/e2e/test/elasticsearch/checks_data.go Lines 162 to 176 in 6156d47
In practice, I think most of our rolling upgrade E2E tests have master+data nodes though, for convenience. Not sure the data vs. master distinction brings much, but still worth doing it right :) |
So it has just worked 3 time in a row with 7.3 (I have been testing with 7.1 and 7.2 so far), maybe that there are some improvements in ES 7.3. Anyway IIRC it is expected to have a red status when a master leaves the cluster. It is supposed to affect the cluster for a very short time but it happens. |
Discussed outside this PR: maybe change the implementation so we allow max. 1min contiguous errors to happen (corresponding to a maximum allowed leader election duration). If more, return the error. |
Only if they don't happen contiguously for more than 60sec, which should help allowing unavailability during leader election, but not if that lasts too long.
425d0a7
to
22d8369
Compare
I changed the implementation to allow http requests to return errors on health check, only if those occurs do not happen continuously for more than 60 seconds. The goal here is to allow normal leader elections to happen, but catch leader elections that would take too long (> 60sec). A race condition may still occur if we kill eg. 3 times the master node in order during a rolling upgrade, in such a way that e2e tests don't have time to catch up in between. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
if cu.start.IsZero() { | ||
return false | ||
} | ||
return time.Now().Sub(cu.start) >= cu.threshold |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
linter says time.Since
is better :-)
Jenkins test this please. |
1 similar comment
Jenkins test this please. |
We disabled error reporting (leading to e2e test failure) when we cannot
retrieve the health.
A "valid case" to not being able to retrieve the cluster health is when
we're restarting/removing the master node of a cluster. During leader
election, the cluster is temporarily unavailable to such requests.
This is way better with v7 clusters and zen2, for which unavailability
is made a lot smaller, but can still happen.
To solve that, let's only report health check requests failures that
contiguously happen for more than a threshold (1 minute here).
Fixes #614.