Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Restore e2e test failures on cluster health retrieval #1805

Merged
merged 6 commits into from
Oct 1, 2019

Conversation

sebgl
Copy link
Contributor

@sebgl sebgl commented Sep 26, 2019

We disabled error reporting (leading to e2e test failure) when we cannot
retrieve the health.
A "valid case" to not being able to retrieve the cluster health is when
we're restarting/removing the master node of a cluster. During leader
election, the cluster is temporarily unavailable to such requests.
This is way better with v7 clusters and zen2, for which unavailability
is made a lot smaller, but can still happen.

To solve that, let's only report health check requests failures that
contiguously happen for more than a threshold (1 minute here).

Fixes #614.

We disabled error reporting (leading to e2e test failure) when we cannot
retrieve the health.
A "valid case" to not being able to retrieve the cluster health is when
we're restarting/removing the master node of a v6 cluster. During leader
election, the cluster is temporarily unavailable to such requests.
This is way better with v7 clusters and zen2, for which unavailability
is made a lot smaller.

Let's restore that check, but make sure we ignore any errors resulting
from a v6 cluster upgrade.
@sebgl sebgl added the >test Related to unit/integration/e2e tests label Sep 26, 2019
@barkbay
Copy link
Contributor

barkbay commented Sep 26, 2019

Thank you, forgot that one 😳
I will try to run it a few times on a test cluster.

@@ -147,8 +154,13 @@ func (hc *ContinuousHealthCheck) Start() {
defer cancel()
health, err := hc.esClient.GetClusterHealth(ctx)
if err != nil {
// TODO: Temporarily account only red clusters, see https://github.com/elastic/cloud-on-k8s/issues/614
// hc.AppendErr(err)
if IsMutationFromV6Cluster(hc.b) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think technically the same thing can happen on a 7.x cluster (as you mentioned in the issue description) especially under load. It is just that the likelihood of such an event is much much lower because master elections are typically sub-second in 7.x. iiuc

I wonder if we will be able to observe this short unavailability in the tests.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know. Maybe worth waiting to see if that ever happens?

@barkbay
Copy link
Contributor

barkbay commented Sep 27, 2019

I ran the test a few times and it has always failed for ES 7.x:

    --- FAIL: TestMutationMdiToDedicated/Elasticsearch_cluster_health_should_not_have_been_red_during_mutation_process (0.00s)
        steps_mutation.go:95:
            	Error Trace:	steps_mutation.go:95
            	Error:      	Not equal:
            	            	expected: 0
            	            	actual  : 1
            	Test:       	TestMutationMdiToDedicated/Elasticsearch_cluster_health_should_not_have_been_red_during_mutation_process
        steps_mutation.go:97: Elasticsearch cluster health check failure at 2019-09-27 12:41:32.771121224 +0000 UTC m=+105.114006423: Get https://test-mutation-mdi-to-dedicated-trcf-es-http.e2e-xna3f-mercury.svc:9200/_cluster/health: dial tcp 10.59.247.229:9200: i/o timeout
FAIL
FAIL	github.com/elastic/cloud-on-k8s/test/e2e/es	109.030s
    --- FAIL: TestMutationMdiToDedicated/Elasticsearch_cluster_health_should_not_have_been_red_during_mutation_process (0.00s)
        steps_mutation.go:95:
            	Error Trace:	steps_mutation.go:95
            	Error:      	Not equal:
            	            	expected: 0
            	            	actual  : 1
            	Test:       	TestMutationMdiToDedicated/Elasticsearch_cluster_health_should_not_have_been_red_during_mutation_process
        steps_mutation.go:97: Elasticsearch cluster health check failure at 2019-09-27 14:09:14.06069233 +0000 UTC m=+86.036044375: Get https://test-mutation-mdi-to-dedicated-4fgv-es-http.e2e-dbenc-mercury.svc:9200/_cluster/health: dial tcp 10.59.252.175:9200: connect: connection refused

@sebgl
Copy link
Contributor Author

sebgl commented Sep 27, 2019

Thanks for testing @barkbay! I cannot reproduce the failure locally :(

@sebgl
Copy link
Contributor Author

sebgl commented Sep 27, 2019

I guess we'll have to simply ignore errors then? Pretty hard to detect when the master was killed and ignore errors only at that moment from the e2e tests point of view.

@pebrc
Copy link
Collaborator

pebrc commented Sep 27, 2019

I guess we'll have to simply ignore errors then?

Can we not derive from the type of mutation whether or not a short downtime is to be expected? E.g. rolling master nodes? downtime OK, rolling data nodes: NOK

@sebgl
Copy link
Contributor Author

sebgl commented Sep 27, 2019

Can we not derive from the type of mutation whether or not a short downtime is to be expected?

Indeed that sounds feasible in theory.
We already have an imperfect version of this check there:

// attempt do detect a rolling upgrade scenario
// Important: this only checks ES version and spec, other changes such as secure settings update
// are tricky to capture and ignored here.
isVersionUpgrade := initial.Elasticsearch.Spec.Version != b.Elasticsearch.Spec.Version
httpOptionsChange := reflect.DeepEqual(initial.Elasticsearch.Spec.HTTP, b.Elasticsearch.Spec.HTTP)
for _, initialNs := range initial.Elasticsearch.Spec.Nodes {
for _, mutatedNs := range b.Elasticsearch.Spec.Nodes {
if initialNs.Name == mutatedNs.Name &&
(isVersionUpgrade || httpOptionsChange || !reflect.DeepEqual(initialNs, mutatedNs)) {
// a rolling upgrade is scheduled for that NodeSpec
// we need at least 1 replica per shard for the cluster to remain green during the operation
return 1
}
}
}

In practice, I think most of our rolling upgrade E2E tests have master+data nodes though, for convenience. Not sure the data vs. master distinction brings much, but still worth doing it right :)

@barkbay
Copy link
Contributor

barkbay commented Sep 27, 2019

So it has just worked 3 time in a row with 7.3 (I have been testing with 7.1 and 7.2 so far), maybe that there are some improvements in ES 7.3.

Anyway IIRC it is expected to have a red status when a master leaves the cluster. It is supposed to affect the cluster for a very short time but it happens.
One solution would be to tolerate a few errors, as you can see it is a matter of 1 error, so maybe tolerate 2 or 3 errors and fail if there 're more ?

@sebgl
Copy link
Contributor Author

sebgl commented Sep 27, 2019

Discussed outside this PR: maybe change the implementation so we allow max. 1min contiguous errors to happen (corresponding to a maximum allowed leader election duration). If more, return the error.

Only if they don't happen contiguously for more than 60sec, which should
help allowing unavailability during leader election, but not if that
lasts too long.
@sebgl sebgl force-pushed the re-enable-health-errors branch from 425d0a7 to 22d8369 Compare September 30, 2019 13:10
@sebgl
Copy link
Contributor Author

sebgl commented Sep 30, 2019

I changed the implementation to allow http requests to return errors on health check, only if those occurs do not happen continuously for more than 60 seconds. The goal here is to allow normal leader elections to happen, but catch leader elections that would take too long (> 60sec).

A race condition may still occur if we kill eg. 3 times the master node in order during a rolling upgrade, in such a way that e2e tests don't have time to catch up in between.
I changed the default timeout to 5 seconds to minimise chances that this occurs.
Since we recently changed the readiness check to be much more permissive, I observed the cluster to be unavailable for a smaller period of time (no need to wait for pods to see the master to end up in the service endpoints). So I'd say while the race can definitely still happen, it's rather unlikely. I'm open to suggestions to improve that.

@sebgl sebgl requested review from pebrc and barkbay September 30, 2019 13:10
Copy link
Collaborator

@pebrc pebrc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

if cu.start.IsZero() {
return false
}
return time.Now().Sub(cu.start) >= cu.threshold
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

linter says time.Since is better :-)

@sebgl
Copy link
Contributor Author

sebgl commented Oct 1, 2019

Jenkins test this please.

1 similar comment
@sebgl
Copy link
Contributor Author

sebgl commented Oct 1, 2019

Jenkins test this please.

@sebgl sebgl merged commit 5275cf8 into elastic:master Oct 1, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>test Related to unit/integration/e2e tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Investigating TestMutationMdiToDedicated failure
3 participants