Restore e2e test failures on cluster health retrieval #1805

sebgl · 2019-09-26T13:34:53Z

We disabled error reporting (leading to e2e test failure) when we cannot
retrieve the health.
A "valid case" to not being able to retrieve the cluster health is when
we're restarting/removing the master node of a cluster. During leader
election, the cluster is temporarily unavailable to such requests.
This is way better with v7 clusters and zen2, for which unavailability
is made a lot smaller, but can still happen.

To solve that, let's only report health check requests failures that
contiguously happen for more than a threshold (1 minute here).

Fixes #614.

We disabled error reporting (leading to e2e test failure) when we cannot retrieve the health. A "valid case" to not being able to retrieve the cluster health is when we're restarting/removing the master node of a v6 cluster. During leader election, the cluster is temporarily unavailable to such requests. This is way better with v7 clusters and zen2, for which unavailability is made a lot smaller. Let's restore that check, but make sure we ignore any errors resulting from a v6 cluster upgrade.

barkbay · 2019-09-26T13:50:42Z

Thank you, forgot that one 😳
I will try to run it a few times on a test cluster.

pebrc · 2019-09-27T07:55:28Z

test/e2e/test/elasticsearch/steps_mutation.go

@@ -147,8 +154,13 @@ func (hc *ContinuousHealthCheck) Start() {
 				defer cancel()
 				health, err := hc.esClient.GetClusterHealth(ctx)
 				if err != nil {
-					// TODO: Temporarily account only red clusters, see https://github.com/elastic/cloud-on-k8s/issues/614
-					// hc.AppendErr(err)
+					if IsMutationFromV6Cluster(hc.b) {


I think technically the same thing can happen on a 7.x cluster (as you mentioned in the issue description) especially under load. It is just that the likelihood of such an event is much much lower because master elections are typically sub-second in 7.x. iiuc

I wonder if we will be able to observe this short unavailability in the tests.

I don't know. Maybe worth waiting to see if that ever happens?

barkbay · 2019-09-27T14:18:44Z

I ran the test a few times and it has always failed for ES 7.x:

    --- FAIL: TestMutationMdiToDedicated/Elasticsearch_cluster_health_should_not_have_been_red_during_mutation_process (0.00s)
        steps_mutation.go:95:
            	Error Trace:	steps_mutation.go:95
            	Error:      	Not equal:
            	            	expected: 0
            	            	actual  : 1
            	Test:       	TestMutationMdiToDedicated/Elasticsearch_cluster_health_should_not_have_been_red_during_mutation_process
        steps_mutation.go:97: Elasticsearch cluster health check failure at 2019-09-27 12:41:32.771121224 +0000 UTC m=+105.114006423: Get https://test-mutation-mdi-to-dedicated-trcf-es-http.e2e-xna3f-mercury.svc:9200/_cluster/health: dial tcp 10.59.247.229:9200: i/o timeout
FAIL
FAIL	github.com/elastic/cloud-on-k8s/test/e2e/es	109.030s

    --- FAIL: TestMutationMdiToDedicated/Elasticsearch_cluster_health_should_not_have_been_red_during_mutation_process (0.00s)
        steps_mutation.go:95:
            	Error Trace:	steps_mutation.go:95
            	Error:      	Not equal:
            	            	expected: 0
            	            	actual  : 1
            	Test:       	TestMutationMdiToDedicated/Elasticsearch_cluster_health_should_not_have_been_red_during_mutation_process
        steps_mutation.go:97: Elasticsearch cluster health check failure at 2019-09-27 14:09:14.06069233 +0000 UTC m=+86.036044375: Get https://test-mutation-mdi-to-dedicated-4fgv-es-http.e2e-dbenc-mercury.svc:9200/_cluster/health: dial tcp 10.59.252.175:9200: connect: connection refused

sebgl · 2019-09-27T14:36:10Z

Thanks for testing @barkbay! I cannot reproduce the failure locally :(

sebgl · 2019-09-27T14:41:51Z

I guess we'll have to simply ignore errors then? Pretty hard to detect when the master was killed and ignore errors only at that moment from the e2e tests point of view.

pebrc · 2019-09-27T14:47:27Z

I guess we'll have to simply ignore errors then?

Can we not derive from the type of mutation whether or not a short downtime is to be expected? E.g. rolling master nodes? downtime OK, rolling data nodes: NOK

sebgl · 2019-09-27T14:55:51Z

Can we not derive from the type of mutation whether or not a short downtime is to be expected?

Indeed that sounds feasible in theory.
We already have an imperfect version of this check there:

cloud-on-k8s/test/e2e/test/elasticsearch/checks_data.go

Lines 162 to 176 in 6156d47

    
           // attempt do detect a rolling upgrade scenario 
        
           // Important: this only checks ES version and spec, other changes such as secure settings update 
        
           // are tricky to capture and ignored here. 
        
           isVersionUpgrade := initial.Elasticsearch.Spec.Version != b.Elasticsearch.Spec.Version 
        
           httpOptionsChange := reflect.DeepEqual(initial.Elasticsearch.Spec.HTTP, b.Elasticsearch.Spec.HTTP) 
        
           for _, initialNs := range initial.Elasticsearch.Spec.Nodes { 
        
           	for _, mutatedNs := range b.Elasticsearch.Spec.Nodes { 
        
           		if initialNs.Name == mutatedNs.Name && 
        
           			(isVersionUpgrade || httpOptionsChange || !reflect.DeepEqual(initialNs, mutatedNs)) { 
        
           			// a rolling upgrade is scheduled for that NodeSpec 
        
           			// we need at least 1 replica per shard for the cluster to remain green during the operation 
        
           			return 1 
        
           		} 
        
           	} 
        
           }

In practice, I think most of our rolling upgrade E2E tests have master+data nodes though, for convenience. Not sure the data vs. master distinction brings much, but still worth doing it right :)

barkbay · 2019-09-27T15:01:30Z

So it has just worked 3 time in a row with 7.3 (I have been testing with 7.1 and 7.2 so far), maybe that there are some improvements in ES 7.3.

Anyway IIRC it is expected to have a red status when a master leaves the cluster. It is supposed to affect the cluster for a very short time but it happens.
One solution would be to tolerate a few errors, as you can see it is a matter of 1 error, so maybe tolerate 2 or 3 errors and fail if there 're more ?

sebgl · 2019-09-27T15:44:06Z

Discussed outside this PR: maybe change the implementation so we allow max. 1min contiguous errors to happen (corresponding to a maximum allowed leader election duration). If more, return the error.

Only if they don't happen contiguously for more than 60sec, which should help allowing unavailability during leader election, but not if that lasts too long.

sebgl · 2019-09-30T13:10:21Z

I changed the implementation to allow http requests to return errors on health check, only if those occurs do not happen continuously for more than 60 seconds. The goal here is to allow normal leader elections to happen, but catch leader elections that would take too long (> 60sec).

A race condition may still occur if we kill eg. 3 times the master node in order during a rolling upgrade, in such a way that e2e tests don't have time to catch up in between.
I changed the default timeout to 5 seconds to minimise chances that this occurs.
Since we recently changed the readiness check to be much more permissive, I observed the cluster to be unavailable for a smaller period of time (no need to wait for pods to see the master to end up in the service endpoints). So I'd say while the race can definitely still happen, it's rather unlikely. I'm open to suggestions to improve that.

pebrc

LGTM!

pebrc · 2019-10-01T07:16:38Z

test/e2e/test/elasticsearch/steps_mutation.go

+	if cu.start.IsZero() {
+		return false
+	}
+	return time.Now().Sub(cu.start) >= cu.threshold


linter says time.Since is better :-)

sebgl · 2019-10-01T08:32:44Z

Jenkins test this please.

sebgl · 2019-10-01T12:27:07Z

Jenkins test this please.

sebgl assigned barkbay Sep 26, 2019

sebgl added the >test Related to unit/integration/e2e tests label Sep 26, 2019

pebrc reviewed Sep 27, 2019

View reviewed changes

sebgl added 2 commits September 27, 2019 10:08

Merge branch 'master' into re-enable-health-errors

ed4d799

Merge branch 'master' into re-enable-health-errors

b722e20

Allow http errors during mutations

22d8369

Only if they don't happen contiguously for more than 60sec, which should help allowing unavailability during leader election, but not if that lasts too long.

sebgl force-pushed the re-enable-health-errors branch from 425d0a7 to 22d8369 Compare September 30, 2019 13:10

sebgl requested review from pebrc and barkbay September 30, 2019 13:10

Add missing license header

8b215bc

pebrc approved these changes Oct 1, 2019

View reviewed changes

Make linter happy

e91407f

sebgl merged commit 5275cf8 into elastic:master Oct 1, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Restore e2e test failures on cluster health retrieval #1805

Restore e2e test failures on cluster health retrieval #1805

sebgl commented Sep 26, 2019 •

edited

Loading

barkbay commented Sep 26, 2019

pebrc Sep 27, 2019

sebgl Sep 27, 2019

barkbay commented Sep 27, 2019

sebgl commented Sep 27, 2019

sebgl commented Sep 27, 2019

pebrc commented Sep 27, 2019

sebgl commented Sep 27, 2019 •

edited

Loading

barkbay commented Sep 27, 2019

sebgl commented Sep 27, 2019

sebgl commented Sep 30, 2019

pebrc left a comment

pebrc Oct 1, 2019

sebgl commented Oct 1, 2019

sebgl commented Oct 1, 2019

Restore e2e test failures on cluster health retrieval #1805

Restore e2e test failures on cluster health retrieval #1805

Conversation

sebgl commented Sep 26, 2019 • edited Loading

barkbay commented Sep 26, 2019

pebrc Sep 27, 2019

Choose a reason for hiding this comment

sebgl Sep 27, 2019

Choose a reason for hiding this comment

barkbay commented Sep 27, 2019

sebgl commented Sep 27, 2019

sebgl commented Sep 27, 2019

pebrc commented Sep 27, 2019

sebgl commented Sep 27, 2019 • edited Loading

barkbay commented Sep 27, 2019

sebgl commented Sep 27, 2019

sebgl commented Sep 30, 2019

pebrc left a comment

Choose a reason for hiding this comment

pebrc Oct 1, 2019

Choose a reason for hiding this comment

sebgl commented Oct 1, 2019

sebgl commented Oct 1, 2019

sebgl commented Sep 26, 2019 •

edited

Loading

sebgl commented Sep 27, 2019 •

edited

Loading