Investigating TestMutationMdiToDedicated failure #614

barkbay · 2019-04-09T07:43:18Z

The e2e test TestMutationMdiToDedicated seems to consistently fails, on my laptop I have dozens of errors like this one during the stepCluster_health_should_not_have_been_red_during_mutation_process (The name of the test is misleading as any error,e.g. a timeout, will trigger a failure):

Cluster health check failure at 2019-04-09 09:11:08.481942 +0200 CEST m=+232.212360466: Get http://test-mutation-mdi-to-dedicated-es.e2e.svc.cluster.local:9200/_cluster/health: no pod addresses found in service endpoints

I have tried to understand why the test is failing, here what I found:

The first interesting thing is that the operator may remove the last pod in a Ready state (from a K8S pov), leaving the Kubernetes service without any endpoint to route a request:

I guess it is because the K8S state is not taken into account when the operator checks if it can delete a pod:

https://github.com/elastic/k8s-operators/blob/bac6e037efac300d51c0f68c71942121ba7627ef/operators/pkg/controller/elasticsearch/mutation/podrestrictions.go#L50-L55

A first question is: should we remove a node that is the last in the Ready state ?
note that this behavior is mitigated when some data is present, the migration gives some time to a node to move into a Ready state

But the real problem is that when the former master leaves the cluster it can take more than 1 minute for the cluster to recover:

[2019-04-09T06:48:17,110][INFO ][o.e.d.z.ZenDiscovery     ] [test-mutation-mdi-to-dedicated-es-vlnkt2svz4] master_left [{test-mutation-mdi-to-dedicated-es-j7cxq7f4nc}{oHRJa9kbTa6qf8oknEPLbg}{vNTpy9onR5mpCew57WRAMQ}{172.17.0.10}{172.17.0.10:9300}{xpack.installed=true}], reason [shut_down]
[2019-04-09T06:48:17,111][WARN ][o.e.d.z.ZenDiscovery     ] [test-mutation-mdi-to-dedicated-es-vlnkt2svz4] master left (reason = shut_down), current nodes: nodes: 
   {test-mutation-mdi-to-dedicated-es-vlnkt2svz4}{Hm0TkDYlRoygmT206xXEzQ}{B8xYzZPsS6OtFKatRhpOew}{172.17.0.11}{172.17.0.11:9300}{xpack.installed=true}, local
   {test-mutation-mdi-to-dedicated-es-j7cxq7f4nc}{oHRJa9kbTa6qf8oknEPLbg}{vNTpy9onR5mpCew57WRAMQ}{172.17.0.10}{172.17.0.10:9300}{xpack.installed=true}, master
   {test-mutation-mdi-to-dedicated-es-9666rsgvfl}{Ki4h1ZkrTqqI3BXPtRqJBw}{1fofStBBTR6RmySl29HyBw}{172.17.0.12}{172.17.0.12:9300}{xpack.installed=true}

[2019-04-09T06:48:20,219][INFO ][o.e.c.s.MasterService    ] [test-mutation-mdi-to-dedicated-es-vlnkt2svz4] zen-disco-elected-as-master ([0] nodes joined), reason: new_master {test-mutation-mdi-to-dedicated-es-vlnkt2svz4}{Hm0TkDYlRoygmT206xXEzQ}{B8xYzZPsS6OtFKatRhpOew}{172.17.0.11}{172.17.0.11:9300}{xpack.installed=true}
[2019-04-09T06:48:47,943][WARN ][r.suppressed             ] [test-mutation-mdi-to-dedicated-es-vlnkt2svz4] path: /_cat/master, params: {}
org.elasticsearch.discovery.MasterNotDiscoveredException: null
...
[2019-04-09T06:48:50,236][WARN ][o.e.d.z.PublishClusterStateAction] [test-mutation-mdi-to-dedicated-es-vlnkt2svz4] timed out waiting for all nodes to process published state [19] (timeout [30s], pending nodes: [{test-mutation-mdi-to-dedicated-es-9666rsgvfl}{Ki4h1ZkrTqqI3BXPtRqJBw}{1fofStBBTR6RmySl29HyBw}{172.17.0.12}{172.17.0.12:9300}{xpack.installed=true}])
[2019-04-09T06:48:50,237][WARN ][o.e.d.z.PublishClusterStateAction] [test-mutation-mdi-to-dedicated-es-vlnkt2svz4] publishing cluster state with version [19] failed for the following nodes: [[{test-mutation-mdi-to-dedicated-es-j7cxq7f4nc}{oHRJa9kbTa6qf8oknEPLbg}{vNTpy9onR5mpCew57WRAMQ}{172.17.0.10}{172.17.0.10:9300}{xpack.installed=true}]]
[2019-04-09T06:48:50,239][INFO ][o.e.c.s.ClusterApplierService] [test-mutation-mdi-to-dedicated-es-vlnkt2svz4] new_master {test-mutation-mdi-to-dedicated-es-vlnkt2svz4}{Hm0TkDYlRoygmT206xXEzQ}{B8xYzZPsS6OtFKatRhpOew}{172.17.0.11}{172.17.0.11:9300}{xpack.installed=true}, reason: apply cluster state (from master [master {test-mutation-mdi-to-dedicated-es-vlnkt2svz4}{Hm0TkDYlRoygmT206xXEzQ}{B8xYzZPsS6OtFKatRhpOew}{172.17.0.11}{172.17.0.11:9300}{xpack.installed=true} committed version [19] source [zen-disco-elected-as-master ([0] nodes joined)]])
[2019-04-09T06:48:51,444][WARN ][o.e.c.NodeConnectionsService] [test-mutation-mdi-to-dedicated-es-vlnkt2svz4] failed to connect to node {test-mutation-mdi-to-dedicated-es-j7cxq7f4nc}{oHRJa9kbTa6qf8oknEPLbg}{vNTpy9onR5mpCew57WRAMQ}{172.17.0.10}{172.17.0.10:9300}{xpack.installed=true} (tried [1] times)
org.elasticsearch.transport.ConnectTransportException: [test-mutation-mdi-to-dedicated-es-j7cxq7f4nc][172.17.0.10:9300] connect_exception
...
[2019-04-09T06:48:57,780][WARN ][r.suppressed             ] [test-mutation-mdi-to-dedicated-es-vlnkt2svz4] path: /_cat/master, params: {}
org.elasticsearch.discovery.MasterNotDiscoveredException: null
...
[2019-04-09T06:49:07,842][WARN ][r.suppressed             ] [test-mutation-mdi-to-dedicated-es-vlnkt2svz4] path: /_cat/master, params: {}
org.elasticsearch.discovery.MasterNotDiscoveredException: null
...
[2019-04-09T06:49:10,037][WARN ][o.e.c.s.MasterService    ] [test-mutation-mdi-to-dedicated-es-vlnkt2svz4] cluster state update task [zen-disco-elected-as-master ([0] nodes joined)] took [49.9s] above the warn threshold of 30s
[2019-04-09T06:49:10,039][INFO ][o.e.c.s.MasterService    ] [test-mutation-mdi-to-dedicated-es-vlnkt2svz4] zen-disco-node-failed({test-mutation-mdi-to-dedicated-es-j7cxq7f4nc}{oHRJa9kbTa6qf8oknEPLbg}{vNTpy9onR5mpCew57WRAMQ}{172.17.0.10}{172.17.0.10:9300}{xpack.installed=true}), reason(transport disconnected)[{test-mutation-mdi-to-dedicated-es-j7cxq7f4nc}{oHRJa9kbTa6qf8oknEPLbg}{vNTpy9onR5mpCew57WRAMQ}{172.17.0.10}{172.17.0.10:9300}{xpack.installed=true} transport disconnected], reason: removed {{test-mutation-mdi-to-dedicated-es-j7cxq7f4nc}{oHRJa9kbTa6qf8oknEPLbg}{vNTpy9onR5mpCew57WRAMQ}{172.17.0.10}{172.17.0.10:9300}{xpack.installed=true},}
[2019-04-09T06:49:20,405][INFO ][o.e.c.s.ClusterApplierService] [test-mutation-mdi-to-dedicated-es-vlnkt2svz4] removed {{test-mutation-mdi-to-dedicated-es-j7cxq7f4nc}{oHRJa9kbTa6qf8oknEPLbg}{vNTpy9onR5mpCew57WRAMQ}{172.17.0.10}{172.17.0.10:9300}{xpack.installed=true},}, reason: apply cluster state (from master [master {test-mutation-mdi-to-dedicated-es-vlnkt2svz4}{Hm0TkDYlRoygmT206xXEzQ}{B8xYzZPsS6OtFKatRhpOew}{172.17.0.11}{172.17.0.11:9300}{xpack.installed=true} committed version [20] source [zen-disco-node-failed({test-mutation-mdi-to-dedicated-es-j7cxq7f4nc}{oHRJa9kbTa6qf8oknEPLbg}{vNTpy9onR5mpCew57WRAMQ}{172.17.0.10}{172.17.0.10:9300}{xpack.installed=true}), reason(transport disconnected)[{test-mutation-mdi-to-dedicated-es-j7cxq7f4nc}{oHRJa9kbTa6qf8oknEPLbg}{vNTpy9onR5mpCew57WRAMQ}{172.17.0.10}{172.17.0.10:9300}{xpack.installed=true} transport disconnected]]])

I'm not sure I understand how this test has been successful in the past.
I think we should stop increasing the timeout and try to evaluate the downtime, if the recovery of the cluster can't be optimized we should decide that, for instance, it is ok for this test to have a downtime of 90 seconds.
If after 90 seconds there's still some timeout or the cluster is red then the test should be considered as failed.

The text was updated successfully, but these errors were encountered:

sebgl · 2019-04-09T08:06:12Z

I'm not sure I understand how this test has been successful in the past

I think it was at some point, but then we added more checks (like making sure the cluster stayed green), and tests failed for other reasons, needing some time and investigation.

It's nice that you're investigating this 👍 Looks like we cannot avoid the master election downtime?

barkbay · 2019-04-09T10:42:52Z

As a side note I don't understand why the minimum master nodes value is always updated to 1 since at some point there may have 2 masters in flight:

grep -r "Setting minimum master nodes to" ./*
./MDI_TEST_FAILURE/operator.log:2019-04-09T06:45:08.308Z	INFO	version	Setting minimum master nodes to 1
./MDI_TEST_FAILURE/operator.log:2019-04-09T06:45:08.994Z	INFO	version	Setting minimum master nodes to 1
./MDI_TEST_FAILURE/operator.log:2019-04-09T06:45:17.711Z	INFO	version	Setting minimum master nodes to 1
./MDI_TEST_FAILURE/operator.log:2019-04-09T06:45:57.433Z	INFO	version	Setting minimum master nodes to 1
./MDI_TEST_FAILURE/operator.log:2019-04-09T06:45:58.644Z	INFO	version	Setting minimum master nodes to 1
./MDI_TEST_FAILURE/operator.log:2019-04-09T06:48:16.919Z	INFO	version	Setting minimum master nodes to 1
./MDI_TEST_FAILURE/operator.log:2019-04-09T06:49:28.511Z	INFO	version	Setting minimum master nodes to 1
./MDI_TEST_FAILURE/operator.log:2019-04-09T06:49:29.111Z	INFO	version	Setting minimum master nodes to 1
./MDI_TEST_FAILURE/operator.log:2019-04-09T06:49:33.122Z	INFO	version	Setting minimum master nodes to 1
./MDI_TEST_FAILURE/operator.log:2019-04-09T06:49:33.756Z	INFO	version	Setting minimum master nodes to 1
./MDI_TEST_FAILURE/operator.log:2019-04-09T06:49:36.562Z	INFO	version	Setting minimum master nodes to 1
./MDI_TEST_FAILURE/operator.log:2019-04-09T06:50:27.153Z	INFO	version	Setting minimum master nodes to 1
./MDI_TEST_FAILURE/operator.log:2019-04-09T06:50:28.026Z	INFO	version	Setting minimum master nodes to 1

Is there not a risk of having a split brain ?

sebgl · 2019-04-09T16:58:30Z

It's possible we could have a bug here!

Looking at where we update minimum_master_nodes in the code:

when we create a new pod, it is started with a configuration including minimum_master_nodes set corresponding to the expected target configuration (eg. want to create a cluster with 5 masters, all pods will start with minimum_master_nodes=3; adding 2 new nodes to this 5 masters cluster, they'll both get minimum_master_nodes=4).
when all changes have been made, we also update both transient and persistent cluster settings minimum master nodes to match what we should have (considering the available pods)
when we delete a pod, we update both transient and persistent cluster settings minimum master nodes to match what we currently have in the cluster (5 pods available? update it to 3).

What we don't do is update the minimum_master_nodes just after creating a pod though.

Trying to fit this to the e2e test where we mutate from 1MDI to 1M and 1D:

Create 1 MDI: gets created with minimum_master_nodes=1
Add 1 master and 1 data: both get created with minimum_master_nodes=1 (a single master in the target MD cluster)
Cluster is up with 3 nodes, including 2 masters. minimum_master_nodes is 1. Changes are not done yet.
Wait for shards migration to be over.
Before deleting the MDI node, update minimum_master_nodes (computed fro the current pod being removed => 1).
Changes are done. Update minimum_master_nodes once again.

I think we should set the minimum_master_nodes after every pod creation? That is between 2. and 3. If we do it too early (before the new master pod is created), we provoke some downtime though.

@nkvoll @pebrc what do you think? I think we did have this right at some point, then did some changes in the code and now I'm not sure it is behaving correctly.

barkbay · 2019-04-10T09:26:13Z

If we do it too early (before the new master pod is created), we provoke some downtime though.

But if we do it too late I think this can lead to a split brain situation.
Looking at the logs this is raised as a warning by Elasticsearch:

[2019-04-10T08:23:29,797][WARN ][o.e.d.z.ElectMasterService] [test-mutation-mdi-to-dedicated-es-9sl99nfx7r] value for setting "discovery.zen.minimum_master_nodes" is too low. This can result in data loss! Please set it to at least a quorum of master-eligible nodes (current value: [1], total number of master-eligible nodes used for publishing in this round: [2])

Regarding the election process I have enabled trace logs, there are available here:

For information:

First master (test-mutation-mdi-to-dedicated-es-9sl99nfx7r) is started at 08:20:11,386
First master is shutting down at 08:23:41,099
Second master eligible node (test-mutation-mdi-to-dedicated-es-n77c4vcfxv) is started at 08:22:40,880
Dedicated data node (test-mutation-mdi-to-dedicated-es-4n6zhp2tc6) is started at 08:22:39,700

If we have a closer look at what is happening on the second eligible master:

It starts an election at 08:23:41,377 👍
At 08:23:42,394 it is trying to open a connection to the former master (172.17.0.10:9300)
08:23:42,521 it receives a "connect_exception"
08:23:43,381 it tries again to connect to 172.17.0.10:9300
08:23:44,382 "won election" is logged 👍
but at 08:24:14,483 there is this message: "timed out waiting to be elected. waited [30s]. pending master node joins [0]"
08:24:24 we still have a 'MasterNotDiscoveredException', guess it means that we still have no master
08:24:33,480 the former master seems to be removed from the cluster state: "removed {{test-mutation-mdi-to-dedicated-es-9sl99nfx7r}{CrBUA6rISaen7XsxvyvxXA}{YnsP3uGcT72wIXf3pB9juA}{172.17.0.10}{172.17.0.10:9300}{xpack.installed=true},}, reason: apply cluster state"
Situation seems stable at 08:24:36,391

From what I understand a new election is triggered quickly but it takes some time to apply the new state of the cluster because it looks like the new master is waiting an ACK from the former one.
I'm wondering if file based discovery would not improve the situation as the former master would be removed more quickly from cluster state ?

barkbay · 2019-05-06T09:19:15Z

summary of the discussion I had with @\boaz on that case:

Having a red cluster is expected when a master is lost
During the test a new master is elected as expected but ... the new master has to update the state of the cluster. When the new master attempts to publish the new state of the cluster it tries to establish a connection to the former one. We would expect a quick failure here because no one is supposed to listen. It seems that something in the K8S environment prevents this quick failure.

TODO:

Only connect to new nodes on new cluster state elasticsearch#39629 may improve the situation, it is expected to be included in 7.2.0, I will do a test to check that.
Maybe understand what is the behavior of the network when a Pod is deleted (no reply ? no route to host ? ...)

If there is nothing we can really do about that we could keep the test but flag it as "slow"

DaveCTurner · 2019-05-07T07:36:45Z

It seems that something in the K8S environment prevents this quick failure.

I think that "something" is that when a container (pod?) shuts down its network interface goes away pretty much immediately, meaning that reconnection attempts receive no response. Contrast this to a process running outside of a container where the process can shut down but the network interface will live on, at least for long enough to reject any further traffic to the shut-down node.

I think this is consistent with what we've seen elsewhere (see e.g. elastic/helm-charts#63 or helm/charts#8785).

There are a few workarounds that I know of right now:

you could elect a new master before shutting the old one down (only in 7.x).
you could reduce the connection timeout.
you could keep the pod alive long enough to let the new master know.

DaveCTurner · 2019-05-07T07:58:29Z

Re. #614 (comment), unfortunately there's no 100% safe way to add or remove master-eligible nodes from a 6.x cluster with one or two master nodes.

You can make it slightly safer than the process described there by starting the new master with minimum_master_nodes=2 in its static config (elasticsearch.yml or command-line args). This'll prevent it from electing itself master if for some reason it can't find the real master at all. The value of the dynamic setting overwrites the static setting once the node has joined the cluster, so this only offers limited protection.

Then I agree that it'd be good to update the dynamic minimum_master_nodes setting to 2 as soon as possible after the new master has joined. There's an unavoidable period of time (in 6.x) where there's two master-eligible nodes with minimum_master_nodes=1 but the shorter this period is the better.

Similarly when shutting the old master-node down: reduce minimum_master_nodes back to 1 just before shutting the node down. Again this risky state is unavoidable (in 6.x) but the shorter it is the better.

barkbay · 2019-09-25T09:10:23Z

Closing this one as some new efforts have been made to improve the situation while working on the statefulsets.
See #1281 and #1693

pebrc assigned barkbay Apr 11, 2019

pebrc added the >bug Something isn't working label Apr 11, 2019

pebrc added this to the Alpha milestone Apr 11, 2019

barkbay mentioned this issue May 6, 2019

E2E tests fixes #734

Merged

pebrc removed this from the Alpha milestone May 10, 2019

barkbay closed this as completed Sep 25, 2019

sebgl mentioned this issue Sep 26, 2019

Restore e2e test failures on cluster health retrieval #1805

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Investigating TestMutationMdiToDedicated failure #614

Investigating TestMutationMdiToDedicated failure #614

barkbay commented Apr 9, 2019

sebgl commented Apr 9, 2019

barkbay commented Apr 9, 2019

sebgl commented Apr 9, 2019

barkbay commented Apr 10, 2019

barkbay commented May 6, 2019

DaveCTurner commented May 7, 2019 •

edited

Loading

DaveCTurner commented May 7, 2019

barkbay commented Sep 25, 2019

Investigating TestMutationMdiToDedicated failure #614

Investigating TestMutationMdiToDedicated failure #614

Comments

barkbay commented Apr 9, 2019

sebgl commented Apr 9, 2019

barkbay commented Apr 9, 2019

sebgl commented Apr 9, 2019

barkbay commented Apr 10, 2019

barkbay commented May 6, 2019

DaveCTurner commented May 7, 2019 • edited Loading

DaveCTurner commented May 7, 2019

barkbay commented Sep 25, 2019

DaveCTurner commented May 7, 2019 •

edited

Loading