-
Notifications
You must be signed in to change notification settings - Fork 723
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Investigating TestMutationMdiToDedicated failure #614
Comments
I think it was at some point, but then we added more checks (like making sure the cluster stayed green), and tests failed for other reasons, needing some time and investigation. It's nice that you're investigating this 👍 Looks like we cannot avoid the master election downtime? |
As a side note I don't understand why the minimum master nodes value is always updated to 1 since at some point there may have 2 masters in flight:
Is there not a risk of having a split brain ? |
It's possible we could have a bug here! Looking at where we update minimum_master_nodes in the code:
What we don't do is update the Trying to fit this to the e2e test where we mutate from 1MDI to 1M and 1D:
I think we should set the @nkvoll @pebrc what do you think? I think we did have this right at some point, then did some changes in the code and now I'm not sure it is behaving correctly. |
But if we do it too late I think this can lead to a split brain situation.
Regarding the election process I have enabled trace logs, there are available here: For information:
If we have a closer look at what is happening on the second eligible master:
From what I understand a new election is triggered quickly but it takes some time to apply the new state of the cluster because it looks like the new master is waiting an ACK from the former one. |
summary of the discussion I had with @\boaz on that case:
TODO:
If there is nothing we can really do about that we could keep the test but flag it as "slow" |
I think that "something" is that when a container (pod?) shuts down its network interface goes away pretty much immediately, meaning that reconnection attempts receive no response. Contrast this to a process running outside of a container where the process can shut down but the network interface will live on, at least for long enough to reject any further traffic to the shut-down node. I think this is consistent with what we've seen elsewhere (see e.g. elastic/helm-charts#63 or helm/charts#8785). There are a few workarounds that I know of right now:
|
Re. #614 (comment), unfortunately there's no 100% safe way to add or remove master-eligible nodes from a 6.x cluster with one or two master nodes. You can make it slightly safer than the process described there by starting the new master with Then I agree that it'd be good to update the dynamic Similarly when shutting the old master-node down: reduce |
The e2e test
TestMutationMdiToDedicated
seems to consistently fails, on my laptop I have dozens of errors like this one during the stepCluster_health_should_not_have_been_red_during_mutation_process
(The name of the test is misleading as any error,e.g. a timeout, will trigger a failure):I have tried to understand why the test is failing, here what I found:
The first interesting thing is that the operator may remove the last pod in a
Ready
state (from a K8S pov), leaving the Kubernetes service without any endpoint to route a request:I guess it is because the K8S state is not taken into account when the operator checks if it can delete a pod:
https://github.com/elastic/k8s-operators/blob/bac6e037efac300d51c0f68c71942121ba7627ef/operators/pkg/controller/elasticsearch/mutation/podrestrictions.go#L50-L55
A first question is: should we remove a node that is the last in the
Ready
state ?note that this behavior is mitigated when some data is present, the migration gives some time to a node to move into a
Ready
stateBut the real problem is that when the former master leaves the cluster it can take more than 1 minute for the cluster to recover:
I'm not sure I understand how this test has been successful in the past.
I think we should stop increasing the timeout and try to evaluate the downtime, if the recovery of the cluster can't be optimized we should decide that, for instance, it is ok for this test to have a downtime of 90 seconds.
If after 90 seconds there's still some timeout or the cluster is red then the test should be considered as failed.
The text was updated successfully, but these errors were encountered: