-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Slow re-election when elected master pod is deleted #63
Comments
So it doesn't look like I'm alone! I found issue helm/charts#8785 that looks identical to this issue. For that issue, PR helm/charts#10687 was proposed and also submitted to this repo in PR #41. Unfortunately, I didn't have success when deploying the multi-node example with those PRs. Here's a summary of the problem.
Whereas issue helm/charts#8785 talks about a total cluster outage where even reads are not possible, I'm thankfully not seeing that. For example, calls to The easiest solution to the connection timeout would be to just keep the pod running for a while after shutting down Elasticsearch. I tried to do this with a So, how can we actually run code after Elasticsearch stops? We can modify the container's entrypoint to send Elasticsearch to the background, trap the SIGTERM upon pod termination, forward that to Elasticsearch, and then sleep! For example, inside of the container spec:
Deploying our master nodes like this allows for the outer pod to sit in a terminating state for 5 seconds after Elasticsearch stops, so that other masters can properly get a refused connection rather than timing out. As a result, writes and the An alternative approach is to deploy a dummy sidecar container alongside the master nodes that just waits indefinitely. To keep the pod alive for a bit after Elasticsearch stops, we add a
This allows for the Elasticsearch containers to terminate before the sidecar container can, as documented in the pod termination flow. For fun, we can also avoid the lifecycle hook altogether and trap the SIGTERM like we did above:
Note that in this case, the order of the Apart from these workarounds, another solution might be the solution to the currently open issue kubernetes/kubernetes#28969. If I'm understanding the sticky IPs proposal correctly, pods in a StatefulSet would remain the same for each replica. So in our case, the eligible masters should ideally have no connection timeouts to the old master (since the old master pod will come back with the same IP address). |
I haven't tried to reproduce, but it's possible that lowering If not, we should look into what's causing this problem and pair up with the Elasticsearch team to find a solution. I spent a very short time googling for similar issues, and found this forum post where docker/docker-compose appears to be having similar behaviors where the network interface is destroyed and the next pings wait for 30 seconds. |
Firstly, thank you so much for writing up such a detailed thorough issue! The fact that things work properly when you directly kill it is the most interesting part here. It is very possible that Couple of questions:
|
@Crazybus Rolling updates seem to have the same issue. After setting an extra env var and deploying, my requests still hang when the rolling update gets to the elected master pod. Killing the process directly takes less than a second. After doing this, I don't think the I think the forum post @jordansissel linked is identical to this issue. For convenience, here's David's response:
In both environments, the old master's network is gone before re-election can finish. I tried lowering Looking carefully through the previous Helm chart issue again, I found @kimxogus linked to an issue he opened elastic/elasticsearch#36822 that has the extra suggestion of tweaking the So far the only reliable workaround for me seems to be keeping the pod alive for a few extra seconds after Elasticsearch terminates so the other eligible masters can send their final ping. |
@andreykaipov thanks again for all of the investigation and information you are adding! This is super useful and I feel like I know understand what is going on. From an Elasticsearch point of view things are actually working as expected. The master disappears, and it waits for the 30 second ping timeout. Elasticsearch is configured by default to wait 30 seconds for a ping to timeout, which is different compared to be able to connect to the host but it not responding. This issue is not something unique to this helm-chart or to Kubernetes, but will actually apply to any immutable infrastructure setup where the active master is deleted (or at least its IP address) directly after stopping it. Even if a workaround is put in place there is still going to be around 3 seconds of write downtime while a new master is being elected. The ideal "perfect world" fixes to this problem:
I'm going to sync with the Elasticsearch team to see how feasible they would be. I'm also making a note to test this in Elasticsearch 7 because it now uses a different discovery method which may or may not be affected by this. Out of all of the workarounds you suggested I think the easiest to maintain is going to be having a dummy side car container. Instead of doing a 5 second sleep it could actually wait for the pod to no longer be the master when shutting down. Or even better it could wait until a new master has been elected before allowing the pod to be deleted. Note: None of the below is tested, just an idea of how to solve this without relying on a hardcoded sleep time.
|
It is still affected by this. The problem is caused by the new master attempting to get all the nodes in the cluster to reconnect to the old master as part of the process of winning its first election, and waiting for this to time out before proceeding is the problem described in elastic/elasticsearch#29025. There's a related discussion here. This is not really affected by the changes to how discovery works in 7.0. In 6.x the best solution we have is to reduce the connection timeout. If your cluster is not talking to remote systems then the connect timeout can reasonably be very short (e.g. In 7.x the same advice is basically true (at time of writing) but there is another option too: if you want to shut the master down then you can trigger an election first by excluding the current master from the voting configuration. |
Here is what worked for me, based on the above suggestion. I mounted a script in the container that I run instead of the docker entrypoint which below. It still takes 30s to timeout the old master, but the new master seemed to be operational in 4 seconds of the old master shutdown. if [[ -z $NODE_MASTER || "$NODE_MASTER" = "true" ]] ; then
else
fi |
I really appreciate all the attention this issue has received! The workaround I decided to go with was wrapping the base Elasticsearch image with an entrypoint that traps some common exit signals, and allows the execution of a handler after Elasticsearch stops (see https://github.com/qoqodev/elasticsearch-hooks-docker). In this case, the "post-stop" handler would involve just sleeping, or waiting until a new master has been elected. However, it looks like @DaveCTurner recently closed out the upstream issue elastic/elasticsearch#29025 with PR elastic/elasticsearch#31547 that should fix the slow re-election, so no workarounds should be necessary! Whenever those changes make it into an Elasticsearch release, whether it's 6.x or 7.x, I'll be glad to test it out! 😄 |
Indeed, we've had a few failed attempts to fix elastic/elasticsearch#29025 and this very thread prompted us to look at it again. The fix is elastic/elasticsearch#39629 which has been backported to the |
I think this is the same problem as helm/charts#8785. |
This has been merged into master but not yet released. I'm leaving this until it is released and that others have also confirmed that this solution resolves the issue properly. |
This has been merged and released. Thanks everyone for all of the help investigating and with contributing the fix! |
This commit removes the `masterTerminationFix` side-car container introduced in elastic#63 to fix slow elections issues when master node is deleted. This workaround is no more needed since Elasticsearch 7.2.
* [elasticsearch] fix values table formatting * [elasticsearch] remove masterTerminationFix This commit removes the `masterTerminationFix` side-car container introduced in #63 to fix slow elections issues when master node is deleted. This workaround is no more needed since Elasticsearch 7.2.
* [elasticsearch] fix values table formatting * [elasticsearch] remove masterTerminationFix This commit removes the `masterTerminationFix` side-car container introduced in elastic#63 to fix slow elections issues when master node is deleted. This workaround is no more needed since Elasticsearch 7.2.
* [elasticsearch] fix values table formatting * [elasticsearch] remove masterTerminationFix This commit removes the `masterTerminationFix` side-car container introduced in elastic#63 to fix slow elections issues when master node is deleted. This workaround is no more needed since Elasticsearch 7.2.
* [elasticsearch] fix values table formatting * [elasticsearch] remove masterTerminationFix This commit removes the `masterTerminationFix` side-car container introduced in elastic#63 to fix slow elections issues when master node is deleted. This workaround is no more needed since Elasticsearch 7.2.
* [elasticsearch] fix values table formatting * [elasticsearch] remove masterTerminationFix This commit removes the `masterTerminationFix` side-car container introduced in #63 to fix slow elections issues when master node is deleted. This workaround is no more needed since Elasticsearch 7.2.
* [elasticsearch] fix values table formatting * [elasticsearch] remove masterTerminationFix This commit removes the `masterTerminationFix` side-car container introduced in #63 to fix slow elections issues when master node is deleted. This workaround is no more needed since Elasticsearch 7.2.
* [elasticsearch] fix values table formatting * [elasticsearch] remove masterTerminationFix This commit removes the `masterTerminationFix` side-car container introduced in elastic#63 to fix slow elections issues when master node is deleted. This workaround is no more needed since Elasticsearch 7.2.
* [elasticsearch] fix values table formatting * [elasticsearch] remove masterTerminationFix This commit removes the `masterTerminationFix` side-car container introduced in elastic#63 to fix slow elections issues when master node is deleted. This workaround is no more needed since Elasticsearch 7.2.
First of all - thank you guys for the chart!
I was playing around with the multi-node example and experienced some odd behavior. Here's how I'm reproducing the issue.
After the multi example is deployed, open up the multi-data service to your local in one terminal:
Watch the call to
/_cat/master
in another terminal:In a third terminal, whoever the elected master is, delete them:
The API call in the second terminal will now hang. After 30 seconds, the request will timeout and we might see the following error for a split second:
Soon after, the cluster recovers and the API call from the second window starts responding again. Here are the logs off another master node before and after the re-election:
I figured Kubernetes might be killing the pods too abruptly, so I followed the instructions at https://www.elastic.co/guide/en/elasticsearch/reference/6.6/stopping-elasticsearch.html to stop Elasticsearch. Sure enough, if we kill the process from the elected master pod directly, the re-election will be quick!
Assuming
mutli-master-2
is the new master:Notice how the API call from the second terminal only hangs for around 3 seconds this time!
Reading through the docs for the termination of pods (https://kubernetes.io/docs/concepts/workloads/pods/pod/#termination-of-pods), Kubernetes does in fact send a SIGTERM to the container, so I'm guessing deleting a pod does something more than just send a SIGTERM that Elasticsearch doesn't like.
The text was updated successfully, but these errors were encountered: