Skip to content
This repository has been archived by the owner on Feb 9, 2024. It is now read-only.

[BUG] gravity 6.1.x - can't communicate pods in nodes after master reboot #1436

Closed
snirkatriel opened this issue Apr 23, 2020 · 2 comments
Closed

Comments

@snirkatriel
Copy link

snirkatriel commented Apr 23, 2020

Describe the bug
In case a reboot is forced (reboot -f or electric failure) on a master node of a gravity 6.1.x cluster - all pods that scheduled and become ready in nodes are unreachable.

The only way to workaround this issue is to restart flanneld in all nodes in cluster after master node becomes ready again.

Important to mention that this issue can't be reproduced in gravity 6.3.x.

To Reproduce

  1. Setup gravity 6.1.x cluster
  2. Create dummy app, scale it to some replicas
  3. Force reboot on the master node ("/sbin/reboot -f ")
  4. Try to reach any of the pods on the none master node (using curl for "Hello World" style app)

Expected behavior
After a master restart - all pods in cluster are becoming available and ready to serve requests

Logs
The only "indicative" logs are seems to be from the flanneld in the gravity shell, but important to mention that these logs are also showed when trying to reproduce this bug on gravity 6.3.x even though it doesn't affect the cluster at this version

Apr 19 14:30:19 ip-10-18-132-86 flanneld[1066]: E0419 14:30:19.031215 1066 watch.go:176] Subnet watch failed: client: etcd cluster is unavailable or miApr 19 14:30:19 ip-10-18-132-86 flanneld[1066]: E0419 14:30:19.031231 1066 watch.go:44] Watch subnets: client: etcd cluster is unavailable or misconfigApr 19 14:30:20 ip-10-18-132-86 flanneld[1066]: E0419 14:30:20.031910 1066 watch.go:176] Subnet watch failed: client: etcd cluster is unavailable or miApr 19 14:30:20 ip-10-18-132-86 flanneld[1066]: E0419 14:30:20.031920 1066 watch.go:44] Watch subnets: client: etcd cluster is unavailable or misconfigApr 19 14:30:21 ip-10-18-132-86 flanneld[1066]: E0419 14:30:21.032536 1066 watch.go:44] Watch subnets: client: etcd cluster is unavailable or misconfig

Environment (please complete the following information):

  • OS : Ubuntu 18.04
  • Gravity : 6.1.22
  • Platform : PC, AWS, GCP

Additional context
We made sure that this commit (that seems to try and solve the problem) is implemented into our gravity cluster - gravitational/planet@faa4fae
Means, looks like it doesn't.

A "regular" soft reboot doesn't reproducing the problem, but only forced reboot.

@knisbet
Copy link
Contributor

knisbet commented Apr 25, 2020

@snirkatriel thanks for the report, I managed to reproduce the issue.

This particular issue is different than the one in gravitational/planet@faa4fae / gravitational/flannel#5

Although the symptoms appear to be the same, this is a distinct problem. I cannot explain why you're unable to reproduce on gravity 6.3.x and I haven't tried myself, but there may be some luck involved in triggering the issue.

Basically what the problem boils down to, is in etcd v2 there is no protocol level ping or watchdog on a created watch for changes in the datastore. So at a protocol level, after setting up a watch it's perfectly normal to go totally idle for long periods of time. Looking through the etcd code base, it looks like many services and clients enable TCP keepalives, which is a TCP layer test of whether the remote TCP endpoint is available and responding.

However, it doesn't look like the etcd gateway enables keepalives. We use the etcd gateway in gravity, to create a stable endpoint on each node to connect to a potentially changing list of master servers. Because the etcd gateway doesn't do TCP keepalives, there is no application or protocol level indication that the server has disappeared from the network.

As I understand it, part of the etcd v3 design attempts to avoid these types of issues by employing application layer keepalives, if the client is setup to do the keepalives. This get's past this particular issue and similar issues that are known to occur in etcd v2. Etcd v3 introduces significant changes, so we haven't yet tackled how to make this migration in all of our clients.

Anyways, I was able to hack in keepalives on the gateway, and the change appears to be effective in addressing this set of symptoms. I'll see about submitting the fix upstream, and discuss with the rest of the gravity team this week on whether we want to temporarily build a fork or wait and see about upstream releases.

Thanks,

@knisbet
Copy link
Contributor

knisbet commented Apr 26, 2020

Ok, I managed to figure out why the discrepancy between 6.1.x and 6.3.x exists while I was preparing to submit a PR upstream to the etcd project.

It turns out golang changed its defaults in golang 1.12 to enable TCP keepalives when available on all connections. In 6.3.x we ship etcd 3.3.15 and in 6.1.x we ship etcd 3.3.12. It turns out somewhere between these versions go 1.12.9 was adopted which brings in the new golang defaults. So while the keepalive settings are not set by etcd within the gateway daemon, the new defaults in golang are what's causing the behaviour difference.

With this in mind, this should be a fairly easy fix, I just need to bump etcd to 3.3.15 or later on the gravity 6.1.x branch.

Please keep in mind though, it will still take several minutes for this failure mode to be detected with the default golang settings (around 3 minutes according to upstream). Also, I think I forgot to mention in my previous comment, that restarting only flannel may not fully restore gravity, since other components also use etcd based watches. So it's likely safer to drain and restart all of planet to avoid any underlying issues that may not be apparent if any other etcd watches have totally stalled.

Golang PR: golang/go@5bd7e9c

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants