-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
clients flap after upgrading to 0.4.1 from 0.3.2 #1641
Comments
Two additional points that I find interesting, but which may be red-herrings:
|
|
Yeah 0.3.2 had no auto-clustering. |
I had a "home-brew" auto-clustering based on consul and watches and formula to generate the configs so clients found servers through consul. I disabled my auto-clustering method for deploying 0.4.1, and the client/server find each other, but the client flaps. |
@ketzacoatl If you get a chance could you dump consul's view of the world while clients are having this problem? Is it possible there's stale entries in consul causing clients to flap? http://127.0.0.1:8500/v1/catalog/service/nomad?pretty on a node with consul should do it. |
I'm having this problem also on a completely clean consul/nomad setup. Just ran terraform to setup a consul/nomad cluster and the nomad clients are flapping. consul 0.6.4, nomad 0.4.1, Ubuntu 14.04 |
I've found the problem in my setup. The example docs https://www.nomadproject.io/docs/cluster/bootstrapping.html explain how to use nomad with consul. The server section is
and the client section is
The server section does not contain a base.hcl
server.hcl
client.hcl
|
I should note that |
@magiconair Could you share the TF when it was flapping? We are trying to reproduce this issue |
@magiconair Very interesting; thanks for the info! We treat servers from other datacenters as backups in the client and use them only if there are no servers in the same DC. I'm guessing the flapping you're seeing is a bug in this backup server code. I'm actually working on that now, so that should work better in 0.5. Fixed docs in #1712 Sadly it looks like @ketzacoatl is experiencing another issue though as the DCs in his confs match. |
Maybe the issue is between |
@schmichael @ketzacoatl maybe the |
Interesting find @magiconair, I will make some time this weekend to retest and see what I can find based on that. |
@schmichael, would you be able to share the configs, TF, or other code you are using in an attempt to reproduce? |
@magiconair TF-> Terraform. |
@ketzacoatl It would be great if you could keep us in the loop. Would be good to know if your issue was due to the same thing magiconair was experiencing! |
I shall @dadgar! |
My weekend was a bit too overloaded, I will update when I can run some tests. |
@magiconair I just merged #1735 which fixes the issue you were seeing where clients flap when servers are in another datacenter. @ketzacoatl If you have a chance to test master, I'd be interested if it happened to fix your issue to. Bootstrapping/discovery got reworked a bit. I'm going to close this, but please feel free to reopen if any flapping issues continue. |
OK, I can run some tests today. @schmichael, is there a .zip I can download or should I build from master and host that? |
I would build from master |
OK, thanks for confirming, I'll work on that. |
It looks like there is a build failure on
Though I see
|
@ketzacoatl you need to compile with Go 1.7 (see: https://golang.org/doc/go1.7#context). |
ohhhh... well thanks for that, I didn't even realize 1.7 was out.. and I totally ought to have seen that detail in the README.. sorry for the noise. |
OK, initial tests confirm this as resolved. I am going to scale up these tests to be a little bit more sure. To satisfy my curiosity, does anyone know what may have been the issue? Was it the bootstrap refactor, or the RPC retry fix? Also, @dadgar, how close is 0.5.x to an RC or release? I'm wondering if I should hold out for that, or start running the build I have from yesterday's master. |
@ketzacoatl Scaling up the tests would be great. I'm afraid I can't point to a specific line that broke it in 0.4.1 or fixed it in master -- too much changed. The old code allowed consul discovery and heartbeating from the servers to interleave in unintended ways. The new code ensures that discovering servers is only done with Consul when there are no known servers to heartbeat (startup bootstrapping is the most common case, but there are outage situations that could fallback to Consul as well). |
I'm pretty certain this is resolved in my deployments! |
Fantastic! 0.5 should be out in October. |
I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues. |
I have Terraform modules to deploy consul/nomad servers and clients as auto-scaling groups on AWS. When I run nomad 0.3.2 with consul 0.6.4, and my own form of auto-discovery for the nomad servers and clients, nomad is stable and has run well. After upgrading to 0.4.1, I updated the config templates I have, to ensure nomad would use its native auto-discovery thru consul (and to disable my previous form of auto-discovery with consul). With that fiddling, I found the nomad servers to be stable, and reliably boot and form their quorum. However, I have been unable to get the clients to stabilize, I see them flap up and down (status from
down
toready
and back).Nomad version
0.4.1
Operating system and Environment details
Linux Ubuntu 14.04 AMD64,
3.13.0-74-generic
, consul0.6.4
, upstart, nomad linux executable (not docker).Issue
I expect the nomad clients to auto-discover their servers through consul, and I see log messages from the consul agent to confirm that is happening:
At that point, the client can report out about jobs:
and the node can be seen on a nomad server:
(these nodes ^ are flapping, with some up and some down, eventually they'll all be down for a while, before coming back online for short bits of time)
Eventually, the client goes
down
, andnomad status
looks like:and the client logs says:
If I wait a little while, the client will re-register to the servers auto-discovered via consul (same logs as shown above), and
nomad status
will work. The clients seem to be up for 30 seconds or so, and down for about the same, or maybe a little longer. Eventually, you'll catch them alldown
:But then they'll start re-registering.
Reproduction steps
I am running the executable on the host directly, with ubuntu 14.04 and upstart:
Nomad Server config
Nomad Client config
Nomad Client logs
Here is a client starting up, searching consul for nomad, finding nomad, registering with the servers, and then losing that connection and failing to find any servers:
The text was updated successfully, but these errors were encountered: