-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Nomad kills off healthy allocations on deploy when new allocations fail #6864
Comments
Hi @kaspergrubbe, thanks for taking the time to file this issue. I'm trying to reproduce locally and am having some difficulties. Do you mind sharing any relevant logs or allocation details for the healthy allocations that got killed? Are you able to share the relevant parts of your nginx config for how it relates to the rails task and health checks? |
@drewbailey This happened in our production environment, I'm currently setting up a local version on my Linux machine to replicate the behaviour here, give me a day to see if it happens in another environment as well. |
@kaspergrubbe thanks! Here is the job file I was using to try to reproduce, I have a very simple server with a docker tag job "web" {
datacenters = ["dc1"]
type = "service"
update {
max_parallel = 4
min_healthy_time = "10s"
healthy_deadline = "3m"
progress_deadline = "10m"
auto_revert = false
auto_promote = false
canary = 4
}
migrate {
max_parallel = 3
health_check = "checks"
min_healthy_time = "10s"
healthy_deadline = "5m"
}
group "group" {
count = 4
restart {
attempts = 0
interval = "30m"
delay = "15s"
mode = "fail"
}
ephemeral_disk {
size = 300
}
task "nginx" {
driver = "docker"
config {
image = "nginx"
port_map {
http = 80
}
volumes = [
"local:/etc/nginx/conf.d",
]
}
template {
data = <<EOF
upstream backend {
{{ range service "demo-webapp" }}
server {{ .Address }}:{{ .Port }};
{{ else }}server 127.0.0.1:65535; # force a 502
{{ end }}
}
server {
listen {{ env "NOMAD_PORT_nginx" }};
location /nginx_health {
return 200 'nginx health';
}
location / {
proxy_pass http://backend;
}
}
EOF
destination = "local/load-balancer.conf"
change_mode = "signal"
change_signal = "SIGHUP"
}
resources {
network {
mbits = 10
mode = "host"
port "nginx" {
# static = 8080
}
}
}
service {
name = "nginx"
tags = []
port = "nginx"
check {
name = "nginx healthcheck"
type = "http"
protocol = "http"
path = "/healthcheck"
interval = "5s"
timeout = "3s"
}
}
}
task "server" {
env {
HTTP_PORT = "${NOMAD_PORT_http}"
}
driver = "docker"
config {
image = "drewbailey/simple-server:1"
}
resources {
network {
mbits = 10
port "http"{}
}
}
service {
name = "demo-webapp"
port = "http"
check {
type = "http"
path = "/healthcheck"
interval = "2s"
timeout = "2s"
}
}
}
}
} for my deployment I was changing the image name to config {
- image = "drewbailey/simple-server:1"
+ image = "drewbailey/simple-server:fail"
} |
@drewbailey Ok, you win first round! I'm simply unable to replicate the issue locally as well both with using the terminal command I did however manage to recreate some of it on production: As you can see in my first post we specify a count of 4, however, something have happened where 2 of the healthy containers have been killed off. And when I check the logs it seems like Nomad did ask them to shut down: Nomad alloc status says "alloc not needed due to job update" maybe that's a clue?
Can you tell me how to debug further and where to look? |
@kaspergrubbe Do you mind sharing the full output of |
@drewbailey I will try and dig in the log files as well, in the meantime here's the full output of the command:
I've just now realized that my nodes are resource exhausted, can this cause what I am seeing? |
Hey @kaspergrubbe still looking into this, I don't think that's the cause, those placement metrics show why / how many nodes were ruled out from being a viable placement |
@drewbailey Ok, I will try and replicate it again and fetch the logs again this time. Do I need to run it with DEBUG logging enabled for a better trace? In that case I might need to redeploy some things. |
@kaspergrubbe DEBUG or TRACE would be ideal, if you are on Sorry that I'm unable to reproduce this, I'll be slow to respond over the next two weeks. |
Hi @drewbailey, I've finally had some time to test this out, really useful to know about the
This time we ended up with only one healthy allocation: I've attached the full log from Here is the the job-file with our secret removed: job "billetto-production-web" {
datacenters = ["eu-west-1"]
type = "service"
update {
max_parallel = 4
min_healthy_time = "10s"
healthy_deadline = "3m"
progress_deadline = "10m"
auto_revert = false
auto_promote = false
canary = 4
}
migrate {
max_parallel = 3
health_check = "checks"
min_healthy_time = "10s"
healthy_deadline = "5m"
}
group "group" {
count = 4
restart {
attempts = 0
interval = "30m"
delay = "15s"
mode = "fail"
}
ephemeral_disk {
size = 300
}
task "nginx" {
driver = "docker"
template {
data = <<EOF
I_AM_FAILING
upstream unicorn {
# fail_timeout=0 means we always retry an upstream even if it failed
# to return a good HTTP response (in case the unicorn master nukes a
# single worker for timing out).
# for UNIX domain socket setups:
server unix:{{ env "NOMAD_ALLOC_DIR" }}/unicorn.sock fail_timeout=0;
}
server {
listen 80;
proxy_set_header Host $http_host;
proxy_redirect off;
client_max_body_size 20m;
error_page 500 502 503 504 /500.html;
root /app;
gzip on; # enable gzip
gzip_http_version 1.1; # turn on gzip for http 1.1 and higher
gzip_disable "msie6"; # IE 6 had issues with gzip
gzip_comp_level 5; # inc compresion level, and CPU usage
gzip_min_length 100; # minimal weight to gzip file
gzip_proxied any; # enable gzip for proxied requests (e.g. CDN)
gzip_buffers 16 8k; # compression buffers (if we exceed this value, disk will be used instead of RAM)
gzip_vary on; # add header Vary Accept-Encoding (more on that in Caching section)
# define files which should be compressed
gzip_types text/plain;
gzip_types text/css;
gzip_types text/xml;
gzip_types application/xhtml+xml;
gzip_types application/xml;
gzip_types text/javascript;
gzip_types application/x-javascript;
gzip_types application/javascript;
gzip_types application/json;
gzip_types image/svg+xml;
gzip_types image/x-icon;
gzip_types application/rss+xml;
gzip_types application/vnd.ms-fontobject;
gzip_types font/opentype;
gzip_types application/x-font;
gzip_types application/x-font-opentype;
gzip_types application/x-font-otf;
gzip_types application/x-font-truetype;
gzip_types application/x-font-ttf;
gzip_types font/opentype;
gzip_types font/otf;
gzip_types font/ttf;
location / {
server_tokens off;
try_files $uri @app;
}
location @app {
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header Host $http_host;
proxy_redirect off;
proxy_pass http://unicorn;
proxy_connect_timeout 30;
proxy_read_timeout 30;
}
}
EOF
destination = "local/rails.conf"
change_mode = "signal"
change_signal = "SIGHUP"
}
config {
image = "billetto/billetto-rails:4.2.415-nginx"
volumes = [
"local/rails.conf:/etc/nginx/conf.d/rails.conf",
]
port_map {
nginx = 80
}
dns_servers = ["172.17.0.1"]
}
resources {
cpu = 200
memory = 100
network {
mbits = 50
port "nginx" {}
}
}
service {
name = "billetto-production-nginx"
tags = []
port = "nginx"
check {
name = "billetto-production-web nginx healthcheck"
type = "http"
protocol = "http"
path = "/debug/healthcheck"
interval = "5s"
timeout = "3s"
}
}
}
task "logging-nginx" {
driver = "docker"
logs {
max_files = 1
max_file_size = 1
}
config {
image = "billetto/datadog-rsyslog:0.0.16"
}
env {
DATADOG_TOKEN = ""
LOGGED_TASK_NAME = "nginx"
SYSLOG_SERVICE_NAME = "${NOMAD_JOB_NAME}-${NOMAD_GROUP_NAME}-nginx"
SYSLOG_HOST_NAME = "${attr.unique.hostname}"
SYSLOG_SOURCE_NAME = "nginx"
SYSLOG_TAGS = "allocid:${NOMAD_ALLOC_ID}"
}
resources {
cpu = 100
memory = 32
}
}
task "rails" {
driver = "docker"
env {}
config {
image = "billetto/billetto-rails:4.2.415"
command = "/usr/local/bin/thpoff"
args = ["bin/bundle", "exec", "unicorn", "-c", "config/unicorn.rb", "--no-default-middleware"]
port_map {
web = 3000
}
dns_servers = ["172.17.0.1"]
}
resources {
cpu = 750 # 500 MHz
memory = 1650 # MB
network {
mbits = 50
port "web" {}
}
}
service {
name = "billetto-production-web"
tags = []
port = "web"
check {
name = "billetto-production-web rails healthcheck"
type = "http"
protocol = "http"
path = "/debug/healthcheck"
interval = "5s"
timeout = "3s"
}
}
}
task "logging-rails" {
driver = "docker"
logs {
max_files = 1
max_file_size = 1
}
config {
image = "billetto/datadog-rsyslog:0.0.16"
}
env {
DATADOG_TOKEN = ""
LOGGED_TASK_NAME = "nginx"
SYSLOG_SERVICE_NAME = "${NOMAD_JOB_NAME}-${NOMAD_GROUP_NAME}-rails"
SYSLOG_HOST_NAME = "${attr.unique.hostname}"
SYSLOG_SOURCE_NAME = "rails"
SYSLOG_TAGS = "allocid:${NOMAD_ALLOC_ID}"
}
resources {
cpu = 100
memory = 32
}
}
}
} |
I hope you had a nice holiday season, and a fantastic new years party! I've tried (and failed) yet again to replicate this on my one-node Nomad cluster. But I can make it happen on our production cluster every time now. I've had a look at the look and I don't see the culprit there, I guess I'm not well-versed in the Nomad universe. @drewbailey Let me know if I can do anything to help you debug it more, we could perhaps do a screenshare session if that would be easier for you, I'm in the UTC timezone, but I am flexible. |
@kaspergrubbe thanks for the additional info. Unfortunately I didn't see anything in the logs, curiously I also didn't see eval Are you able to monitor consul at all during the deploy? Are the checks ever passing, then failing or always failing? Just to try, could you change the nginx healthcheck to something different than its upstream rails endpoint? something like what I have below to rule out any sort of weirdness with the two tasks using the same check
|
@kaspergrubbe if you'd be willing, we have an unofficial debugging tool which may be helpful here, if you are open to trying it, it should automatically collect some helpful output related to the job. The files are located here https://gist.github.com/drewbailey/cb1421fb4aed50fbea15554df7c7fa48 it has dependencies on From the output above, if the info is still there |
Good catch, however, that didn't seem to solve the issue. I've changed it to that now. I just noticed something funky, it's also possible for our Nomad servers to schedule too many containers, we're having a |
I checked the logs on a Consul server during a failing deploy, but I don't see anything suspicious (not even the deregistering), is there a smarter way to log the state of a Consul cluster than individually on each Consul Server and Client? Here's the Consul log: The jobs that are deregistering are called |
https://download.gnome.org is having issues, so I can't obtain Graphviz through Homebrew, I will try again later this evening. |
I've updated from Consul 1.6.2 (from 1.6.0) in a hope that there was a fix there, but no. @drewbailey I've run the script, but the graphviz tool says there are errors in the file:
The file is here: |
Here's a new freshly failing deploy: Healthy allocations running before failing deploy
Job run
Evaluation status
Deployment status
Healthy allocation that was removed
Info from one of the failed allocations:
|
@drewbailey Is it this you need?
|
@kaspergrubbe thanks, do you have info on the previous deployment? Also, if you don't mind uploading the job-history output for a failed deploy that matches up with the jobs/deploys pasted above that'd be great. Thanks in advance, We have a few ideas that we are trying to reproduce with now. |
Hi @drewbailey Thank you so much for getting back to me, it is very appreciated as we have halted our Nomad rollout because of this issue!
This sounds promising, I can't replicate it locally on my own one-node cluster, but I can replicate it every time on our production cluster. So tell me if there is anything else you want me to test. This is the flow (newest first), so the one you ask for is further down: 4507daa1-9030-4150-1d86-5a00688fdc26
ad47ca9a-715b-e120-6b2a-2c31efb87c9d
482e62f2-a1d5-b602-42eb-501c54962a8e
82e00fca-df2b-5c1f-c63f-ade120ea1f4e
(It is a bit strange that it says fd05cdee-9b0b-f2dc-8d2d-db49dab57b5d
f97ecb5e-4441-a9de-6bde-a1b0f49416d9
f5911776-dead-6371-794b-b50bdd9ed7c2
|
Here's the output for the
|
@kaspergrubbe is If a canary deploy has If you want to coordinate a time to chat more synchronously I set up a gitter channel https://gitter.im/hashicorp-nomad/issues-6864# |
No, there are newer deploys as I did more testing. I can reset the test and focus on.
Oh that does make sense, but not all of these deploys have been promoted, because it looked like the system took care of it itself: So you're saying even though the interface says |
There are a few scenarios when a canary deployment won't (can't) be promoted. The initial deploy and any healthy deploy after a previously failed deploy won't require promotion. Does a failing job kill the allocations from deploy |
Hi @drewbailey Sorry I can't answer that, but I did a full test again, and I think I got all the information that you need: 0. IntroOur cluster is located in AWS Consul-servers:
Nomad-servers:
Nomad-clients:
Healthy job (
|
2. Deploy first healthy job$ nomad job run -verbose test.hcl
2.1 Get deployment status after self-promotion$ nomad deployment status -verbose 731d8446-c70e-ad5e-fede-9ba8194c1eb1
2.2 Note down allocation IDs
2.3 Get evaluation status$ nomad eval status -verbose 536dd256-e1c4-9090-113f-c4a6ced42f2a
|
3. Deploy failing job$ nomad job run -verbose test-fail.hcl
2.1 Fail deployment$ nomad deployment fail f8ef8532-75c3-1685-21b8-bde47d87beee
3.1 Note down allocation IDs
3.2 Get evaluation status$ nomad eval status -verbose 377eecea-a038-6ad4-29f6-05ceb0f51858
3.3 Get deployment status$ nomad deployment status f8ef8532-75c3-1685-21b8-bde47d87beee
3.4 NotesAt this point there is still one healthy allocation running from first deploy:
I also find it strange that Nomad starts 11 allocations when the job specifies |
4. Re-deploy healthy job$ nomad job run -verbose test.hcl
4.1 Note down allocation IDs
4.2 Get evaluation status$ nomad eval status -verbose 81de9158-44f1-f2b9-5a93-1dd1d0972b75
4.3 Get deployment status$ nomad deployment status a7641e73-922b-9149-7fe9-f21e56c4b8d8
|
5. Aftermath5.1 Print job status$ nomad job status -verbose failing-nomad-test01
|
5.2 Print all allocation statuses24d6277f-dd1a-8935-d842-eb9431255d81
d74a3b70-f071-2755-d9fa-d366874ea12e
e3a2470a-d88a-c730-59b6-edef0e787556
0056be34-cec2-4b20-dfd0-91d65a9f4310
0a43fa24-61e8-d261-5301-d24f792a5f1d
1169e48d-3cf1-acff-f094-e54a0d566e4f
1a6c612d-8aa4-09f5-a228-de4907d9fd6b
2387bd9c-c05f-8ccf-e873-438992645cfe
3abfc083-c586-fb55-e490-0b9622aa1005
8b90ca29-810b-3181-ab2b-b54307faf998
b98dec9b-332f-fb59-04d6-5848505202f3
c2ade919-0f84-2795-dc76-110794f75fa4
d2657214-7019-c7e1-28fb-96f52ea1bef8
d9d10e6e-8253-64fc-1bf6-619bd32a6827
e9cc157e-4f56-0b39-62e0-ec8ffc805859
24d6277f-dd1a-8935-d842-eb9431255d81
4c448781-6d59-22fc-cc54-f62d5278b8cd
5ce27208-d182-cb21-12e5-bf4755812aee
e8f15755-9a49-94bb-69b4-e9d9546a09a4
|
And here are all the logfiles collected: leader.log |
And hey! Look! The job-history script worked (added |
Thanks for all of this @kaspergrubbe, its very helpful.. digging into this now, for your question about the 11 allocs, the restart stanza you have dictates the groups behavior on task failure, 0 attemps, mode fail means the allocation itself will not restart, but it will reschedule which you can read more of here https://www.nomadproject.io/docs/job-specification/reschedule.html Something that is interesting and I'll be digging into is alloc |
@kaspergrubbe alright, I've finally got this reproduced on a deployed cluster in aws, will be interesting to dig in and see why it wasn't happening locally. Now that I have it reproduced I'll hopefully have something for you soon. Thanks for your patience and debugging help on this! |
@drewbailey I'm really happy that you can reproduce it! That means that it (hopefully) wasn't me that have misconfigured something. |
Will do, think I should be ok now that I can reproduce it locally. In the mean time, depending on how you want to configure your deploys, disabling rescheduling will fix the immediate issue, though if you end up with a failed task the deploy will hault and essentially fail immediately. |
@kaspergrubbe I created #6936 with reproduction steps and will be working on the issue from there, thanks again for the reproduction help! |
@drewbailey Thanks Drew, I will follow that other issue. Do you know how far back this has been broken? I am currently looking at my options of either downgrading or disabling rescheduling. |
It looks like it could potentially be as far back as 0.8.4 https://github.com/hashicorp/nomad/blob/master/CHANGELOG.md#084-june-11-2018 |
@drewbailey Scary it hasn't been noticed earlier, but maybe other users just have very stable jobs and they haven't noticed, I will be disabling rescheduling to mitigate the issue for now. Thanks again! :) |
I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues. |
Nomad version
Operating system and Environment details
Amazon Linux 2 running on AWS.
Issue
When we are deploying using the Blue/Green canary deployment model we have 4 allocations running (healthy), and we try to boot up 4 new allocations and they fail, Nomad retries that, and somehow Nomad kills off the healthy and running allocations leaving us with no healthy allocations running:
Reproduction steps
Deploy a healthy job, after that deploy a failing job, observe what happens.
Job file
The text was updated successfully, but these errors were encountered: