-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Nomad Client becomes unresponsive after time #8085
Comments
Looks like the same issue as #8038 |
Just happened again. Notes on #7987 |
I'm gonna close this out in favor of #7987, @idrennanvmware . Definitely let us know if there are still problems after upgrading to Nomad v0.11.3+ ! |
I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues. |
Nomad version
Output from
nomad version
Operating system and Environment details
Nomad v0.11.1 (b434570)
Issue
After some time nomad clients can become unresponsive. We have a cluster that clients join and leave with regularity. After some time a client can become unresponsive BUT the nomad agent is actually running.
At this time the following is observed:
systemctl status nomad shows the service as running
systemctl status nomad
● nomad.service - "HashiCorp Nomad - Workload orchestrator"
Loaded: loaded (/etc/systemd/system/nomad.service; enabled; vendor preset: enabled)
Active: active (running) since Tue 2020-05-12 17:43:33 UTC; 2 weeks 5 days ago
Docs: https://nomadproject.io/docs/
Main PID: 122615 (nomad)
Tasks: 105
Memory: 379.1M
CGroup: /system.slice/nomad.service
├─ 50060 /opt/nomad/0.11.1/nomad logmon
├─ 50061 /opt/nomad/0.11.1/nomad logmon
├─ 50177 /opt/nomad/0.11.1/nomad docker_logger
├─ 50212 /opt/nomad/0.11.1/nomad docker_logger
├─ 50277 /opt/nomad/0.11.1/nomad logmon
├─ 50337 /opt/nomad/0.11.1/nomad docker_logger
├─107085 /opt/nomad/0.11.1/nomad logmon
├─107086 /opt/nomad/0.11.1/nomad logmon
├─107197 /opt/nomad/0.11.1/nomad docker_logger
├─107230 /opt/nomad/0.11.1/nomad docker_logger
└─122615 /opt/nomad/0.11.1/nomad agent -config /etc/nomad.d
The health check on the node is failing for nomad client
Get http://0.0.0.0:4646/v1/agent/health?type=client: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
The nomad agent has not written any logs for days, and the last few entries are
2020/05/24 20:30:01.028143 [INFO] (runner) creating new runner (dry: false, once: false)
2020/05/24 20:30:01.029012 [INFO] (runner) creating watcher
2020/05/24 20:30:01.032749 [INFO] (runner) starting
2020/05/24 20:30:01.152990 [INFO] (runner) rendered "(dynamic)" => "/opt/nomad/alloc/51241bf4-c37b-eced-2c43-308cd03c4808/run-consul-backup/secrets/file.env"
2020-05-24T20:30:01.156Z [INFO] client.driver_mgr.raw_exec: starting task: driver=raw_exec driver_cfg="{Command:/services/consul/assets/scripts/backup.sh Args:[]}"
2020-05-24T20:30:16.655Z [INFO] client.alloc_runner.task_runner: not restarting task: alloc_id=51241bf4-c37b-eced-2c43-308cd03c4808 task=run-consul-backup reason="Restart unnecessary as task terminated successfully"
2020-05-24T20:30:16.658Z [INFO] client.gc: marking allocation for GC: alloc_id=51241bf4-c37b-eced-2c43-308cd03c4808
2020/05/24 20:30:16.661027 [INFO] (runner) stopping
2020/05/24 20:30:16.661518 [INFO] (runner) received finish
2020-05-24T20:51:14.883Z [INFO] client.gc: garbage collecting allocation: alloc_id=914fce0e-0996-4e48-aa44-0c87ca4434ce reason="forced collection"
2020-05-24T21:00:00.444Z [INFO] client.alloc_runner.task_runner.task_hook.logmon.nomad: opening fifo: alloc_id=1523fc28-b77b-9043-2857-99a3e7f9c021 task=run-consul-backup @module=logmon path=/opt/nomad/alloc/1523fc28-b77b-9043-2857-99a3e7f9c021/alloc/logs/.run-consul-backup.stdout.fifo timestamp=2020-05-24T21:00:00.444Z
2020-05-24T21:00:00.447Z [INFO] client.alloc_runner.task_runner.task_hook.logmon.nomad: opening fifo: alloc_id=1523fc28-b77b-9043-2857-99a3e7f9c021 task=run-consul-backup @module=logmon path=/opt/nomad/alloc/1523fc28-b77b-9043-2857-99a3e7f9c021/alloc/logs/.run-consul-backup.stderr.fifo timestamp=2020-05-24T21:00:00.447Z
2020/05/24 21:00:01.141007 [INFO] (runner) creating new runner (dry: false, once: false)
2020/05/24 21:00:01.142009 [INFO] (runner) creating watcher
2020/05/24 21:00:01.143019 [INFO] (runner) starting
2020/05/24 21:00:01.248536 [INFO] (runner) rendered "(dynamic)" => "/opt/nomad/alloc/1523fc28-b77b-9043-2857-99a3e7f9c021/run-consul-backup/secrets/file.env"
2020-05-24T21:00:01.253Z [INFO] client.driver_mgr.raw_exec: starting task: driver=raw_exec driver_cfg="{Command:/services/consul/assets/scripts/backup.sh Args:[]}"
2020-05-24T21:00:16.806Z [ERROR] client.driver_mgr.raw_exec: error receiving stream from Stats executor RPC, closing stream: alloc_id=1523fc28-b77b-9043-2857-99a3e7f9c021 driver=raw_exec task_name=run-consul-backup error="rpc error: code = Unavailable desc = transport is closing"
2020-05-24T21:00:16.806Z [ERROR] client.alloc_runner.task_runner.task_hook.stats_hook: failed to start stats collection for task: alloc_id=1523fc28-b77b-9043-2857-99a3e7f9c021 task=run-consul-backup error="rpc error: code = Unavailable desc = connection error: desc = "transport: error while dialing: dial unix /tmp/plugin525470496: connect: no such file or directory""
2020-05-24T21:00:16.807Z [INFO] client.alloc_runner.task_runner: not restarting task: alloc_id=1523fc28-b77b-9043-2857-99a3e7f9c021 task=run-consul-backup reason="Restart unnecessary as task terminated successfully"
2020-05-24T21:00:16.809Z [INFO] client.gc: marking allocation for GC: alloc_id=1523fc28-b77b-9043-2857-99a3e7f9c021
2020/05/24 21:00:16.812760 [INFO] (runner) stopping
2020/05/24 21:00:16.812991 [INFO] (runner) received finish
2020-05-24T21:21:14.885Z [INFO] client.gc: garbage collecting allocation: alloc_id=a4870502-4cd3-4404-c977-c2b18fd2f31b reason="forced collection"
2020-05-24T21:30:00.409Z [INFO] client.alloc_runner.task_runner.task_hook.logmon.nomad: opening fifo: alloc_id=2f327d5b-3f89-ace6-3463-71acd31241aa task=run-consul-backup @module=logmon path=/opt/nomad/alloc/2f327d5b-3f89-ace6-3463-71acd31241aa/alloc/logs/.run-consul-backup.stdout.fifo timestamp=2020-05-24T21:30:00.409Z
2020-05-24T21:30:00.411Z [INFO] client.alloc_runner.task_runner.task_hook.logmon.nomad: opening fifo: alloc_id=2f327d5b-3f89-ace6-3463-71acd31241aa task=run-consul-backup path=/opt/nomad/alloc/2f327d5b-3f89-ace6-3463-71acd31241aa/alloc/logs/.run-consul-backup.stderr.fifo @module=logmon timestamp=2020-05-24T21:30:00.410Z
2020/05/24 21:30:01.014612 [INFO] (runner) creating new runner (dry: false, once: false)
2020/05/24 21:30:01.015174 [INFO] (runner) creating watcher
2020/05/24 21:30:01.015990 [INFO] (runner) starting
2020/05/24 21:30:01.148086 [INFO] (runner) rendered "(dynamic)" => "/opt/nomad/alloc/2f327d5b-3f89-ace6-3463-71acd31241aa/run-consul-backup/secrets/file.env"
2020-05-24T21:30:01.152Z [INFO] client.driver_mgr.raw_exec: starting task: driver=raw_exec driver_cfg="{Command:/services/consul/assets/scripts/backup.sh Args:[]}"
2020-05-24T21:30:16.795Z [INFO] client.alloc_runner.task_runner: not restarting task: alloc_id=2f327d5b-3f89-ace6-3463-71acd31241aa task=run-consul-backup reason="Restart unnecessary as task terminated successfully"
2020-05-24T21:30:16.797Z [INFO] client.gc: marking allocation for GC: alloc_id=2f327d5b-3f89-ace6-3463-71acd31241aa
2020/05/24 21:30:16.800719 [INFO] (runner) stopping
2020/05/24 21:30:16.801398 [INFO] (runner) received finish
2020-05-24T21:36:14.880Z [INFO] client.gc: garbage collecting allocation: alloc_id=70a4b1f4-0d38-5eeb-05c8-85e379ad047e reason="forced collection"
Allocations appear to still be running on the node. Previously allocated tasks show active and working and hitting one of them (a fabio allocation) is responsive.
Consul and other services are functioning as expected
systemctl restart nomad hangs indefinitely too. A kill of the process is required. After killing the process Nomad Client functions as expected
Thoughts
Seems this may be a possible deadlock? Wondering if the way we call GC reasonably regularly is something that could be triggering this. We see it occur regularly and simply restarting the nomad agent fixes it.
The text was updated successfully, but these errors were encountered: