Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Restarting nginx hangs forever #33172

Open
nh2 opened this issue Dec 29, 2017 · 8 comments
Open

Restarting nginx hangs forever #33172

nh2 opened this issue Dec 29, 2017 · 8 comments
Labels
2.status: stale https://github.com/NixOS/nixpkgs/blob/master/.github/STALE-BOT.md

Comments

@nh2
Copy link
Contributor

nh2 commented Dec 29, 2017

Issue description

On one of my servers, when I deploy it with nixops, or simply run systemctl restart nginx.service (which nixops also calls), that just hangs without any output.

systemctl status nginx.service says:

● nginx.service - Nginx Web Server
  Loaded: loaded (/nix/store/0047j5y9c7g7dfmq8y5ikqfrzdk0ijyc-unit-nginx.service/nginx.service; enabled; vendor preset: enabled)
  Active: inactive (dead) since Fri 2017-12-29 15:34:44 UTC; 1min 37s ago
Main PID: 20703 (code=exited, status=0/SUCCESS)

Dec 29 15:34:43 node-1 systemd[1]: nginx.service: Current command vanished from the unit file, execution of the command list won't be resumed.
Dec 29 15:34:43 node-1 nginx[20703]: 2017/12/29 15:34:43 [error] 20712#20712: *435807 connect() failed (111: Connection refused) while connecting to upstream, cli
Dec 29 15:34:43 node-1 nginx[20703]: 2017/12/29 15:34:43 [error] 20712#20712: *435808 connect() failed (111: Connection refused) while connecting to upstream, cli
Dec 29 15:34:44 node-1 systemd[1]: Stopping Nginx Web Server...
Dec 29 15:34:44 node-1 systemd[1]: Stopped Nginx Web Server.

The weird thing here is that there's no Starting ... line after the Stopped ... line, so I'm wondering if this is a systemd bug or a bug in my config or NixOS's way of configuring nginx in this case.

I'm also not sure why I'm getting Current command vanished from the unit file here.

Steps to reproduce

Not sure yet.

Technical details

On top of nixpkgs commit a2845aa.

@nh2
Copy link
Contributor Author

nh2 commented Dec 29, 2017

On similar machines deployed the same way, but that are not in the middle of a nixops deploy right now, systemctl restart nginx.service succeeds immediately with:

Dec 29 16:02:19 node-3 systemd[1]: Stopping Nginx Web Server...
Dec 29 16:02:19 node-3 systemd[1]: Stopped Nginx Web Server.
Dec 29 16:02:19 node-3 systemd[1]: Starting Nginx Web Server...
Dec 29 16:02:19 node-3 nginx-pre-start[1217]: nginx: the configuration file /nix/store/qlcqn1k6ra0i6azp6n0gb4a2ldckqciz-nginx.conf syntax is ok
Dec 29 16:02:19 node-3 nginx-pre-start[1217]: nginx: configuration file /nix/store/qlcqn1k6ra0i6azp6n0gb4a2ldckqciz-nginx.conf test is successful
Dec 29 16:02:19 node-3 systemd[1]: Started Nginx Web Server.

@nh2
Copy link
Contributor Author

nh2 commented Dec 29, 2017

I am suspecting that there is a race condition between nixops or switch-configuration that leaves the systemd unit file empty (or without ExecStart line or something like that), maybe because of non-atomic file writes, and as a result systemd issues its Current command vanished from the unit file, execution of the command list won't be resumed. and doesn't start the unit again, even if the unit file is written completely a little bit later.

Here is the systemd code that issues this warning.

@nh2
Copy link
Contributor Author

nh2 commented Dec 29, 2017

The commit message of the commit that introduced this logic in systemd also gives some hints:

The problem might be that if a systemctl daemon-reload makes systemd discover a new nginx unit file (put there by switch-configuration), and that one has an ExecStart line with different contents, then that warning case gets triggered.

The commit message says

However, in most cases (we assume that in most common case unit file command list is not changed while some other command is running for the same unit) it should cause that systemd does the right thing, which is restoring execution exactly at the point we were before daemon-reload.

I'm not sure if this is problematic or not: we certainly "change the unit file command list" while a command (nginx) is running, but we have only one ExecStart= command, so "while some other command is running" is not true for us.

@nh2
Copy link
Contributor Author

nh2 commented Dec 29, 2017

Urgh it's getting really strange here.

On the machine where I'm not in the middle of a nixops deploy and corresponding nginx restart hang, I can run systemctl restart nginx and in journalctl -f -t systemd will appear:

Dec 29 16:51:34 node-3 systemd[1]: Stopping Nginx Web Server...
Dec 29 16:51:34 node-3 systemd[1]: Stopped Nginx Web Server.
Dec 29 16:51:34 node-3 systemd[1]: Starting Nginx Web Server...
Dec 29 16:51:34 node-3 systemd[1]: Started Nginx Web Server.

as expected.

But on the problematic machine, if I run that command, I get reliably instead:

Dec 29 16:45:36 node-1 systemd[1]: Starting consulReady.service...
Dec 29 16:45:36 node-1 systemd[1]: Started consulReady.service.

which is a completely different service I wrote that seems to have no dependency with nginx whatsoever.

What's going on? Is this some weird offset error in systemd that gets triggered during the deserialisation of daemon-reload, with systemd starting the wrong unit after that, or is there some dependency that I just don't spot?

@brainrake
Copy link
Contributor

brainrake commented Jun 19, 2018

Is this still a problem with 18.03? Did you manage to find a NixOS config that reliably reproduces this issue?

@nh2
Copy link
Contributor Author

nh2 commented Jun 24, 2018

Is this still a problem with 18.03?

I can't tell yet, I have only upgraded these servers to 18.03 very recently and thus haven't collected much data on it yet.

Did you manage to find a NixOS config that reliably reproduces this issue?

No.

I did get Current command vanished from the unit file just last month though (note this is also 17.09), across all machines:

Jun 14 07:57:13 node-1 systemd[1]: nginx.service: Current command vanished from the unit file, execution of the command list won't be resumed.
Jun 14 10:05:49 node-1 systemd[1]: nginx.service: Current command vanished from the unit file, execution of the command list won't be resumed.
Jun 14 07:57:13 node-2 systemd[1]: nginx.service: Current command vanished from the unit file, execution of the command list won't be resumed.
Jun 14 10:05:49 node-2 systemd[1]: nginx.service: Current command vanished from the unit file, execution of the command list won't be resumed.
Jun 14 07:57:13 node-3 systemd[1]: nginx.service: Current command vanished from the unit file, execution of the command list won't be resumed.
Jun 14 10:05:50 node-3 systemd[1]: nginx.service: Current command vanished from the unit file, execution of the command list won't be resumed.

Only nginx.service seems to emit this; this is the output of journalctl | grep 'Current command' on all machines in that setup.

@asymmetric
Copy link
Contributor

Should this be closed? Only one report so far in 1 year, we can reopen if it turns out to still be an issue.

@stale
Copy link

stale bot commented Jun 2, 2020

Thank you for your contributions.

This has been automatically marked as stale because it has had no activity for 180 days.

If this is still important to you, we ask that you leave a comment below. Your comment can be as simple as "still important to me". This lets people see that at least one person still cares about this. Someone will have to do this at most twice a year if there is no other activity.

Here are suggestions that might help resolve this more quickly:

  1. Search for maintainers and people that previously touched the related code and @ mention them in a comment.
  2. Ask on the NixOS Discourse.
  3. Ask on the #nixos channel on irc.freenode.net.

@stale stale bot added the 2.status: stale https://github.com/NixOS/nixpkgs/blob/master/.github/STALE-BOT.md label Jun 2, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
2.status: stale https://github.com/NixOS/nixpkgs/blob/master/.github/STALE-BOT.md
Projects
None yet
Development

No branches or pull requests

3 participants