Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

switch-to-configuration.pl: Fix timers resulting in old unit running #307562

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

nh2
Copy link
Contributor

@nh2 nh2 commented Apr 28, 2024

Fixes #49415.

See the added comment for a detail description.

See also the related #49528.

Things done

  • Built on platform(s)
    • x86_64-linux
    • aarch64-linux
    • x86_64-darwin
    • aarch64-darwin
  • For non-Linux: Is sandboxing enabled in nix.conf? (See Nix manual)
    • sandbox = relaxed
    • sandbox = true
  • Tested, as applicable:
  • Tested compilation of all packages that depend on this change using nix-shell -p nixpkgs-review --run "nixpkgs-review rev HEAD". Note: all changes have to be committed, also see nixpkgs-review usage
  • Tested basic functionality of all binary files (usually in ./result/bin/)
  • 24.05 Release Notes (or backporting 23.05 and 23.11 Release notes)
    • (Package updates) Added a release notes entry if the change is major or breaking
    • (Module updates) Added a release notes entry if the change is significant
    • (Module addition) Added a release notes entry if adding a new NixOS module
  • Fits CONTRIBUTING.md.

Add a 👍 reaction to pull requests you find important.

CC:

@nh2 nh2 self-assigned this Apr 28, 2024
@nh2 nh2 requested a review from dasJ as a code owner April 28, 2024 21:07
@github-actions github-actions bot added 6.topic: nixos Issues or PRs affecting NixOS modules, or package usability issues specific to NixOS 8.has: module (update) This PR changes an existing module in `nixos/` labels Apr 28, 2024
…ixes NixOS#49415

See the added comment for a detail description.

See also the related NixOS#49528.
@nh2 nh2 force-pushed the issue-49415-fix-systemd-timers-interrupting-stops branch from c65e2f9 to 3f896ed Compare April 28, 2024 21:12
}
print STDERR "ensuring the following units and targets are started: ", join(", ", sort(keys(%units_to_start))), "\n";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this not potentially a quite long list. I thought this is what the filter above was for?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a medium long list on my servers, e.g. 3-4 lines in my 100-char terminal.

I believe it is worth showing, because:

  • It's useful for debugging what's going on. The systemctl start done here is one of the 3 key actions done in configuration switching (the other ones being stop and restart), and depending on what your units/targets do, it may be the last line you see before your server loses the network or desktop loses the GPU output. In those cases, it can be extremely helpful to see what's being started.
  • This not being an invisble start invocation will hopefully reduce the number of issues people can't diagnose/solve by themselves, and thus number of issues they file. Most NixOS users understand systemctl and can reproduce issues when being told what systemctl commands to run, but much fewer users are able/willing to read/modify Perl to be able see the invocation.

I added it as a separate log line in addition to the previous one that lists only the stopped units, so that the short, most important line stays easy to read for the user. Then the longer line follows, which the user can ignore if their system didn't go down.

@@ -806,6 +806,7 @@ sub filter_units {
print STDERR "stopping the following units: ", join(", ", @units_to_stop_filtered), "\n";
}
# Use current version of systemctl binary before daemon is reexeced.
# The stop may be cancelled by a timer, see note [systemd-timers-cancelling-stop-commands].
Copy link
Member

@arianvp arianvp Apr 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But this stopping units that are getting removed from the NixOS configuration right? So this specific case here is probably very rare (I hope)

@arianvp
Copy link
Member

arianvp commented Apr 30, 2024

Wouldn't changing start to a restart cause ExecStop to run twice? Once in the old generation through the systemctl stop and once in the new generation through the systemctl restart. This seems like surprising behaviour to me and I wonder if it will break things subtly

@arianvp
Copy link
Member

arianvp commented Apr 30, 2024

Oh: but if it's stopped already I guess it isn't stopped a second time by restart... Okay perhaps this approach actually works

@symphorien
Copy link
Member

Since this is a tricky problem I wrote a regression test, feel free to include it

import ./make-test-python.nix ({ pkgs, ... }: {
  name = "nixos-rebuild-timers";

  nodes = {
    machine = { lib, pkgs, ... }: {
      imports = [
        ../modules/profiles/installation-device.nix
        ../modules/profiles/base.nix
      ];

      nix.settings = {
        substituters = lib.mkForce [ ];
        hashed-mirrors = null;
        connect-timeout = 1;
      };

      system.includeBuildDependencies = true;

      system.extraDependencies = [
        # Not part of the initial build apparently?
        pkgs.grub2
      ];

      virtualisation = {
        cores = 2;
        memorySize = 4096;
      };

      systemd.services.dostuff = {
        script = "while true; do echo a > /tmp/a; sleep 1; done";
        startAt = "*:*"; # every second, in practice it seems rate limited to 30s
        serviceConfig.Type="simple";
      };

      # this service ensures that switching takes at least 60s, enough time for
      # dostuff.timer to fire
      systemd.services.delaySwitch = {
        script = "echo delaySwitch started";
        postStop = "echo start sleeping; sleep 60; echo stop sleeping";
        serviceConfig.Type="oneshot";
        serviceConfig.RemainAfterExit=true;
        wantedBy = [ "multi-user.target" ];
      };
    };
  };

  testScript =
    let
      configFile = pkgs.writeText "configuration.nix" ''
        { lib, pkgs, ... }: {
          imports = [
            ./hardware-configuration.nix
            <nixpkgs/nixos/modules/testing/test-instrumentation.nix>
          ];

          boot.loader.grub = {
            enable = true;
            device = "/dev/vda";
            forceInstall = true;
          };

          documentation.enable = false;

      systemd.services.dostuff = {
        script = "echo b > /tmp/a";
        startAt = "*:*"; # every second
        serviceConfig.Type="oneshot";
      };

      systemd.services.delaySwitch = {
        script = "echo delaySwitch changed";
        postStop = "echo start sleeping; sleep 60; echo stop sleeping";
        serviceConfig.Type="oneshot";
        serviceConfig.RemainAfterExit=true;
        wantedBy = [ "multi-user.target" ];
      };

        }
      '';

    in
    ''
      machine.start()
      machine.succeed("udevadm settle")
      machine.wait_for_unit("multi-user.target")

      machine.succeed("nixos-generate-config")
      machine.copy_from_host(
          "${configFile}",
          "/etc/nixos/configuration.nix",
      )

      with subtest("Switch to the base system"):
          machine.succeed("nixos-rebuild switch")

      machine.wait_until_succeeds("grep b /tmp/a")

    '';
})

@wegank wegank added the 2.status: stale https://github.com/NixOS/nixpkgs/blob/master/.github/STALE-BOT.md label Jan 2, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
2.status: stale https://github.com/NixOS/nixpkgs/blob/master/.github/STALE-BOT.md 6.topic: nixos Issues or PRs affecting NixOS modules, or package usability issues specific to NixOS 8.has: module (update) This PR changes an existing module in `nixos/` 10.rebuild-darwin: 1-10 10.rebuild-linux: 1-10
Projects
None yet
Development

Successfully merging this pull request may close these issues.

systemd timers can result in changed services not being restarted reliably
5 participants