Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Broken NixOS setup as a result of now removed mirrored boot setup docs #531

Open
FrostKiwi opened this issue Oct 8, 2024 · 6 comments
Open

Comments

@FrostKiwi
Copy link
Contributor

FrostKiwi commented Oct 8, 2024

I followed this Repo's setup for Root on ZFS for NixOS during NixOS 22.11 to create a NixOS setup, where 2 SSDs are mirrored for a redundant boot drive. This resulted in very weird issues at that time ( NixOS/nixpkgs#214871 ), which were resolved with updates by in 1211e98 by @gmelikov, as reported in #383. This setup ran fine for me a very long time, with the following config, as per the Root on ZFS docs:

{ config, pkgs, ... }:

{
  networking.hostId = "XXXX";
  boot = {
    supportedFilesystems = [ "zfs" ];
    kernelPackages = config.boot.zfs.package.latestCompatibleLinuxPackages;
    loader = {
      efi = {
        efiSysMountPoint = "/boot/efis/nvme-Samsung_SSD_980_PRO_1TB_S5GXNX0T966XXXX-part1";
        canTouchEfiVariables = true;
      };
      generationsDir.copyKernels = true;
      grub = {
        efiInstallAsRemovable = false;
        enable = true;
        copyKernels = true;
        efiSupport = true;
        zfsSupport = true;
        extraInstallCommands = ''
ESP_MIRROR=$(${pkgs.coreutils}/bin/mktemp -d)
${pkgs.coreutils}/bin/cp -r ${config.boot.loader.efi.efiSysMountPoint}/EFI $ESP_MIRROR
for i in /boot/efis/*; do
 ${pkgs.coreutils}/bin/cp -r $ESP_MIRROR/EFI $i
done
${pkgs.coreutils}/bin/rm -rf $ESP_MIRROR
'';
        devices = [
          "/dev/disk/by-id/nvme-Samsung_SSD_980_PRO_1TB_S5GXNX0T966XXXX"
          "/dev/disk/by-id/nvme-Samsung_SSD_980_PRO_1TB_S5GXNX0T966XXXX"
        ];
      };
    };
  };
  users.users.root.initialHashedPassword = "XXXX";
}

During the update of NixOS 24.05, this setup exploded the update process. The update process finished with all packages rebuilt and restarted, but failed at the final steps, to create what I would guess is an unbootable state, though I haven't tried to reboot yet.

$ nixos-rebuild switch --upgrade
unpacking channels...
building Nix...
building the system configuration...
updating GRUB 2 menu...
/nix/store/mr63za5vkxj0yip6wj3j9lya2frdm3zc-coreutils-9.5/bin/cp: cannot stat '/boot/efis/nvme-Samsung_SSD_980_PRO_1TB_S5GXNX0T966732E-part1/BOOT': Too many levels of symbolic links
/nix/store/mr63za5vkxj0yip6wj3j9lya2frdm3zc-coreutils-9.5/bin/cp: cannot stat '/boot/efis/nvme-Samsung_SSD_980_PRO_1TB_S5GXNX0T966732E-part1/NixOS-boot-efis-nvme-Samsung_SSD_980_PRO_1TB_S5GXNX0T966731K-part1': Too many levels of symbolic links
warning: error(s) occurred while switching to the new configuration

Via cc6d72c and 1211e98 these instructions were deleted with commit messages:

Previously we used a bind mount from /boot/efis/*-part1 to /boot/efi to facilitate bootloader configuration. Recent reports indicate that this bind mount prevents the system from booting. This pull request removes the bind mount.

Now the Root on ZFS docs just say: Format and mount ESP. Only one of them is used as /boot, you need to set up mirroring afterwards, with no new documentation to take its place. Also the documentation says:

If you have a bug report or feature request related to this HOWTO, please file a new issue and mention @ne9z.

But that user is deleted, so I assume this was the handle of Maurice Zhou <ja@apvc.uk>

What would be appropriate steps to migrate this? I was recommended by the NixOS discord to look into boot.loader.grub.mirroredBoots, which seems to support the mirroring previously implemented by the bash snippet in extraInstallCommands.

What are good next steps to take, to make the system viable again? How should I migrate away from the now deleted extraInstallCommands script? I have a rough plan in my head, but since this concerns a live system, I would love some input.

@gmelikov
Copy link
Member

Unfortunately I don't use NixOS to help you somehow, and we don't have an active NixOS doc contributor. If there'll be more problems with this guide, we'll have to deprecate it.

FWIW maybe as a workaround you may use only one boot disk as a start.

@FrostKiwi
Copy link
Contributor Author

Solved my issue by going back to a single drive bootloader.

Unfortunately I don't use NixOS to help you somehow, and we don't have an active NixOS doc contributor. If there'll be more problems with this guide, we'll have to deprecate it.

FWIW maybe as a workaround you may use only one boot disk as a start.

Yeah, I think it should be. Keeping the documentation, but conceding with the bootloader being on just one of the drives maybe a thing I can contribute. From the standpoint of nix, all previous implementations were hacks using copy commands, something that NixOS discord members were horrified to read in my config, as taken from previous iterations of the guide setup here.

I tried to switch to boot.loader.grub.mirroredBoots, but NixOS exploded, became unbootable, even though the NixOS rebuild was successful. I read all the documentation around all the grub settings, made sure the right things were mounted and cross-checking with efibootmgr and lsblk for consistency and it didn't matter. The Bootloaders as produced by Nix weren't even perfect copies, something around boot.loader.generationsDir.copyKernels doesn't play well with mirroredBoots. The idea of Nix's purity and and inherently inpure way in which UEFI, EFI and the mounting process is mixes like oil and water, at least at my Nix skill level.

I had to create a NixOS rescue USB stick, chroot into the original system. Rerunning nixos-rebuild switch didn't work. Nothing described in https://nixos.wiki/wiki/Change_root worked either, Nixos refused a rebuild following PAM Authentication errors, even though the chrooted user was root.
https://nixos.wiki/wiki/Bootloader#From_an_installation_media didnt work either, because it would restore a bootloader, broken by the mirroredBoot configuration. In the end https://nixos.wiki/wiki/Bootloader#From_an_installation_media was the correct answer though, as I could make NixOS recreate the bootloader from the previous installation by /nix/var/nix/profiles/<SUBSTITUTING THIS>/bin/switch-to-configuration boot with a previous version. What a mess, not touching any of that again. Running with a Mirroed ZFS root, but unmirrored Bootloaders now. If the day comes that one mirrored half dies, then I gotta do the bootloader thing again, but for now at least it works.

@nipsy
Copy link

nipsy commented Jan 15, 2025

Saw your comment over on HN. Not sure if this will help at all, but here's how I'm configuring all of my ZFS based mirrored systems:

boot = {
  initrd.kernelModules = [ "zfs" ];
  kernelPackages = pkgs.linuxPackages_6_12;
  loader = {
    efi = {
      canTouchEfiVariables = true;
      efiSysMountPoint = "/efiboot/efi1";
    };
    systemd-boot = {
      enable = true;
      extraInstallCommands = ''
        ${pkgs.rsync}/bin/rsync -av --delete /efiboot/efi1/ /efiboot/efi2
      '';
    };
    timeout = 3;
  };
  supportedFilesystems = [ "zfs" ];
  #zfs.package = pkgs.master.zfs;
};

I'm using disko for the underlying partition definitions:

disk = {
  nvme0n1 = {
    type = "disk";
    device = "/dev/disk/by-id/nvme-WD_BLACK_SN850X_4000GB_23162P800014";
    content = {
      type = "gpt";
      partitions = {
        ESP = {
          size = "1G";
          type = "EF00";
          content = {
            type = "filesystem";
            format = "vfat";
            mountpoint = "/efiboot/efi1";
            mountOptions = [ "X-mount.mkdir" "umask=0077" ];
            extraArgs = [ "-nESP1" ];
          };
        };
        swap = {
          size = "32G";
          type = "8200";
          content = {
            type = "swap";
            extraArgs = [ "-L swap1" ];
          };
        };
        zfs = {
          size = "100%";
          content = {
            type = "zfs";
            pool = "rpool";
          };
        };
      };
    };
  };
  nvme1n1 = {
    type = "disk";
    device = "/dev/disk/by-id/nvme-WD_BLACK_SN850X_4000GB_23162P800005";
    content = {
      type = "gpt";
      partitions = {
        ESP = {
          size = "1G";
          type = "EF00";
          content = {
            type = "filesystem";
            format = "vfat";
            mountpoint = "/efiboot/efi2";
            mountOptions = [ "X-mount.mkdir" "umask=0077" ];
            extraArgs = [ "-nESP2" ];
          };
        };
        swap = {
          size = "32G";
          type = "8200";
          content = {
            type = "swap";
            extraArgs = [ "-L swap2" ];
          };
        };
        zfs = {
          size = "100%";
          content = {
            type = "zfs";
            pool = "rpool";
          };
        };
      };
    };
  };
};

I'm not including the subsequent zpool definition for rpool which follows all of that because I don't think it's directly relevant to the discussion.

@FrostKiwi
Copy link
Contributor Author

FrostKiwi commented Jan 15, 2025

Many thanks for your config. Thanks for introducing me to Disko, seems super useful!

As for the mirrored bootloaders, you aren't using the built in boot.loader.grub.mirroredBoots either and rely on a self made ${pkgs.rsync}/bin/rsync -av --delete /efiboot/efi1/ /efiboot/efi2.

This is similar to how RootOnZFS did it before it's way of doing it was removed in this line:

boot.loader.grub.extraInstallCommands = ''
though your commands and invication through systemd-boot is different. Super interesting, will totally look into it and what it means to do this in the context of systemd-boot.

On the NixOS discord, when I showed my config following Root on ZFS instructions, I was warned in ALL CAPS to not do such copy commands as they are a ticking timebomb and rely on mounts being present and correct, which is not a guarantee with a mere copy command.

So I'm still unsure of what is the Nix approved correct way, that won't blow up if the hardware does things that the NixOS state didn't expect.

@nipsy
Copy link

nipsy commented Jan 15, 2025

Yeah, disko is great as it lets you declare all of your partitions ahead of time during installation essentially and it automatically translates all of its entries to the appropriate corresponding fileSystems entries so you don't have to duplicate that all elsewhere.

Sorry, I didn't read closely enough to realize you were specifically looking for an officially supported option. It would be really nice to have such a thing again though.

While I'm certainly not sanity checking anything with my current rsync invocation, it seems like that would be trivial to add if it were a major concern for someone. It's not entirely clear to me how this wouldn't be a problem for just about any solution really. Tt seems like it would only potentially be a problem if you were doing automated updates or some such and even then, if you have partitions failing and no monitoring to catch that, I think you have bigger architecture problems.

But I'd love to hear how this is wrong, and if so, what the better way to solve the problem would be. I figure with this, even if I need to go into BIOS to change the boot disk or some such, I'll still at least be able to boot my system from the other half of the mirror regardless as ultimately this is simply a FAT partitions with some EFI boot loader bits and the kernel/initrd(s) on it.

Oh, I was also going to mention a link to my own Git repository is on my profile page should you wish to peruse any of that.

@FrostKiwi
Copy link
Contributor Author

a link to my own Git

Awesome, thanks!

But I'd love to hear how this is wrong

I don't think it's wrong, had a very similar setup, until the major 24.05 update when it blew up ;__;

and if so, what the better way to solve the problem would be.

A mystery I'd love to know myself.

It seems like that would be trivial to add if it were a major concern for someone.
Tt seems like it would only potentially be a problem if you were doing automated updates

I don't do automated updates, it's an issue for manual update as well and sanity checks aren't that simple.

The way grub installs in NixOS are setup is that if they fail, they prevent services from launching, which don't support hotreloading Like nginx. Take Nextcloud as an example, nixos-rebuild switch runs, succeeds, stops Services, performs a grub-install and relaunches the services. A failure with grub installing leads NixOS in a state where it never finishes relaunching many services. A failure in nixos-rebuild switch shouldn't kill your system, but does, if the switch part isn't clean. That's nothing a simple check can solve I think, though I don't know nix enough to judge that.

For instance, the recent bigger NixOS update that happened this week caused a re-redrivation of many services. Nextcloud + Mailserver stopped, grub failed installing due to nixos-rebuild switch unmounting /Boot for a reason I'm still investigating, fails the install and never starts the services backup. Even with the update being manual, this led to a downtime of 5 minutes for my users and several E-Mails being delayed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants