-
Notifications
You must be signed in to change notification settings - Fork 27
dracut/30ignition: Explicitly add OnFailure=emergency.target #61
Conversation
Right now, we enable our services by adding a link in `initrd.target.requires`. However, it seems like systemd doesn't necessarily fail the boot even if `initrd.target` fails. That unit has: OnFailure=emergency.target OnFailureJobMode=replace-irreversibly But looking at the logs, it seems like that can get overridden by the `systemctl isolate initrd-switch-root.target` call that `initrd-cleanup.service` does. Let's just be really explicit here and tell systemd we want to immediately switch to `emergency.target` if any of our units fail.
This is the master version of #60. |
cc @bgilbert |
Hmm, CL doesn't have this problem. @dm0-, can you think of a reason |
I don't know why there would be a difference with |
OK, did more digging into this. So on both CL and FCOS, systemd correctly queues up the
But on CL, the call to
Whereas on FCOS it just works:
Nice. You should be able to reproduce this with the latest FCOS from the pipeline: http://artifacts.ci.centos.org/fedora-coreos/prod/builds/latest/. (And just passing an Ignition that will make |
are you saying this is a bug in Fedora/FCOS? |
Yes, we should be failing instead of continuing. I'll note though that it doesn't immediately continue. It seems like systemd retries the whole transaction a few times before finally continuing, which means that e.g. |
@dm0- Were you able to look into this? While it would be good to eventually get to the bottom of this and figure out what's going on, I wonder if we should just get this in for now. On RHCOS for example, I saw a boot just keep looping forever in the initrd. This is also biting other people and making it harder to debug failures. :( |
Sorry, no, I lost track of this ticket doing CL release work today. I'll look into it tomorrow, but merging the workaround sounds okay until there's a "proper" fix. |
One note I forgot to add here is that adding an explicit |
I'm okay merging this for now to avoid the bug, but let's also file an issue to get to the bottom of it. |
Cool, let's merge this then? Planning to do a respin soon, and might as well pick it up. |
👍 |
This was merged as part of #47. |
I think this was the missing piece in the issues described in coreos#61. Essentially, `ignition-complete.target` wasn't explicitly ordered wrt to `initrd.target`, so its completion was racing against system's `isolate` to `initrd-switch-root.target`. Ordering ourselves before `initrd.target` (which is also the unit that pulls us in through `initrd.target.requires`) ensures that the target can only be reached successfully if we're successful. And if our target fails, then we immediate trigger `emergency.target`.
I think this was the missing piece in the issues described in coreos#61. Essentially, `ignition-complete.target` wasn't explicitly ordered wrt to `initrd.target`, so its completion was racing against system's `isolate` to `initrd-switch-root.target`. Ordering ourselves before `initrd.target` (which is also the unit that pulls us in through `initrd.target.requires`) ensures that the target can only be reached successfully if we're successful. And if our target fails, then we immediately trigger `emergency.target`.
We shouldn't actually need to do `OnFailure=emergency.target` in `ignition-complete.service` here. Our unit is already a requirement of `initrd.target`, and so once we fail, `initrd.target` should fail, which in turn should trigger *its* `OnFailure=emergency.target`. However, this doesn't work in f29/el8: https://bugzilla.redhat.com/show_bug.cgi?id=1696796 Add a comment about that. See also: coreos#61
OK, I filed https://bugzilla.redhat.com/show_bug.cgi?id=1696796 about this. |
Right now, we enable our services by adding a link in
initrd.target.requires
. However, it seems like systemd doesn'tnecessarily fail the boot even if
initrd.target
fails. That unit has:But looking at the logs, it seems like that can get overridden by the
systemctl isolate initrd-switch-root.target
call thatinitrd-cleanup.service
does.Let's just be really explicit here and tell systemd we want to
immediately switch to
emergency.target
if any of our units fail.