Skip to content
This repository has been archived by the owner on Aug 25, 2021. It is now read-only.

dracut/30ignition: Explicitly add OnFailure=emergency.target #61

Closed
wants to merge 1 commit into from

Conversation

jlebon
Copy link
Member

@jlebon jlebon commented Mar 20, 2019

Right now, we enable our services by adding a link in
initrd.target.requires. However, it seems like systemd doesn't
necessarily fail the boot even if initrd.target fails. That unit has:

OnFailure=emergency.target
OnFailureJobMode=replace-irreversibly

But looking at the logs, it seems like that can get overridden by the
systemctl isolate initrd-switch-root.target call that
initrd-cleanup.service does.

Let's just be really explicit here and tell systemd we want to
immediately switch to emergency.target if any of our units fail.

Right now, we enable our services by adding a link in
`initrd.target.requires`. However, it seems like systemd doesn't
necessarily fail the boot even if `initrd.target` fails. That unit has:

    OnFailure=emergency.target
    OnFailureJobMode=replace-irreversibly

But looking at the logs, it seems like that can get overridden by the
`systemctl isolate initrd-switch-root.target` call that
`initrd-cleanup.service` does.

Let's just be really explicit here and tell systemd we want to
immediately switch to `emergency.target` if any of our units fail.
@jlebon
Copy link
Member Author

jlebon commented Mar 20, 2019

This is the master version of #60.

@dustymabe
Copy link
Member

cc @bgilbert

@bgilbert
Copy link
Contributor

Hmm, CL doesn't have this problem. @dm0-, can you think of a reason initrd.target failures would behave differently on FCOS?

@dm0-
Copy link
Contributor

dm0- commented Mar 22, 2019

I don't know why there would be a difference with initrd.target specifically, but if you have an image and a way to reproduce this, I can take a look at it.

@jlebon
Copy link
Member Author

jlebon commented Mar 22, 2019

OK, did more digging into this. So on both CL and FCOS, systemd correctly queues up the emergency.target job:

ignition-files.service: Main process exited, code=exited, status=1/FAILURE
ignition-files.service: Failed with result 'exit-code'.
[FAILED] Failed to start Ignition (files).
See 'systemctl status ignition-files.service' for details.
[DEPEND] Dependency failed for Initrd Default Target.
initrd.target: Job initrd.target/start failed with result 'dependency'.
initrd.target: Triggering OnFailure= dependencies.
emergency.target: Trying to enqueue job emergency.target/start/replace-irreversibly
emergency.service: Installed new job emergency.service/start as 179
emergency.target: Installed new job emergency.target/start as 178
emergency.target: Enqueued job emergency.target/start as 178

But on CL, the call to systemctl isolate --no-block initrd-switch-root.target in initrd-cleanup.service correctly fails:

initrd-cleanup.service: Executing: /usr/bin/systemctl --no-block isolate initrd-switch-root.target
...
[   11.052847] systemctl[593]: Failed to start initrd-switch-root.target: Transaction is destructive.
[   11.055565] systemctl[593]: See system logs and 'systemctl status initrd-switch-root.target' for details.
Failed to process message type=method_call sender=n/a destination=org.freedesktop.systemd1 path=/org/freedesktop/systemd1 interface=org.freedesktop.systemd1.Manager member=StartUnit cookie=1 reply_cookie=0 signature=ss error-name=n/a error-message=n/a: Transaction is destructive.
Received SIGCHLD from PID 593 (systemctl).
Child 593 (systemctl) died (code=exited, status=4/NOPERMISSION)
initrd-cleanup.service: Child 593 belongs to initrd-cleanup.service.
initrd-cleanup.service: Main process exited, code=exited, status=4/NOPERMISSION

Whereas on FCOS it just works:

initrd-cleanup.service: Executing: /usr/bin/systemctl --no-block isolate initrd-switch-root.target
...
[   36.910964] systemctl[1066]: Executing dbus call org.freedesktop.systemd1.Manager StartUnit(initrd-switch-root.target, isolate)
Bus private-bus-connection: changing state AUTHENTICATING → RUNNING
Got message type=method_call sender=n/a destination=org.freedesktop.systemd1 path=/org/freedesktop/systemd1 interface=org.freedesktop.systemd1.Manager member=StartUnit cookie=1 reply_cookie=0 signature=ss error-name=n/a error-message=n/a
initrd-switch-root.target: Trying to enqueue job initrd-switch-root.target/start/isolate
...
emergency.target: Installed new job emergency.target/stop as 145
...
Received SIGCHLD from PID 1066 (systemctl).
Child 1066 (systemctl) died (code=exited, status=0/SUCCESS)
initrd-cleanup.service: Child 1066 belongs to initrd-cleanup.service.
initrd-cleanup.service: Main process exited, code=exited, status=0/SUCCESS
initrd-cleanup.service: Changed start -> dead
initrd-cleanup.service: Job initrd-cleanup.service/start finished, result=done

if you have an image and a way to reproduce this, I can take a look at it.

Nice. You should be able to reproduce this with the latest FCOS from the pipeline: http://artifacts.ci.centos.org/fedora-coreos/prod/builds/latest/. (And just passing an Ignition that will make ignition-files.service fail, e.g. adding a non-existent group to the core user).

@dustymabe
Copy link
Member


Whereas on FCOS it just works:

are you saying this is a bug in Fedora/FCOS?

@jlebon
Copy link
Member Author

jlebon commented Mar 22, 2019

are you saying this is a bug in Fedora/FCOS?

Yes, we should be failing instead of continuing. I'll note though that it doesn't immediately continue. It seems like systemd retries the whole transaction a few times before finally continuing, which means that e.g. ignition-files runs more than once. 😿

@jlebon
Copy link
Member Author

jlebon commented Mar 25, 2019

@dm0- Were you able to look into this?

While it would be good to eventually get to the bottom of this and figure out what's going on, I wonder if we should just get this in for now. On RHCOS for example, I saw a boot just keep looping forever in the initrd. This is also biting other people and making it harder to debug failures. :(

@dm0-
Copy link
Contributor

dm0- commented Mar 25, 2019

@dm0- Were you able to look into this?

Sorry, no, I lost track of this ticket doing CL release work today. I'll look into it tomorrow, but merging the workaround sounds okay until there's a "proper" fix.

@jlebon
Copy link
Member Author

jlebon commented Mar 25, 2019

One note I forgot to add here is that adding an explicit Before=initrd.target to our units seemed to also work. Even though AFAICT all our units end up running before initrd.target. Though maybe systemd needs it to be made more explicit?

@bgilbert
Copy link
Contributor

I'm okay merging this for now to avoid the bug, but let's also file an issue to get to the bottom of it.

@jlebon
Copy link
Member Author

jlebon commented Mar 26, 2019

Cool, let's merge this then? Planning to do a respin soon, and might as well pick it up.

@dustymabe
Copy link
Member

Cool, let's merge this then?

👍

@jlebon
Copy link
Member Author

jlebon commented Mar 26, 2019

This was merged as part of #47.

@jlebon jlebon closed this Mar 26, 2019
jlebon added a commit to jlebon/ignition-dracut that referenced this pull request Apr 3, 2019
I think this was the missing piece in the issues described in coreos#61.
Essentially, `ignition-complete.target` wasn't explicitly ordered wrt to
`initrd.target`, so its completion was racing against system's `isolate`
to `initrd-switch-root.target`.

Ordering ourselves before `initrd.target` (which is also the unit that
pulls us in through `initrd.target.requires`) ensures that the target
can only be reached successfully if we're successful. And if our target
fails, then we immediate trigger `emergency.target`.
jlebon added a commit to jlebon/ignition-dracut that referenced this pull request Apr 4, 2019
I think this was the missing piece in the issues described in coreos#61.
Essentially, `ignition-complete.target` wasn't explicitly ordered wrt to
`initrd.target`, so its completion was racing against system's `isolate`
to `initrd-switch-root.target`.

Ordering ourselves before `initrd.target` (which is also the unit that
pulls us in through `initrd.target.requires`) ensures that the target
can only be reached successfully if we're successful. And if our target
fails, then we immediately trigger `emergency.target`.
jlebon added a commit to jlebon/ignition-dracut that referenced this pull request Apr 5, 2019
We shouldn't actually need to do `OnFailure=emergency.target` in
`ignition-complete.service` here. Our unit is already a requirement of
`initrd.target`, and so once we fail, `initrd.target` should fail, which
in turn should trigger *its* `OnFailure=emergency.target`.

However, this doesn't work in f29/el8:
https://bugzilla.redhat.com/show_bug.cgi?id=1696796

Add a comment about that.

See also: coreos#61
@jlebon
Copy link
Member Author

jlebon commented Apr 5, 2019

are you saying this is a bug in Fedora/FCOS?

Yes, we should be failing instead of continuing

OK, I filed https://bugzilla.redhat.com/show_bug.cgi?id=1696796 about this.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants