-
Notifications
You must be signed in to change notification settings - Fork 60
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
initrd: nm-initrd.service fails to spawn depending on console setup #943
Comments
This may be fallout from #842 (mainly because it was first implemented in |
Without OKD, e.g. some basic standalone VM with an otherwise similar (but purely local, e.g. completely within guestinfo.ignition) ignition file (just without the merged in remote/secondary ignition file) comes up fine. I would need to construct a case where I configure a standalone VM but with a merged in add'l remote ignition file ... did not have that case yet. |
Yep. We have some logic in there that won't bring up networking at all if there are no remote references.
+1 - it may be easy enough just to provide |
@kai-uwe-rommel is this node set up via DHCP (specifically, for the initrd and Ignition fetch), or does it have some custom networking configuration via guestinfo or kargs? |
This is via DHCP. |
As I tried to express above: this seems a timing issue. When I attach a serial console to debug, the problem vanishes away. Does this ring some bell for someone of you? |
Thank you @kai-uwe-rommel, it "solved" my problems. |
??? I am having a problem. In what way could this have solved your problem?? |
well - in the very least using the serial console (as you suggested) would allow someone to workaround the issue and get unblocked. So maybe that's what they meant. |
Yes sorry it dit not solved the root issue, but at least I am able to get my cluster working again, by attaching a serial console (as a workaround). |
Back to the topic - what next? |
Yep. It would still be nice to get a simple reproducer without OKD. |
I did manage to reproduce this outside of OKD on latest I'm having some hard time catching the real error. |
After a few tries, I did manage to capture the failure (cut for brevity, this was the 4th restart-on-failure of that unit): I think this is due to I believe this isn't vmware-specific, but may affect any setup with a mismatched console configuration. |
Ok. This sounds like this can/will be fixed in near future? |
Yes, I think this could be simply fixed by dropping the Sidenote: I won't be able to push this forward anytime soon, as I'm about to go offline for a few days. |
|
@bengal ack. In that case I think you can try to see whether |
I reproduced the issue by adding
Perhaps we should add a configuration option (or command line switch) so that NM can write only to stdout (and then the service will use |
Yes, that makes the most sense to me, then systemd handles everything (and presumably does so in a race-free way). |
How will I know when and with which release of FCOS the fix will be delivered? |
Seems like I hit this issue as well, thanks @kai-uwe-rommel for pointing here. |
The network-manager module also writes logs to the console, so that it's easier to debug network-related boot issues. If systemd can't open the console, the service fails and network doesn't get configured. Add a check to disable tty output when the console is not present or not usable. coreos/fedora-coreos-tracker#943
Any news about when a release to be expected? |
waiting on @haraldh in dracutdevs/dracut#1611 (comment) |
The network-manager module also writes logs to the console, so that it's easier to debug network-related boot issues. If systemd can't open the console, the service fails and network doesn't get configured. Add a check to disable tty output when the console is not present or not usable. coreos/fedora-coreos-tracker#943
Now that the dracut fix has been merged, what are the next steps? Backporting the fix in Fedora's dracut rpm package and fast-tracking the update in fcos? |
yep.. any chance you want to do the backport to the rpm? See https://src.fedoraproject.org/rpms/dracut/pull-request/14# for an example. |
I've successfully tested the fix (I also encountered the issue) 🎉 |
Thank you so much @olivierlemasle - you're amazing! |
In case anyone else wants to try out the fix here is a link to a rawhide stream build: https://dustymabe.fedorapeople.org/fedora-coreos-36.20211027.dev.0-vmware.x86_64.ova |
PR merged, builds and updates done for F34, F35 and F36. |
The fix for this went into next stream release |
The fix for this went into testing stream release |
I just did a cluster installation with 34.20211031.2.0 on vSphere UPI and the fix seems to work properly. All nodes started fine. |
The fix for this went into stable stream release |
The network-manager module also writes logs to the console, so that it's easier to debug network-related boot issues. If systemd can't open the console, the service fails and network doesn't get configured. Add a check to disable tty output when the console is not present or not usable. coreos/fedora-coreos-tracker#943 (cherry picked from commit f6e6be245d0cda14d90a0442b688c8dca1410a2e)
The network-manager module also writes logs to the console, so that it's easier to debug network-related boot issues. If systemd can't open the console, the service fails and network doesn't get configured. Add a check to disable tty output when the console is not present or not usable. coreos/fedora-coreos-tracker#943 (cherry picked from commit f6e6be2)
The network-manager module also writes logs to the console, so that it's easier to debug network-related boot issues. If systemd can't open the console, the service fails and network doesn't get configured. Add a check to disable tty output when the console is not present or not usable. coreos/fedora-coreos-tracker#943 (cherry picked from commit f6e6be2) bsc#1201975
Describe the bug
I regularly deploy OKD clusters with FCOS on vSphere.
The deployment process worked fine up until FCOS 34.2021-06-26 and started failing with 2021-07-11 and still fails with 2021-07-25 and 2021-0808.
The OKD cluster VMs start with plain FCOS OVA and get an igition file passed via guestinfo.ignition from vSphere.
Starting with FCOS 34.2021-07-11, the VMs simply fail to start their network.
Reproduction steps
Steps to reproduce the behavior:
Expected behavior
As usual, the VM should initialize their network (with DHCP) and process the ignition file which includes merging an additional remote ignition file.
Actual behavior
The VM does not start its network, is not pingable and cannot resolve the remote http server (and can not connect to it) from which the initial ignition file (passed from vSphere via guestinfo.ignition) tries to merge a remote ignition file.
System details
All on a vSphere 7.0u2 cluster.
FCOS versions see above.
Ignition config
File attached here as text file.
Additional information
I tried to gather more details about why the boot process fails to start the networking by connecting a virtual serial console to the VM, in order to view serial console output and record it.
Unfortunately, when a serial console is connected to the VM, then the networking starts successfully.
So it looks like it is a timing issue and the serial console slows it down enough to let it succeed.
initial-ignition.txt
The text was updated successfully, but these errors were encountered: