-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug]: Podman machine fails to start with exit status 255 on Mac #17403
Comments
Here is what I have tried so far without success:
Is recreating the machine the only way forward at this point? I guess there is no way to recover containers from the VM if the machine cannot be connected? |
I see this error quite frequently too. Not sure if this is related, but feel like i started seeing this error more frequently after starting to use Podman for dev containers, with named volumes that might have more IO opperations than other containers, and additionally needing to add swap space to the I also always seem to see this error after podman has (inexplicably) crashed. No idea why podman has crashed, the only symptom is that my containers stop responding. Killing |
@smakinen I think you are correct about this being a race condition, and some bug internal to Podman I'm able to get podman to start a machine that was previously failing using the following shell script, which manually pauses the #!/bin/bash
pkill podman
pkill qemu
podman machine start --log-level debug &
PID=$!
sleep 2
pkill -STOP podman
pkill -STOP gvproxy
echo "^^^^"
echo "^^^^"
echo "^^^^"
echo "If the above says the VM already running or started"
echo "then edit the json file located at ~/.config/containers/podman/machine/qemu/"
echo "and change the line"
echo "\"Starting\": true"
echo "to be"
echo "\"Starting\": false"
echo ""
echo "dont forget to save, and rerun this script."
echo ""
echo "Else, continue with instructions below"
echo ""
echo "Qemu will open in another window (likely in the background)"
echo "wait until you see a login prompt on that window"
read -p "then return to THIS terminal and hit enter"
pkill -CONT podman
pkill -CONT gvproxy Unfortunately, this is only a bandaid. We need something to address whatever the underlying root problem is. |
It happened on an Intel MacBook (MacBookPro16,1) too, not just the M1. @jamesmikesell's script works for me, as a workaround. |
I just started having this problem after restarting my M1 Mac. Running with --log-level debug shows that it dies after trying to ssh into the VM. For the record, I am using:
|
Removing the vm and recreating it fixed it for me. |
A brilliant idea and move @jamesmikesell! I'm happy to say that with your script, I can see my containers again :). So halting the podman and gvproxy processes until QEMU has time to finish initialization works. I actually previously tried to renice the QEMU and Podman processes to make the QEMU process run faster but I could not figure out how to halt the other process altogether. Thanks a lot. To @berndlosert: also good to hear that recreating the VM works but I did not want lose my VM since I had a bunch of containers in the VM used for development. So I was not keen to remove the existing VM (and the problem could also reappear after a while). For Podman and the QEMU machine initialization code, I think there should a function such as isSSHRunning which would be used to check the SSH status before using SSH (in the 'Waiting for VM...' phase). It's a good question what is the best way to check the availability of SSH in the guest OS if the machine state and the listening-state of gvproxy cannot be relied on. |
I observed the same: podman vm starting successfully only once every 4 times. See the script I use podmac.sh start. Example: $ ./podmac.sh start intel_64
Starting machine "intel_64"
Waiting for VM ...
Pausing podman
Waiting for SSH
SSH-2.0-OpenSSH_8.8
Resuming podman
...
Machine "intel_64" started successfully |
I reproduce it time to time as well with the latest version
|
Hitting same issue with the following:
Workarounds mentioned here did not resolve the issue (removing VM and recreating, and script). Running into same failure location as what berndlosert is seeing. |
I just ran into this issue as well with podman 4.4.4 on a 2020 MBP. |
For me, James' script has worked well so far. Waiting until QEMU is prepared for SSH logins helps. I'm certain that the waiting approach could be forged into a PR for Podman but perhaps it won't work in all cases. I'm a bit curious, what do you @itsthejoker and others see in the QEMU window when you run |
This Error 255 seems to be a catch-all so users may have various issues under the hood. It is also a tricky one since In my case, I spotted the issue using:
The SSH connection failed for some reasons.
shows where the ssh key is located. A simple I then deleted keys and machine. Then recreated the machine and everything was back in track:
My 2 cents if you have this error 255 issue, would be to first try to get things to work with a default podman machine (ie no custom memory, cpu, volumes, etc...) and then see if the fancy options keep working once the core problem is solved. I would also advise not using podman-desktop (for the duration of the test) as it give a false sense of success. |
I kept on having issue and I am seeing some success thanks to this comment in this issue. I think the topic of this issue could be edited and the M1 part removed, I don't think the issues discussed here are M1 specific. I run into the same on an Intel Mac. The issue is however likely related to the fact that users here use a Mac. The podman troubleshooting guide mentions some extra cleanup so the following may help some:
then your regular:
AFAIK, the fix for me was to add the following:
to my I was consistently getting this With the fix above of the ssh config and yet-another cleanup, I was able to finally get things back in order. |
For me, the issue was my SSH agent with a lot of keys already. My fix: ssh-add -D # clear the SSH keys from my SSH agent
podman machine stop
podman machine start |
Same here, i just ran into this issue. I just temporarily unset SSH_AUTH_SOCK. @samuel-phan didn't know about |
Had the same issue and the start script fixed it. I also had to change "Starting" to false in json (as @jamesmikesell mentioned). Without it any attempt to start a machine was returning "VM already running or starting". |
I have updated the script here: https://github.com/laurent-martin/podman_x86_64_on_apple_aach64
|
@ashley-cui @baude PTAL |
This happens to me sometimes too, even though the machine starts anyways. I don't know if this is related.
|
Yes, the VM may be started, but the mount failed. The test is rather (/Users represents the mounted volume in VM): $ podman run -it -v /Users:/Users debian
root@a1f31cce35f5:/# ls /Users It should show the contents of /Users on macos... when the mount did not fail. |
I am aware that the mount didn’t fail unlike with OP. I was just reporting on the error message in case it might be related to this issue, since it is part of the same message. |
I've come up with a slightly improved version of my script (read hack) above, which waits for the VM to finish booting before letting podman mount directories. The above script ran podman/qemu in debug mode, which slowed down performance of the running containers. This improved version of the script runs podman/qemu in normal mode, and relies on the user waiting for the CPU utilization of the #!/bin/bash
pkill podman
pkill qemu
podman machine start &
PID=$!
sleep 2
pkill -STOP podman
pkill -STOP gvproxy
echo "^^^^"
echo "^^^^"
echo "^^^^"
echo "If the above says the VM already running or started"
echo "then edit the json file located at ~/.config/containers/podman/machine/qemu/"
echo "and change the line"
echo "\"Starting\": true"
echo "to be"
echo "\"Starting\": false"
echo ""
echo "don't forget to save, and rerun this script."
echo ""
echo "Else, continue with instructions below"
echo ""
echo "Wait until the displayed CPU utilization lowers and stabilizes to 1% or less"
echo "Then hit enter"
PID_QEMU=$(pgrep qemu)
while true; do
CPU=$(ps -p $PID_QEMU -o %cpu | awk 'NR>1 {print $1}')
printf "\rCPU utilization: %s%% " $CPU
read -s -t 1
if [ $? -eq 0 ]; then
break
fi
done
pkill -CONT podman
pkill -CONT gvproxy |
@jamesmikesell thanks for your script. I have the same problem and using it reliably work. |
During the exponential backoff waiting for the machine to be fully up and running, also make sure that SSH is ready. The systemd dependencies of the ready.service include the sshd.service among others but that is not enough. Other CoreOS users reported the same issue on IRC, so I feel fairly confident to use the pragmatic approach of making sure SSH works on the client side. containers#17403 is quite old and there are other pressing machine issues that need attention. [NO NEW TESTS NEEDED] Fixes: containers#17403 Signed-off-by: Valentin Rothberg <vrothberg@redhat.com>
Make sure that starting a qemu machine uses proper exponential backoffs and that a single variable isn't shared across multiple backoffs. DO NOT BACKPORT: I want to avoid backporting this PR to the upcoming 4.6 release as it increases the flakiness of machine start (see containers#17403). On my M2 machine, the flake rate seems to have increased with this change and I strongly suspect that additional/redundant sleep after waiting for the machine to be running and listening reduced the flakiness. My hope is to have more predictable behavior and find the sources of the flakes soon. [NO NEW TESTS NEEDED] - still too flaky to add a test to CI. Signed-off-by: Valentin Rothberg <vrothberg@redhat.com>
During the exponential backoff waiting for the machine to be fully up and running, also make sure that SSH is ready. The systemd dependencies of the ready.service include the sshd.service among others but that is not enough. Other CoreOS users reported the same issue on IRC, so I feel fairly confident to use the pragmatic approach of making sure SSH works on the client side. containers#17403 is quite old and there are other pressing machine issues that need attention. [NO NEW TESTS NEEDED] Fixes: containers#17403 Signed-off-by: Valentin Rothberg <vrothberg@redhat.com>
I am still experiencing this issue - on
|
The bug is not fixed in 4.5.1, but will be part of the 4.6.0 release (see changelog) |
Yes, that's it. The fixes will ship with the upcoming 4.6 release. The plan is to release 4.6 by the end of this week, so I the fix will reach you soon. |
Awesome, thank you!! |
Can someone confirm v4.6 solves the issue? In my case I upgraded 1 week ago, had the issue multiple times despite a full reboot of my MacOS... #17403 (comment) is still my only viable solution 😢 |
@sneko did you recreate your podman machine ? Ignition script has been changed so you'll need to recreate the machine |
I seem to have stopped experiencing the issue after I upgraded, even without recreating the machine |
Thanks for the changes to the Ignition script and for shipping the fix. For a previously created Podman machine that was failing in machine startup, I still got the exit status 255 error with Podman 4.6.0. As Florent mentioned, taking advantage of the changed Ignition script won't happen without recreating the machine.
I think the situation is fairly good now. If all future machines start ok and at least some of the currently failing legacy machines can be started with the delayed startup scripts found here, Podman should be good to go in most scenarios 😊. Thank you all and have a nice autumn (at least in the northern hemisphere) everyone 🍁. |
Starting a x86 VM on M1 macbook works now with 4.6.0 : % podman --version
podman version 4.6.0
% podman machine start intel_64
Starting machine "intel_64"
Waiting for VM ...
Mounting volume... /Users/laurent:/Users/laurent
This machine is currently configured in rootless mode. If your containers
require root permissions (e.g. ports < 1024), or if you run into compatibility
issues with non-podman clients, you can switch using the following command:
podman machine set --rootful intel_64
API forwarding listening on: /var/run/docker.sock
Docker API clients default to this address. You do not need to set DOCKER_HOST.
Machine "intel_64" started successfully |
I am still facing the same issue on Mac Ventura. bash-3.2$ podman machine start podman-machine-default --log-level debug |
@shrishs thanks for reporting. I don´t think it's the same issue but another issue. Feel free to create a new issue on GitHub. |
Issue Description
There seems to be an issue when trying to start Podman with
podman machine start
when using macOS and QEMU. I created a Podman machine about two months ago but now the machine fails to start. Somehow, starting the machine got gradually worse after a time before completely failing in startup.Here is what happens when I try to start the machine.
From the
podman machine start --log-level debug
output, I can see the last statement that is executed is the SSH command related to creating the mount point directories (maybe line 669 in qemu/machine.go), which fails. Previous issues have considered this to be a sign of e.g. invalid SSH configuration (e.g. #14237) but maybe there is something more to it. When running withlog-level debug
and the QEMU window open, the exit status 255 error shows before all the Fedora services have started.Could this be a race condition issue so that SSH related services have not yet started when the SSH mount commands are executed? I found one closed and apparently fixed issue where the race condition was suggested (#11532).
I tried connecting to the QEMU QMP monitor socket with
nc -U qmp_podman-machine-default.sock
and running the following queries just after QEMU has started and before all the services are running.So the VM is in an running state and also the gvproxy port (50810 here) is in a listening state early on.
Is it possible that condition in qemu/machine.go (line 645)
for state != machine.Running || !listening {
cannot hold back the execution of the SSH statement before the machine is fully initialized?Steps to reproduce the issue
Steps to reproduce the issue (happens on an existing Podman machine)
podman machine start
Describe the results you received
Describe the results you expected
Podman machine should be able to start. Thinking more broadly, maybe there should additional guarantees that SSH on the QEMU machine is up and running before issuing commands to the machine. Perhaps the SSH connection should be polled for a couple of times with sensible timeout or then other events from QEMU used for the purpose? For instance, the QMP monitor emits a NIC_RX_FILTER_CHANGED event towards the end of the initialization (not sure of its purpose, though).
podman info output
The machine (or gvproxy) is not up and running so cannot get info.
Some supplemental environment details to
podman info
.Podman in a container
No
Privileged Or Rootless
Rootless
Upstream Latest Release
Yes
Additional environment details
Here are the results of
podman machine inspect
.Additional information
It appears that the problem affects Podman machines that have been in use for some time (i.e. a few months). Why fresh Podman machines straight out-of-the-box work and what cause the slow decay remain a mystery.
The text was updated successfully, but these errors were encountered: