-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
REST API is failing with errors when listing containers after being in an inconsistent state #15526
Comments
Here is a small script to reproduce the issue on macOS Create 10 pods At the end my podman machine is not working
Here is the script #!/bin/bash
for i in {1..10}
do
podman run --name "mariadb${i}" --pod "new:apps-${i}" -e MYSQL_RANDOM_ROOT_PASSWORD=yes -d docker.io/library/mariadb:10
done
# try to remove all containers
all_containers=$(podman ps -a -q)
for containerId in $all_containers
do
podman rm "${containerId}"
done
# remove all pods
all_pods=$(podman pod ls -q)
for podId in $all_pods
do
podman pod rm "${podId}"
done
# now, list all containers calling REST API
echo "Call REST API"
curl --unix-socket "$HOME/.local/share/containers/podman/machine/podman-machine-default/podman.sock" "http:/v1.41/containers/json?all=true"
At the end it should display: |
We've handled this race condition already in CLI |
I can also reproduce with one pod using instructions in a shell sequentially.
Now, everything is broken |
Removing good first issue and self-assigning, that seems very serious. |
Looks like #15367 |
Probably unrelated @edsantiago - no pods involved there. Remote |
It's removing the infra container despite dependencies on it being present. Serious bug, possibly present in non-remote Podman. |
Alright, identified the cause. It's 384c235 Container removal is unordered and normal checks to make sure that dependency containers and the infra container are not removed until the pod is removed are not enforced as we are attempting to remove the pod. Solution here is probably not fun. Going to need to restructure pod removal to work in a graph-traversal fashion. |
my only workaround is to call |
It is possible that a |
|
#15757 should fix, but testing would be appreciated. |
Originally, during pod removal, we locked every container in the pod at once, did a number of validity checks to ensure everything was safe, and then removed all the containers in the pod. A deadlock was recently discovered with this approach. In brief, we cannot lock the entire pod (or much more than a single container at a time) without causing a deadlock. As such, we converted to an approach where we just looped over each container in the pod, removing them individually. Unfortunately, this removed a lot of the validity checking of the earlier approach, allowing for a lot of unintended bad things. Infra containers could be removed while containers in the pod still depended on them, for example. There's no easy way to do validity checks while in a simple loop, so I implemented a version of our graph-traversal logic that currently handles pod start. This version acts in the reverse order of startup: startup starts from containers which depend on nothing and moves outwards, while removal acts on containers which have nothing depend on them and moves inwards. By doing graph traversal, we can guarantee that nothing is removed while something that depends on it still exists - so the infra container should be the last thing in a pod that is removed, for example. In the (unlikely) case that a graph of the pod's containers cannot be built (most likely impossible without database editing) the old method of pod removal has been retained to ensure that even misbehaving pods can be forcibly evicted from the state. I'm fairly confident that this resolves the problem, but there are a lot of assumptions around dependency structure built into the original pod removal code and I am not 100% sure I have captured all of them. Fixes containers#15526 Signed-off-by: Matthew Heon <matthew.heon@pm.me>
Is this fully solved?
|
Is this a BUG REPORT or FEATURE REQUEST? (leave only one on its own line)
/kind bug
Description
After starting/stopping/deleting containers, now I'm in an inconsistent state
When listing containers, I've the error
error getting container from store
Steps to reproduce the issue:
I don't know how to reproduce but it was just by doing start, stop and delete on containers and pods.
Note: Using a UI, I'm sending multiple events at the same time, so it means, start/stop/delete actions are occurring concurrently
Describe the results you received:
Error
Describe the results you expected:
No error
Additional information you deem important (e.g. issue happens only occasionally):
while the REST API is not working (throwing error)
I've
podman container ps -a
workingand if I try to inspect the infra container, I've:
Output of
podman version
:Output of
podman info
:Package info (e.g. output of
rpm -q podman
orapt list podman
):Have you tested with the latest version of Podman and have you checked the Podman Troubleshooting Guide? (https://github.com/containers/podman/blob/main/troubleshooting.md)
Yes/No
Additional environment details (AWS, VirtualBox, physical, etc.):
The text was updated successfully, but these errors were encountered: