Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

podman rm: fails, and leaves everything hosed #15367

Closed
edsantiago opened this issue Aug 17, 2022 · 42 comments · Fixed by #17126
Closed

podman rm: fails, and leaves everything hosed #15367

edsantiago opened this issue Aug 17, 2022 · 42 comments · Fixed by #17126
Labels
flakes Flakes from Continuous Integration kube locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. remote Problem is in podman-remote

Comments

@edsantiago
Copy link
Member

Another hard-to-isolate flake. I'm not convinced that it's truly podman-remote-only, because I've actually seen this one on my laptop. Two symptoms (possibly unrelated):

# podman-remote --url unix:/tmp/podman_tmp_loGI rm -t 0 -f cpcontainer
Error: cannot remove container 5d183a4ab677c5b65a78ebf87720153aee486145b24631f08edb7e20671498b7 as it could not be stopped: error sending SIGKILL to container 5d183a4ab677c5b65a78ebf87720153aee486145b24631f08edb7e20671498b7: %!w(<nil>)
[ rc=125 (** EXPECTED 0 **) ]

and

# podman-remote --url unix:/tmp/podman_tmp_fYnD pod rm -f -t 0 -a
Error: error freeing lock for container 38ecd608db0c5616563d5b268e85fea2fbada7a9b7d57eabf705c4f8f92aca92: no such file or directory
[ rc=125 (** EXPECTED 0 **) ]

With the second one, once it happens, the entire system becomes completely unusable.

Here's an example of the cannot-remove one

Here's an example of the error-freeing one

@edsantiago edsantiago added flakes Flakes from Continuous Integration remote Problem is in podman-remote labels Aug 17, 2022
@github-actions github-actions bot removed the remote Problem is in podman-remote label Aug 17, 2022
@edsantiago
Copy link
Member Author

Here's a cannot-remove one, int remote ubuntu root on August 2

Aha, and here it is int podman f36 root, confirming that it's not a remote-only bug.

@edsantiago edsantiago changed the title remote: podman rm: fails, and leaves everything hosed podman rm: fails, and leaves everything hosed Aug 17, 2022
@edsantiago
Copy link
Member Author

Here's the error freeing one, sys remote f36 rootless, possibly connected to #11594

@edsantiago
Copy link
Member Author

The error freeing one, int podman f35 root, May 27.

@rhatdan
Copy link
Member

rhatdan commented Aug 17, 2022

In runtime_ctr I see:

	// For pod removal, the container is already locked by the caller
	if !removePod {
		c.lock.Lock()
		defer c.lock.Unlock()
	}
...
	// Deallocate the container's lock
	if err := c.lock.Free(); err != nil {
		if cleanupErr == nil {
			cleanupErr = fmt.Errorf("error freeing lock for container %s: %w", c.ID(), err)
		} else {
			logrus.Errorf("Free container lock: %v", err)
		}
	}

Could this be a situation where the c.lock is never Locked?

@edsantiago
Copy link
Member Author

Here's one with possibly useful context. sys remote f36 rootless:

$ podman-remote --url unix:/tmp/podman_tmp_r7Ne kube down /tmp/podman_bats.DD7B1h/test.yaml
Error: json: cannot unmarshal string into Go struct field PodRmReport.RmReport.Err of type error

Then, immediately after that, kablooie: error freeing on all subsequent tests.

@edsantiago
Copy link
Member Author

Interesting. This one sys remote ubuntu-2204 root goes into the error freeing death spiral, but the trigger seems to be the unlinkat/EBUSY flake (#11594):

# podman-remote --url unix:/tmp/podman_tmp_9X4G pod rm yky40edmyy
Error: error removing container c460399c59e569bd25127913cecc01c7de058a07c0aacde4bc9fa2ae2f2c3128 root filesystem: 1 error occurred:
	* unlinkat /var/lib/containers/storage/overlay/a6b8a4bc3e3ab118ed01e0b3e8d16f7b41a7b50176f6f9eb713c5baf9bbf47e0/merged: device or resource busy
[ rc=125 (** EXPECTED 0 **) ]
#/vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
#| FAIL: exit code is 125; expected 0
#\^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
# podman-remote --url unix:/tmp/podman_tmp_9X4G pod rm -f -t 0 -a
Error: error freeing lock for container c460399c59e569bd25127913cecc01c7de058a07c0aacde4bc9fa2ae2f2c3128: no such file or directory

@apinter
Copy link

apinter commented Aug 24, 2022

Not sure if at all related, but got a similar - likely storage related - issue as well on openSUSE MicroOS:

adathor@vegas:~> podman pod --log-level debug rm -i -f hedgedoc 
INFO[0000] podman filtering at log level debug          
DEBU[0000] Called rm.PersistentPreRunE(podman pod --log-level debug rm -i -f hedgedoc) 
DEBU[0000] Merged system config "/usr/share/containers/containers.conf" 
DEBU[0000] Using conmon: "/usr/bin/conmon"              
DEBU[0000] Initializing boltdb state at /home/podman_vol/home-vol/podman/containers/storage/libpod/bolt_state.db 
DEBU[0000] systemd-logind: Unknown object '/'.          
DEBU[0000] Using graph driver btrfs                     
DEBU[0000] Using graph root /home/podman_vol/home-vol/podman/containers/storage 
DEBU[0000] Using run root /run/user/1000/containers     
DEBU[0000] Using static dir /home/podman_vol/home-vol/podman/containers/storage/libpod 
DEBU[0000] Using tmp dir /run/user/1000/libpod/tmp      
DEBU[0000] Using volume path /home/podman_vol/home-vol/podman/containers/storage/volumes 
DEBU[0000] Set libpod namespace to ""                   
DEBU[0000] [graphdriver] trying provided driver "btrfs" 
DEBU[0000] Initializing event backend journald          
DEBU[0000] Configured OCI runtime crun initialization failed: no valid executable found for OCI runtime crun: invalid argument 
DEBU[0000] Configured OCI runtime runj initialization failed: no valid executable found for OCI runtime runj: invalid argument 
DEBU[0000] Configured OCI runtime kata initialization failed: no valid executable found for OCI runtime kata: invalid argument 
DEBU[0000] Configured OCI runtime runsc initialization failed: no valid executable found for OCI runtime runsc: invalid argument 
DEBU[0000] Configured OCI runtime krun initialization failed: no valid executable found for OCI runtime krun: invalid argument 
DEBU[0000] Using OCI runtime "/usr/bin/runc"            
INFO[0000] Setting parallel job count to 13             
DEBU[0000] Removing container 052bcd6e2d129159e7f8871310c06cc485a124bf52699db8ee41cb26e3c3e0fb 
DEBU[0000] Cleaning up container 052bcd6e2d129159e7f8871310c06cc485a124bf52699db8ee41cb26e3c3e0fb 
DEBU[0000] Network is already cleaned up, skipping...   
DEBU[0000] Container 052bcd6e2d129159e7f8871310c06cc485a124bf52699db8ee41cb26e3c3e0fb storage is already unmounted, skipping... 
DEBU[0000] Removing all exec sessions for container 052bcd6e2d129159e7f8871310c06cc485a124bf52699db8ee41cb26e3c3e0fb 
DEBU[0000] Container 052bcd6e2d129159e7f8871310c06cc485a124bf52699db8ee41cb26e3c3e0fb storage is already unmounted, skipping... 
INFO[0000] Storage for container 052bcd6e2d129159e7f8871310c06cc485a124bf52699db8ee41cb26e3c3e0fb already removed 
DEBU[0000] Removing container 200ddd92e43b4a2090682ab7534ec126fec80000dfe2f87041ed5236f3a8459b 
DEBU[0000] Cleaning up container 200ddd92e43b4a2090682ab7534ec126fec80000dfe2f87041ed5236f3a8459b 
DEBU[0000] Failed to reset unit file: "Unit 200ddd92e43b4a2090682ab7534ec126fec80000dfe2f87041ed5236f3a8459b.service not loaded." 
DEBU[0000] Container 200ddd92e43b4a2090682ab7534ec126fec80000dfe2f87041ed5236f3a8459b storage is already unmounted, skipping... 
DEBU[0000] Removing all exec sessions for container 200ddd92e43b4a2090682ab7534ec126fec80000dfe2f87041ed5236f3a8459b 
DEBU[0000] Container 200ddd92e43b4a2090682ab7534ec126fec80000dfe2f87041ed5236f3a8459b storage is already unmounted, skipping... 
INFO[0000] Storage for container 200ddd92e43b4a2090682ab7534ec126fec80000dfe2f87041ed5236f3a8459b already removed 
ERRO[0000] Removing container 200ddd92e43b4a2090682ab7534ec126fec80000dfe2f87041ed5236f3a8459b from pod ad97dd32896ba8af8f4a76144f9a6f6c26cab1b352ef47a11e91af48cbbdf2d4: error freeing lock for container 200ddd92e43b4a2090682ab7534ec126fec80000dfe2f87041ed5236f3a8459b: no such file or directory 
adathor@vegas:~> podman info
host:
  arch: amd64
  buildahVersion: 1.27.0
  cgroupControllers:
  - memory
  - pids
  cgroupManager: systemd
  cgroupVersion: v2
  conmon:
    package: conmon-2.1.2-1.1.x86_64
    path: /usr/bin/conmon
    version: 'conmon version 2.1.2, commit: unknown'
  cpuUtilization:
    idlePercent: 93.86
    systemPercent: 2.51
    userPercent: 3.64
  cpus: 4
  distribution:
    distribution: '"opensuse-microos"'
    version: "20220822"
  eventLogger: journald
  hostname: vegas
  idMappings:
    gidmap:
    - container_id: 0
      host_id: 1000
      size: 1
    - container_id: 1
      host_id: 100000
      size: 65536
    uidmap:
    - container_id: 0
      host_id: 1000
      size: 1
    - container_id: 1
      host_id: 100000
      size: 65536
  kernel: 5.19.2-1-default
  linkmode: dynamic
  logDriver: journald
  memFree: 16723730432
  memTotal: 20513906688
  networkBackend: cni
  ociRuntime:
    name: runc
    package: runc-1.1.3-2.1.x86_64
    path: /usr/bin/runc
    version: |-
      runc version 1.1.3
      commit: v1.1.3-0-ga916309fff0f
      spec: 1.0.2-dev
      go: go1.18.3
      libseccomp: 2.5.4
  os: linux
  remoteSocket:
    exists: true
    path: /run/user/1000/podman/podman.sock
  security:
    apparmorEnabled: false
    capabilities: CAP_CHOWN,CAP_DAC_OVERRIDE,CAP_FOWNER,CAP_FSETID,CAP_KILL,CAP_NET_BIND_SERVICE,CAP_SETFCAP,CAP_SETGID,CAP_SETPCAP,CAP_SETUID,CAP_SYS_CHROOT
    rootless: true
    seccompEnabled: true
    seccompProfilePath: /etc/containers/seccomp.json
    selinuxEnabled: true
  serviceIsRemote: false
  slirp4netns:
    executable: /usr/bin/slirp4netns
    package: slirp4netns-1.1.11-1.6.x86_64
    version: |-
      slirp4netns version 1.1.11
      commit: unknown
      libslirp: 4.7.0
      SLIRP_CONFIG_VERSION_MAX: 3
      libseccomp: 2.5.4
  swapFree: 0
  swapTotal: 0
  uptime: 0h 13m 0.00s
plugins:
  authorization: null
  log:
  - k8s-file
  - none
  - passthrough
  - journald
  network:
  - bridge
  - macvlan
  - ipvlan
  volume:
  - local
registries:
  search:
  - registry.opensuse.org
  - registry.suse.com
  - docker.io
store:
  configFile: /home/adathor/.config/containers/storage.conf
  containerStore:
    number: 10
    paused: 0
    running: 7
    stopped: 3
  graphDriverName: btrfs
  graphOptions: {}
  graphRoot: /home/podman_vol/home-vol/podman/containers/storage
  graphRootAllocated: 10000830279680
  graphRootUsed: 247550713856
  graphStatus:
    Build Version: Btrfs v5.18.1
    Library Version: "102"
  imageCopyTmpDir: /var/tmp
  imageStore:
    number: 28
  runRoot: /run/user/1000/containers
  volumePath: /home/podman_vol/home-vol/podman/containers/storage/volumes
version:
  APIVersion: 4.2.0
  Built: 1660176000
  BuiltTime: Thu Aug 11 07:00:00 2022
  GitCommit: ""
  GoVersion: go1.16.15
  Os: linux
  OsArch: linux/amd64

@edsantiago
Copy link
Member Author

Here's another unlinkat (#11594) followed by systemsnafu: sys remote ubuntu 2204 root.

@containers/podman-maintainers since this is now in the wild, I'd say it's higher priority.

@edsantiago
Copy link
Member Author

f36 aarch64 root

@edsantiago
Copy link
Member Author

No podman rm that I can see, but here we go again, remote f36 root

@edsantiago
Copy link
Member Author

remote f36-aarch64 root

@edsantiago
Copy link
Member Author

remote f36 root

@edsantiago
Copy link
Member Author

remote f36-aarch64 root

@edsantiago
Copy link
Member Author

Here's another one with the unmarshal error happening first, then everything going bonkers. remote ubuntu root

# podman-remote --url unix:/tmp/podman_tmp_wtPS inspect --format {{.HostConfig.LogConfig.Type}} test_pod-test
journald
# podman-remote --url unix:/tmp/podman_tmp_wtPS kube down /tmp/podman_bats.LQsRLR/test.yaml
Error: json: cannot unmarshal string into Go struct field PodRmReport.RmReport.Err of type error
...
# podman-remote --url unix:/tmp/podman_tmp_wtPS pod rm -t 0 -f -a
Error: removing container d4d4bc7ee1505bd7738922c686fd35788c67e4c7a936e62044287dde0abaf6a1 root filesystem: 1 error occurred:
	* unlinkat /var/lib/containers/storage/overlay/c01ffe8ae42779fe4737be67fb350cabbe5f0c0a309ffbcd318fd584ff7af701/merged: device or resource busy

@edsantiago
Copy link
Member Author

Interesting variation on remote f36-aarch64 root: "cannot remove container" but subsequent tests actually pass:

# podman-remote --url unix:/tmp/podman_tmp_IPZS kill cpcontainer
cpcontainer
# podman-remote --url unix:/tmp/podman_tmp_IPZS rm -t 0 -f cpcontainer
Error: cannot remove container 380fcd0fd9264c2daf9157c7dfe07a8f7e31a84bd1c47e14a3ad01e76d4a7682 as it could not be stopped: sending SIGKILL to container 380fcd0fd9264c2daf9157c7dfe07a8f7e31a84bd1c47e14a3ad01e76d4a7682: container state improper
[ rc=2 (** EXPECTED 0 **) ]

@edsantiago
Copy link
Member Author

Another one like yesterday's, again on remote f36-aarch64 root, again just one rm failure without the cascade of death:

# podman-remote --url unix:/tmp/podman_tmp_BY5k rm -t 0 -f cpcontainer
Error: cannot remove container X as it could not be stopped: sending SIGKILL to container X: container state improper

@edsantiago
Copy link
Member Author

Here's the unlinkat/ebusy one on f36 remote rootless

OK, f2f over, we need serious attention on this one.

@edsantiago
Copy link
Member Author

f36 remote rootless again

@edsantiago edsantiago added the remote Problem is in podman-remote label Oct 3, 2022
@edsantiago
Copy link
Member Author

f36 remote root

Re-adding remote tag which was "helpfully" removed by bot.

@rhatdan rhatdan added the kube label Oct 12, 2022
@edsantiago
Copy link
Member Author

Another one starting with podman images, not podman rm (AFAICT). In remote f36 aarch64. The title of this issue seems more and more misleading. But there's a podman rm just before the podman images, so I'm not changing the title yet.

And here's another maybe-related-maybe-not, it happens in kube down, in almost the very last test, so I can't quickly tell if the system gets hosed or not. podman ubuntu-2204 rootless. This could be the unlinkat bug instead. They may be related. I really have no idea.

@vrothberg
Copy link
Member

vrothberg commented Oct 13, 2022

The symptoms look similar to #11594, don't they?

@edsantiago
Copy link
Member Author

Sometimes, but not enough for me to combine the two issues. #11594 is a one-shot: it happens, and life goes on. This issue is a catastrophe: once it happens, podman never works again.

@edsantiago
Copy link
Member Author

Potentially interesting correlation: #16154 (unmarshal blah blah, podman-remote only) seems to be highly correlated with this one. My gut feeling is that "unmarshal" is its own bug; that what's happening is that THIS issue (the forever-hose) is triggering, and sending an error that podman-remote cannot deal with. So the underlying error is actually this one. But please take this with a grain of salt.

@edsantiago
Copy link
Member Author

@edsantiago
Copy link
Member Author

Hard to know if this one (ubuntu root) is the hosed flake or the unlinkat/EBUSY one, but I'm going to add it here.

@edsantiago
Copy link
Member Author

Another case where the first failure presents with "blah blah is mounted", then cascades from there into disaster:

not ok 361 podman run with slirp4ns adds correct dns address to resolv.conf
...
# podman-remote run --rm  \
      --network slirp4netns:cidr=172.24.205.0/24 \
      quay.io/libpod/testimage:20221018 \
      grep 172.24.205 /etc/resolv.conf
nameserver 172.24.205.3
time="2022-10-25T13:47:23Z" level=error msg="Removing container SHA: 
         container \"SHA\" is mounted and cannot be removed without using force: container state improper"
...
[all subsequent tests fail]

ubuntu 2204 root remote

Can we please, please get some attention on this? It's a bad one, flaking at least once per day.

@edsantiago
Copy link
Member Author

@vrothberg
Copy link
Member

@containers/podman-maintainers PTAL

@rhatdan
Copy link
Member

rhatdan commented Nov 14, 2022

The issue is the timesMounted, err := r.store.Mounted(ctr.ID) call below could be getting screwed up or have a race condition.

	if !force {
		timesMounted, err := r.store.Mounted(ctr.ID)
		if err != nil {
			if errors.Is(err, storage.ErrContainerUnknown) {
				// Container was removed from under us.
				// It's gone, so don't bother erroring.
				logrus.Infof("Storage for container %s already removed", ctr.ID)
				return nil
			}
			logrus.Warnf("Checking if container %q is mounted, attempting to delete: %v", idOrName, err)
		}
		if timesMounted > 0 {
			return fmt.Errorf("container %q is mounted and cannot be removed without using force: %w", idOrName, define.ErrCtrStateInvalid)
		}

It is possible to loose an unmount on a killed process. This couner is attempting to keep track of the number of

podman mount XYZ
podman mount XYZ
podman run XYZ ...

timesMounted would show 3. If for some reason we had a race where someone else was executing a mount while executing this code, it could cause issues. Or if a podman container cleanup failed and the number of mounts was not decremented.

@edsantiago
Copy link
Member Author

Now seen in f37 gating tests, too

@Johnpc123
Copy link

...also can't remove a pod after, being lazy, and attempting rm -f on a running pod.

"Error: error freeing lock for container 95de07970b6cd967f1587294ffe05db767cd2f20bfd061633b010bb3654e2d73: no such file or directory"
How can I help?

@maisatanel
Copy link

@Johnpc123 to help you out, i was able to remove the pod that i borked like you did by removing it through the cockpit-podman web interface. it threw that error at me but was able to remove it completely nonetheless, while podman pod rm gitea -f couldn't

@vrothberg
Copy link
Member

@mtrmac I think your recent c/storage fixes may have resolved this issue. It seems to happen only with remote which somehow made me think about your in-process fixes.

WDYT?

@mtrmac
Copy link
Collaborator

mtrmac commented Dec 16, 2022

This is quite long, and might well mix different root causes:

  • error sending SIGKILL to container 5d183a4ab677c5b65a78ebf87720153aee486145b24631f08edb7e20671498b7: %!w(<nil>):
    • the nil part was fixed in 6bf8670
    • The error itself seems unlikely to be directly c/storage-related ; c/storage does not record PIDs or process state.
  • freeing lock for container: These are libpod/lock.Manager locks
  • unlinkat /var/lib/containers/storage/overlay/…/merged: device or resource busy: something is missing an unmount, or otherwise mismanaging recorded state vs. running processes??
  • podman-remote images … retrieving parent of image … unlinkat /var/lib/containers/storage/overlay/SHA2/merged: device or resource busy: Why is obtaining a parent (pure metadata) doing any overlay operations at all?
  • “The issue is the timesMounted, err := r.store.Mounted(ctr.ID) call below could be getting screwed up”: That API is fundamentally racy; as far as c/storage cares, the caller is not holding any lock and the data can be out-of date already at the moment it is returned by c/storage. Compare containers/storage@cbc2ca8 . It may well be the case that Podman is doing more synchronization, I wouldn’t know.

In summary nothing of the above is obviously related to the c/storage work I was doing. The container state cleanup (unlike layer state cleanup) is very minimal in c/storage.

My hunch would be to follow the define.ErrCtrStateInvalid inconsistencies. Not that I know how.

@vrothberg
Copy link
Member

Thanks for taking a look, @mtrmac !

@vrothberg
Copy link
Member

@edsantiago fresh occurrences of the flake since #16886?

@edsantiago
Copy link
Member Author

edsantiago commented Jan 2, 2023

@vrothberg it's hard to say: there have been a lot of flakes in the past few weeks, even despite a low number of CI runs, and it'll take me half a day to evaluate/categorize them all. From a quick overview, though, I don't see the classic "everything hosed" signature. I'll keep my eyes open for it.

PS isn't today a holiday for you?

@vrothberg
Copy link
Member

Thanks, Ed!

PS isn't today a holiday for you?

Not in France. But in the US, so please don't let me bother you today. There's plenty of time for that this year :)

@edsantiago
Copy link
Member Author

Here goes. I'm not 100% sure that these are the same bug, because they don't exhibit the "everything-hosed" symptom, but they do look like the same "podman rm fails" symptom. Is it possible that #16886 fixed the everything-hosed bug, but we're still seeing the "could not remove container, could not be stopped" bug?

vrothberg added a commit to vrothberg/libpod that referenced this issue Jan 16, 2023
The container lock is released before stopping/killing which implies
certain race conditions with, for instance, the cleanup process changing
the container state to stopped, exited or other states.

The (remaining) flakes seen in containers#16142 and containers#15367 strongly indicate a
race in between the stopping/killing a container and the cleanup
process.  To fix the flake make sure to ignore invalid-state errors.
An alternative fix would be to change `KillContainer` to not return such
errors at all but commit c77691f indicates an explicit desire to
have these errors being reported in the sig proxy.

[NO NEW TESTS NEEDED] as it's a race already covered by the system
tests.

Fixes: containers#16142
Fixes: containers#15367
Signed-off-by: Valentin Rothberg <vrothberg@redhat.com>
@github-actions github-actions bot added the locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. label Sep 4, 2023
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Sep 4, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
flakes Flakes from Continuous Integration kube locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. remote Problem is in podman-remote
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants