Remove leftover autoremove containers during refresh #21541

mheon · 2024-02-06T21:13:17Z

During system shutdown, Podman should go down gracefully, meaning that we have time to spawn cleanup processes which remove any containers set to autoremove. Unfortunately, this isn't always the case. If we get a SIGKILL because the system is going down immediately, we can't recover from this, and the autoremove containers are not removed.

However, we can pick up any leftover autoremove containers when we refesh the DB state, which is the first thing Podman does after a reboot. By detecting any autoremove containers that have actually run (a container that was created but never run doesn't need to be removed) at that point and removing them, we keep the fresh boot clean, even if Podman was terminated abnormally.

Fixes #21482

Does this PR introduce a user-facing change?

Fixed a bug where containers started with `--rm` were sometimes not removed affter a reboot.

mheon · 2024-02-06T21:13:50Z

Oops. Fixes #21482

During system shutdown, Podman should go down gracefully, meaning that we have time to spawn cleanup processes which remove any containers set to autoremove. Unfortunately, this isn't always the case. If we get a SIGKILL because the system is going down immediately, we can't recover from this, and the autoremove containers are not removed. However, we can pick up any leftover autoremove containers when we refesh the DB state, which is the first thing Podman does after a reboot. By detecting any autoremove containers that have actually run (a container that was created but never run doesn't need to be removed) at that point and removing them, we keep the fresh boot clean, even if Podman was terminated abnormally. Fixes containers#21482 [NO NEW TESTS NEEDED] This requires a reboot to realistically test. Signed-off-by: Matt Heon <mheon@redhat.com>

Luap99 · 2024-02-07T10:34:24Z

libpod/runtime.go

@@ -856,6 +859,22 @@ func (r *Runtime) refresh(alivePath string) error {
 		if err := ctr.refresh(); err != nil {
 			logrus.Errorf("Refreshing container %s: %v", ctr.ID(), err)
 		}
+		// This is the only place it's safe to use ctr.state.State unlocked
+		// We're holding the alive lock, guaranteed to be the only Libpod on the system right now.
+		if ctr.AutoRemove() && ctr.state.State == define.ContainerStateExited {


I don't see how this can work. resetContainerState() seems to set the state to ContainerStateConfigured if the state was not ContainerStateExited which may be fine but if you consider this is supposed to fix the case of a hard shutdown I would think the previous state could have been running, stopped or stopping.

The only case were the state is ContainerStateExited is likely when podman container cleanup would have been run successfully at which would point it should be safe to assume the the container was already deleted.

I really think the state check must happen before resetContainerState() and it should delete for all states except ContainerStateConfigured.

The refresh logic should be forcing anything except ContainerStateConfigured and ContainerStateInitialized to ContainerStateExited. If we're not forcing Running and Stopped to Exited, that's a problem.

We should probably also catch ContainerStateRemoving now that I think about it. Pass it through unchanged and remove immediately if true.

Yeah, our refresh logic is definitely wrong, we should be catching more than just Exited.

Updated. We now catch anything that was running and convert it into an Exited container.

Indeed this looks much better.

We were preserving ContainerStateExited, which is better than nothing, but definitely not correct. A container that ran at any point during the last boot should be moved to Exited state to preserve the fact that they were run at least one. This means we have to convert Running, Stopped, Stopping, Paused containers to exited as well. Signed-off-by: Matt Heon <mheon@redhat.com>

Luap99

LGTM, I am somewhat concerned that this critical part of system refresh after boot is pretty much untested.
I wonder if we can do a e2e test, start a container, then SIGKILL conmon, delete the rundir/tmpdir (should simulate a reboot since that holds the alive file), then do a podman inspect and we should se the exited state, WDYT?

mheon · 2024-02-07T17:40:13Z

You'd probably want to force-kill the container process and unmount the container as well before you removed the rundir... Is there a solid way of unmounting a container from the CLI without using a Podman command, I wonder

Luap99 · 2024-02-07T17:45:53Z

We show the overlayfs MergedDir in podman inspect, so a simple umount on that should work I assume? And AFAIK killing just conmon should kill the container process as well but yes we could make an explicit kill on the container pid as well.

mheon · 2024-02-07T18:11:00Z

Do you want me to do that in this PR, or should that be followon?

Luap99 · 2024-02-07T18:17:03Z

I am fine with a follow on.

rhatdan · 2024-02-07T20:47:52Z

/approve
/lgtm
/hold

openshift-ci · 2024-02-07T20:48:00Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: mheon, rhatdan

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [mheon,rhatdan]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Luap99 · 2024-02-08T10:24:23Z

/hold cancel

openshift-ci bot added release-note approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Feb 6, 2024

mheon force-pushed the refresh_rm_autoremove branch from 1030bc3 to 805351e Compare February 6, 2024 21:14

mheon force-pushed the refresh_rm_autoremove branch from 805351e to 9983e87 Compare February 6, 2024 22:04

Luap99 reviewed Feb 7, 2024

View reviewed changes

openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Feb 7, 2024

openshift-ci bot assigned rhatdan Feb 7, 2024

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Feb 7, 2024

openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Feb 8, 2024

openshift-merge-bot bot merged commit 3aa413f into containers:main Feb 8, 2024
88 of 89 checks passed

edsantiago mentioned this pull request Feb 8, 2024

rawhide: podman hangs, unstoppable #21504

Closed

stale-locking-app bot added the locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. label May 9, 2024

stale-locking-app bot locked as resolved and limited conversation to collaborators May 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove leftover autoremove containers during refresh #21541

Remove leftover autoremove containers during refresh #21541

mheon commented Feb 6, 2024 •

edited

Loading

mheon commented Feb 6, 2024

Luap99 Feb 7, 2024

mheon Feb 7, 2024

mheon Feb 7, 2024

mheon Feb 7, 2024

Luap99 Feb 7, 2024

Luap99 left a comment

mheon commented Feb 7, 2024

Luap99 commented Feb 7, 2024

mheon commented Feb 7, 2024

Luap99 commented Feb 7, 2024

rhatdan commented Feb 7, 2024

openshift-ci bot commented Feb 7, 2024

Luap99 commented Feb 8, 2024

Remove leftover autoremove containers during refresh #21541

Remove leftover autoremove containers during refresh #21541

Conversation

mheon commented Feb 6, 2024 • edited Loading

Does this PR introduce a user-facing change?

mheon commented Feb 6, 2024

Luap99 Feb 7, 2024

Choose a reason for hiding this comment

mheon Feb 7, 2024

Choose a reason for hiding this comment

mheon Feb 7, 2024

Choose a reason for hiding this comment

mheon Feb 7, 2024

Choose a reason for hiding this comment

Luap99 Feb 7, 2024

Choose a reason for hiding this comment

Luap99 left a comment

Choose a reason for hiding this comment

mheon commented Feb 7, 2024

Luap99 commented Feb 7, 2024

mheon commented Feb 7, 2024

Luap99 commented Feb 7, 2024

rhatdan commented Feb 7, 2024

openshift-ci bot commented Feb 7, 2024

Luap99 commented Feb 8, 2024

mheon commented Feb 6, 2024 •

edited

Loading