Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove leftover autoremove containers during refresh #21541

Merged
merged 2 commits into from
Feb 8, 2024

Conversation

mheon
Copy link
Member

@mheon mheon commented Feb 6, 2024

During system shutdown, Podman should go down gracefully, meaning that we have time to spawn cleanup processes which remove any containers set to autoremove. Unfortunately, this isn't always the case. If we get a SIGKILL because the system is going down immediately, we can't recover from this, and the autoremove containers are not removed.

However, we can pick up any leftover autoremove containers when we refesh the DB state, which is the first thing Podman does after a reboot. By detecting any autoremove containers that have actually run (a container that was created but never run doesn't need to be removed) at that point and removing them, we keep the fresh boot clean, even if Podman was terminated abnormally.

Fixes #21482

Does this PR introduce a user-facing change?

Fixed a bug where containers started with `--rm` were sometimes not removed affter a reboot.

@openshift-ci openshift-ci bot added release-note approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Feb 6, 2024
@mheon
Copy link
Member Author

mheon commented Feb 6, 2024

Oops. Fixes #21482

@mheon mheon force-pushed the refresh_rm_autoremove branch from 1030bc3 to 805351e Compare February 6, 2024 21:14
During system shutdown, Podman should go down gracefully, meaning
that we have time to spawn cleanup processes which remove any
containers set to autoremove. Unfortunately, this isn't always
the case. If we get a SIGKILL because the system is going down
immediately, we can't recover from this, and the autoremove
containers are not removed.

However, we can pick up any leftover autoremove containers when
we refesh the DB state, which is the first thing Podman does
after a reboot. By detecting any autoremove containers that have
actually run (a container that was created but never run doesn't
need to be removed) at that point and removing them, we keep the
fresh boot clean, even if Podman was terminated abnormally.

Fixes containers#21482

[NO NEW TESTS NEEDED] This requires a reboot to realistically
test.

Signed-off-by: Matt Heon <mheon@redhat.com>
@mheon mheon force-pushed the refresh_rm_autoremove branch from 805351e to 9983e87 Compare February 6, 2024 22:04
@@ -856,6 +859,22 @@ func (r *Runtime) refresh(alivePath string) error {
if err := ctr.refresh(); err != nil {
logrus.Errorf("Refreshing container %s: %v", ctr.ID(), err)
}
// This is the only place it's safe to use ctr.state.State unlocked
// We're holding the alive lock, guaranteed to be the only Libpod on the system right now.
if ctr.AutoRemove() && ctr.state.State == define.ContainerStateExited {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see how this can work. resetContainerState() seems to set the state to ContainerStateConfigured if the state was not ContainerStateExited which may be fine but if you consider this is supposed to fix the case of a hard shutdown I would think the previous state could have been running, stopped or stopping.

The only case were the state is ContainerStateExited is likely when podman container cleanup would have been run successfully at which would point it should be safe to assume the the container was already deleted.

I really think the state check must happen before resetContainerState() and it should delete for all states except ContainerStateConfigured.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The refresh logic should be forcing anything except ContainerStateConfigured and ContainerStateInitialized to ContainerStateExited. If we're not forcing Running and Stopped to Exited, that's a problem.

We should probably also catch ContainerStateRemoving now that I think about it. Pass it through unchanged and remove immediately if true.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, our refresh logic is definitely wrong, we should be catching more than just Exited.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated. We now catch anything that was running and convert it into an Exited container.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed this looks much better.

We were preserving ContainerStateExited, which is better than
nothing, but definitely not correct. A container that ran at any
point during the last boot should be moved to Exited state to
preserve the fact that they were run at least one. This means we
have to convert Running, Stopped, Stopping, Paused containers to
exited as well.

Signed-off-by: Matt Heon <mheon@redhat.com>
Copy link
Member

@Luap99 Luap99 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, I am somewhat concerned that this critical part of system refresh after boot is pretty much untested.
I wonder if we can do a e2e test, start a container, then SIGKILL conmon, delete the rundir/tmpdir (should simulate a reboot since that holds the alive file), then do a podman inspect and we should se the exited state, WDYT?

@mheon
Copy link
Member Author

mheon commented Feb 7, 2024

You'd probably want to force-kill the container process and unmount the container as well before you removed the rundir... Is there a solid way of unmounting a container from the CLI without using a Podman command, I wonder

@Luap99
Copy link
Member

Luap99 commented Feb 7, 2024

We show the overlayfs MergedDir in podman inspect, so a simple umount on that should work I assume? And AFAIK killing just conmon should kill the container process as well but yes we could make an explicit kill on the container pid as well.

@mheon
Copy link
Member Author

mheon commented Feb 7, 2024

Do you want me to do that in this PR, or should that be followon?

@Luap99
Copy link
Member

Luap99 commented Feb 7, 2024

I am fine with a follow on.

@rhatdan
Copy link
Member

rhatdan commented Feb 7, 2024

/approve
/lgtm
/hold

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Feb 7, 2024
@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Feb 7, 2024
Copy link
Contributor

openshift-ci bot commented Feb 7, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: mheon, rhatdan

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@Luap99
Copy link
Member

Luap99 commented Feb 8, 2024

/hold cancel

@openshift-ci openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Feb 8, 2024
@openshift-merge-bot openshift-merge-bot bot merged commit 3aa413f into containers:main Feb 8, 2024
88 of 89 checks passed
@stale-locking-app stale-locking-app bot added the locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. label May 9, 2024
@stale-locking-app stale-locking-app bot locked as resolved and limited conversation to collaborators May 9, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. release-note
Projects
None yet
Development

Successfully merging this pull request may close these issues.

podman machine on Windows WSL2 does not stop "autoremove" containers depending on Windows machine
3 participants