-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
libct/cg/sd: return error from stopUnit #2946
libct/cg/sd: return error from stopUnit #2946
Conversation
Failure on Fedora:
I have no explanation for this (other than recvtty suddenly died?) |
Historically, we never returned an error from failed startUnit or stopUnit. The startUnit case was fixed by commit 3844789. It is time to fix stopUnit, too. The reasons are: 1. Ignoring an error from stopUnit means an unexpected trouble down the road, for example a failure to create a container with the same name: > time="2021-05-07T19:51:27Z" level=error msg="container_linux.go:380: starting container process caused: process_linux.go:385: applying cgroup configuration for process caused: Unit runc-test_busybox.scope already exists." 2. A somewhat short timeout of 1 second means the cgroup might actually be removed a few seconds later but we might have a race between removing the cgroup and creating another one with the same name, resulting in the same error as amove. So, return an error if removal failed, and increase the timeout. Now, modify the systemd cgroup v1 manager to not mask the error from stopUnit (stopErr) with the subsequent one from cgroups.RemovePath, as stopErr is most probably the reason why RemovePath failed. Note that for v1 we do want to remove the paths even in case of a failure from stopUnit, as some were not created by systemd. There's no need to do that for v2, thanks to unified hierarchy, so no changes there. Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
472cb0c
to
33c9f8b
Compare
close/reopen to kick CI |
Tentatively added this to 1.0 milestone as I think it makes sense to have it (as said earlier, ignoring an error means more trouble later). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
// Both on success and on error, cleanup all the cgroups | ||
// we are aware of, as some of them were created directly | ||
// by Apply() and are not managed by systemd. | ||
if err := cgroups.RemovePaths(m.paths); err != nil && stopErr == nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel this change could've been a bit neater but it works so w/e.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You mean something like this?
rmErr := cgroups.RemovePaths(m.paths)
// stopErr should prevail as it might be the reason for rmErr.
if stopErr != nil {
return stopErr
}
return rmErr
Fixes the source of the error seen at #2944 (comment)
Historically, we never returned an error from failed
startUnit
or
stopUnit
. ThestartUnit
case was fixed by commit 3844789 (PR #2614).It is time to fix
stopUnit
, too. The reasons are:Ignoring an error from
stopUnit
means an unexpected trouble down theroad, for example a failure to create a container with the same name:
A somewhat short timeout of 1 second means the cgroup might
actually be removed a few seconds later but we might have a
race between removing the cgroup and creating another one
with the same name, resulting in the same error as above.
So, return an error if removal failed, and increase the timeout.
Now, modify the systemd cgroup v1 manager to not mask the error from
stopUnit
(stopErr
) with the subsequent one fromcgroups.RemovePath
,as
stopErr
is most probably the reason whyRemovePath
failed.Note that for v1 we do want to remove the paths even in case
of a failure from
stopUnit
, as some were not created by systemd.There's no need to do that for v2, thanks to unified hierarchy,
so no changes there.