libct/cg/sd: return error from stopUnit #2946

kolyshkin · 2021-05-07T23:15:49Z

Fixes the source of the error seen at #2944 (comment)

Historically, we never returned an error from failed startUnit
or stopUnit. The startUnit case was fixed by commit 3844789 (PR #2614).

It is time to fix stopUnit, too. The reasons are:

Ignoring an error from stopUnit means an unexpected trouble down the
road, for example a failure to create a container with the same name:

time="2021-05-07T19:51:27Z" level=error msg="container_linux.go:380: starting container process caused: process_linux.go:385: applying cgroup configuration for process caused: Unit runc-test_busybox.scope already exists."
A somewhat short timeout of 1 second means the cgroup might
actually be removed a few seconds later but we might have a
race between removing the cgroup and creating another one
with the same name, resulting in the same error as above.

So, return an error if removal failed, and increase the timeout.

Now, modify the systemd cgroup v1 manager to not mask the error from
stopUnit (stopErr) with the subsequent one from cgroups.RemovePath,
as stopErr is most probably the reason why RemovePath failed.

Note that for v1 we do want to remove the paths even in case
of a failure from stopUnit, as some were not created by systemd.
There's no need to do that for v2, thanks to unified hierarchy,
so no changes there.

kolyshkin · 2021-05-08T01:03:13Z

Failure on Fedora:

not ok 80 runc start
# (in test file tests/integration/start.bats, line 15)
#   `[ "$status" -eq 0 ]' failed
# runc spec (status=0):
# 
# runc create --console-socket /tmp/bats-run-23697/runc.8iGh0j/tty/sock test_busybox (status=1):
# time="2021-05-07T23:29:00Z" level=error msg="dial unix /tmp/bats-run-23697/runc.8iGh0j/tty/sock: connect: no such file or directory"

I have no explanation for this (other than recvtty suddenly died?)

Historically, we never returned an error from failed startUnit or stopUnit. The startUnit case was fixed by commit 3844789. It is time to fix stopUnit, too. The reasons are: 1. Ignoring an error from stopUnit means an unexpected trouble down the road, for example a failure to create a container with the same name: > time="2021-05-07T19:51:27Z" level=error msg="container_linux.go:380: starting container process caused: process_linux.go:385: applying cgroup configuration for process caused: Unit runc-test_busybox.scope already exists." 2. A somewhat short timeout of 1 second means the cgroup might actually be removed a few seconds later but we might have a race between removing the cgroup and creating another one with the same name, resulting in the same error as amove. So, return an error if removal failed, and increase the timeout. Now, modify the systemd cgroup v1 manager to not mask the error from stopUnit (stopErr) with the subsequent one from cgroups.RemovePath, as stopErr is most probably the reason why RemovePath failed. Note that for v1 we do want to remove the paths even in case of a failure from stopUnit, as some were not created by systemd. There's no need to do that for v2, thanks to unified hierarchy, so no changes there. Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>

kolyshkin · 2021-05-15T02:23:49Z

close/reopen to kick CI

kolyshkin · 2021-05-21T22:28:04Z

Tentatively added this to 1.0 milestone as I think it makes sense to have it (as said earlier, ignoring an error means more trouble later).

libcontainer/cgroups/systemd/common.go

cyphar

LGTM.

cyphar · 2021-05-25T04:41:29Z

libcontainer/cgroups/systemd/v1.go

+	// Both on success and on error, cleanup all the cgroups
+	// we are aware of, as some of them were created directly
+	// by Apply() and are not managed by systemd.
+	if err := cgroups.RemovePaths(m.paths); err != nil && stopErr == nil {


I feel this change could've been a bit neater but it works so w/e.

You mean something like this?

rmErr := cgroups.RemovePaths(m.paths) // stopErr should prevail as it might be the reason for rmErr. if stopErr != nil { return stopErr } return rmErr

kolyshkin mentioned this pull request May 7, 2021

Enable rootless cgroup v2 tests + some fixes #2944

Merged

kolyshkin added the area/systemd label May 7, 2021

kolyshkin force-pushed the systemd-stop-timeout branch from 472cb0c to 33c9f8b Compare May 12, 2021 18:40

kolyshkin closed this May 15, 2021

kolyshkin reopened this May 15, 2021

kolyshkin added this to the 1.0.0 milestone May 21, 2021

cyphar reviewed May 25, 2021

View reviewed changes

libcontainer/cgroups/systemd/common.go Show resolved Hide resolved

cyphar approved these changes May 25, 2021

View reviewed changes

cyphar reviewed May 25, 2021

View reviewed changes

cyphar requested review from AkihiroSuda and a team May 26, 2021 02:55

AkihiroSuda approved these changes May 26, 2021

View reviewed changes

AkihiroSuda merged commit e005fee into opencontainers:master May 26, 2021

kolyshkin mentioned this pull request Jun 1, 2021

VERSION: release runc 1.0.0 #2971

Merged

7 tasks

kolyshkin mentioned this pull request Dec 16, 2021

[4.8] fix freeze, add SkipFreezeOnSet openshift/opencontainers-runc#10

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

libct/cg/sd: return error from stopUnit #2946

libct/cg/sd: return error from stopUnit #2946

kolyshkin commented May 7, 2021 •

edited

Loading

kolyshkin commented May 8, 2021

kolyshkin commented May 15, 2021

kolyshkin commented May 21, 2021

cyphar left a comment

cyphar May 25, 2021

kolyshkin Jun 1, 2021

libct/cg/sd: return error from stopUnit #2946

libct/cg/sd: return error from stopUnit #2946

Conversation

kolyshkin commented May 7, 2021 • edited Loading

kolyshkin commented May 8, 2021

kolyshkin commented May 15, 2021

kolyshkin commented May 21, 2021

cyphar left a comment

Choose a reason for hiding this comment

cyphar May 25, 2021

Choose a reason for hiding this comment

kolyshkin Jun 1, 2021

Choose a reason for hiding this comment

kolyshkin commented May 7, 2021 •

edited

Loading