Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

libct/cg/sd: return error from stopUnit #2946

Merged
merged 1 commit into from
May 26, 2021

Conversation

kolyshkin
Copy link
Contributor

@kolyshkin kolyshkin commented May 7, 2021

Fixes the source of the error seen at #2944 (comment)

Historically, we never returned an error from failed startUnit
or stopUnit. The startUnit case was fixed by commit 3844789 (PR #2614).

It is time to fix stopUnit, too. The reasons are:

  1. Ignoring an error from stopUnit means an unexpected trouble down the
    road, for example a failure to create a container with the same name:

    time="2021-05-07T19:51:27Z" level=error msg="container_linux.go:380: starting container process caused: process_linux.go:385: applying cgroup configuration for process caused: Unit runc-test_busybox.scope already exists."

  2. A somewhat short timeout of 1 second means the cgroup might
    actually be removed a few seconds later but we might have a
    race between removing the cgroup and creating another one
    with the same name, resulting in the same error as above.

So, return an error if removal failed, and increase the timeout.

Now, modify the systemd cgroup v1 manager to not mask the error from
stopUnit (stopErr) with the subsequent one from cgroups.RemovePath,
as stopErr is most probably the reason why RemovePath failed.

Note that for v1 we do want to remove the paths even in case
of a failure from stopUnit, as some were not created by systemd.
There's no need to do that for v2, thanks to unified hierarchy,
so no changes there.

@kolyshkin
Copy link
Contributor Author

Failure on Fedora:

not ok 80 runc start
# (in test file tests/integration/start.bats, line 15)
#   `[ "$status" -eq 0 ]' failed
# runc spec (status=0):
# 
# runc create --console-socket /tmp/bats-run-23697/runc.8iGh0j/tty/sock test_busybox (status=1):
# time="2021-05-07T23:29:00Z" level=error msg="dial unix /tmp/bats-run-23697/runc.8iGh0j/tty/sock: connect: no such file or directory"

I have no explanation for this (other than recvtty suddenly died?)

Historically, we never returned an error from failed startUnit
or stopUnit. The startUnit case was fixed by commit 3844789.

It is time to fix stopUnit, too. The reasons are:

1. Ignoring an error from stopUnit means an unexpected trouble down the
   road, for example a failure to create a container with the same name:

   > time="2021-05-07T19:51:27Z" level=error msg="container_linux.go:380: starting container process caused: process_linux.go:385: applying cgroup configuration for process caused: Unit runc-test_busybox.scope already exists."

2. A somewhat short timeout of 1 second means the cgroup might
   actually be removed a few seconds later but we might have a
   race between removing the cgroup and creating another one
   with the same name, resulting in the same error as amove.

So, return an error if removal failed, and increase the timeout.

Now, modify the systemd cgroup v1 manager to not mask the error from
stopUnit (stopErr) with the subsequent one from cgroups.RemovePath,
as stopErr is most probably the reason why RemovePath failed.

Note that for v1 we do want to remove the paths even in case of a
failure from stopUnit, as some were not created by systemd.
There's no need to do that for v2, thanks to unified hierarchy,
so no changes there.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
@kolyshkin kolyshkin force-pushed the systemd-stop-timeout branch from 472cb0c to 33c9f8b Compare May 12, 2021 18:40
@kolyshkin
Copy link
Contributor Author

close/reopen to kick CI

@kolyshkin kolyshkin closed this May 15, 2021
@kolyshkin kolyshkin reopened this May 15, 2021
@kolyshkin kolyshkin added this to the 1.0.0 milestone May 21, 2021
@kolyshkin
Copy link
Contributor Author

Tentatively added this to 1.0 milestone as I think it makes sense to have it (as said earlier, ignoring an error means more trouble later).

Copy link
Member

@cyphar cyphar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

// Both on success and on error, cleanup all the cgroups
// we are aware of, as some of them were created directly
// by Apply() and are not managed by systemd.
if err := cgroups.RemovePaths(m.paths); err != nil && stopErr == nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel this change could've been a bit neater but it works so w/e.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean something like this?

rmErr := cgroups.RemovePaths(m.paths)
// stopErr should prevail as it might be the reason for rmErr.
if stopErr != nil {
     return stopErr
}
return rmErr

@cyphar cyphar requested review from AkihiroSuda and a team May 26, 2021 02:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants