Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

libct/cgroups/systemd: eliminate runc/systemd race #2614

Merged
merged 1 commit into from
Sep 30, 2020

Conversation

kolyshkin
Copy link
Contributor

@kolyshkin kolyshkin commented Sep 30, 2020

In case it takes more than 1 second for systemd to create a unit,
startUnit() times out with a warning and then runc proceeds
(to create cgroups using fs manager and so on).

Now runc and systemd are racing, and multiple scenarios are possible.

In one such scenario, by the time runc calls systemd manager's Apply()
the unit is not yet created, the dbusConnection.SetUnitProperties()
call fails with "unit xxx.scope not found", and the whole container
start also fails.

To eliminate the race, we need to return an error in case the timeout is
hit.

To reduce the chance to fail, increase the timeout from 1 to 30 seconds,
to not error out too early on a busy/slow system (and times like 3-5
seconds are not unrealistic).

While at it, as the timeout is quite long now, make sure to not leave
a stray timer.

Reference: https://bugzilla.redhat.com/show_bug.cgi?id=1883640

Signed-off-by: Kir Kolyshkin kolyshkin@gmail.com

In case it takes more than 1 second for systemd to create a unit,
startUnit() times out with a warning and then runc proceeds
(to create cgroups using fs manager and so on).

Now runc and systemd are racing, and multiple scenarios are possible.

In one such scenario, by the time runc calls systemd manager's Apply()
the unit is not yet created, the dbusConnection.SetUnitProperties()
call fails with "unit xxx.scope not found", and the whole container
start also fails.

To eliminate the race, we need to return an error in case the timeout is
hit.

To reduce the chance to fail, increase the timeout from 1 to 30 seconds,
to not error out too early on a busy/slow system (and times like 3-5
seconds are not unrealistic).

While at it, as the timeout is quite long now, make sure to not leave
a stray timer.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants