Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

libct/cg/sd/v1: add a knob to skip freeze on set #3161

Closed
wants to merge 1 commit into from

Conversation

kolyshkin
Copy link
Contributor

Commit b810da1 added support for setting Device* properties to
systemd cgroup v1 controller (as otherwise systemd applied "allow-all"
device rule).

Unfortunately, systemd re-applies device rules on every resource update,
and does so in a disruptive way, causing spurious device permission
errors.

To work around that, systemd cgroup v1 controller freezes the
cgroup before (and thaws it after) applying systemd properties.
This freeze fixes the issue for runc, but creates a problem for
kubernetes.

Kubernetes uses libcontainer/cgroups/systemd to manage pod slice
units. When kubelet updates a pod slice, all the containers under that
pod are unexpectedly frozen for some time, which leads to other issues.

One way to fix this was tried in commit f2db879 (PR #3082).
By looking into unit properties, we can predict whether systemd is going
to update device rules, and skip the freeze if it won't.

Unfortunately, the above commit contains a few bugs, resulting in
broken device rules check, so it's still not fixing the issue.

Upon fixing those bug (see PR #3143), a benchmark was created to assess
how expensive getting unit properties is. Turns out, it is -- Set()
times went from 200us to 500us (see full benchmark results here).

An alternative approach was tried in PR #3151, in which we check
devices.list cgroup file to find out whether the cgroup has some rules or just
a single "allow-all". Alas, this approach is not working if a cgroup
has a parent with some device rules set.

Considering all this, it seems that the best approach is to add a
"do not freeze on set" flag, which kubernetes can use when it knows
the slice did not and does not have device rules, so freeze can safely
be skipped.

Remove the (expensive and currently broken) unit properties querying,
and add/use a SkipFreezeOnSet flag instead.

Since the logic is trivial, no tests are needed.

This is a stop-gap measure. Long term solution is for k8s to stop
using libcontainer/cgroups (kubernetes/kubernetes#104325).

Commit b810da1 added support for Device* properties into systemd
cgroup v1 controller. Unfortunately, systemd re-applies device rules on
every resource update, and does so disruptively, which results in
spurious device permission errors. To work around that, systemd cgroup
v1 controller freezes the cgroup before (and thaws it after) applying
systemd properties.

This freeze fixes the issue for runc, but creates a problem for
kubernetes, which uses libcontainer/cgroups/systemd to manage pod slice
units. When kubelet updates a pod slice, all the containers under that
pod are unexpectedly frozen for some time, which leads to other issues.

One way to fix this was tried in commit f2db879. By looking
into unit properties, we can predict whether systemd is going to update
device rules, and skip the freeze if it won't.

Unfortunately, the above commit contains a few bugs, resulting in
broken device rules check, so it's still not fixing the issue.

Upon fixing those bug (see PR 3143), a benchmark was created to assess
how expensive getting unit properties is. Turns out, it is -- Set()
times went from 200us to 500us (see full benchmark results in that PR).

An alternative approach was tried in PR 3151, in which we check
devices.list to find out whether they have some rules or just
a single "allow-all". Alas, this approach is not working if a cgroup
there is a parent with some device rules set.

Considering all this, it seems that the best approach is to add a
"do not freeze on set" flag, which kubernetes can use when it knows
the slice did not and does not have device rules, so freeze can safely
be skipped.

Remove the (expensive and currently broken) unit properties querying,
and add/use a SkipFreezeOnSet flag instead.

Since the logic is trivial, no tests are needed.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Copy link
Contributor

@odinuge odinuge left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm 🙌

This will remove (tho it has never worked) support for avoiding the freeze for containers with full access. Should we add this flag and remove it, or should we do both? The overhead of the one/two dbus calls would still be lower than actually freezing containers with a lot of procs.

@kolyshkin
Copy link
Contributor Author

I'm trying to assess what's better:

  1. Penalize privileged (no device restrictions) containers by freezing them on Set.
  2. Penalize all containers by making Set slower with the device* properties check.

I don't like it either way, but the only other choice is

  1. Fix systemd, doing something similar to what @cyphar did in PR cgroup: devices: major cleanups and minimal transition rules #2391.

Taking into account that

  • solution 1 is the simplest
  • I haven't seen anyone complaining about freezing their privileged container during update
  • we're switching to cgroup v2 which do not have this problem

I think I prefer solution 1. Solution 2 is also OK but I like it less mostly because of complexity. If anyone can work on solution 3, I'm all for it :)

@odinuge
Copy link
Contributor

odinuge commented Aug 15, 2021

Well that is all up to you as runc maintianers I guess. The only reason for asking was that when this was implemented instead of a separate flag, the argument was that such privileged containers would also benefit from it. If that decision is changed, I am totally fine with it tho.

As long as we can get a flag for k8s so that we can avoid permanently frozen control groups, that's the most important part.

@kolyshkin
Copy link
Contributor Author

Here's another option: #3166. As of this morning it seems best.

@kolyshkin
Copy link
Contributor Author

Closed in favor of #3166 (which includes this one).

@kolyshkin kolyshkin closed this Aug 17, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants