-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RFC] Implement systemd-specific per-cgroup support (+ proof-of-concept "devices" and "memory") #1991
Conversation
@mrunalp PTAL |
I like the approach of subsystem returning systemd properties 👍 Looking at the rest in more detail. Thanks! |
So I have just broken this down into two separate commits, one for the infrastructure of detecting subsystems that support systemd and the second for handling the devices cgroup. Furthermore, on the second commit handling devices, I did a small fix to the implementation, when we have major/minor, I'm using the syntax I also added a third commit, handling the memory cgroup. The memory cgroup was quite interesting, since a lot changed in cgroupv2 and many of the options are no longer available. (Also see related issue opencontainers/runtime-spec#1005 I opened about the OCI spec for memory and comparing it to how memory controller works in cgroupv2.) In doing the split, my objective was to keep the original behavior whenever the "legacy" or "hybrid" hierarchy was in use, so I split those into a separate I tested this locally and it's working. Granted, it needs a lot more testing, we want to address that, but I'd like to get some feedback on the code itself and the methods I'm introducing. I'm happy to do changes to the API, but I'd like to have some feedback of what would look good to you. @cyphar, @crosbymichael Would you mind taking a look at the code and see if you'd like to have structural changes on where this code goes and whether the function prototypes look OK to you? Very specifically, this part here (which essentially determines how all the rest works): type systemdSubsystem interface {
// Returns a list of systemd properties to manage the underlying cgroups.
ToSystemdProperties(cgroup *configs.Cgroup) ([]systemdDbus.Property, error)
// Set the cgroupv1 attributes only, which can be used when in hybrid or legacy mode
// to keep setting properties that are not (and will not be) managed by systemd.
SetCgroupv1(path string, cgroup *configs.Cgroup) error
} Thanks! |
Define a new interface that will allow subsystems/controllers to implement systemd-based configuration, by using systemd directives rather than writing directly to the cgroup subtree. Signed-off-by: Filipe Brandenburger <filbranden@google.com>
I tested this with Podman using: $ podman --runtime ~/go/src/github.com/opencontainers/runc/runc run -t fedora:29 echo hello And also bringing up a container and checking the contents of "device.list" in the cgroup subtree: $ podman --runtime ~/go/src/github.com/opencontainers/runc/runc run -t fedora:29 sleep 1h $ cat /sys/fs/cgroup/devices/machine.slice/libpod-12fc7bd62fd6*/devices.list c 10:200 rwm c 5:2 rwm c 136:* rwm c 5:1 rwm c 1:9 rwm c 1:5 rwm c 5:0 rwm c 1:7 rwm c 1:8 rwm c 1:3 rwm b *:* m c *:* m This matches the output of devices.list when using the official "runc" binary, only difference being the lines are inverted in order (again, we can fix that on a second step.) Querying systemd for this unit also works as expected: $ systemctl show libpod-12fc7bd62fd66ff62fa1b045c2d717c7b2076c072c20de14f5c1ad86b78865eb.scope -p DevicePolicy -p DeviceAllow DevicePolicy=strict DeviceAllow=/dev/net/tun rwm DeviceAllow=/dev/ptmx rwm DeviceAllow=char-136 rwm DeviceAllow=/dev/console rwm DeviceAllow=/dev/urandom rwm DeviceAllow=/dev/zero rwm DeviceAllow=/dev/tty rwm DeviceAllow=/dev/full rwm DeviceAllow=/dev/random rwm DeviceAllow=/dev/null rwm DeviceAllow=block-* m DeviceAllow=char-* m v2: Now using systemd support for /dev/{char,block}/<major>:<minor> instead, which results into these properties being set instead: DevicePolicy=strict DeviceAllow=/dev/char/10:200 rwm DeviceAllow=/dev/char/5:2 rwm DeviceAllow=char-136 rwm DeviceAllow=/dev/char/5:1 rwm DeviceAllow=/dev/char/1:9 rwm DeviceAllow=/dev/char/1:5 rwm DeviceAllow=/dev/char/5:0 rwm DeviceAllow=/dev/char/1:7 rwm DeviceAllow=/dev/char/1:8 rwm DeviceAllow=/dev/char/1:3 rwm DeviceAllow=block-* m DeviceAllow=char-* m Unclear whether we should really be using this unconditionally, or only in the cases where a device name hasn't been supplied through d.Path. Signed-off-by: Filipe Brandenburger <filbranden@google.com>
This is a preliminary implementation setting systemd properties rather than writing to the cgroup files directly. Given systemd doesn't support most of the cgroupv1 properties, these continue being set using a new SetCgroupv1() method, which keeps compatibility in case the system is using "legacy" or "hybrid" hierarchy (but which can be skipped when using "unified" hierarchy, at the cost of change in behavior.) Tested with podman: $ podman --runtime ~/go/src/github.com/opencontainers/runc/runc run --memory 1g --memory-reservation 100m --memory-swap 2g --memory-swappiness 50 -t fedora:29 sleep 1h Inspecting the systemd unit: $ systemctl show libpod-30a387de72953df5368645405c04e7aa7253fc31be2ef73f334e2753f652e34e.scope -p MemoryLow -p MemoryMax -p MemorySwapMax -p MemoryLimit MemoryLow=104857600 MemoryMax=1073741824 MemorySwapMax=2147483648 MemoryLimit=1073741824 And looking at the files which are set in the cgroup tree directly: $ cat /sys/fs/cgroup/memory/machine.slice/libpod-30a387de72953df5368645405c04e7aa7253fc31be2ef73f334e2753f652e34e.scope/memory.swappiness 50 Signed-off-by: Filipe Brandenburger <filbranden@google.com>
I have now rebased this on top of #2014, so the Travis CI checks will include a run of |
@crosbymichael Can you please do a code review and take a look at the API? I'm happy to rework the code to change the internal API, just looking for some feedback to see what you think would be acceptable here... Thanks! |
Closing this one, since this approach will not work with older systemd versions (Travis-CI running systemd v229 on Ubuntu Xenial 16.04 shows that clearly.) See #2007 for a more lengthy update on the proposed new approach. And also #2047 which starts to refactor the code to allow for that switch. Cheers! |
This PR is trying to accomplish two things:
Define a new interface that will allow subsystems/controllers to
implement systemd-based configuration, by using systemd directives
rather than writing directly to the cgroup subtree.
Add a systemd-based implementation to the "devices" subsystem, to
illustrate how it is meant to be used.
The initial point I'd like to make here is towards discussing (1) as an idea and whether the Go abstractions/interfaces are appropriate here or whether we should move things around.
Consider part (2) to be really a draft and not really finished (even though it actually works to a large extent, the D-Bus messages are correct and that has been tested to do what's expected.)
I tested this with Podman using:
And also bringing up a container and checking the contents of "device.list" in the cgroup subtree:
This matches the output of devices.list when using the official "runc" binary, only difference being the lines are inverted in order (again, we can fix that on a second step.)
Querying systemd for this unit also works as expected:
/cc @cyphar @crosbymichael @mrunalp
Also @rhatdan
Maybe @sargun might be interested, since he was working on #1708 and this here is a step towards supporting cgroupv2 in unified hierarchy in runc/libcontainer.
Also @AkihiroSuda who has been working on rootless/usernetes and has been asking about this.
Cheers!
Filipe