Add proposal for hiding container mountpoints from systemd #593

lack · 2021-01-19T21:39:58Z

This is my first attempt at an openshift enhancement, all comments and suggestions are welcome!

enhancements/hide_container_mountpoints.md

haircommander · 2021-01-19T21:47:26Z

enhancements/hide_container_mountpoints.md

+
+### Risks and Mitigations
+
+Any other external or 3rd-party tools would need to change to match.


a field should be added in open questions to investigate who may be affected by this

Yeah big worry is about anything external that cares about the mounts.

i thought we hit perf issue with these mounts and being inspected in systemd cadvisor handler.

google/cadvisor#1573

i am not aware who actually cares about the mounts in question at this point...

Podman/Skopeo/Buildah will also need to run in the same mount namespace, otherwise they will end-up setting up the storage in the host mount namespace, possibly causing corruption of the storage

CSI driver pods / containers typically mount /var/lib/kubelet into their pods with rshared mount propagation. They must mount volumes there and kubelet + the container runtime must see them there. Any change in this behavior will likely break storage in OCP clusters.

What will a pod with /var/lib/kubelet HostPath volume actually see? The systemd or kubelet mount namespace?

@giuseppe : Can you give me a little more context or an example of how Podman/Skopeo/Buildah normally run in conjunction with OpenShift? If they set up storage in the host mount namespace this will be propagated down to the container mount namespace (because it's a slave of the host mount namespace); so I'm unsure how much of a concern this is.

I think a potential issue is with /var/lib/containers/storage/overlay, as it is always mounted MS_PRIVATE and it is not propagated to different namespaces.

this idea is very similar to rootless podman unshare that executes the specified command in a different mount (and user in this case) namespace where all the containers are running. For rootless it is a necessity because it needs a user namespace to gain capabilities.

I think in future we could teach other container tools to check if there are images/containers mounted in a different mount namespace and try to join it, or at least raise a warning.

@giuseppe : When you mention /var/lib/containers/storage/overlay, I'd love to learn more about that filesystem and how it's used and shared between various tools, especially since it's mounted MS_PRIVATE.

If you don't mind taking the time to explain, I'd like to understand more about which process mounts the filesystem there, and which other processes expect to be able to act on it.

any containers tool based on containers/storage (so all of CRI-O, Podman, Skopeo, Buildah).

The first access to the storage makes sure that /var/lib/containers/storage/$DRIVER is a mount point and it is made MS_PRIVATE. All containers and images are mounted under this directory. There is also some cleanup performed when there are no more containers and images mounted.

The reason for MS_PRIVATE is to prevent that each mountpoint will be propagated to the mount namespaces for each newly created container and avoid the cost of unmounting them.

enhancements/hide_container_mountpoints.md

romfreiman · 2021-01-19T21:52:02Z

+1 we saw in single node openshift flow that systemd consumes significant cpu resources

mrunalp · 2021-01-19T22:07:54Z

enhancements/hide_container_mountpoints.md

+1819868](https://bugzilla.redhat.com/show_bug.cgi?id=1819868), but we can work
+around this exclusively within openshift. Using a separate mountpoint to for
+both crio-o and kubelet mounts can completely segregate all container-specific
+mounts away from any systemd interaction whatsoever.


kubelet uses systemd to setup mounts. Any impact to that?

Can you expand a bit more on what you mean by this? What is the current kubelet/systemd interaction like?

In OCP 3.x we allowed kubelet to run in a docker container. Therefore it called nsenter to escape the container's mount namespace (mount propagation was not used at that time) and systemd-run --scope -- mount -t <type> <what> <where> to run any potential fuse mount daemons outside of kubelet's pid namespace / kubelet service cgroup.

nsenter was removed, but we still drag systemd-run ... mount <what> <where>, to run fuse daemons outside of kubelet's cgroup. systemctl restart kubelet.service would kill them.

Btw, this means that there is a short-lived systemd service that does the mount and dies. We do not support any fuse-based storage in 4.x. Upstream kubernetes does though. I don't think it affects how systemd handles mounts.

@giuseppe Could this be related to the BZ https://bugzilla.redhat.com/show_bug.cgi?id=1903553 how systemd handles mount points?

@zvonkok : Can you talk to me a bit more about the need for the bidirectional mount in /run/nvidia/driver? I get the sense that this is mounted by a container and intended to be seen outside that container. Is the contents needed by other containers? Some piece of software running on the host outside of the container runtime?

@zvonkok hiding such mounts from systemd could probably help in the issue you are seeing

@lack The situation I have is a Pod running privileged + rbind + systemd in the container. Now if I delete the container systemd is stopping all services in the container but also on the node and the node is rendered NotReady, only a reboot can "save" it. If I am using the Pod without the rbind meaning privileged + systemd it works as expected. Looks like the rbind is introducing something that throws off systemd. I will test your POC on my cluster and see if it helps or not and update the BZ. /cc @giuseppe

Ah, that's interesting... if the cause is an instance of systemd beneath the container runtime namespace that gets over-eager because it sees state from the top-level systemd, my proposal may not be the help you need; I think the rbind will still see the state that's properly owned by the parent systemd, so the container systemd would still do too much.

If the cause is the top-level systemd seeing state it shouldn't from mountpoints owned by the container runtime or the container itself, then my fix might be helpful, since sytsemd won't see anything under mountpoints created by the container runtime or containers themselves.

@jsafrane : Coming back to your original comment regarding systemd-run, if kubelet uses systemd-run to mount a filesystem, that mount will end up happening in the top-level systemd mount namespace, which will then be propagated down and seen by CRI-O/Kubelet/etc because their namespace is a 'slave' to that top-level mount namespace. So this proposal doesn't interfere with that working, and in fact that method of mounting a filesystem circumvents this proposal's mechanism and makes such mounts visibile to systemd and the host OS again.

lack · 2021-01-20T17:13:35Z

@romfreiman : Some ad-hoc testing we did with my workaround shows a noticable reduction in systemd CPU usage. No hard numbers yet.

rphillips · 2021-01-21T13:55:40Z

cc @jsafrane

rphillips · 2021-01-21T14:10:34Z

cc @kolyshkin @giuseppe

jsafrane · 2021-01-22T10:15:07Z

enhancements/hide_container_mountpoints.md

+ > 1. What is the reason for the Kubernetes test? Is it okay to just skip or
+ >    disable the test in OpenShift?


Is it Mount propagation test? I wrote the test, together with mount propagation support in Kubernetes.

Documentation says:

// MountPropagationBidirectional means that the volume in a container will // receive new mounts from the host or other containers, and its own mounts // will be propagated from the container to the host or other containers. // Note that this mode is recursively applied to all mounts in the volume // ("rshared" in Linux terminology).

It's not specified there what the "host" is. The test defines it as something that's available under nsenter --mount=/rootfs/proc/1/ns/mnt ... and that's what Kubernetes has guaranteed so far. Changing this behavior may have consequences that we do not know about. I mentioned CSI drivers above, but we never know what people use mount propagation for.

Thank you for that information!

Yes, that is the test I was concerned about because it implied some kind of relationship between that host mount namespace and the container's. Basically instead of being under /rootfs/proc/1/ns/mnt, kubelet and the container runtime would be under a slightly different namespace, on level below that one.

So yes, this proposal would intentionally change that guarantee, but not all of it.

What would stay the same:

The volume in a container will receive new mounts from other containers

Its own mounts will be propagated from the container to other containers

What I'm pretty sure we can keep though I have to double-check if my current proof-of-concept does this today:

The volume in a container will receive new mounts from the host

What would necessarily change:

Its own mounts would no longer be propagated up to the host

I'll have to do some further experiments with my proof-of-concept to be sure, but I suspect the CSI case you mentioned should be okay, since it's in a container and therefore already under the lower-level umbrella namespace, thus will continue to interoperate fine with any other containers also rooted in that namespace.

What would necessarily change:

Its own mounts would no longer be propagated up to the host

As you wrote, CSI drivers might be fine with that - we can test the ones that we ship and we could give 3rd party vendors some time to test theirs.

Still, it's a significant a change from upstream. Everyone who uses mount propagation for anything else than /var/lib/kubelet will be broken. cc @openshift/openshift-architects

Its own mounts would no longer be propagated up to the host

Give me some hypothetical examples of things that go into mounts that admins on the host wouldn't be able to see?

Mounted cloud disks?

Empty dirs?

Host level containers that create a directory in a host dir? such as an agent that creates logs on the host?

The first two if true would be frustrating but not end of world for admins. The last one would be a blocker if true. If the last one is true, is there a way to a) identify workloads where sharing with the host is important and b) separate those pods from regular workload pods when mounting?

I dont think that a big problem

we can document for admins how to nsenter to the right namespace to see the mounts

and pod created by crio would be in the same mount namespace, so will see the mounts (incl. oc debug, etc)

If in the future we'll ever want to start each pod in its own mount namespace, we can add a privileged flag to inherit crio namespace, so privileged pods would still be able to see that

After some conversation with @zvonkok, in the case of the special-resource-operator, the rshared mount of /run is specifically so that OCI hooks (like prestart) can mount things under /run that are then consumed by other containers. And because OCI hooks run inside the container runtime namespace (aka, a process somewhere within the process tree of CRI-O), within my proposal the mount will end up in the lower-level namespace and still be visible to all containers. It seems based on my understanding so far that the host doesn't actually need access to the mountpoints, as long as the OCI prestart hook and containers have common access to them.

I've updated the proposal to explicitly mention that OCI hooks will run inside this container-mount-namespace.

I have also just posted a summary of this proposal to a few mailing lists to see if there are any specific use-cases or solutions people know about that this proposal would interfere with.

@lack I agree with @jsafrane re: opening an enhancement upstream. We can then discuss it in sigs and mailing list to make sure we aren't breaking any use cases unintentionally.

Understood! I am now taking this discussion to sig-node, sig-storage, and the upstream community for comments.

russellb · 2021-01-22T18:44:53Z

/priority important-soon

openshift-ci-robot · 2021-01-28T16:01:16Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
To complete the pull request process, please assign markmc after the PR has been reviewed.
You can assign the PR to them by writing /assign @markmc in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

kolyshkin · 2021-02-12T00:19:26Z

I very much like the overall idea, and the implementation seems to be decent, too.

It might (and will) break some setups in which people expect the container mounts to be visible on the host (which seems trivial to fix -- use the same nsenter wrapper as for k8s and cri-o). For this reason (of breakage) perhaps it is better to make this change in the upstream k8s and cri-o, rather than by using an operator.

umohnani8 · 2021-02-17T15:23:55Z

enhancements/hide_container_mountpoints.md

+approach it another way:
+ > 1. Should these services and overrides be installed by MCO?
+ > 2. If so, should they be part of the main `00-worker` profile, or in a
+ >    separate `02-worker-container-mount-namespace` object?


For the MCO part, this would depend on whether this is going to be the default behavior for OpenShift. If it is, then we should use templates for the various service files and it will be a part of the 00-worker MC. If we are giving the user a knob to be able to enable/disable this then it should be a separate MC like 02-worker-container-mount-namespace.

I would suggest that it should be enabled by default but can be disabled (in case there are use-cases where this mount hiding is undesirable).

In the MCO world, what's the best way to have a feature toggle for something like this? Is it enough for MCO to create this 02-worker-container-mount-namespace object by default that an admin could manually delete?

Yes, the MCO can create the 02-worker-container-mount-namespace MC and an admin could manually delete it. But once deleted, this MC will not be created unless the admin decides to manually recreate it again. If you want the MCO to automatically create this and delete it based on a toggle, then we will have to add logic to one of the controllers in the MCO to watch for the toggle and create/delete accordingly.

That's great! Do you have any suggestion of which controller in MCO would be best suited to have this toggle? I'd be happy to try my hand at coding it up.

lack · 2021-02-19T17:22:10Z

I very much like the overall idea, and the implementation seems to be decent, too.

It might (and will) break some setups in which people expect the container mounts to be visible on the host (which seems trivial to fix -- use the same nsenter wrapper as for k8s and cri-o). For this reason (of breakage) perhaps it is better to make this change in the upstream k8s and cri-o, rather than by using an operator.

@kolyshkin: Thanks for the comments; and I have two questions for you here:

What setups do you know about where people expect the container mounts to be visible on the host? I would expect that this would be extremely uncommon and an anti-pattern in OpenShift (though perhaps less so in upstream k8s)
Without a centralized top-level installer or configuration management system like MCO, how do you think we would package something like this, which requires coordination between the init system, kubelet and whatever container runtime a user chooses for k8s?

mrunalp · 2021-02-24T05:05:16Z

enhancements/hide_container_mountpoints.md

+
+- Enhance systemd to support unsharing namespaces at the slice level, then put
+  `crio.service` and `kubelet.service` in the same slice
+- Alter both CRI-O and Kubelet executables to take a mount namespace via


I prefer that for CRI-O we support joining the mount namespace as an option. We should be able to make it work with an init in C similar to runc.

The benefit to sticking with the 'nsenter' mechanism is that it would be the same for Kubelet, CRI-O, and easily adaptable to any other container runtime, but I am happy to consider natively supporting mount-namespace-entering for CRI-O as well. It's just more work.

mrunalp

I think that if we do proceed with this once we know that we don't have edge cases we worry about it, it should still be treated as something that we back out once kernel / systemd have the fixes. Could that be captured in the enhancement?

lack · 2021-03-04T15:44:47Z

@mrunalp : I am not sure if making this temporary is necessarily the right option. Once systemd/kernel are fixed and this isn't for CPU utilization purposes, I think there's a good case to be made for both tidiness/encapsulation reasons (ie, the mounts owned by OpenShift shouldn't be needed by the host) and security (if unprivileged users aren't allowed to see the container-specific mountpoint locations, this removes one source of potentially damaging knowledge about container contents).

That said, I do think it would make sense for this to be optional within OpenShift, perhaps via some trigger in MCO which would be on by default, but could be turned off to restore the older system behavior at the behest of an administrator. Would this be a reasonable compromise to your request?

mrunalp · 2021-03-05T15:30:08Z

That said, I do think it would make sense for this to be optional within OpenShift, perhaps via some trigger in MCO which would be on by default, but could be turned off to restore the older system behavior at the behest of an administrator. Would this be a reasonable compromise to your request?

That sounds good to have an escape hatch if needed 👍

openshift-bot · 2021-06-03T19:43:36Z

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

openshift-bot · 2021-07-03T23:22:12Z

Stale issues rot after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

lack · 2021-07-07T11:23:32Z

I will be reviving this shortly.

I've been working with the upstream sig-node team to get make sure this idea doesn't break any present or future assumptions in kubernetes upstream.

lack · 2021-07-07T14:00:17Z

/remove-lifecycle rotten

lack · 2021-07-21T12:58:32Z

So it turns out this proposed mechanism likely prevents containers (even privileged containers) from exploiting CVE-2021-33910!

openshift-bot · 2021-09-21T21:15:52Z

Inactive enhancement proposals go stale after 28d of inactivity.

See https://github.com/openshift/enhancements#life-cycle for details.

Mark the proposal as fresh by commenting /remove-lifecycle stale.
Stale proposals rot after an additional 30d of inactivity and eventually close.
Exclude this proposal from closing by commenting /lifecycle frozen.

If this proposal is safe to close now please do so with /close.

/lifecycle stale

openshift-bot · 2021-09-29T03:22:59Z

Stale enhancement proposals rot after 7d of inactivity.

See https://github.com/openshift/enhancements#life-cycle for details.

Mark the proposal as fresh by commenting /remove-lifecycle rotten.
Rotten proposals close after an additional 7d of inactivity.
Exclude this proposal from closing by commenting /lifecycle frozen.

If this proposal is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

openshift-bot · 2021-10-06T03:30:05Z

Rotten enhancement proposals close after 7d of inactivity.

See https://github.com/openshift/enhancements#life-cycle for details.

Reopen the proposal by commenting /reopen.
Mark the proposal as fresh by commenting /remove-lifecycle rotten.
Exclude this proposal from closing again by commenting /lifecycle frozen.

/close

openshift-ci · 2021-10-06T03:30:27Z

@openshift-bot: Closed this PR.

In response to this:

Rotten enhancement proposals close after 7d of inactivity.

See https://github.com/openshift/enhancements#life-cycle for details.

Reopen the proposal by commenting /reopen.
Mark the proposal as fresh by commenting /remove-lifecycle rotten.
Exclude this proposal from closing again by commenting /lifecycle frozen.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

lack · 2022-05-10T15:26:46Z

/reopen

I'm reviving this given the request from customers for this feature, as well as some time spend on revamping the proposal to be a bit clearer.

openshift-ci · 2022-05-10T15:28:19Z

@lack: Failed to re-open PR: state cannot be changed. The hide_container_mountpoints branch was force-pushed or recreated.

In response to this:

/reopen

I'm reviving this given the request from customers for this feature, as well as some time spend on revamping the proposal to be a bit clearer.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-ci-robot requested review from knobunc and Miciah January 19, 2021 21:40

haircommander reviewed Jan 19, 2021

View reviewed changes

enhancements/hide_container_mountpoints.md Outdated Show resolved Hide resolved