-
Notifications
You must be signed in to change notification settings - Fork 478
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add proposal for hiding container mountpoints from systemd #593
Conversation
|
||
### Risks and Mitigations | ||
|
||
Any other external or 3rd-party tools would need to change to match. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a field should be added in open questions to investigate who may be affected by this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah big worry is about anything external that cares about the mounts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i thought we hit perf issue with these mounts and being inspected in systemd cadvisor handler.
i am not aware who actually cares about the mounts in question at this point...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Podman/Skopeo/Buildah will also need to run in the same mount namespace, otherwise they will end-up setting up the storage in the host mount namespace, possibly causing corruption of the storage
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CSI driver pods / containers typically mount /var/lib/kubelet
into their pods with rshared
mount propagation. They must mount volumes there and kubelet + the container runtime must see them there. Any change in this behavior will likely break storage in OCP clusters.
What will a pod with /var/lib/kubelet
HostPath volume actually see? The systemd or kubelet mount namespace?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@giuseppe : Can you give me a little more context or an example of how Podman/Skopeo/Buildah normally run in conjunction with OpenShift? If they set up storage in the host mount namespace this will be propagated down to the container mount namespace (because it's a slave of the host mount namespace); so I'm unsure how much of a concern this is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think a potential issue is with /var/lib/containers/storage/overlay
, as it is always mounted MS_PRIVATE
and it is not propagated to different namespaces.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this idea is very similar to rootless podman unshare
that executes the specified command in a different mount (and user in this case) namespace where all the containers are running. For rootless it is a necessity because it needs a user namespace to gain capabilities.
I think in future we could teach other container tools to check if there are images/containers mounted in a different mount namespace and try to join it, or at least raise a warning.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@giuseppe : When you mention /var/lib/containers/storage/overlay
, I'd love to learn more about that filesystem and how it's used and shared between various tools, especially since it's mounted MS_PRIVATE
.
If you don't mind taking the time to explain, I'd like to understand more about which process mounts the filesystem there, and which other processes expect to be able to act on it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
any containers tool based on containers/storage (so all of CRI-O, Podman, Skopeo, Buildah).
The first access to the storage makes sure that /var/lib/containers/storage/$DRIVER
is a mount point and it is made MS_PRIVATE
. All containers and images are mounted under this directory. There is also some cleanup performed when there are no more containers and images mounted.
The reason for MS_PRIVATE
is to prevent that each mountpoint will be propagated to the mount namespaces for each newly created container and avoid the cost of unmounting them.
+1 we saw in single node openshift flow that systemd consumes significant cpu resources |
1819868](https://bugzilla.redhat.com/show_bug.cgi?id=1819868), but we can work | ||
around this exclusively within openshift. Using a separate mountpoint to for | ||
both crio-o and kubelet mounts can completely segregate all container-specific | ||
mounts away from any systemd interaction whatsoever. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
kubelet uses systemd to setup mounts. Any impact to that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you expand a bit more on what you mean by this? What is the current kubelet/systemd interaction like?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In OCP 3.x we allowed kubelet to run in a docker container. Therefore it called nsenter
to escape the container's mount namespace (mount propagation was not used at that time) and systemd-run --scope -- mount -t <type> <what> <where>
to run any potential fuse mount daemons outside of kubelet's pid namespace / kubelet service cgroup.
nsenter
was removed, but we still drag systemd-run ... mount <what> <where>
, to run fuse daemons outside of kubelet's cgroup. systemctl restart kubelet.service
would kill them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Btw, this means that there is a short-lived systemd service that does the mount and dies. We do not support any fuse-based storage in 4.x. Upstream kubernetes does though. I don't think it affects how systemd handles mounts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@giuseppe Could this be related to the BZ https://bugzilla.redhat.com/show_bug.cgi?id=1903553 how systemd handles mount points?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zvonkok : Can you talk to me a bit more about the need for the bidirectional mount in /run/nvidia/driver
? I get the sense that this is mounted by a container and intended to be seen outside that container. Is the contents needed by other containers? Some piece of software running on the host outside of the container runtime?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zvonkok hiding such mounts from systemd could probably help in the issue you are seeing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lack The situation I have is a Pod running privileged + rbind + systemd in the container. Now if I delete the container systemd is stopping all services in the container but also on the node and the node is rendered NotReady, only a reboot can "save" it. If I am using the Pod without the rbind meaning privileged + systemd it works as expected. Looks like the rbind is introducing something that throws off systemd. I will test your POC on my cluster and see if it helps or not and update the BZ. /cc @giuseppe
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, that's interesting... if the cause is an instance of systemd beneath the container runtime namespace that gets over-eager because it sees state from the top-level systemd, my proposal may not be the help you need; I think the rbind will still see the state that's properly owned by the parent systemd, so the container systemd would still do too much.
If the cause is the top-level systemd seeing state it shouldn't from mountpoints owned by the container runtime or the container itself, then my fix might be helpful, since sytsemd won't see anything under mountpoints created by the container runtime or containers themselves.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jsafrane : Coming back to your original comment regarding systemd-run
, if kubelet uses systemd-run
to mount a filesystem, that mount will end up happening in the top-level systemd mount namespace, which will then be propagated down and seen by CRI-O/Kubelet/etc because their namespace is a 'slave' to that top-level mount namespace. So this proposal doesn't interfere with that working, and in fact that method of mounting a filesystem circumvents this proposal's mechanism and makes such mounts visibile to systemd and the host OS again.
1026bc1
to
26a30b7
Compare
@romfreiman : Some ad-hoc testing we did with my workaround shows a noticable reduction in systemd CPU usage. No hard numbers yet. |
26a30b7
to
a2a86f0
Compare
cc @jsafrane |
> 1. What is the reason for the Kubernetes test? Is it okay to just skip or | ||
> disable the test in OpenShift? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it Mount propagation test? I wrote the test, together with mount propagation support in Kubernetes.
Documentation says:
// MountPropagationBidirectional means that the volume in a container will
// receive new mounts from the host or other containers, and its own mounts
// will be propagated from the container to the host or other containers.
// Note that this mode is recursively applied to all mounts in the volume
// ("rshared" in Linux terminology).
It's not specified there what the "host" is. The test defines it as something that's available under nsenter --mount=/rootfs/proc/1/ns/mnt ...
and that's what Kubernetes has guaranteed so far. Changing this behavior may have consequences that we do not know about. I mentioned CSI drivers above, but we never know what people use mount propagation for.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for that information!
Yes, that is the test I was concerned about because it implied some kind of relationship between that host mount namespace and the container's. Basically instead of being under /rootfs/proc/1/ns/mnt, kubelet and the container runtime would be under a slightly different namespace, on level below that one.
So yes, this proposal would intentionally change that guarantee, but not all of it.
What would stay the same:
- The volume in a container will receive new mounts from other containers
- Its own mounts will be propagated from the container to other containers
What I'm pretty sure we can keep though I have to double-check if my current proof-of-concept does this today:
- The volume in a container will receive new mounts from the host
What would necessarily change:
- Its own mounts would no longer be propagated up to the host
I'll have to do some further experiments with my proof-of-concept to be sure, but I suspect the CSI case you mentioned should be okay, since it's in a container and therefore already under the lower-level umbrella namespace, thus will continue to interoperate fine with any other containers also rooted in that namespace.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What would necessarily change:
- Its own mounts would no longer be propagated up to the host
As you wrote, CSI drivers might be fine with that - we can test the ones that we ship and we could give 3rd party vendors some time to test theirs.
Still, it's a significant a change from upstream. Everyone who uses mount propagation for anything else than /var/lib/kubelet
will be broken. cc @openshift/openshift-architects
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Its own mounts would no longer be propagated up to the host
Give me some hypothetical examples of things that go into mounts that admins on the host wouldn't be able to see?
- Mounted cloud disks?
- Empty dirs?
- Host level containers that create a directory in a host dir? such as an agent that creates logs on the host?
The first two if true would be frustrating but not end of world for admins. The last one would be a blocker if true. If the last one is true, is there a way to a) identify workloads where sharing with the host is important and b) separate those pods from regular workload pods when mounting?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I dont think that a big problem
- we can document for admins how to
nsenter
to the right namespace to see the mounts - and pod created by crio would be in the same mount namespace, so will see the mounts (incl. oc debug, etc)
- If in the future we'll ever want to start each pod in its own mount namespace, we can add a privileged flag to inherit crio namespace, so privileged pods would still be able to see that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After some conversation with @zvonkok, in the case of the special-resource-operator, the rshared mount of /run is specifically so that OCI hooks (like prestart) can mount things under /run that are then consumed by other containers. And because OCI hooks run inside the container runtime namespace (aka, a process somewhere within the process tree of CRI-O), within my proposal the mount will end up in the lower-level namespace and still be visible to all containers. It seems based on my understanding so far that the host doesn't actually need access to the mountpoints, as long as the OCI prestart hook and containers have common access to them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've updated the proposal to explicitly mention that OCI hooks will run inside this container-mount-namespace.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have also just posted a summary of this proposal to a few mailing lists to see if there are any specific use-cases or solutions people know about that this proposal would interfere with.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Understood! I am now taking this discussion to sig-node, sig-storage, and the upstream community for comments.
/priority important-soon |
a2a86f0
to
d14c951
Compare
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
6a0b8c9
to
12e0f50
Compare
d7eaede
to
42bb7b5
Compare
87a67c7
to
789dd1a
Compare
I very much like the overall idea, and the implementation seems to be decent, too. It might (and will) break some setups in which people expect the container mounts to be visible on the host (which seems trivial to fix -- use the same nsenter wrapper as for k8s and cri-o). For this reason (of breakage) perhaps it is better to make this change in the upstream k8s and cri-o, rather than by using an operator. |
approach it another way: | ||
> 1. Should these services and overrides be installed by MCO? | ||
> 2. If so, should they be part of the main `00-worker` profile, or in a | ||
> separate `02-worker-container-mount-namespace` object? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the MCO part, this would depend on whether this is going to be the default behavior for OpenShift. If it is, then we should use templates for the various service files and it will be a part of the 00-worker
MC. If we are giving the user a knob to be able to enable/disable this then it should be a separate MC like 02-worker-container-mount-namespace
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would suggest that it should be enabled by default but can be disabled (in case there are use-cases where this mount hiding is undesirable).
In the MCO world, what's the best way to have a feature toggle for something like this? Is it enough for MCO to create this 02-worker-container-mount-namespace
object by default that an admin could manually delete?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the MCO can create the 02-worker-container-mount-namespace
MC and an admin could manually delete it. But once deleted, this MC will not be created unless the admin decides to manually recreate it again. If you want the MCO to automatically create this and delete it based on a toggle, then we will have to add logic to one of the controllers in the MCO to watch for the toggle and create/delete accordingly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's great! Do you have any suggestion of which controller in MCO would be best suited to have this toggle? I'd be happy to try my hand at coding it up.
@kolyshkin: Thanks for the comments; and I have two questions for you here:
|
|
||
- Enhance systemd to support unsharing namespaces at the slice level, then put | ||
`crio.service` and `kubelet.service` in the same slice | ||
- Alter both CRI-O and Kubelet executables to take a mount namespace via |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I prefer that for CRI-O we support joining the mount namespace as an option. We should be able to make it work with an init in C similar to runc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The benefit to sticking with the 'nsenter' mechanism is that it would be the same for Kubelet, CRI-O, and easily adaptable to any other container runtime, but I am happy to consider natively supporting mount-namespace-entering for CRI-O as well. It's just more work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that if we do proceed with this once we know that we don't have edge cases we worry about it, it should still be treated as something that we back out once kernel / systemd have the fixes. Could that be captured in the enhancement?
@mrunalp : I am not sure if making this temporary is necessarily the right option. Once systemd/kernel are fixed and this isn't for CPU utilization purposes, I think there's a good case to be made for both tidiness/encapsulation reasons (ie, the mounts owned by OpenShift shouldn't be needed by the host) and security (if unprivileged users aren't allowed to see the container-specific mountpoint locations, this removes one source of potentially damaging knowledge about container contents). That said, I do think it would make sense for this to be optional within OpenShift, perhaps via some trigger in MCO which would be on by default, but could be turned off to restore the older system behavior at the behest of an administrator. Would this be a reasonable compromise to your request? |
That sounds good to have an escape hatch if needed 👍 |
789dd1a
to
f73d626
Compare
Issues go stale after 90d of inactivity. Mark the issue as fresh by commenting If this issue is safe to close now please do so with /lifecycle stale |
Stale issues rot after 30d of inactivity. Mark the issue as fresh by commenting If this issue is safe to close now please do so with /lifecycle rotten |
I will be reviving this shortly. I've been working with the upstream sig-node team to get make sure this idea doesn't break any present or future assumptions in kubernetes upstream. |
/remove-lifecycle rotten |
So it turns out this proposed mechanism likely prevents containers (even privileged containers) from exploiting CVE-2021-33910! |
Inactive enhancement proposals go stale after 28d of inactivity. See https://github.com/openshift/enhancements#life-cycle for details. Mark the proposal as fresh by commenting If this proposal is safe to close now please do so with /lifecycle stale |
Stale enhancement proposals rot after 7d of inactivity. See https://github.com/openshift/enhancements#life-cycle for details. Mark the proposal as fresh by commenting If this proposal is safe to close now please do so with /lifecycle rotten |
Rotten enhancement proposals close after 7d of inactivity. See https://github.com/openshift/enhancements#life-cycle for details. Reopen the proposal by commenting /close |
@openshift-bot: Closed this PR. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/reopen I'm reviving this given the request from customers for this feature, as well as some time spend on revamping the proposal to be a bit clearer. |
@lack: Failed to re-open PR: state cannot be changed. The hide_container_mountpoints branch was force-pushed or recreated. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
This is my first attempt at an openshift enhancement, all comments and suggestions are welcome!