-
Notifications
You must be signed in to change notification settings - Fork 349
[1.3 - DO NOT MERGE] User namespace support #1527
[1.3 - DO NOT MERGE] User namespace support #1527
Conversation
Introduce two new configuration options to specify the container to host uid/gid mapping.
…n, User Initial work by vikaschoudhary16 <vichoudh@redhat.com> on Kubernetes PR 64005.
- The sandbox container (aka "pause" container) has a tmpfs mount on /dev/shm. Bind mount it with nosuid, noexec, nodev because the mount would not be allowed in user namespaces otherwise. - Move the creation of the network namespace after the sandbox has been created: Before this patch, containerd created a netns, configured it with CNI, and then creates the sandbox container by giving the netns path previously setup. This means that the netns was owned by the host userns. Mounting sysfs in the container is restricted in this setup. This patch sets up the netns in the other way around instead: it creates the sandbox container, letting runc create a new netns. Then, it picks the new netns from /proc/$pid/ns/net, binds mount it in the usual CNI path and then gives it to CNI to configure. This means that the netns is owned by the userns of the sandbox container. In this way, mounting sysfs is possible. For more information about namespace ownership, see - man ioctl_ns - man user_namespaces, section "Interaction of user namespaces and other types of namespaces" - Linux commit 7dc5dbc879bd ("sysfs: Restrict mounting sysfs") torvalds/linux@7dc5dbc#diff-4839664cd0c8eab716e064323c7cd71fR1164 - net_current_may_mount() used for mounting sysfs: ns_capable(net->user_ns, CAP_SYS_ADMIN); https://github.com/torvalds/linux/blob/v5.7/net/core/net-sysfs.c#L1679
runc needs to bind mount files from /var/lib/kubelet/pods/... (such as etc-hosts) into the container. When using user namespaces, the bind mount didn't work anymore when containerd is started from a systemd unit. This patch fixes that by adding SupplementaryGroups=0 runc needs to have permission on the directory to stat() the source of the bind mount. Without user namespaces, this is not a problem since runc is running as root, so it has 'rwx' permissions over the directory: drwxr-x---. 8 root root 4096 May 28 18:05 /var/lib/kubelet Moreover, runc has CAP_DAC_OVERRIDE at this point because the mount phase happens before giving up the additional permissions. However, when using user namespaces, the runc process is belonging to a different user than root (depending on the mapping). /var/lib/kubelet is seen as belonging to the special unmapped user (65534, nobody). runc does not have 'rwx' permissions anymore but the empty '---' permission for 'other'. CAP_DAC_OVERRIDE is also no effective because the kernel performs the capability check with capable_wrt_inode_uidgid(inode, CAP_DAC_OVERRIDE). This checks that the owner of the /var/lib/kubelet is mapped in the current user namespace, which is not the case. Despite that, bind mounting /var/lib/kubelet/pods/...etc-hosts was working when containerd was started manually with 'sudo' but not working when started from a systemd unit. The difference is how supplementary groups are handled between sudo and systemd units: systemd does not set supplementary groups by default. $ sudo grep -E 'Groups:|Uid:|Gid:' /proc/self/status Uid: 0 0 0 0 Gid: 0 0 0 0 Groups: 0 $ sudo systemd-run -t grep -E 'Groups:|Uid:|Gid:' /proc/self/status Running as unit: run-u296886.service Press ^] three times within 1s to disconnect TTY. Uid: 0 0 0 0 Gid: 0 0 0 0 Groups: When runc has the supplementary group 0 configured, it is retained during the bind-mount phase, even though it is an unmapped group (runc temporarily sees 'Groups: 65534' in its own /proc/self/status), so runc effectively has the 'r-x' permissions over /var/lib/kubelet. This makes the bind mount of etc-hosts work. After the mount phase, runc will set the credential correctly (following OCI's config.json specification), so the container will not retain this unmapped supplementary group. I rely on the systemd unit file being configured correctly with SupplementaryGroups=0 and I don't attempt to set it up automatically with syscall.Setgroups() because "at the kernel level, user IDs and group IDs are a per-thread attribute" (man setgroups) and the way Golang uses threads make it difficult to predict which thread is going to be used to execute runc. glibc's setgroup() is a wrapper that changes the credentials for all threads but Golang does not use the glibc implementation.
@mauriciovasquezbernal: GitHub didn't allow me to request PR reviews from the following users: alban, rata. Note that only containerd members and repo collaborators can review this PR, and authors cannot review their own PRs. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Hi @mauriciovasquezbernal. Thanks for your PR. I'm waiting for a containerd member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
? we would not merge a PR that is in [wip]/hold against the contributor's wishes... |
I don't mean that we based it on release/1.3 to avoid merging. What I wanted to say is that it's not a problem to be based on release/1.3 because we don't want to merge it... |
It is same as https://github.com/rootless-containers/usernetes? And maybe kubernetes/enhancements#1371 should replace Kubernetes Node-Level User Namespaces Design Proposal |
@zhsj It is not the same: while both use user namespaces as underlying technology, they achieve different things:
Both are useful, depending on the use cases. |
What's current status? |
We're waiting reviews on the Kubernetes KEP in order to define the CRI changes. |
Can we implement the "io.kubernetes.cri-o.userns-mode" convention until the KEP settles, or do we want to wait until KEP settles? |
Will this allow a user to start an unprivileged kubernetes pod, and then use e.g. Singularity inside their pod to create a nested container? |
I'd like to get the KEP merged first and then implement this support in containerd. I fear that we would have a lot of heterogeneous implementations if we start implementing this support in the different runtimes without having a proper specification to do it. I also agree that people needing to have this support quickly could implement such mechanism to avoid waiting on Kubernetes on the meanwhile. |
Yes, that should work. User namespaces used in this context allow to run pods that require elevated privileges in a safer way. If you give the pod the correct capabilities to create containers and set correct k8s settings like https://kubernetes.io/docs/concepts/policy/pod-security-policy/#allowedprocmounttypes that should be fine. However it's something we haven't tried and there could be unknown issues. |
This is the containerd/cri implementation for the Kubernetes Node-Level User Namespaces Design Proposal. The patches are based on the
release/1.3
branch. It is tested on Kubernetes 1.17 with patches adapted from PR 64005 (kinvolk/kubernetes#4).The purpose of this PR is to gather some early feedback from the community on this feature to start the discussion again. This PR is based on the
release/1.3
branch, we don't intend to merge it as it is so it should not be a problem. We are planning to create / update a KEP to have a proper discussion about the design of this feature.The main changes are:
GetRuntimeConfigInfo
returning the configured mappings.WithRemappedSnapshot
snapshotter.WithoutNamespace
,WithUserNamespace
,WithLinuxNamespace
at the right places.At the OCI level (config.json), we have the following changes:
Demo:
Minimal example of the containerd configuration file:
/cc @alban @rata