[1.3 - DO NOT MERGE] User namespace support #1527

mauriciovasquezbernal · 2020-07-07T15:50:33Z

This is the containerd/cri implementation for the Kubernetes Node-Level User Namespaces Design Proposal. The patches are based on the release/1.3 branch. It is tested on Kubernetes 1.17 with patches adapted from PR 64005 (kinvolk/kubernetes#4).

The purpose of this PR is to gather some early feedback from the community on this feature to start the discussion again. This PR is based on the release/1.3 branch, we don't intend to merge it as it is so it should not be a problem. We are planning to create / update a KEP to have a proper discussion about the design of this feature.

The main changes are:

Extend the configuration with uid/guid mappings.
Import the new CRI API from Kubernetes PR 64005.
Implement GetRuntimeConfigInfo returning the configured mappings.
Use the WithRemappedSnapshot snapshotter.
Fix sysfs mount with correct ownership of netns (see commitmsg for details).
Additional mount restrictions on /dev/shm (nosuid, noexec, nodev).
Chown /dev/shm appropriately.
Fix etc-hosts mounts with supplementary groups (see commitmsg for details).
Use custom options WithoutNamespace, WithUserNamespace, WithLinuxNamespace at the right places.

At the OCI level (config.json), we have the following changes:

sandbox container with new "user" namespace and UidMappings
normal containers with "user" namespace from a path and UidMappings

Demo:

Minimal example of the containerd configuration file:

version = 2
[plugins]
  [plugins."io.containerd.grpc.v1.cri"]
    [plugins."io.containerd.grpc.v1.cri".node_wide_uid_mapping]
      container_id = 0
      host_id = 100000
      size = 65536
    [plugins."io.containerd.grpc.v1.cri".node_wide_gid_mapping]
      container_id = 0
      host_id = 100000
      size = 65536

$ kubectl apply -f userns-tests/node-standard.yaml
pod/pod-simple created
$ kubectl apply -f userns-tests/pod-standard.yaml
pod/pod-userns created
$ kubectl exec -ti node-standard -- /bin/sh -c 'cat /proc/self/uid_map'
         0          0 4294967295
$ kubectl exec -ti pod-standard -- /bin/sh -c 'cat /proc/self/uid_map'
         0     100000      65535

userns-tests/pod-standard.yaml

# Pod user with userns set to "node" mode.
apiVersion: v1
kind: Pod
metadata:
  name: node-standard
  namespace: default
  annotations:
    alpha.kinvolk.io/userns: "node"
spec:
  restartPolicy: Never
  containers:
  - name: container1
    image: busybox
    command: ["sh"]
    args: ["-c", "sleep infinity"]

userns-tests/pod-standard.yaml:

# Pod user with userns set to "pod" mode. Pod creation would fail if userns
# not supported by runtime.
apiVersion: v1
kind: Pod
metadata:
  name: pod-standard
  namespace: default
  annotations:
    alpha.kinvolk.io/userns: "pod"
spec:
  restartPolicy: Never
  containers:
  - name: container1
    image: busybox
    command: ["sh"]
    args: ["-c", "sleep infinity"]

/cc @alban @rata

Introduce two new configuration options to specify the container to host uid/gid mapping.

…n, User Initial work by vikaschoudhary16 <vichoudh@redhat.com> on Kubernetes PR 64005.

- The sandbox container (aka "pause" container) has a tmpfs mount on /dev/shm. Bind mount it with nosuid, noexec, nodev because the mount would not be allowed in user namespaces otherwise. - Move the creation of the network namespace after the sandbox has been created: Before this patch, containerd created a netns, configured it with CNI, and then creates the sandbox container by giving the netns path previously setup. This means that the netns was owned by the host userns. Mounting sysfs in the container is restricted in this setup. This patch sets up the netns in the other way around instead: it creates the sandbox container, letting runc create a new netns. Then, it picks the new netns from /proc/$pid/ns/net, binds mount it in the usual CNI path and then gives it to CNI to configure. This means that the netns is owned by the userns of the sandbox container. In this way, mounting sysfs is possible. For more information about namespace ownership, see - man ioctl_ns - man user_namespaces, section "Interaction of user namespaces and other types of namespaces" - Linux commit 7dc5dbc879bd ("sysfs: Restrict mounting sysfs") torvalds/linux@7dc5dbc#diff-4839664cd0c8eab716e064323c7cd71fR1164 - net_current_may_mount() used for mounting sysfs: ns_capable(net->user_ns, CAP_SYS_ADMIN); https://github.com/torvalds/linux/blob/v5.7/net/core/net-sysfs.c#L1679

runc needs to bind mount files from /var/lib/kubelet/pods/... (such as etc-hosts) into the container. When using user namespaces, the bind mount didn't work anymore when containerd is started from a systemd unit. This patch fixes that by adding SupplementaryGroups=0 runc needs to have permission on the directory to stat() the source of the bind mount. Without user namespaces, this is not a problem since runc is running as root, so it has 'rwx' permissions over the directory: drwxr-x---. 8 root root 4096 May 28 18:05 /var/lib/kubelet Moreover, runc has CAP_DAC_OVERRIDE at this point because the mount phase happens before giving up the additional permissions. However, when using user namespaces, the runc process is belonging to a different user than root (depending on the mapping). /var/lib/kubelet is seen as belonging to the special unmapped user (65534, nobody). runc does not have 'rwx' permissions anymore but the empty '---' permission for 'other'. CAP_DAC_OVERRIDE is also no effective because the kernel performs the capability check with capable_wrt_inode_uidgid(inode, CAP_DAC_OVERRIDE). This checks that the owner of the /var/lib/kubelet is mapped in the current user namespace, which is not the case. Despite that, bind mounting /var/lib/kubelet/pods/...etc-hosts was working when containerd was started manually with 'sudo' but not working when started from a systemd unit. The difference is how supplementary groups are handled between sudo and systemd units: systemd does not set supplementary groups by default. $ sudo grep -E 'Groups:|Uid:|Gid:' /proc/self/status Uid: 0 0 0 0 Gid: 0 0 0 0 Groups: 0 $ sudo systemd-run -t grep -E 'Groups:|Uid:|Gid:' /proc/self/status Running as unit: run-u296886.service Press ^] three times within 1s to disconnect TTY. Uid: 0 0 0 0 Gid: 0 0 0 0 Groups: When runc has the supplementary group 0 configured, it is retained during the bind-mount phase, even though it is an unmapped group (runc temporarily sees 'Groups: 65534' in its own /proc/self/status), so runc effectively has the 'r-x' permissions over /var/lib/kubelet. This makes the bind mount of etc-hosts work. After the mount phase, runc will set the credential correctly (following OCI's config.json specification), so the container will not retain this unmapped supplementary group. I rely on the systemd unit file being configured correctly with SupplementaryGroups=0 and I don't attempt to set it up automatically with syscall.Setgroups() because "at the kernel level, user IDs and group IDs are a per-thread attribute" (man setgroups) and the way Golang uses threads make it difficult to predict which thread is going to be used to execute runc. glibc's setgroup() is a wrapper that changes the credentials for all threads but Golang does not use the glibc implementation.

k8s-ci-robot · 2020-07-07T15:50:35Z

@mauriciovasquezbernal: GitHub didn't allow me to request PR reviews from the following users: alban, rata.

Note that only containerd members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to this:

This is the containerd/cri implementation for the Kubernetes Node-Level User Namespaces Design Proposal. The patches are based on the release/1.3 branch. It is tested on Kubernetes 1.17 with patches adapted from PR 64005 (kinvolk/kubernetes#4).

The purpose of this PR is to gather some early feedback from the community on this feature to start the discussion again. We don't intend to merge it as it is (that's the reason we used release/1.3 as the base). We are planning to create / update a KEP to have a proper discussion about the design of this feature.

The main changes are:

Extend the configuration with uid/guid mappings.

Import the new CRI API from Kubernetes PR 64005.

Implement GetRuntimeConfigInfo returning the configured mappings.

Use the WithRemappedSnapshot snapshotter.

Fix sysfs mount with correct ownership of netns (see commitmsg for details).

Additional mount restrictions on /dev/shm (nosuid, noexec, nodev).

Chown /dev/shm appropriately.

Fix etc-hosts mounts with supplementary groups (see commitmsg for details).

Use custom options WithoutNamespace, WithUserNamespace, WithLinuxNamespace at the right places.

At the OCI level (config.json), we have the following changes:

sandbox container with new "user" namespace and UidMappings

normal containers with "user" namespace from a path and UidMappings

Demo:

Minimal example of the containerd configuration file:
version = 2
[plugins]
 [plugins."io.containerd.grpc.v1.cri"]
   [plugins."io.containerd.grpc.v1.cri".node_wide_uid_mapping]
     container_id = 0
     host_id = 100000
     size = 65536
   [plugins."io.containerd.grpc.v1.cri".node_wide_gid_mapping]
     container_id = 0
     host_id = 100000
     size = 65536
$ kubectl apply -f userns-tests/node-standard.yaml
pod/pod-simple created
$ kubectl apply -f userns-tests/pod-standard.yaml
pod/pod-userns created
$ kubectl exec -ti node-standard -- /bin/sh -c 'cat /proc/self/uid_map'
        0          0 4294967295
$ kubectl exec -ti pod-standard -- /bin/sh -c 'cat /proc/self/uid_map'
        0     100000      65535
userns-tests/pod-standard.yaml
# Pod user with userns set to "node" mode.
apiVersion: v1
kind: Pod
metadata:
 name: node-standard
 namespace: default
 annotations:
   alpha.kinvolk.io/userns: "node"
spec:
 restartPolicy: Never
 containers:
 - name: container1
   image: busybox
   command: ["sh"]
   args: ["-c", "sleep infinity"]
userns-tests/pod-standard.yaml:
# Pod user with userns set to "pod" mode. Pod creation would fail if userns
# not supported by runtime.
apiVersion: v1
kind: Pod
metadata:
 name: pod-standard
 namespace: default
 annotations:
   alpha.kinvolk.io/userns: "pod"
spec:
 restartPolicy: Never
 containers:
 - name: container1
   image: busybox
   command: ["sh"]
   args: ["-c", "sleep infinity"]
/cc @alban @rata

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot · 2020-07-07T15:50:42Z

Hi @mauriciovasquezbernal. Thanks for your PR.

I'm waiting for a containerd member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

mikebrow · 2020-07-07T16:33:32Z

We don't intend to merge it as it is (that's the reason we used release/1.3 as the base)

? we would not merge a PR that is in [wip]/hold against the contributor's wishes...

mauriciovasquezbernal · 2020-07-07T18:00:55Z

? we would not merge a PR that is in [wip]/hold against the contributor's wishes...

I don't mean that we based it on release/1.3 to avoid merging. What I wanted to say is that it's not a problem to be based on release/1.3 because we don't want to merge it...

zhsj · 2020-09-16T06:49:01Z

It is same as https://github.com/rootless-containers/usernetes?

And maybe kubernetes/enhancements#1371 should replace Kubernetes Node-Level User Namespaces Design Proposal

alban · 2020-09-16T12:56:43Z

@zhsj It is not the same: while both use user namespaces as underlying technology, they achieve different things:

usernetes uses unprivileged user namespaces to be able to run container runtimes without being root on the host. All containers run as the same user.
this user namespace support PR allows containers to run in a user namespace (and possibly soon with different id mappings), hardening the isolation between containers. This does not allow by itself to run kubelet/containerd/runc as non-root.

Both are useful, depending on the use cases.

AkihiroSuda · 2020-11-20T16:50:53Z

What's current status?

mauriciovasquezbernal · 2020-11-23T14:33:00Z

What's current status?

We're waiting reviews on the Kubernetes KEP in order to define the CRI changes.

AkihiroSuda · 2020-11-23T14:52:36Z

Can we implement the "io.kubernetes.cri-o.userns-mode" convention until the KEP settles, or do we want to wait until KEP settles?

rptaylor · 2020-11-24T00:41:33Z

Will this allow a user to start an unprivileged kubernetes pod, and then use e.g. Singularity inside their pod to create a nested container?

mauriciovasquezbernal · 2020-11-30T16:12:29Z

Can we implement the "io.kubernetes.cri-o.userns-mode" convention until the KEP settles, or do we want to wait until KEP settles?

I'd like to get the KEP merged first and then implement this support in containerd. I fear that we would have a lot of heterogeneous implementations if we start implementing this support in the different runtimes without having a proper specification to do it. I also agree that people needing to have this support quickly could implement such mechanism to avoid waiting on Kubernetes on the meanwhile.

mauriciovasquezbernal · 2020-11-30T16:42:03Z

Will this allow a user to start an unprivileged kubernetes pod, and then use e.g. Singularity inside their pod to create a nested container?

Yes, that should work. User namespaces used in this context allow to run pods that require elevated privileges in a safer way. If you give the pod the correct capabilities to create containers and set correct k8s settings like https://kubernetes.io/docs/concepts/policy/pod-security-policy/#allowedprocmounttypes that should be fine. However it's something we haven't tried and there could be unknown issues.

mauriciovasquezbernal and others added 4 commits July 2, 2020 15:14

pkg/config: Introduce new option to configure uid/gid remapping

9c8a8da

Introduce two new configuration options to specify the container to host uid/gid mapping.

Add new CRI API, GetRuntimeConfigInfo() and a new LinuxNamepsaceOptio…

edfd9bd

…n, User Initial work by vikaschoudhary16 <vichoudh@redhat.com> on Kubernetes PR 64005.

k8s-ci-robot added needs-ok-to-test size/XXL labels Jul 7, 2020

alban mentioned this pull request Sep 28, 2020

keps/127: Support User Namespaces kinvolk/kubernetes-enhancements#3

Closed

mauriciovasquezbernal mentioned this pull request Nov 13, 2020

Experimental: support idmapped mounts kernel patches containerd/containerd#4734

Closed

mumoshu mentioned this pull request Nov 21, 2021

ERROR controller-runtime.controller - Privileged containers are not allowed spec.containers[1].securityContext.privileged actions/actions-runner-controller#792

Closed

dmcgowan closed this Mar 8, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[1.3 - DO NOT MERGE] User namespace support #1527

[1.3 - DO NOT MERGE] User namespace support #1527

mauriciovasquezbernal commented Jul 7, 2020 •

edited

Loading

k8s-ci-robot commented Jul 7, 2020

k8s-ci-robot commented Jul 7, 2020

mikebrow commented Jul 7, 2020 •

edited

Loading

mauriciovasquezbernal commented Jul 7, 2020

zhsj commented Sep 16, 2020

alban commented Sep 16, 2020

AkihiroSuda commented Nov 20, 2020

mauriciovasquezbernal commented Nov 23, 2020

AkihiroSuda commented Nov 23, 2020 •

edited

Loading

rptaylor commented Nov 24, 2020

mauriciovasquezbernal commented Nov 30, 2020

mauriciovasquezbernal commented Nov 30, 2020 •

edited

Loading

[1.3 - DO NOT MERGE] User namespace support #1527

[1.3 - DO NOT MERGE] User namespace support #1527

Conversation

mauriciovasquezbernal commented Jul 7, 2020 • edited Loading

k8s-ci-robot commented Jul 7, 2020

k8s-ci-robot commented Jul 7, 2020

mikebrow commented Jul 7, 2020 • edited Loading

mauriciovasquezbernal commented Jul 7, 2020

zhsj commented Sep 16, 2020

alban commented Sep 16, 2020

AkihiroSuda commented Nov 20, 2020

mauriciovasquezbernal commented Nov 23, 2020

AkihiroSuda commented Nov 23, 2020 • edited Loading

rptaylor commented Nov 24, 2020

mauriciovasquezbernal commented Nov 30, 2020

mauriciovasquezbernal commented Nov 30, 2020 • edited Loading

mauriciovasquezbernal commented Jul 7, 2020 •

edited

Loading

mikebrow commented Jul 7, 2020 •

edited

Loading

AkihiroSuda commented Nov 23, 2020 •

edited

Loading

mauriciovasquezbernal commented Nov 30, 2020 •

edited

Loading