Skip to content
This repository has been archived by the owner on Mar 9, 2022. It is now read-only.

[1.3 - DO NOT MERGE] User namespace support #1527

Conversation

mauriciovasquezbernal
Copy link

@mauriciovasquezbernal mauriciovasquezbernal commented Jul 7, 2020

This is the containerd/cri implementation for the Kubernetes Node-Level User Namespaces Design Proposal. The patches are based on the release/1.3 branch. It is tested on Kubernetes 1.17 with patches adapted from PR 64005 (kinvolk/kubernetes#4).

The purpose of this PR is to gather some early feedback from the community on this feature to start the discussion again. This PR is based on the release/1.3 branch, we don't intend to merge it as it is so it should not be a problem. We are planning to create / update a KEP to have a proper discussion about the design of this feature.

The main changes are:

  • Extend the configuration with uid/guid mappings.
  • Import the new CRI API from Kubernetes PR 64005.
  • Implement GetRuntimeConfigInfo returning the configured mappings.
  • Use the WithRemappedSnapshot snapshotter.
  • Fix sysfs mount with correct ownership of netns (see commitmsg for details).
  • Additional mount restrictions on /dev/shm (nosuid, noexec, nodev).
  • Chown /dev/shm appropriately.
  • Fix etc-hosts mounts with supplementary groups (see commitmsg for details).
  • Use custom options WithoutNamespace, WithUserNamespace, WithLinuxNamespace at the right places.

At the OCI level (config.json), we have the following changes:

  • sandbox container with new "user" namespace and UidMappings
  • normal containers with "user" namespace from a path and UidMappings

Demo:

Minimal example of the containerd configuration file:

version = 2
[plugins]
  [plugins."io.containerd.grpc.v1.cri"]
    [plugins."io.containerd.grpc.v1.cri".node_wide_uid_mapping]
      container_id = 0
      host_id = 100000
      size = 65536
    [plugins."io.containerd.grpc.v1.cri".node_wide_gid_mapping]
      container_id = 0
      host_id = 100000
      size = 65536
$ kubectl apply -f userns-tests/node-standard.yaml
pod/pod-simple created
$ kubectl apply -f userns-tests/pod-standard.yaml
pod/pod-userns created
$ kubectl exec -ti node-standard -- /bin/sh -c 'cat /proc/self/uid_map'
         0          0 4294967295
$ kubectl exec -ti pod-standard -- /bin/sh -c 'cat /proc/self/uid_map'
         0     100000      65535
  • userns-tests/pod-standard.yaml
# Pod user with userns set to "node" mode.
apiVersion: v1
kind: Pod
metadata:
  name: node-standard
  namespace: default
  annotations:
    alpha.kinvolk.io/userns: "node"
spec:
  restartPolicy: Never
  containers:
  - name: container1
    image: busybox
    command: ["sh"]
    args: ["-c", "sleep infinity"]
  • userns-tests/pod-standard.yaml:
# Pod user with userns set to "pod" mode. Pod creation would fail if userns
# not supported by runtime.
apiVersion: v1
kind: Pod
metadata:
  name: pod-standard
  namespace: default
  annotations:
    alpha.kinvolk.io/userns: "pod"
spec:
  restartPolicy: Never
  containers:
  - name: container1
    image: busybox
    command: ["sh"]
    args: ["-c", "sleep infinity"]

/cc @alban @rata

mauriciovasquezbernal and others added 4 commits July 2, 2020 15:14
Introduce two new configuration options to specify the container to host uid/gid
mapping.
…n, User

Initial work by vikaschoudhary16 <vichoudh@redhat.com> on Kubernetes PR
64005.
- The sandbox container (aka "pause" container) has a tmpfs mount on
/dev/shm. Bind mount it with nosuid, noexec, nodev because the mount
would not be allowed in user namespaces otherwise.

- Move the creation of the network namespace after the sandbox has been created:
Before this patch, containerd created a netns, configured it with CNI,
and then creates the sandbox container by giving the netns path previously
setup. This means that the netns was owned by the host userns. Mounting
sysfs in the container is restricted in this setup.

This patch sets up the netns in the other way around instead: it creates
the sandbox container, letting runc create a new netns. Then, it picks
the new netns from /proc/$pid/ns/net, binds mount it in the usual CNI
path and then gives it to CNI to configure. This means that the netns is
owned by the userns of the sandbox container. In this way, mounting
sysfs is possible.

For more information about namespace ownership, see
- man ioctl_ns
- man user_namespaces, section "Interaction of user namespaces and other types of namespaces"
- Linux commit 7dc5dbc879bd ("sysfs: Restrict mounting sysfs")
  torvalds/linux@7dc5dbc#diff-4839664cd0c8eab716e064323c7cd71fR1164
- net_current_may_mount() used for mounting sysfs:
  ns_capable(net->user_ns, CAP_SYS_ADMIN);
  https://github.com/torvalds/linux/blob/v5.7/net/core/net-sysfs.c#L1679
runc needs to bind mount files from /var/lib/kubelet/pods/... (such as
etc-hosts) into the container. When using user namespaces, the bind
mount didn't work anymore when containerd is started from a systemd unit.

This patch fixes that by adding SupplementaryGroups=0

runc needs to have permission on the directory to stat() the source of
the bind mount. Without user namespaces, this is not a problem since
runc is running as root, so it has 'rwx' permissions over the directory:

drwxr-x---. 8 root   root   4096 May 28 18:05 /var/lib/kubelet

Moreover, runc has CAP_DAC_OVERRIDE at this point because the mount
phase happens before giving up the additional permissions.

However, when using user namespaces, the runc process is belonging to a
different user than root (depending on the mapping). /var/lib/kubelet is
seen as belonging to the special unmapped user (65534, nobody). runc
does not have 'rwx' permissions anymore but the empty '---' permission
for 'other'.

CAP_DAC_OVERRIDE is also no effective because the kernel performs the
capability check with capable_wrt_inode_uidgid(inode, CAP_DAC_OVERRIDE).
This checks that the owner of the /var/lib/kubelet is mapped in the
current user namespace, which is not the case.

Despite that, bind mounting /var/lib/kubelet/pods/...etc-hosts was
working when containerd was started manually with 'sudo' but not working
when started from a systemd unit. The difference is how supplementary
groups are handled between sudo and systemd units: systemd does not set
supplementary groups by default.

$ sudo grep -E 'Groups:|Uid:|Gid:' /proc/self/status
Uid:	0	0	0	0
Gid:	0	0	0	0
Groups:	0

$ sudo systemd-run -t grep -E 'Groups:|Uid:|Gid:' /proc/self/status
Running as unit: run-u296886.service
Press ^] three times within 1s to disconnect TTY.
Uid:	0	0	0	0
Gid:	0	0	0	0
Groups:

When runc has the supplementary group 0 configured, it is retained
during the bind-mount phase, even though it is an unmapped group (runc
temporarily sees 'Groups: 65534' in its own /proc/self/status), so runc
effectively has the 'r-x' permissions over /var/lib/kubelet. This makes
the bind mount of etc-hosts work.

After the mount phase, runc will set the credential correctly (following
OCI's config.json specification), so the container will not retain this
unmapped supplementary group.

I rely on the systemd unit file being configured correctly with
SupplementaryGroups=0 and I don't attempt to set it up automatically
with syscall.Setgroups() because "at the kernel level, user IDs and
group IDs are a per-thread attribute" (man setgroups) and the way Golang
uses threads make it difficult to predict which thread is going to be
used to execute runc. glibc's setgroup() is a wrapper that changes the
credentials for all threads but Golang does not use the glibc
implementation.
@k8s-ci-robot
Copy link

@mauriciovasquezbernal: GitHub didn't allow me to request PR reviews from the following users: alban, rata.

Note that only containerd members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to this:

This is the containerd/cri implementation for the Kubernetes Node-Level User Namespaces Design Proposal. The patches are based on the release/1.3 branch. It is tested on Kubernetes 1.17 with patches adapted from PR 64005 (kinvolk/kubernetes#4).

The purpose of this PR is to gather some early feedback from the community on this feature to start the discussion again. We don't intend to merge it as it is (that's the reason we used release/1.3 as the base). We are planning to create / update a KEP to have a proper discussion about the design of this feature.

The main changes are:

  • Extend the configuration with uid/guid mappings.
  • Import the new CRI API from Kubernetes PR 64005.
  • Implement GetRuntimeConfigInfo returning the configured mappings.
  • Use the WithRemappedSnapshot snapshotter.
  • Fix sysfs mount with correct ownership of netns (see commitmsg for details).
  • Additional mount restrictions on /dev/shm (nosuid, noexec, nodev).
  • Chown /dev/shm appropriately.
  • Fix etc-hosts mounts with supplementary groups (see commitmsg for details).
  • Use custom options WithoutNamespace, WithUserNamespace, WithLinuxNamespace at the right places.

At the OCI level (config.json), we have the following changes:

  • sandbox container with new "user" namespace and UidMappings
  • normal containers with "user" namespace from a path and UidMappings

Demo:

Minimal example of the containerd configuration file:

version = 2
[plugins]
 [plugins."io.containerd.grpc.v1.cri"]
   [plugins."io.containerd.grpc.v1.cri".node_wide_uid_mapping]
     container_id = 0
     host_id = 100000
     size = 65536
   [plugins."io.containerd.grpc.v1.cri".node_wide_gid_mapping]
     container_id = 0
     host_id = 100000
     size = 65536
$ kubectl apply -f userns-tests/node-standard.yaml
pod/pod-simple created
$ kubectl apply -f userns-tests/pod-standard.yaml
pod/pod-userns created
$ kubectl exec -ti node-standard -- /bin/sh -c 'cat /proc/self/uid_map'
        0          0 4294967295
$ kubectl exec -ti pod-standard -- /bin/sh -c 'cat /proc/self/uid_map'
        0     100000      65535
  • userns-tests/pod-standard.yaml
# Pod user with userns set to "node" mode.
apiVersion: v1
kind: Pod
metadata:
 name: node-standard
 namespace: default
 annotations:
   alpha.kinvolk.io/userns: "node"
spec:
 restartPolicy: Never
 containers:
 - name: container1
   image: busybox
   command: ["sh"]
   args: ["-c", "sleep infinity"]
  • userns-tests/pod-standard.yaml:
# Pod user with userns set to "pod" mode. Pod creation would fail if userns
# not supported by runtime.
apiVersion: v1
kind: Pod
metadata:
 name: pod-standard
 namespace: default
 annotations:
   alpha.kinvolk.io/userns: "pod"
spec:
 restartPolicy: Never
 containers:
 - name: container1
   image: busybox
   command: ["sh"]
   args: ["-c", "sleep infinity"]

/cc @alban @rata

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot
Copy link

Hi @mauriciovasquezbernal. Thanks for your PR.

I'm waiting for a containerd member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@mikebrow
Copy link
Member

mikebrow commented Jul 7, 2020

We don't intend to merge it as it is (that's the reason we used release/1.3 as the base)

? we would not merge a PR that is in [wip]/hold against the contributor's wishes...

@mauriciovasquezbernal
Copy link
Author

? we would not merge a PR that is in [wip]/hold against the contributor's wishes...

I don't mean that we based it on release/1.3 to avoid merging. What I wanted to say is that it's not a problem to be based on release/1.3 because we don't want to merge it...

@zhsj
Copy link
Contributor

zhsj commented Sep 16, 2020

@alban
Copy link

alban commented Sep 16, 2020

@zhsj It is not the same: while both use user namespaces as underlying technology, they achieve different things:

  • usernetes uses unprivileged user namespaces to be able to run container runtimes without being root on the host. All containers run as the same user.
  • this user namespace support PR allows containers to run in a user namespace (and possibly soon with different id mappings), hardening the isolation between containers. This does not allow by itself to run kubelet/containerd/runc as non-root.

Both are useful, depending on the use cases.

@AkihiroSuda
Copy link
Member

What's current status?

@mauriciovasquezbernal
Copy link
Author

What's current status?

We're waiting reviews on the Kubernetes KEP in order to define the CRI changes.

@AkihiroSuda
Copy link
Member

AkihiroSuda commented Nov 23, 2020

Can we implement the "io.kubernetes.cri-o.userns-mode" convention until the KEP settles, or do we want to wait until KEP settles?

@rptaylor
Copy link

Will this allow a user to start an unprivileged kubernetes pod, and then use e.g. Singularity inside their pod to create a nested container?

@mauriciovasquezbernal
Copy link
Author

Can we implement the "io.kubernetes.cri-o.userns-mode" convention until the KEP settles, or do we want to wait until KEP settles?

I'd like to get the KEP merged first and then implement this support in containerd. I fear that we would have a lot of heterogeneous implementations if we start implementing this support in the different runtimes without having a proper specification to do it. I also agree that people needing to have this support quickly could implement such mechanism to avoid waiting on Kubernetes on the meanwhile.

@mauriciovasquezbernal
Copy link
Author

mauriciovasquezbernal commented Nov 30, 2020

Will this allow a user to start an unprivileged kubernetes pod, and then use e.g. Singularity inside their pod to create a nested container?

Yes, that should work. User namespaces used in this context allow to run pods that require elevated privileges in a safer way. If you give the pod the correct capabilities to create containers and set correct k8s settings like https://kubernetes.io/docs/concepts/policy/pod-security-policy/#allowedprocmounttypes that should be fine. However it's something we haven't tried and there could be unknown issues.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants