runc with user namespace enabled fails to bind mount host dirs with 750 permission #2484

alban · 2020-06-22T17:10:08Z

When config.json has the following:

user namespace enabled
a bind mount from a host directory that is a sub-directory of one with with drwxr-x--- permission, it fails with the error message:

# time="2020-06-22T13:48:26Z" level=error msg="container_linux.go:367:
starting container process caused: process_linux.go:459:
container init caused: rootfs_linux.go:58:
mounting \"/tmp/busyboxtest/source-inaccessible/dir\"
to rootfs at \"/tmp/inaccessible\" caused:
stat /tmp/busyboxtest/source-inaccessible/dir: permission denied"

I implemented a reproducer in the integration tests, along with the explanation and a workaround when started from a systemd unit: #2483

The text was updated successfully, but these errors were encountered:

cyphar · 2020-06-22T18:43:00Z

This is expected behaviour. Even if we were to blindly mount(2) the path, path lookup would fail because by the time we are mounting filesystems the container is running in the user namespace and thus CAP_DAC_OVERRIDE doesn't work anymore.

There are possible ways to work around this (such as looking up the path while running with privileges on the host, then bind-mounting those file descriptors once the container has started) but that will make our mounting code much more complicated. Is this something that you really need, or is it possible to adjust your system configuration?

EDIT: Ah, I just saw the description of the PR you linked. Yeah, the fact that this works under sudo due to the semantics of supplementary groups is a little bit fruity. And yeah, we configure the privileges of the container a fair bit after the rootfs has been configured, hence why the supplementary groups have an impact on our mount code. I guess if this is needed for user namespace support in Kubernetes we can look at implementing workarounds.

alban · 2020-06-23T10:15:45Z

Is this something that you really need, or is it possible to adjust your system configuration?

I think that would be nice to have: /var/lib/kubelet contains cryptographic keys, pods directories with ConfigMap and Secrets. Currently, only root can access it. It feels wrong if we give access to all users, even if it is only traversal rights (--x).

I am not sure what is the best workaround. I have not thought of passing file descriptors.

Open file descriptors in parent privileged process, mounting in the child process
- Pros: should work for all corner cases
- Cons(?): does it require the new Linux mount API? Does it work with old kernels?
- Cons: more complicated code
Do the mounts in the child mntns but in the host userns (so we retain privileges).
- Cons: not possible in Golang: Linux does not allow setns() on mntns with multithreaded programs. And Linux does not allow to go back to the parent userns with setns().
- Cons: needs complicated interprocess synchronisation
runc adding gid=0 as supplemental group
- Cons: difficult to do with Golang because syscall.Setgroups() is per thread. POSIX setgroups() support is in development but not there yet.
- Cons: gid=0 might not be the only gid to add in supplemental groups: if a directory belongs to another group than root, that other group would need to be added too.
- Cons: if a directory is drwx------, then adding a supplemental user would not work. There is no concept of supplemental user.
- Pros: maybe we could use LockOSThread + syscall.Setgroups just before starting the new process.
Documenting that processes starting runc (such as containerd) should run with systemd option SupplementaryGroups=0.
- Pros: easy to implement
- Pros: works good enough for the /var/lib/kubelet use case
- Cons: it is implementing a workaround in the wrong place in my opinion
- Cons: would not work with more complicated use cases such as a directory with drwx------ or belonging to another group than root (but I don't have a use case that would require it at the moment)

cyphar · 2020-06-23T14:25:02Z

Open file descriptors in parent privileged process, mounting in the child process

Cons(?): does it require the new Linux mount API? Does it work with old kernels?

Yes, it'll work with basically all old kernels I can think of. While the mount API does make the fact you can bind-mount file descriptors explicit, you've always been able to do it through /proc/self/fd/.... Whether this is a bug or a feature is up to your world-view.

One other con is that this could be used to attack the host in certain situations (a-la CVE-2016-9962) because you can join existing namespaces while still setting up mounts (not that I'd recommend that). But PR_SET_DUMPABLE should block that, especially since this is all only required for user namespaces anyway (which got beefed up protections after CVE-2016-9962 specifically in this area).

Honestly, given all of the other options I think it's the least-ugly and most-correct solution. 🤷

cyphar · 2020-06-24T00:34:04Z

It actually shouldn't be too complicated. We can just rewrite the configuration we pass to the container to use /proc/self/fd/.... The only issue is that the mountpoint fds can't be marked O_CLOEXEC and so we'd need to make sure we manually close them (though we already have code to do this at the very last stage of runc init).

alban · 2020-09-03T15:36:06Z

@cyphar I tried to implement that in this branch: I open the sources of the mounts with O_PATH in the host mntns and then mount from /proc/self/fd/... in the child process after it is in the new mount namespace. But that does not work: I think this is because the kernel checks that the source mount comes from the same mntns as the current mntns.

For debugging purposes, I added a commit with some debugs and I open the source of the mount both from the host mntns and the container mntns. As you can see, both /proc/self/fd/7 and /proc/self/fd/8 in the strace log below refers to the same file (same file, same inode) but one can be mounted and the other cannot.

I think this is because of the cross mntns mount check:
https://github.com/torvalds/linux/blob/v5.8/fs/namespace.c#L2312

The function function __do_loopback() performing the bind mount will run check_mnt(old). This checks if the source file opened with O_PATH comes from the same mntns: mnt->mnt_ns == current->nsproxy->mnt_ns.

[pid 3654266] mount("/proc/self/fd/7", "/home/alban/go/src/github.com/opencontainers/runc/mycontainer/rootfs/host-tmp", 0xc0000a79a8, MS_RDONLY|MS_BIND|MS_REC, NULL <unfinished ...>
[pid 3654266] <... mount resumed>)      = -1 EINVAL (Invalid argument)
[pid 3654266] fstat(7,  <unfinished ...>
[pid 3654266] <... fstat resumed>{st_mode=S_IFDIR|S_ISVTX|0777, st_size=1000, ...}) = 0
[pid 3654266] write(1, "failed to mount from file &{Dev:46 Ino:229569 Nlink:42 Mode:17407 Uid:0 Gid:0 X__pad0:0 Rdev:0 Size:1000 Blksize:4096 Blocks:0 Atim:{Sec:1599142825 Nsec:342786981} Mtim:{Sec:1599145892 Nsec:486390736} Ctim:{Sec:1599145892 Nsec:486390736} X__unused:[0 0 0]}"..., 257) = 257
[pid 3654266] openat(AT_FDCWD, "/tmp", O_RDONLY|O_CLOEXEC|O_PATH <unfinished ...>
[pid 3654266] <... openat resumed>)     = 8
[pid 3654266] fstat(8,  <unfinished ...>
[pid 3654266] <... fstat resumed>{st_mode=S_IFDIR|S_ISVTX|0777, st_size=1000, ...}) = 0
[pid 3654266] write(1, "Let's try this other file instead &{Dev:46 Ino:229569 Nlink:42 Mode:17407 Uid:0 Gid:0 X__pad0:0 Rdev:0 Size:1000 Blksize:4096 Blocks:0 Atim:{Sec:1599142825 Nsec:342786981} Mtim:{Sec:1599145892 Nsec:486390736} Ctim:{Sec:1599145892 Nsec:486390736} X__unused:"..., 265 <unfinished ...>
[pid 3654266] <... write resumed>)      = 265
[pid 3654266] mount("/proc/self/fd/8", "/home/alban/go/src/github.com/opencontainers/runc/mycontainer/rootfs/host-tmp", 0xc0000a7ce8, MS_RDONLY|MS_BIND|MS_REC, NULL) = 0

Any suggestions?

alban · 2020-09-07T17:15:38Z

Due to the kernel cross-mntns check, I tried this approach: open the mount sources in the host userns but in the container mntns, and pass the fds by SCM_RIGHTS. It works on my test scenario.
#2576

The source of the bind mount might not be accessible in a different user namespace because a component of the source path might not be traversed under the users and groups mapped inside the user namespace. This caused errors such as the following: # time="2020-06-22T13:48:26Z" level=error msg="container_linux.go:367: starting container process caused: process_linux.go:459: container init caused: rootfs_linux.go:58: mounting \"/tmp/busyboxtest/source-inaccessible/dir\" to rootfs at \"/tmp/inaccessible\" caused: stat /tmp/busyboxtest/source-inaccessible/dir: permission denied" To solve this problem, this patch performs the following: 1. in nsexec.c, it opens the source path in the host userns (so we have the right permissions to open it) but in the container mntns (so the kernel cross mntns mount check let us mount it later: https://github.com/torvalds/linux/blob/v5.8/fs/namespace.c#L2312). 2. in nsexec.c, it passes the file descriptors of the source to the child process with SCM_RIGHTS. 3. In runc-init in Golang, it finishes the mounts while inside the userns even without access to the some components of the source paths. Passing the fds with SCM_RIGHTS is necessary because once the child process is in the container mntns, it is already in the container userns so it cannot temporarily join the host mntns. This patch uses the existing mechanism with _LIBCONTAINER_* environment variables to pass the file descriptors from runc to runc init. This patch uses the existing mechanism with the Netlink-style bootstrap to pass information about the list of source mounts to nsexec.c. Rootless containers don't use this bind mount sources fdpassing mechanism because we can't setns() to the target mntns in a rootless container (we don't have the privileges when we are in the host userns). This patch takes care of using O_CLOEXEC on mount files, and close them early. Fixes: opencontainers#2484. Signed-off-by: Alban Crequy <alban@kinvolk.io> Signed-off-by: Rodrigo Campos <rodrigo@kinvolk.io> Co-authored-by: Rodrigo Campos <rodrigo@kinvolk.io>

The source of the bind mount might not be accessible in a different user namespace because a component of the source path might not be traversed under the users and groups mapped inside the user namespace. This caused errors such as the following: # time="2020-06-22T13:48:26Z" level=error msg="container_linux.go:367: starting container process caused: process_linux.go:459: container init caused: rootfs_linux.go:58: mounting \"/tmp/busyboxtest/source-inaccessible/dir\" to rootfs at \"/tmp/inaccessible\" caused: stat /tmp/busyboxtest/source-inaccessible/dir: permission denied" To solve this problem, this patch performs the following: 1. in nsexec.c, it opens the source path in the host userns (so we have the right permissions to open it) but in the container mntns (so the kernel cross mntns mount check let us mount it later: https://github.com/torvalds/linux/blob/v5.8/fs/namespace.c#L2312). 2. in nsexec.c, it passes the file descriptors of the source to the child process with SCM_RIGHTS. 3. In runc-init in Golang, it finishes the mounts while inside the userns even without access to the some components of the source paths. Passing the fds with SCM_RIGHTS is necessary because once the child process is in the container mntns, it is already in the container userns so it cannot temporarily join the host mntns. This patch uses the existing mechanism with _LIBCONTAINER_* environment variables to pass the file descriptors from runc to runc init. This patch uses the existing mechanism with the Netlink-style bootstrap to pass information about the list of source mounts to nsexec.c. Rootless containers don't use this bind mount sources fdpassing mechanism because we can't setns() to the target mntns in a rootless container (we don't have the privileges when we are in the host userns). This patch takes care of using O_CLOEXEC on mount fds, and close them early. Fixes: opencontainers#2484. Signed-off-by: Alban Crequy <alban@kinvolk.io> Signed-off-by: Rodrigo Campos <rodrigo@kinvolk.io> Co-authored-by: Rodrigo Campos <rodrigo@kinvolk.io>

mauriciovasquezbernal mentioned this issue Jul 8, 2020

User ns reworked kinvolk/kubernetes#4

Closed

rata mentioned this issue Jul 8, 2020

New tests for user namespaces and groups issue #2483

Closed

alban mentioned this issue Sep 7, 2020

Open bind mount sources from the host userns #2576

Merged

3 tasks

AkihiroSuda closed this as completed in #2576 Oct 28, 2021

thaJeztah mentioned this issue Jul 4, 2022

Cannot start container with --network container:<name> when using userns-remap and a volume moby/moby#43758

Closed

ygersie mentioned this issue Jul 6, 2022

CSI: set group read permissions on csi paths hashicorp/nomad#13512

Closed

rata mentioned this issue Mar 16, 2023

Failure to run user namespaced container #3770

Open

yawqi mentioned this issue Nov 10, 2023

support user namespace kata-containers/kata-containers#8170

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

runc with user namespace enabled fails to bind mount host dirs with 750 permission #2484

runc with user namespace enabled fails to bind mount host dirs with 750 permission #2484

alban commented Jun 22, 2020

cyphar commented Jun 22, 2020 •

edited

Loading

alban commented Jun 23, 2020

cyphar commented Jun 23, 2020 •

edited

Loading

cyphar commented Jun 24, 2020

alban commented Sep 3, 2020

alban commented Sep 7, 2020

runc with user namespace enabled fails to bind mount host dirs with 750 permission #2484

runc with user namespace enabled fails to bind mount host dirs with 750 permission #2484

Comments

alban commented Jun 22, 2020

cyphar commented Jun 22, 2020 • edited Loading

alban commented Jun 23, 2020

cyphar commented Jun 23, 2020 • edited Loading

cyphar commented Jun 24, 2020

alban commented Sep 3, 2020

alban commented Sep 7, 2020

cyphar commented Jun 22, 2020 •

edited

Loading

cyphar commented Jun 23, 2020 •

edited

Loading