Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bind mounting /etc/resolv.conf fails #1523

Open
yuvipanda opened this issue Jul 16, 2017 · 8 comments
Open

bind mounting /etc/resolv.conf fails #1523

yuvipanda opened this issue Jul 16, 2017 · 8 comments

Comments

@yuvipanda
Copy link

yuvipanda commented Jul 16, 2017

I'm trying to run a rootless container on a Linux machine.

How to reproduce

$ skopeo --insecure-policy copy docker://opensuse/amd64:42.2 oci:opensuse:latest
$umoci unpack --rootless --image singleuser:latest bundle
$ runc  --root /tmp/^Cnc run  test
container_linux.go:265: starting container process caused "process_linux.go:348: container init caused \"rootfs_linux.go:57: mounting \\\"/etc/resolv.conf\\\" to rootfs \\\"/home/yuvipanda/code/ferry-commute/bundle/rootfs\\\" at \\\"/home/yuvipanda/code/ferry-commute/bundle/rootfs/etc/resolv.conf\\\" caused \\\"operation not permitted\\\"\""

I suspect that the problem is related to the fact that my $HOME is mounted with encfs and has nodev & nosuid (maybe related to #1247?)

/home/.ecryptfs/yuvipanda/.Private on /home/yuvipanda type ecryptfs (rw,nosuid,nodev,relatime,ecryptfs_fnek_sig=bf2ce9e5622d0c40,ecryptfs_sig=ae8560bd6ae1015e,ecryptfs_cipher=aes,ecryptfs_key_bytes=16,ecryptfs_unlink_sigs)
@yuvipanda
Copy link
Author

hmm, might not be related to nodev / nosuid. I mounted a tmpfs and it still has the same problem.

stracing yields:

25813 mount("/mnt/ramdisk/bundle/rootfs", "/mnt/ramdisk/bundle/rootfs", 0xc4200beac8, MS_BIND|MS_REC, NULL) = 0
25813 stat("/mnt/ramdisk/bundle/rootfs/proc", {st_mode=S_IFDIR|0755, st_size=40, ...}) = 0
25813 mount("proc", "/mnt/ramdisk/bundle/rootfs/proc", "proc", 0, NULL) = 0
25813 stat("/mnt/ramdisk/bundle/rootfs/dev", {st_mode=S_IFDIR|0755, st_size=300, ...}) = 0
25813 mount("tmpfs", "/mnt/ramdisk/bundle/rootfs/dev", "tmpfs", MS_NOSUID|MS_STRICTATIME, "mode=755,size=65536k") = 0
25813 fchmodat(AT_FDCWD, "/mnt/ramdisk/bundle/rootfs/dev", 0755) = 0
25813 lstat("/mnt/ramdisk/bundle/rootfs/dev", {st_mode=S_IFDIR|0755, st_size=40, ...}) = 0
25813 lstat("/mnt/ramdisk/bundle/rootfs/dev/pts", 0xc4200d5a38) = -1 ENOENT (No such file or directory)
25813 stat("/mnt/ramdisk/bundle/rootfs/dev/pts", 0xc4200d5b08) = -1 ENOENT (No such file or directory)
25813 stat("/mnt/ramdisk/bundle/rootfs/dev", {st_mode=S_IFDIR|0755, st_size=40, ...}) = 0
25813 mkdirat(AT_FDCWD, "/mnt/ramdisk/bundle/rootfs/dev/pts", 0755) = 0
25813 mount("devpts", "/mnt/ramdisk/bundle/rootfs/dev/pts", "devpts", MS_NOSUID|MS_NOEXEC, "newinstance,ptmxmode=0666,mode=0"...) = 0
25813 stat("/mnt/ramdisk/bundle/rootfs/dev/shm", 0xc4200d5ca8) = -1 ENOENT (No such file or directory)
25813 stat("/mnt/ramdisk/bundle/rootfs/dev/shm", 0xc4200d5d78) = -1 ENOENT (No such file or directory)
25813 stat("/mnt/ramdisk/bundle/rootfs/dev", {st_mode=S_IFDIR|0755, st_size=60, ...}) = 0
25813 mkdirat(AT_FDCWD, "/mnt/ramdisk/bundle/rootfs/dev/shm", 0755) = 0
25813 mount("shm", "/mnt/ramdisk/bundle/rootfs/dev/shm", "tmpfs", MS_NOSUID|MS_NODEV|MS_NOEXEC, "mode=1777,size=65536k") = 0
25813 stat("/mnt/ramdisk/bundle/rootfs/dev/mqueue", 0xc4200d5f18) = -1 ENOENT (No such file or directory)
25813 mmap(0xc420100000, 1048576, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xc420100000
25813 mmap(0xc41fff0000, 32768, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xc41fff0000
25813 stat("/mnt/ramdisk/bundle/rootfs/dev", {st_mode=S_IFDIR|0755, st_size=80, ...}) = 0
25813 mkdirat(AT_FDCWD, "/mnt/ramdisk/bundle/rootfs/dev/mqueue", 0755) = 0
25813 mount("mqueue", "/mnt/ramdisk/bundle/rootfs/dev/mqueue", "mqueue", MS_NOSUID|MS_NODEV|MS_NOEXEC, NULL) = 0
25813 lstat("/mnt/ramdisk/bundle/rootfs/sys", {st_mode=S_IFDIR|0755, st_size=40, ...}) = 0
25813 stat("/mnt/ramdisk/bundle/rootfs/sys", {st_mode=S_IFDIR|0755, st_size=40, ...}) = 0
25813 mount("/sys", "/mnt/ramdisk/bundle/rootfs/sys", 0xc4200bed5b, MS_RDONLY|MS_NOSUID|MS_NODEV|MS_NOEXEC|MS_BIND|MS_REC, NULL) = 0
25813 stat("/etc/resolv.conf", {st_mode=S_IFREG|0644, st_size=304, ...}) = 0
25813 lstat("/mnt/ramdisk/bundle/rootfs/etc", {st_mode=S_IFDIR|0755, st_size=1860, ...}) = 0
25813 lstat("/mnt/ramdisk/bundle/rootfs/etc/resolv.conf", {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
25813 stat("/mnt/ramdisk/bundle/rootfs/etc/resolv.conf", {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
25813 mount("/etc/resolv.conf", "/mnt/ramdisk/bundle/rootfs/etc/resolv.conf", 0xc4200bed64, MS_RDONLY|MS_BIND, NULL) = 0
25813 mount("/etc/resolv.conf", "/mnt/ramdisk/bundle/rootfs/etc/resolv.conf", 0xc4200bed69, MS_RDONLY|MS_REMOUNT|MS_BIND, NULL) = -1 EPERM (Operation not permitted)

if I turn off this in my config.json and manually copy resolv.conf, it works.

@yuvipanda
Copy link
Author

tmpfs on /mnt/ramdisk type tmpfs (rw,relatime,size=524288k)

is the mount on which I tried putting bundle and trying, to no avail. I'm on a pretty stock and fairly new Ubuntu Zesty install, running master of all the tools.

@yuvipanda
Copy link
Author

Aha, if I don't make it mount as 'ro' (the default) and only mount it as 'rw', it works!

moby/moby#22994 might be related. I see that /etc/resolv.conf is first mounted, which is a successful call, and then there's an attempt to remount - which is what fails.

@cyphar
Copy link
Member

cyphar commented Jul 17, 2017

Ah, this is what @justincormack was trying to fix with #1222, but it looks like it wasn't fully fixed.

/cc @justincormack

@cyphar
Copy link
Member

cyphar commented Jul 17, 2017

Never mind, his fixes were related but not meant to fix this precise issue. It's quite odd that we're trying to remount a mount without changing any of its flags.

@ccbrown
Copy link

ccbrown commented Jan 9, 2018

This is a pretty old issue now, but I stumbled upon it looking for something related.

It's quite odd that we're trying to remount a mount without changing any of its flags.

It does look really odd, but there's a good reason for it. If the initial mount flags include MS_BIND, the other flags are ignored. So that first syscall:

mount("/etc/resolv.conf", "/mnt/ramdisk/bundle/rootfs/etc/resolv.conf", 0xc4200bed64, MS_RDONLY|MS_BIND, NULL)

results in a bind mount that is not read-only. To make it read-only, you have to remount it.

Now if the parent mount in a more privileged namespace has nosuid, noexec, nodev, relatime, etc. bits set, then they will get inherited by the new bind mount, and you cannot change those bits. So when you do the remount, you need to figure out which of those bits the mount inherited (e.g. by inspecting /proc/self/mountinfo), then pass them along to the remount syscall. Otherwise, you'll get an "operation not permitted" error like you see here because you aren't allowed to unset those bits. I bet that's what's happening.

For debugging, I would recommend comparing that strace to what you see using the mount CLI command in a user+mount namespace (unshare --map-root-user --user --mount). The command will probably create the read-only bind mount with no problem, and you should be able to spot the difference in syscalls pretty easily.

@cyphar
Copy link
Member

cyphar commented Jan 9, 2018

Now if the parent mount in a more privileged namespace has nosuid, noexec, nodev, relatime, etc. bits set, then they will get inherited by the new bind mount, and you cannot change those bits. So when you do the remount, you need to figure out which of those bits the mount inherited (e.g. by inspecting /proc/self/mountinfo), then pass them along to the remount syscall. Otherwise, you'll get an "operation not permitted" error like you see here because you aren't allowed to unset those bits. I bet that's what's happening.

We've had this discussion previously (in #1603), and decided to add the code handling that to Docker (moby/moby#35205) or cri-o, because at the time it was caused by having a (possibly) invalid OCI configuration. It should be noted that you can get the flags necessary from statfs which removes the need to parse /proc/self/mountinfo.

But yes, you're completely right that we shouldn't be passing MS_RDONLY|MS_BIND since the MS_RDONLY is confusing and ignored. However, note that MS_BIND|MS_REMOUNT is actually a special case for changing mount flags (see #1572).

So ultimately the fix is either to manually change the config.json to have the right set of security flags, or to revive #1603.

@ctalledo
Copy link
Contributor

ctalledo commented Mar 16, 2021

Hi @cyphar,

I am hitting this same problem running runc within a rootless container (user-namespace).

In function remount() in libcontainer/rootfs_linux.go, the call to unix.Mount() fails because it does not preserve existing flags on the original mount (e.g., nodev):

func remount(m *configs.Mount, rootfs string) error {
	var (
		dest = m.Destination
	)
	if !strings.HasPrefix(dest, rootfs) {
		dest = filepath.Join(rootfs, dest)
	}
	return unix.Mount(m.Source, dest, m.Device, uintptr(m.Flags|unix.MS_REMOUNT), "")
}

In my specific case, the failure occurs when runc (running inside a rootless container) is setting up a bind mount into the container's rootfs, where the bind-mount has nodev set, but the mount flags received from the OCI spec do not. As a result, the remount is seen by the kernel as clearing the nodev flag and it returns EPERM (not sure if the EPERM is specific to running inside a user-namespace, though I suspect it is).

I read the discussion in PR #1603 and it seems the conclusion was that the higher level container manager (in my case Docker + containerd) should be the entity providing the mount flags for the bind mount, and that runc is not allowed to modify those mount flags in any way. Is this correct?

Reading the OCI spec on the meaning of mount options, it simply says "Mount options of the filesystem to be used", implying that the higher level container manager should always pass those options (i.e., in my case preserve nodev in the bind mount for example). But a clarification in the OCI spec for bind mounts would surely help, given that they inherit their mount options from the source mount and it's not clear if the omission of a flag in the spec's mount options implies clearing that flag or keeping it as is. Currently runc is clearing non-present the flags, and that's causing the problem in this issue.

Any further thoughts on this? We need a fix (either at runc level, the OCI spec, or in the container managers above runc), as this issue will continue to show up as more people start running runc inside the user-ns / rootless containers.

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants