-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix --read-only containers under --userns-remap #1572
Conversation
The documentation here: https://docs.docker.com/engine/security/userns-remap/#user-namespace-known-limitations says that readonly containers can't be used with user namespaces do to some kernel restriction. In fact, there is a special case in the kernel to be able to do stuff like this, so let's use it. This takes us from: ubuntu@docker:~$ docker run -it --read-only ubuntu docker: Error response from daemon: oci runtime error: container_linux.go:262: starting container process caused "process_linux.go:339: container init caused \"rootfs_linux.go:125: remounting \\\"/dev\\\" as readonly caused \\\"operation not permitted\\\"\"". to: ubuntu@docker:~$ docker-runc --version runc version 1.0.0-rc4+dev commit: ae29480-dirty spec: 1.0.0 ubuntu@docker:~$ docker run -it --read-only ubuntu root@181e2acb909a:/# touch foo touch: cannot touch 'foo': Read-only file system Signed-off-by: Tycho Andersen <tycho@docker.com>
343abf5
to
66eb2a3
Compare
I actually was hitting a variant of this bug today. I noticed that in /cc @justincormack who was working on this stuff last. |
Is it safe even if the original mount is not bind mount? |
@rhvgoyal PTAL |
On Fri, Aug 25, 2017 at 03:25:38AM +0000, Aleksa Sarai wrote:
I actually was hitting a variant of this bug today. I noticed that in `readonlyPath` we also do `MS_REMOUNT | MS_BIND` which has caused me issues with readonly bindmounts (though I can't reproduce it outside of Docker weirdly). Would this change also apply to `readonlyPath`?
Not that I know of, the implementation looks fine to me. Do you have a
command line that will reproduce it?
… /cc @justincormack who was working on this stuff last.
--
You are receiving this because you authored the thread.
Reply to this email directly or view it on GitHub:
#1572 (comment)
|
On Fri, Aug 25, 2017 at 07:14:15AM +0000, Qiang Huang wrote:
Is it safe even if the original mount is not bind mount?
Yes, the MS_REMOUNT and MS_BIND code paths are handled separately by
the kernel, and MS_REMOUNT is considered first.
… --
You are receiving this because you authored the thread.
Reply to this email directly or view it on GitHub:
#1572 (comment)
|
On Fri, Aug 25, 2017 at 06:37:42AM -0600, Tycho Andersen wrote:
On Fri, Aug 25, 2017 at 03:25:38AM +0000, Aleksa Sarai wrote:
> I actually was hitting a variant of this bug today. I noticed that in `readonlyPath` we also do `MS_REMOUNT | MS_BIND` which has caused me issues with readonly bindmounts (though I can't reproduce it outside of Docker weirdly). Would this change also apply to `readonlyPath`?
Not that I know of, the implementation looks fine to me. Do you have a
command line that will reproduce it?
Oh, I suppose the initial bind mount will break if you're sharing the
host's mount ns. There's nothing much we can do in that case, though,
things simply won't work.
Tycho
|
I do, unfortunately it requires applying an out-of-tree patch to Docker. You can see the patch here, it's effectively forcefully adding a "secret" to every container and the mounting of
If you apply the above patch, every container run will fail. I'm trying to reproduce the issue outside of Docker (which isn't working, even if I use the configuration generated by |
On Fri, Aug 25, 2017 at 05:44:25AM -0700, Aleksa Sarai wrote:
@tych0
> Not that I know of, the implementation looks fine to me. Do you have a
command line that will reproduce it?
I do, unfortunately it requires applying an out-of-tree patch to Docker. You can see the patch [here](https://github.com/suse/docker/tree/suse-v17.04.x-secrets), it's effectively forcefully adding a "secret" to every container and the mounting of `/run/secrets` is what's breaking. The `strace` log looks like this:
```
[pid 26131] mount("/var/lib/docker/100000.100000/containers/904f91a237a6c330a65fd4444635796fb25a67cfefc06213fcaa3df5d73d890c/secrets", "/var/lib/docker/100000.100000/vfs/dir/b252479c3020cdb9bedb93f1717be95714b449e3e0b302fe034149627ee37e08/run/secrets", 0xc4200b5df0, MS_RDONLY|MS_BIND|MS_REC, NULL) = 0
[pid 26131] mount("/var/lib/docker/100000.100000/containers/904f91a237a6c330a65fd4444635796fb25a67cfefc06213fcaa3df5d73d890c/secrets", "/var/lib/docker/100000.100000/vfs/dir/b252479c3020cdb9bedb93f1717be95714b449e3e0b302fe034149627ee37e08/run/secrets", 0xc4200b5df5, MS_RDONLY|MS_REMOUNT|MS_BIND|MS_REC, NULL) = -1 EPERM (Operation not permitted)
```
I'm trying to reproduce the issue outside of Docker (which isn't working, even if I use the configuration generated by `containerd` directly!), or on a stock version of Docker, but I'm having trouble doing that.
Ah, I see. I suspect the problem is that your mount is inheriting some
of the lockable flags (nosuid, nodev, ro, etc.) from its parent when
the bind mount is created, and the subseqent MS_REMOUNT isn't passing
in these flags when required.
A couple possible options here: 1. after creating the bind mount,
parse /proc/self/mountinfo to figure out if it had any of these flags
and add them, or 2. figure out which parent mount is going to be bound by
hand in libcontainer, and pass the additional flags for the parent in
when doing the MS_REMOUNT.
… --
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub:
#1572 (comment)
|
On Fri, Aug 25, 2017 at 06:51:22AM -0600, Tycho Andersen wrote:
On Fri, Aug 25, 2017 at 05:44:25AM -0700, Aleksa Sarai wrote:
> @tych0
>
> > Not that I know of, the implementation looks fine to me. Do you have a
> command line that will reproduce it?
>
> I do, unfortunately it requires applying an out-of-tree patch to Docker. You can see the patch [here](https://github.com/suse/docker/tree/suse-v17.04.x-secrets), it's effectively forcefully adding a "secret" to every container and the mounting of `/run/secrets` is what's breaking. The `strace` log looks like this:
>
> ```
> [pid 26131] mount("/var/lib/docker/100000.100000/containers/904f91a237a6c330a65fd4444635796fb25a67cfefc06213fcaa3df5d73d890c/secrets", "/var/lib/docker/100000.100000/vfs/dir/b252479c3020cdb9bedb93f1717be95714b449e3e0b302fe034149627ee37e08/run/secrets", 0xc4200b5df0, MS_RDONLY|MS_BIND|MS_REC, NULL) = 0
> [pid 26131] mount("/var/lib/docker/100000.100000/containers/904f91a237a6c330a65fd4444635796fb25a67cfefc06213fcaa3df5d73d890c/secrets", "/var/lib/docker/100000.100000/vfs/dir/b252479c3020cdb9bedb93f1717be95714b449e3e0b302fe034149627ee37e08/run/secrets", 0xc4200b5df5, MS_RDONLY|MS_REMOUNT|MS_BIND|MS_REC, NULL) = -1 EPERM (Operation not permitted)
> ```
>
> I'm trying to reproduce the issue outside of Docker (which isn't working, even if I use the configuration generated by `containerd` directly!), or on a stock version of Docker, but I'm having trouble doing that.
Ah, I see. I suspect the problem is that your mount is inheriting some
of the lockable flags (nosuid, nodev, ro, etc.) from its parent when
the bind mount is created, and the subseqent MS_REMOUNT isn't passing
in these flags when required.
A couple possible options here: 1. after creating the bind mount,
parse /proc/self/mountinfo to figure out if it had any of these flags
and add them, or 2. figure out which parent mount is going to be bound by
hand in libcontainer, and pass the additional flags for the parent in
when doing the MS_REMOUNT.
Oh, derp. Or 3. call statfs() to acquire the mount flags, and don't
bother with any parsing because it's a pain in the ass :)
|
@tych0 So what does "MS_REMOUNT | MS_BIND" actually mean? Are we remounting the existing mount point or creating a new bind mount point? |
@cyphar I just pushed a patched to this branch that I think will fix your issue as well. |
@rhvgoyal it means hitting a special code path in the kernel, so it really has nothing to do with MS_BIND. I have no idea why this exists (presumably to do exactly this), but it works and I've used it elsewhere. See: https://github.com/torvalds/linux/blob/master/fs/namespace.c#L2270 in do_remount(). |
729889b
to
a9169aa
Compare
libcontainer/rootfs_linux.go
Outdated
statfs := syscall.Statfs_t{} | ||
syscall.Statfs(path, &statfs) | ||
|
||
flags := statfs.Flags | unix.MS_BIND | unix.MS_REMOUNT | unix.MS_RDONLY | unix.MS_REC |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't use syscall
anymore, we use the extended-stdlib golang.org/x/sys/unix
library. Also you need to handle the error. So this should look more like:
var statfs unix.Statfs_t
if err := unix.Statfs(path, &statfs); err != nil {
return err
}
statfs.Flags |= unix.MS_BIND | unix.MS_REMOUNT | unix.MS_RDONLY | unix.MS_REC
return unix.Mount(path, path, "", uintptr(statfs.Flags), "")
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also it looks like these changes weren't correctly Looks like you updated it while I was commenting.gofmt
'd. Run gofmt -s -w .
to fix this up.
@tych0 I was thinking the same thing wrt dropping inherited flags. Thanks for pushing a patch to fix it, I'll test it now. I'm a bit confused why we are redoing the mount in this case (since we're not actually updating any of the flags) but this shouldn't hurt. |
a9169aa
to
b8136d3
Compare
On Fri, Aug 25, 2017 at 06:19:30AM -0700, Aleksa Sarai wrote:
cyphar commented on this pull request.
> @@ -723,7 +724,13 @@ func readonlyPath(path string) error {
}
return err
}
- return unix.Mount(path, path, "", unix.MS_BIND|unix.MS_REMOUNT|unix.MS_RDONLY|unix.MS_REC, "")
+
+ statfs := syscall.Statfs_t{}
+ syscall.Statfs(path, &statfs)
+
+ flags := statfs.Flags | unix.MS_BIND | unix.MS_REMOUNT | unix.MS_RDONLY | unix.MS_REC
Also it looks like these changes weren't correctly `gofmt`'d. Run `gofmt -s -w .` to fix this up.
Fixed both, thanks. FWIW this line was correctly formatted (even
though it wouldn't be in some cases and the spaces would have to be
dropped, I don't really understand what go's rules are for this, yay
for consistency). I alphabetized the import wrong :)
|
/me is testing to make sure it fixes the issue I was seeing. |
I am seeing 3 places where we are trying to convert a mount point read only. Can we unify all that. setReadonly() Also if retaining old flags is an issue with readonlyPath(), why is it not an issue with setReadonly() and remountReadonly()? |
remountReadonly() passes in the mount flags from the mount configuration, which should have nosuid or whatever; readonlyPath() makes a path that might not be a mount at all read only, setReadonly is only used for the rootfs, which presumably doesn't have stuff line noexec or nosuid, although it could. Perhaps we should use remountReadonly for that and find the "/" mount in the descriptor list? |
I think we can have a helper function whose job is to just remount an existing mount point read-only. It will be passed in flags and destination path. And then all the places can make use of that helper. Say remountReadOnly(path, flags). remountReadOnly() { And now this code can be reused. May be we can get rid of setReadOnly() and replace that call with above helper. |
Alright, I've figured out why my error is happening. It's caused by
|
So I was reading remount code path in kernel. It does not seem to automatically get existing mount flags (except some atime stuff). That means it is caller's responsibility to provide existing flags in remount call if caller wants to retain these flags. We probably also need to differentiate in what cases we want to retain existing flags and in what cases we don't want to do that. I feel a PR can be there just to cleanup code, consolidate these mount calls, put proper comments and lets merge that first and make sure things are not broken. |
@rhvgoyal Yeah, that's what @justincormack noted a few months ago while cleaning up this code and I think @tych0 mentioned it above. I agree we really need to simplify the plethora of |
This patch fixes the issue I had, but as @rhvgoyal said, this code really needs a more significant cleanup: diff --git a/libcontainer/rootfs_linux.go b/libcontainer/rootfs_linux.go
index 3c714b0905e7..e58ee76d4d4e 100644
--- a/libcontainer/rootfs_linux.go
+++ b/libcontainer/rootfs_linux.go
@@ -791,7 +791,14 @@ func remount(m *configs.Mount, rootfs string) error {
if !strings.HasPrefix(dest, rootfs) {
dest = filepath.Join(rootfs, dest)
}
- if err := unix.Mount(m.Source, dest, m.Device, uintptr(m.Flags|unix.MS_REMOUNT), ""); err != nil {
+
+ statfs := unix.Statfs_t{}
+ if err := unix.Statfs(path, &statfs); err != nil {
+ return err
+ }
+ flags := int(statfs.Flags) | m.Flags | unix.MS_REMOUNT
+
+ if err := unix.Mount(m.Source, dest, m.Device, uintptr(flags), ""); err != nil {
return err
}
return nil The reason why this fixes the issue is (as above) it's because the |
I think we need two functions: r.e. @cyphar's bug, that seems to me like whoever is providing the configuration for m should be providing all the right flags (presumably m itself is a bind mount, which needs to be inherited). Whatever is up one level should do that. |
Actually, never mind. This does need a more significant cleanup than even that. I think we should merge this, and I'll put together another branch trying to make it better. |
@tych0 Here's the "higher up" version, but I agree this code all needs to be cleaner. From 0f0054d22b1b2eadda635eb11089d826ecef2b81 Mon Sep 17 00:00:00 2001
From: Aleksa Sarai <asarai@suse.de>
Date: Sat, 26 Aug 2017 01:53:07 +1000
Subject: [PATCH] rootfs: preserve old mount flags when remounting bindmount
Fixes the case where a bind-mount is being defined in a user namespaced
container which requires a remount. Previously this code would attempt
to do a MS_REMOUNT that potentially dropped mount flags that were
inherited through the MS_BIND. Resolve this by intentionally adding the
mount flags to the mount configuration in that scenario.
Signed-off-by: Aleksa Sarai <asarai@suse.de>
---
libcontainer/rootfs_linux.go | 9 +++++++++
1 file changed, 9 insertions(+)
diff --git a/libcontainer/rootfs_linux.go b/libcontainer/rootfs_linux.go
index d4f8595f2873..42a1b365e312 100644
--- a/libcontainer/rootfs_linux.go
+++ b/libcontainer/rootfs_linux.go
@@ -222,6 +222,15 @@ func mountToRootfs(m *configs.Mount, rootfs, mountLabel string) error {
// bind mount won't change mount options, we need remount to make mount options effective.
// first check that we have non-default options required before attempting a remount
if m.Flags&^(syscall.MS_REC|syscall.MS_REMOUNT|syscall.MS_BIND) != 0 {
+ // Make a copy of the previous mount flags, because in a user
+ // namespace we are not allowed to drop mount options from the host
+ // bindmount.
+ var statfs syscall.Statfs_t
+ if err := syscall.Statfs(dest, &statfs); err != nil {
+ return err
+ }
+ m.Flags |= int(statfs.Flags)
+
// only remount if unique mount options are set
if err := remount(m, rootfs); err != nil {
return err
--
2.14.1 |
I still don't think so: whoever creates the |
@tych0 The problem is that if you want to create a readonly bindmount, you can't just do |
Sorry for derailing this patch discussion. @tych0 How about we revert to just having the first patch (which should be fairly un-contentious), and we can discuss the rest of these issues in a separate issue/PR? If you'd prefer to work on it that's fine for me, alternatively I can work on it next week. |
b8136d3
to
66eb2a3
Compare
IMO, the users should probably be specifying these things (i.e. not really users, but docker/whoever should be doing a statfs if it's going to specify bind mounts), but it would also be reasonable and user friendly to do it as you describe, especially if OCI doesn't support specifying options like nosuid. Anyway, I've dropped the second patch for now. I'm happy to do some refactoring next week too, I won't be doing it today, though. We can coordinate off-list or something. Cheers! |
On the one hand I agree (because it means we can do less guesswork), and OCI does support the full range of In either case, the single patch PR is fine as-is and I'll LGTM it. |
The documentation here:
https://docs.docker.com/engine/security/userns-remap/#user-namespace-known-limitations
says that readonly containers can't be used with user namespaces do to some
kernel restriction. In fact, there is a special case in the kernel to be
able to do stuff like this, so let's use it.
This takes us from:
to:
Signed-off-by: Tycho Andersen tycho@docker.com