Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

User ns reworked #4

Conversation

mauriciovasquezbernal
Copy link
Member

It's a reworked version of #3.

Copy link
Member

@rata rata left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did some silly comments.

I'll need to review with more time the changes for the kuberuntime. Is tomorrow okay? :)

pkg/kubelet/kubelet_usernamespace_linux.go Outdated Show resolved Hide resolved
pkg/kubelet/container/runtime.go Outdated Show resolved Hide resolved
pkg/kubelet/container/runtime.go Show resolved Hide resolved
@mauriciovasquezbernal mauriciovasquezbernal force-pushed the mauricio/userns branch 4 times, most recently from 41c3fe9 to 8992dbc Compare July 2, 2020 20:38
GetRuntimeConfigInfo returns information about the configuration of the runtime.
For now it only returns the uid/gid mappings configuration.
Implement a new helper that will be used in the following commits
Extend the CRI to include a new user namespace mode field to allow kubelet to
specify what is the prefered user namespace mode for a given pod.
This commit implements support for user ns in the kubelet. The kubelet uses the
GetRuntimeInfoConfig function of the runtime to query for the uid/gid configured
mapping.

Kubelet tries to use POD mode for the user namespace when possible, NODE is used
when:
- Feature is not supported nor enabled in the runtime
- The value of the "alpha.kinvolk.io/userns" annotation is "node"
- The pod specification is imcompatible with it
-- Any host namespace is used (IPC, PID, NET)
-- There is any host-path volume
-- There is any non namespaced capability (MKNOD, SYS_TIME, SYS_MODULE)
-- There is any privileged container
-- The pod has PVC mounts

Files under the pod volumes dir (/var/lib/kubelet/pods/xxxx/volumes) are chowned
to the mapped user in the host if the user namespace is used.
@mauriciovasquezbernal mauriciovasquezbernal force-pushed the mauricio/userns branch 2 times, most recently from f223792 to 31640cb Compare July 6, 2020 15:30
Copy link
Member

@rata rata left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mauriciovasquezbernal thanks a lot for the patch, looks very good!

Added some silly questions for some things I didn't follow and wanted to confirm if we want that behavior, but overall LGTM! :)

return false
}
if len(c.UserNamespaceConfig.UidMappings) == 1 &&
c.UserNamespaceConfig.UidMappings[0].HostID == uint32(0) && c.UserNamespaceConfig.UidMappings[0].Size == uint32(4294967295) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit picking: is the uint32() needed? I think we shouldn't need it. Or am I missing something?

I mean, golang does force to use some explicit conversions, but as explained here I don't think it should be needed to compare a var with a constant.

Am I missing something?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right. I'll remove them.

return false
}
if len(c.UserNamespaceConfig.UidMappings) == 1 &&
c.UserNamespaceConfig.UidMappings[0].HostID == uint32(0) && c.UserNamespaceConfig.UidMappings[0].Size == uint32(4294967295) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why c.UserNamespaceConfig.UidMappings[0].Size == uint32(4294967295)? I mean, why equal to that constant value? It is set to that when the runtime doesn't support user ns?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

4294967295 encapsulates the entire uid/gid range I believe. The initial root namespace has a uid_map of 0 0 4294967295. But I agree that the use of it here is unclear

Copy link
Member Author

@mauriciovasquezbernal mauriciovasquezbernal Jul 7, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the idea is that if the configured mapping is 0 0 4294967295 it means that the support is not enabled (all the ids in the container are mapped to the same ids in the host).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ohh, I see. Thanks! Maybe a comment can make it very clear to everyone :)

return false
}
if len(c.UserNamespaceConfig.UidMappings) == 1 &&
c.UserNamespaceConfig.UidMappings[0].HostID == uint32(0) && c.UserNamespaceConfig.UidMappings[0].Size == uint32(0) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Idem regarding uint32(). Is it needed?

single := &RuntimeConfigInfo{
UserNamespaceConfig: UserNamespaceConfigInfo{
UidMappings: []*UserNSMapping{
&UserNSMapping{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add a test with more than one mapping too, given that this is a slice?

Comment on lines +505 to +506
if len(c.UserNamespaceConfig.UidMappings) == 1 &&
c.UserNamespaceConfig.UidMappings[0].HostID == uint32(0) && c.UserNamespaceConfig.UidMappings[0].Size == uint32(0) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why if len(...) == 1 and hostID == 0 and size == 0? Not sure I follow what you are trying to detect.

Is it just when the size is 0? Why does it matter, in that case, that the hostID is 0 too? Or why wouldn't it be a problem if the specified mappings are more than one?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's trying to detect the case where a mapping like 0 1000 0 is configured. You're totally right, the whole logic to check if the usernamespace is supported or enabled has to be checked again. I think it could be simplified a lot but I avoided touching that at this time.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense

gidMapping.Size_ = gidMappingSize
} else {
uidMapping.Size_ = uint32(4294967295)
gidMapping.Size_ = uint32(4294967295)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is 4294967295? The max value for something? Is it constant across platforms?

// container_id is the starting id for the mapping inside the container.
uint32 container_id = 1;
// host_id is the starting id for the mapping on the host.
uint32 host_id = 2;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why use uint32 for container and host UIDs? Is it because in all platforms will be 32 or less?

I think we talked about this in the past, about being maybe an unsigned int in C, but can't find that thread now. Sorry to re-ask :-/

&kubecontainer.RuntimeConfigInfo{},
false,
},
"Single mapping": {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Idem, will be nice to test more than once mapping, just in case

@@ -24,6 +24,8 @@ import (
"time"

cadvisorapi "github.com/google/cadvisor/info/v1"
"google.golang.org/grpc/codes"
"google.golang.org/grpc/status"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

status and codes can be too generic and, in the future, clash with other imports. Maybe should we use a name that makes it clear that is a grpc specific thing?

Not now, but in the future. Ah, and only if this codes continues to live here :)

Copy link

@cdesiniotis cdesiniotis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi! Overall this is really good work. I left a few comments/questions.

Comment on lines +493 to +494
if len(c.UserNamespaceConfig.UidMappings) == 1 &&
c.UserNamespaceConfig.UidMappings[0].HostID == uint32(0) && c.UserNamespaceConfig.UidMappings[0].Size == uint32(4294967295) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this would be more clear.

Suggested change
if len(c.UserNamespaceConfig.UidMappings) == 1 &&
c.UserNamespaceConfig.UidMappings[0].HostID == uint32(0) && c.UserNamespaceConfig.UidMappings[0].Size == uint32(4294967295) {
if len(c.UserNamespaceConfig.UidMappings) == 1 &&
c.UserNamespaceConfig.UidMappings[0].HostID == c.UserNamespaceConfig.UidMappings[0].ContainerID {

return false
}
if len(c.UserNamespaceConfig.UidMappings) == 1 &&
c.UserNamespaceConfig.UidMappings[0].HostID == uint32(0) && c.UserNamespaceConfig.UidMappings[0].Size == uint32(4294967295) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

4294967295 encapsulates the entire uid/gid range I believe. The initial root namespace has a uid_map of 0 0 4294967295. But I agree that the use of it here is unclear


// getRemappedNonRootHostID parses docker info to determine ID on the host usernamespace which is mapped to {U/G}ID 0 in the container user-namespace
func getRemappedNonRootHostID(dockerInfo *dockertypes.Info) (uint32, error) {
remappedNonRootHostID64, err := strconv.ParseUint(strings.Split(path.Base(dockerInfo.DockerRootDir), ".")[0], 10, 0)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The output of path.Base(. . .) is /var/lib/docker if userns is not enabled, and /var/lib/docker/1000.1000 if userns is enabled (replace 1000 with whatever uid remapping is configured). I agree this is a little hard to review, but it works :)

// User namespace for this container/sandbox.
// Note: There is currently no way to set CONTAINER scoped user namespace in the Kubernetes API.
// NODE is the default value. Kubelet will set it to POD if pod spec indicates to use user-namespace remapping
// Namespaces currently set by the kubelet: POD, NODE

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does POD mean that all pods will be remapped to the same namespace (assuming userns are supported/enabled at runtime)? If so, I think the term POD is unclear as it suggests that each pod gets mapped to its own namespace. What happened to NODE_WIDE_REMAPPED?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pod means each pod will have it's own namespace. The mapping, at least in this PoC, is the same for all pods, though.

NODE_WIDE_REMAPPED is not implemented here. We find out it was easier to implement a per-pod user ns, that also seems better. IIUC, NODE_WIDE_REMAPPED was proposed years ago before most container runtime added support for user ns and as a way to run containers as root but not be root on the host. We can implement that if there is still value, though :)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the clarification. I don't think NODE_WIDE_REMAPPED is necessary. But maybe a comment here clarifying what POD means (and what it will mean in the future) would be helpful.

makeIPTablesUtilChains: kubeCfg.MakeIPTablesUtilChains,
iptablesMasqueradeBit: int(kubeCfg.IPTablesMasqueradeBit),
iptablesDropBit: int(kubeCfg.IPTablesDropBit),
experimentalHostUserNamespaceDefaulting: utilfeature.DefaultFeatureGate.Enabled(features.ExperimentalHostUserNamespaceDefaultingGate),

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason why this entire sections shows as a diff? I believe you are just removing the experimentalHostUserNamespaceDefaulting feature

@@ -70,6 +70,18 @@ import (
const (
managedHostsHeader = "# Kubernetes-managed hosts file.\n"
managedHostsHeaderWithHostNetwork = "# Kubernetes-managed hosts file (host network).\n"

// Kinvolk alpha annotation for user namespaces
kivolkUsernsAnn = "alpha.kinvolk.io/userns"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is using an annotation a temporary fix? What's the reasoning behind not updating the pod spec?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are using an annotation to easily test on real clusters without changing the pod.Spec. Upstream PR will change the pod spec.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it

@@ -1798,3 +1805,52 @@ func (kl *Kubelet) hasHostMountPVC(pod *v1.Pod) bool {
}
return false
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: add comment for UserNamespaceForPod() for consistency

@rata
Copy link
Member

rata commented Jul 7, 2020

Hi! Overall this is really good work. I left a few comments/questions.

thanks for the review and your answers, really helpful! :)

@cdesiniotis
Copy link

Is this expected to work with dockershim at the moment?

@mauriciovasquezbernal
Copy link
Member Author

Yes, it should work when using the userns-remap option in docker https://docs.docker.com/engine/security/userns-remap/.

@cdesiniotis
Copy link

Hmm okay. I get permission issues when kubelet attempts to mount from /var/lib/pods/ to /var/lib/docker/231072.231072/. I am using Docker 19.03 (and the latest versions of containderd/runc that ship with it). Is there something I am missing with my setup?

@mauriciovasquezbernal
Copy link
Member Author

@cdesiniotis I have the following:

$ dockerd --version
Docker version 19.03.9, build 9d988398e7
 $ sudo cat /etc/subuid
mvb:500000:65536
$ sudo cat /etc/subgid
mvb:500000:65536
$ sudo dockerd --userns-remap="mvb:mvb"

Can you share the exact error you're getting?

@cdesiniotis
Copy link

cdesiniotis commented Jul 7, 2020

@mauriciovasquezbernal My setup:

$ dockerd --version
Docker version 19.03.12, build 48a66213fe
$ sudo cat /etc/subuid
dockremap:231072:65536
$ sudo cat /etc/subgid
dockremap:231072:65536
$ sudo dockerd --userns-remap="default"

I run some pods. When I specify to use the pod userns it fails:

$ cluster/kubectl.sh apply -f nodens.yaml
pod/node-userns created
$ cluster/kubectl.sh get pods
NAME          READY   STATUS    RESTARTS   AGE
node-userns   1/1     Running   0          4s
$ cluster/kubectl.sh apply -f podns.yaml
pod/pod-userns created
$ cluster/kubectl.sh get pods
NAME          READY   STATUS               RESTARTS   AGE
node-userns   1/1     Running              0          22s
pod-userns    0/1     ContainerCannotRun   0          4s
$ cluster/kubectl.sh apply -f default.yaml
pod/default created
$ cluster/kubectl.sh get pods
NAME          READY   STATUS               RESTARTS   AGE
node-userns   1/1     Running              0          38s
pod-userns    0/1     ContainerCannotRun   0          20s
default       0/1     ContainerCannotRun   0          2s

Looking in the kubelet logs at /tmp/kubelet.log I see error messages like the following for the pods that fail:

E0707 21:17:01.437863   15695 remote_runtime.go:222] StartContainer "b48e2e5bc448b58b43a6eed9488e418b3d4753b26303688790e790c3ce2fae88" from runtime service failed: rpc error: code = Unknown desc = failed to star
t container "b48e2e5bc448b58b43a6eed9488e418b3d4753b26303688790e790c3ce2fae88": Error response from daemon: OCI runtime create failed: container_linux.go:349: starting container process caused "process_linux.go:
449: container init caused \"rootfs_linux.go:58: mounting \\\"/var/lib/kubelet/pods/8696e9a0-836c-4539-95c9-3a3ab735dd6c/containers/container1/c1987dfa\\\" to rootfs \\\"/var/lib/docker/231072.231072/overlay2/bb0fe03
d4faab00737d5ed23fb0d6d571af66f8d4c978e346ea8f7871a71e7d9/merged\\\" at \\\"/dev/termination-log\\\" caused \\\"stat /var/lib/kubelet/pods/8696e9a0-836c-4539-95c9-3a3ab735dd6c/containers/container1/c1987dfa: permissi
on denied\\\"\"": unknown

@mauriciovasquezbernal
Copy link
Member Author

@cdesiniotis oh right. I suppose you are running containerd with systemd, if that's the case it needs to have the supplemental group 0 (SupplementaryGroups=0 in the systemd unit). More details in opencontainers/runc#2484.

@cdesiniotis
Copy link

@mauriciovasquezbernal Ah that was it. Thank you for the help!

@@ -81,6 +85,10 @@ const (
// to kubelet behavior and system settings in addition to any API flags that may be introduced.
)

var (
linuxIDMappingRegexp = regexp.MustCompile("([aA-zZ]+):([0-9]+):([0-9]+)")
Copy link
Member

@rata rata Jul 8, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that the first field can be numbers too. We probably need to modify this. Or maybe just use Split() with : as separator? Might be simpler, not sure

From https://docs.docker.com/engine/security/userns-remap/#prerequisites:

Each file contains three fields: the username or ID of the user, followed by a beginning UID or GID (which is treated as UID or GID 0 within the namespace) and a maximum number of UIDs or GIDs available to the user

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additionally, a typical accepted regex for usernames in Linux is "^[a-z][-a-z0-9_]*\$", so need to account for numbers in the first field anyways.

@mauriciovasquezbernal
Copy link
Member Author

@rata @cdesiniotis I'm checking the state of the art on the proposals for this feature first. I'll address your comments once I have a clear view of what are the next steps on this implementation. Thanks a lot for the review.

@cdesiniotis
Copy link

@mauriciovasquezbernal Thanks for your work! I am willing to help/review, so feel free to include me for review on this (or related work). Also, what are the current plans for upstream work (i.e. KEP and upstream PR)? Are you (@rata @mauriciovasquezbernal) still planning to lead that effort?

Add two examples with SYS_ADMIN and NET_ADMIN capabilities.
- Differenciate when it's a limitation on the Pod spec or in the runtime
- Don't drop the content of the error message shown to the user
There are some features that *could* be not compatible with user namespaces:
- hostpath volumes
- PVC
- sharing other host namespaces
- some capabilities

The current code forbids to use 'Pod' mode when one of those features is present
on the PodSpec. This commit relaxes that logic and replaces it by a warning only
for the case where host namespaces are shared given that it's the only case
that could present issues when creating the container.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants