-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
User ns reworked #4
User ns reworked #4
Conversation
a0ab60f
to
b18cd2d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did some silly comments.
I'll need to review with more time the changes for the kuberuntime. Is tomorrow okay? :)
41c3fe9
to
8992dbc
Compare
GetRuntimeConfigInfo returns information about the configuration of the runtime. For now it only returns the uid/gid mappings configuration.
Implement a new helper that will be used in the following commits
Extend the CRI to include a new user namespace mode field to allow kubelet to specify what is the prefered user namespace mode for a given pod.
This commit implements support for user ns in the kubelet. The kubelet uses the GetRuntimeInfoConfig function of the runtime to query for the uid/gid configured mapping. Kubelet tries to use POD mode for the user namespace when possible, NODE is used when: - Feature is not supported nor enabled in the runtime - The value of the "alpha.kinvolk.io/userns" annotation is "node" - The pod specification is imcompatible with it -- Any host namespace is used (IPC, PID, NET) -- There is any host-path volume -- There is any non namespaced capability (MKNOD, SYS_TIME, SYS_MODULE) -- There is any privileged container -- The pod has PVC mounts Files under the pod volumes dir (/var/lib/kubelet/pods/xxxx/volumes) are chowned to the mapped user in the host if the user namespace is used.
f223792
to
31640cb
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mauriciovasquezbernal thanks a lot for the patch, looks very good!
Added some silly questions for some things I didn't follow and wanted to confirm if we want that behavior, but overall LGTM! :)
return false | ||
} | ||
if len(c.UserNamespaceConfig.UidMappings) == 1 && | ||
c.UserNamespaceConfig.UidMappings[0].HostID == uint32(0) && c.UserNamespaceConfig.UidMappings[0].Size == uint32(4294967295) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit picking: is the uint32()
needed? I think we shouldn't need it. Or am I missing something?
I mean, golang does force to use some explicit conversions, but as explained here I don't think it should be needed to compare a var with a constant.
Am I missing something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right. I'll remove them.
return false | ||
} | ||
if len(c.UserNamespaceConfig.UidMappings) == 1 && | ||
c.UserNamespaceConfig.UidMappings[0].HostID == uint32(0) && c.UserNamespaceConfig.UidMappings[0].Size == uint32(4294967295) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why c.UserNamespaceConfig.UidMappings[0].Size == uint32(4294967295)
? I mean, why equal to that constant value? It is set to that when the runtime doesn't support user ns?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
4294967295 encapsulates the entire uid/gid range I believe. The initial root namespace has a uid_map of 0 0 4294967295
. But I agree that the use of it here is unclear
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the idea is that if the configured mapping is 0 0 4294967295
it means that the support is not enabled (all the ids in the container are mapped to the same ids in the host).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ohh, I see. Thanks! Maybe a comment can make it very clear to everyone :)
return false | ||
} | ||
if len(c.UserNamespaceConfig.UidMappings) == 1 && | ||
c.UserNamespaceConfig.UidMappings[0].HostID == uint32(0) && c.UserNamespaceConfig.UidMappings[0].Size == uint32(0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Idem regarding uint32()
. Is it needed?
single := &RuntimeConfigInfo{ | ||
UserNamespaceConfig: UserNamespaceConfigInfo{ | ||
UidMappings: []*UserNSMapping{ | ||
&UserNSMapping{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe add a test with more than one mapping too, given that this is a slice?
if len(c.UserNamespaceConfig.UidMappings) == 1 && | ||
c.UserNamespaceConfig.UidMappings[0].HostID == uint32(0) && c.UserNamespaceConfig.UidMappings[0].Size == uint32(0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why if len(...) == 1 and hostID == 0 and size == 0
? Not sure I follow what you are trying to detect.
Is it just when the size is 0? Why does it matter, in that case, that the hostID is 0 too? Or why wouldn't it be a problem if the specified mappings are more than one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's trying to detect the case where a mapping like 0 1000 0
is configured. You're totally right, the whole logic to check if the usernamespace is supported or enabled has to be checked again. I think it could be simplified a lot but I avoided touching that at this time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense
gidMapping.Size_ = gidMappingSize | ||
} else { | ||
uidMapping.Size_ = uint32(4294967295) | ||
gidMapping.Size_ = uint32(4294967295) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is 4294967295? The max value for something? Is it constant across platforms?
// container_id is the starting id for the mapping inside the container. | ||
uint32 container_id = 1; | ||
// host_id is the starting id for the mapping on the host. | ||
uint32 host_id = 2; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why use uint32
for container and host UIDs? Is it because in all platforms will be 32 or less?
I think we talked about this in the past, about being maybe an unsigned int in C, but can't find that thread now. Sorry to re-ask :-/
&kubecontainer.RuntimeConfigInfo{}, | ||
false, | ||
}, | ||
"Single mapping": { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Idem, will be nice to test more than once mapping, just in case
@@ -24,6 +24,8 @@ import ( | |||
"time" | |||
|
|||
cadvisorapi "github.com/google/cadvisor/info/v1" | |||
"google.golang.org/grpc/codes" | |||
"google.golang.org/grpc/status" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
status and codes can be too generic and, in the future, clash with other imports. Maybe should we use a name that makes it clear that is a grpc specific thing?
Not now, but in the future. Ah, and only if this codes continues to live here :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi! Overall this is really good work. I left a few comments/questions.
if len(c.UserNamespaceConfig.UidMappings) == 1 && | ||
c.UserNamespaceConfig.UidMappings[0].HostID == uint32(0) && c.UserNamespaceConfig.UidMappings[0].Size == uint32(4294967295) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this would be more clear.
if len(c.UserNamespaceConfig.UidMappings) == 1 && | |
c.UserNamespaceConfig.UidMappings[0].HostID == uint32(0) && c.UserNamespaceConfig.UidMappings[0].Size == uint32(4294967295) { | |
if len(c.UserNamespaceConfig.UidMappings) == 1 && | |
c.UserNamespaceConfig.UidMappings[0].HostID == c.UserNamespaceConfig.UidMappings[0].ContainerID { |
return false | ||
} | ||
if len(c.UserNamespaceConfig.UidMappings) == 1 && | ||
c.UserNamespaceConfig.UidMappings[0].HostID == uint32(0) && c.UserNamespaceConfig.UidMappings[0].Size == uint32(4294967295) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
4294967295 encapsulates the entire uid/gid range I believe. The initial root namespace has a uid_map of 0 0 4294967295
. But I agree that the use of it here is unclear
|
||
// getRemappedNonRootHostID parses docker info to determine ID on the host usernamespace which is mapped to {U/G}ID 0 in the container user-namespace | ||
func getRemappedNonRootHostID(dockerInfo *dockertypes.Info) (uint32, error) { | ||
remappedNonRootHostID64, err := strconv.ParseUint(strings.Split(path.Base(dockerInfo.DockerRootDir), ".")[0], 10, 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The output of path.Base(. . .)
is /var/lib/docker
if userns is not enabled, and /var/lib/docker/1000.1000
if userns is enabled (replace 1000 with whatever uid remapping is configured). I agree this is a little hard to review, but it works :)
// User namespace for this container/sandbox. | ||
// Note: There is currently no way to set CONTAINER scoped user namespace in the Kubernetes API. | ||
// NODE is the default value. Kubelet will set it to POD if pod spec indicates to use user-namespace remapping | ||
// Namespaces currently set by the kubelet: POD, NODE |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does POD
mean that all pods will be remapped to the same namespace (assuming userns are supported/enabled at runtime)? If so, I think the term POD
is unclear as it suggests that each pod gets mapped to its own namespace. What happened to NODE_WIDE_REMAPPED
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pod means each pod will have it's own namespace. The mapping, at least in this PoC, is the same for all pods, though.
NODE_WIDE_REMAPPED
is not implemented here. We find out it was easier to implement a per-pod user ns, that also seems better. IIUC, NODE_WIDE_REMAPPED
was proposed years ago before most container runtime added support for user ns and as a way to run containers as root but not be root on the host. We can implement that if there is still value, though :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the clarification. I don't think NODE_WIDE_REMAPPED
is necessary. But maybe a comment here clarifying what POD
means (and what it will mean in the future) would be helpful.
makeIPTablesUtilChains: kubeCfg.MakeIPTablesUtilChains, | ||
iptablesMasqueradeBit: int(kubeCfg.IPTablesMasqueradeBit), | ||
iptablesDropBit: int(kubeCfg.IPTablesDropBit), | ||
experimentalHostUserNamespaceDefaulting: utilfeature.DefaultFeatureGate.Enabled(features.ExperimentalHostUserNamespaceDefaultingGate), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason why this entire sections shows as a diff? I believe you are just removing the experimentalHostUserNamespaceDefaulting
feature
pkg/kubelet/kubelet_pods.go
Outdated
@@ -70,6 +70,18 @@ import ( | |||
const ( | |||
managedHostsHeader = "# Kubernetes-managed hosts file.\n" | |||
managedHostsHeaderWithHostNetwork = "# Kubernetes-managed hosts file (host network).\n" | |||
|
|||
// Kinvolk alpha annotation for user namespaces | |||
kivolkUsernsAnn = "alpha.kinvolk.io/userns" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is using an annotation a temporary fix? What's the reasoning behind not updating the pod spec?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are using an annotation to easily test on real clusters without changing the pod.Spec. Upstream PR will change the pod spec.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it
@@ -1798,3 +1805,52 @@ func (kl *Kubelet) hasHostMountPVC(pod *v1.Pod) bool { | |||
} | |||
return false | |||
} | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: add comment for UserNamespaceForPod()
for consistency
thanks for the review and your answers, really helpful! :) |
Is this expected to work with dockershim at the moment? |
Yes, it should work when using the |
Hmm okay. I get permission issues when kubelet attempts to mount from |
@cdesiniotis I have the following:
Can you share the exact error you're getting? |
@mauriciovasquezbernal My setup:
I run some pods. When I specify to use the pod userns it fails:
Looking in the kubelet logs at
|
@cdesiniotis oh right. I suppose you are running containerd with systemd, if that's the case it needs to have the supplemental group 0 ( |
@mauriciovasquezbernal Ah that was it. Thank you for the help! |
@@ -81,6 +85,10 @@ const ( | |||
// to kubelet behavior and system settings in addition to any API flags that may be introduced. | |||
) | |||
|
|||
var ( | |||
linuxIDMappingRegexp = regexp.MustCompile("([aA-zZ]+):([0-9]+):([0-9]+)") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems that the first field can be numbers too. We probably need to modify this. Or maybe just use Split()
with :
as separator? Might be simpler, not sure
From https://docs.docker.com/engine/security/userns-remap/#prerequisites:
Each file contains three fields: the username or ID of the user, followed by a beginning UID or GID (which is treated as UID or GID 0 within the namespace) and a maximum number of UIDs or GIDs available to the user
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Additionally, a typical accepted regex for usernames in Linux is "^[a-z][-a-z0-9_]*\$"
, so need to account for numbers in the first field anyways.
@rata @cdesiniotis I'm checking the state of the art on the proposals for this feature first. I'll address your comments once I have a clear view of what are the next steps on this implementation. Thanks a lot for the review. |
@mauriciovasquezbernal Thanks for your work! I am willing to help/review, so feel free to include me for review on this (or related work). Also, what are the current plans for upstream work (i.e. KEP and upstream PR)? Are you (@rata @mauriciovasquezbernal) still planning to lead that effort? |
Add two examples with SYS_ADMIN and NET_ADMIN capabilities.
f4ddac6
to
6e902e7
Compare
- Differenciate when it's a limitation on the Pod spec or in the runtime - Don't drop the content of the error message shown to the user
6e902e7
to
b14dc20
Compare
There are some features that *could* be not compatible with user namespaces: - hostpath volumes - PVC - sharing other host namespaces - some capabilities The current code forbids to use 'Pod' mode when one of those features is present on the PodSpec. This commit relaxes that logic and replaces it by a warning only for the case where host namespaces are shared given that it's the only case that could present issues when creating the container.
It's a reworked version of #3.