-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add cpuset_cpus to docker driver. #8291
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 I've left a couple requests for docs and logging tweaks but other than that this LGTM. Thanks so much for this @shishir-a412ed!
I've tested this out and can confirm the cpu set configuration is getting set for Docker containers:
$ docker inspect 753 | jq '.[0].HostConfig.CpusetC
pus'
"0"
@shishir-a412ed Thank you so much for the contributions. I wonder how this useful without scheduler support for exclusivity by the client or scheduler. One concern is interference. Let's say a node is running two allocations, allocation A with cpuset set to 0-1, while B is unrestricted. Will this mean that alloc B can interfere with A; B can use all cpus but A only uses two; A here will be artificially constrained and that seem not so ideal. Is the benefit of NUMA locality in this case override the interference of other jobs in your experience? The second concern concern is usability without exclusivity support by scheduler. Assume you have two jobs that you want to configure with cpusets, how would you ensure that they don't end up using the same cpusets on a host. I assume operators will need to statically partition jobs on nodes to avoid conflicts. Is that something you are considering in your HPC setup? I'd be in support of adding cpuset support - it's very useful indeed, specially with NUMA-aware app. We plan to add support for specifying number of cpus instead of MHz "soon", when we do, it'll be easier to add cpusets with exclusivity. One potential possible alternative is to have the docker driver manage cpusets on the client. i.e. operator specifies |
@notnoop Yeah, interference is a legit concern. That's why I presented it as a two-step solution.
After merging (1) the feature is not completely useful as you mentioned if allocation A is running with cpuset
Here {0,1} are not available since they are already pinned. So e.g. if allocation B lands on the same node, the docker driver will say CPU's 0-1 is already allocated and will error out the allocation. My understanding is when that happens scheduler will detect the failed allocation and try to allocate it on another node. This will happen till, either allocation will get placed on another fit node, or reaches its maximum number of failed attempts (at that point the job should fail). b) Ideal solution: Have a global state at the scheduler level e.g. map[string][]string (key=nodeID, value={0,1,2,4 CPU's already pinned}) With this global state, the scheduler will launch the job only on a node which has available CPU's to be pinned. |
To get a clarification of what I want, I've discussed this with @shishir-a412ed internally a bit. From my perspective, there are a few layers of value here.
That would allow 4 large containers per node, which could then be enforced through node pinning. This could also perhaps provide exclusive access per "datacenter" (like having "dc-hpc1," "dc-hpc2," "dc-hpcn"). We run ~20+ "datacenters" today without any trouble, so I think this should fit quite a few specific large use-cases. This does not prevent this case: #2303 (comment)
From a user perspective this would be something like "I want 2 CPUs," and that could be 0,1 or 2,3 or 15,35. This would not be NUMA-aware from my perspective, at least not starting as NUMA-aware. This covers this case: #2303 (comment) This covers the first bullet point here: #2303 (comment) Reading back through #2303 though, I think the second point (even ignoring NUMA), gives enough of a community-level benefit that it'd be great. Also, this PR specifically allows users to write some tooling on top of Nomad to cover the very specific use-cases of NUMA-neighbors + NUMA-exclusivity. |
c005eee
to
46211fd
Compare
@notnoop @tgross So we discussed this a bit more internally, and we felt while this patch is useful for someone who is trying to use CPU pinning in docker by setting The ideas are very similar to So I have opened another issue #8473 which has an initial spec on how I would approach this problem. lmk what do you guys think? |
b6d8ed3
to
a2d0edc
Compare
Hi! Just wanted to reach out and apologize for the slow response. This is on my plate and I intend to follow up shortly. |
@notnoop Thank you for the update! |
Very eager for this PR. Thanks for the awesome work @shishir-a412ed |
eabe6be
to
009d0b5
Compare
a44a776
to
0df06e2
Compare
7a48c71
to
a00ab2f
Compare
f7f7746
to
c485267
Compare
c485267
to
7a66bde
Compare
@notnoop Any updates on this one? |
7a66bde
to
ec489d2
Compare
ec489d2
to
8f27c08
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for taking so long. Thank you so much for your patience here. This looks good to me, considering the small objective.
I'm inclined to add a beta marker to config, just in case we need to modify the semantics when we introduce global cpu tracking and need to do a backward incompatible change.
Thanks you again!
I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions. |
Fixes #2303
We have an internal HPC customer who could also benefit from the ability to pin CPUs to a docker container.
It would be ideal to have:
--cpuset-cpus
which will allow pinning CPUs to a docker container.schedule on CPU 0 on a different node.
Currently (2) is not supported by docker. This PR only addresses (1).
As a follow-up, we can try nomad docker driver to do some bookkeeping, and achieve (2).