Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add cpuset_cpus to docker driver. #8291

Merged
merged 3 commits into from
Nov 11, 2020
Merged

Conversation

shishir-a412ed
Copy link
Contributor

Fixes #2303

We have an internal HPC customer who could also benefit from the ability to pin CPUs to a docker container.

It would be ideal to have:

  1. --cpuset-cpus which will allow pinning CPUs to a docker container.
  2. If the user tries to pin the same CPU (e.g. CPU 0) to another docker container, it should error out (if on the same node) OR
    schedule on CPU 0 on a different node.

Currently (2) is not supported by docker. This PR only addresses (1).
As a follow-up, we can try nomad docker driver to do some bookkeeping, and achieve (2).

@shishir-a412ed
Copy link
Contributor Author

Website changes

cpuset_cpus

@shishir-a412ed
Copy link
Contributor Author

ping @tgross @notnoop

tgross
tgross previously requested changes Jun 26, 2020
Copy link
Member

@tgross tgross left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 I've left a couple requests for docs and logging tweaks but other than that this LGTM. Thanks so much for this @shishir-a412ed!

I've tested this out and can confirm the cpu set configuration is getting set for Docker containers:

$ docker inspect 753 | jq '.[0].HostConfig.CpusetC
pus'
"0"

website/pages/docs/drivers/docker.mdx Outdated Show resolved Hide resolved
drivers/docker/driver.go Outdated Show resolved Hide resolved
@notnoop
Copy link
Contributor

notnoop commented Jun 26, 2020

@shishir-a412ed Thank you so much for the contributions. I wonder how this useful without scheduler support for exclusivity by the client or scheduler.

One concern is interference. Let's say a node is running two allocations, allocation A with cpuset set to 0-1, while B is unrestricted. Will this mean that alloc B can interfere with A; B can use all cpus but A only uses two; A here will be artificially constrained and that seem not so ideal. Is the benefit of NUMA locality in this case override the interference of other jobs in your experience?

The second concern concern is usability without exclusivity support by scheduler. Assume you have two jobs that you want to configure with cpusets, how would you ensure that they don't end up using the same cpusets on a host. I assume operators will need to statically partition jobs on nodes to avoid conflicts. Is that something you are considering in your HPC setup?

I'd be in support of adding cpuset support - it's very useful indeed, specially with NUMA-aware app. We plan to add support for specifying number of cpus instead of MHz "soon", when we do, it'll be easier to add cpusets with exclusivity.

One potential possible alternative is to have the docker driver manage cpusets on the client. i.e. operator specifies cpus=2 # number of cores and the docker driver will allocate cpusets exclusively for the task (e.g. 1-2) and pass them to docker. This will address the second concern, but not the first. Would that work better for your usecase?

@shishir-a412ed
Copy link
Contributor Author

shishir-a412ed commented Jun 26, 2020

@notnoop Yeah, interference is a legit concern. That's why I presented it as a two-step solution.

  1. We expose docker --cpuset-cpus option to nomad.

After merging (1) the feature is not completely useful as you mentioned if allocation A is running with cpuset 0-1 while allocation B is unrestricted, there will be interference.

  1. We have to build an added logic of top of (1) to achieve exclusivity. I can think of two ways to achieve that.

    a) Hacky solution (not so ideal): Have a local state (e.g. an in-memory map in docker driver on each nomad client)
    which tracks what CPU's are already pinned.

         map[string]bool{
                 "0": false,
                 "1": false,
                 "2": true,
                 "3": true,
         }

Here {0,1} are not available since they are already pinned.

So e.g. if allocation B lands on the same node, the docker driver will say CPU's 0-1 is already allocated and will error out the allocation. My understanding is when that happens scheduler will detect the failed allocation and try to allocate it on another node. This will happen till, either allocation will get placed on another fit node, or reaches its maximum number of failed attempts (at that point the job should fail).

b) Ideal solution: Have a global state at the scheduler level e.g. map[string][]string (key=nodeID, value={0,1,2,4 CPU's already pinned})

With this global state, the scheduler will launch the job only on a node which has available CPU's to be pinned.

@chuckyz
Copy link
Contributor

chuckyz commented Jun 26, 2020

To get a clarification of what I want, I've discussed this with @shishir-a412ed internally a bit. From my perspective, there are a few layers of value here.

  • Expose cpuset-cpus. This allows total DIY use-cases, e.g.: anyone building tooling on top of Nomad, who has tooling that may be NUMA-aware itself, this allows the use-case of; given a 4 cpu server with 16 cores per CPU, 0-15 for CPU1, 16-32 for CPU2, 33-48 for CPU3, and 49-64 for CPU4.

That would allow 4 large containers per node, which could then be enforced through node pinning. This could also perhaps provide exclusive access per "datacenter" (like having "dc-hpc1," "dc-hpc2," "dc-hpcn"). We run ~20+ "datacenters" today without any trouble, so I think this should fit quite a few specific large use-cases.

This does not prevent this case: #2303 (comment)

  • Expose the concept of "cpus," I'm imagining this is implemented as a 5th resource. So, given mhz, memory, disk, bandwidth, and cpus. The user-experience I'm expecting here is that cpus is synonymous with number of threads (don't care about ht/non-ht cores, enforce that to be controlled by the operator), with each core taking the appropriate slice out of mhz.

From a user perspective this would be something like "I want 2 CPUs," and that could be 0,1 or 2,3 or 15,35. This would not be NUMA-aware from my perspective, at least not starting as NUMA-aware.

This covers this case: #2303 (comment)

This covers the first bullet point here: #2303 (comment)
and not the second one.

Reading back through #2303 though, I think the second point (even ignoring NUMA), gives enough of a community-level benefit that it'd be great.

Also, this PR specifically allows users to write some tooling on top of Nomad to cover the very specific use-cases of NUMA-neighbors + NUMA-exclusivity.

@shishir-a412ed
Copy link
Contributor Author

@notnoop @tgross Any updates on this?

@shishir-a412ed
Copy link
Contributor Author

shishir-a412ed commented Jul 20, 2020

@notnoop @tgross So we discussed this a bit more internally, and we felt while this patch is useful for someone who is trying to use CPU pinning in docker by setting --cpuset-cpus flag (with the caveat that they don't get pinned to those CPU's with exclusive access, which can either be documented or enhanced by doing some bookkeeping and making it exclusive), CPU isolation is a bigger problem and needs to be solved at the orchestration layer.

The ideas are very similar to Kubernetes CPU Manager

So I have opened another issue #8473 which has an initial spec on how I would approach this problem.
I would suggest to take CPU manager related discussions there and keep this PR just in the scope of docker driver --cpuset-cpus flag for pinning CPUs.

lmk what do you guys think?

@shishir-a412ed shishir-a412ed force-pushed the cpusets branch 2 times, most recently from b6d8ed3 to a2d0edc Compare August 4, 2020 18:31
@shishir-a412ed
Copy link
Contributor Author

@notnoop @tgross Any updates on this PR?

@notnoop
Copy link
Contributor

notnoop commented Aug 20, 2020

Hi! Just wanted to reach out and apologize for the slow response. This is on my plate and I intend to follow up shortly.

@shishir-a412ed
Copy link
Contributor Author

@notnoop Thank you for the update!

@Asara
Copy link

Asara commented Sep 8, 2020

Very eager for this PR. Thanks for the awesome work @shishir-a412ed

@shishir-a412ed
Copy link
Contributor Author

@notnoop Any updates on this one?

Copy link
Contributor

@notnoop notnoop left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for taking so long. Thank you so much for your patience here. This looks good to me, considering the small objective.

I'm inclined to add a beta marker to config, just in case we need to modify the semantics when we introduce global cpu tracking and need to do a backward incompatible change.

Thanks you again!

@notnoop notnoop dismissed tgross’s stale review November 11, 2020 22:13

the requested changes got addressed

@notnoop notnoop merged commit de5a21f into hashicorp:master Nov 11, 2020
@github-actions
Copy link

I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions.
If you have found a problem that seems related to this change, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Dec 11, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Nomad should pin tasks to CPUs underneath the hood
5 participants