-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CPU Manager for Nomad #8473
Comments
A couple misc questions/comments:
|
|
|
This would be helpful for running game workloads. It can be an issue of the process switches amongst cores or across NUMA boundaries. |
How do you see this interacting with the usual tunings in this space eg. Generally if you set things like these, you're expecting to allocate your processes in certain CPU ranges - having Nomad choose CPUs outside these ranges is counterproductive. I think there's three possible approaches
|
@james-masson We are solving this exact same problem using the cgroups Regarding
If you are interested in (1) I have an open PR #8291 for CPU pinning using docker driver.
Having said that, this is proposed as an optional parameter, so if you (hypothetically) have someway to isolate your CPU's using |
I think it also has an effect on the kernel thread scheduling - not just user-space. You tend to use it when you want control above-and beyond userspace. Commonly used with manual IRQ pinning too. It's a go-to tuning for minimising jitter when you really don't want a context switch.
Yes -systemd's CPUAffinity is all about pulling the rest of the OS - including Nomad itself - away from the cores you want to use for your high-performance/low-jitter workloads The interaction between isolcpus and systemd should be to leave a large set of cores running nothing - not even kernel threads. Ready for your sensitive workloads. My point is - my customers in this space generally already have systems with isolcpus and systemd CPUAffinity ( and optimal IRQ affinity, nohz_full and more) - large multi-socket systems tuned to the hilt for performance. While I've used Nomad before for this sort of workload, it's always involved a custom layer to manage the CPU allocations. |
@james-masson Apart from being NUMA aware? What is it that this proposal doesn't address for you? What this proposal doesn't guarantee is which CPU you will get allocated, which is similar to Kubernetes CPU manager as it also doesn't offer this guarantee: https://kubernetes.io/blog/2018/07/24/feature-highlight-cpu-manager/#limitations After reading your comments, it looks like your customers have high performance/low jitter workloads which need NUMA aware CPU's e.g. running your workload on a CPU which is near to the bus connecting to a high-performance NIC so that it can avoid cross socket traffic. Not saying NUMA is not important, but we are intentionally keeping it out of this proposal to make the initial pass easier to implement and more inline with the k8s CPU manager. Also, there are some internal discussions going on within Hashicorp (I am not fully aware, but maybe someone from Hashicorp can chime in) on how they want to roll out this feature. They might already have NUMA on their roadmap. PS:
This is already covered in this proposal: Under |
Shipped in Nomad 1.1.0-beta |
I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues. |
CPU Manager for Nomad
Overview
The completely fair scheduler or CFS (also referred to as the kernel task scheduler), as the name suggests is completely fair :), which means it treats all available CPUs equally and assigns process threads to any available CPU. However CFS is preemptive, and if other process threads are starving for a long time, CFS will preempt currently running process threads to make room for waiting threads.
E.g. In the above 4 core system, CFS scheduled Process A, B, C, and D on the available 4 cores. After some time the other processes {E, F, G, and H} start starving and CFS will preempt the currently running processes to schedule {E, F, G, and H}.
This is great for multitasking and achieving a high CPU utilization, however, it’s not that great for latency-sensitive workloads. A latency-sensitive workload gets kicked out in favor of a starving workload, and its performance is impacted. We need a way to run these low latency workloads on a dedicated CPU set which CFS doesn’t control.
CPU as a resource
What is a CPU?
In most Linux distributions CPU is viewed as a collection of resource controls.
CFS shares: This treats CPU in the notion of time. It is defined as what is my weighted fair share of CPU time on the system.
E.g. if we say 1 core = 1024 shares on a 4 core system. A container or a process requesting 512 shares will get 1/2 core on the system i.e if a CPU cycle is 500 microseconds, every CPU cycle it gets 250 microseconds of execution time.
CFS quota: This also treats CPU in the notion of time. It is defined as what is my hard cap of CPU time over a period. To understand CFS quota we need to understand two knobs.
E.g If cpu.cfs_quota_us = 250 and cpu.cfs_period_us = 250, the process is getting 1 full CPU i.e. it will be the only process executing during that CPU cycle (period).
Another example: If cpu.cfs_quota_us = 10 and cpu.cfs_period_us = 50, the process is getting 20% of the CPU every execution cycle (period). Once the process hits the quota, the application will be throttled.
These are applied at the cgroup level.
How do Kubernetes do it today?
Kubernetes (k8s) uses CFS quota (explained above under “CPU as a resource”) as a resource control to manage CPU’s.
The k8s operator first set
--cpu-manager-policy=static
as a kubelet option. This will isolate a bunch of CPUs from the CFS view and can be allocated for dedicated usage. Exclusivity is enforced usingcpuset cgroup controller
A user can then request CPU units under 3 groups of classes. A user has to specify requests and limits. Based on these values it’s class can be determined.
Guaranteed (requests == limits): You get exclusive access to a set of CPUs in this class. E.g. if requests=4 and limits=4, users will get guaranteed access to 4 CPU units. This is good for latency-sensitive applications that require dedicated CPU access.
Burstable (requests < limits) You get dedicated access up to requests and can burst up to limits if resources are available in the system. E.g. if requests=4 and limits=10, users will get guaranteed access to 4 CPU units and the application can burst up to 10 CPUs if resources are available. The extra 6 CPU units can be preempted by the system if a higher priority job needs it. This is good for jobs that can have lower requests value which will increase their probability of getting placed in the system quickly and can then burst later if resources are available.
Best effort (requests == 0) This is the bottom of the barrel, where the system makes no guarantees and will make the best effort to allocate whatever is possible to the application.
Here a CPU unit is:
Example guaranteed QOS job
How should nomad do it?
Key takeaway from k8s - Kubernetes primarily uses cgroups (
cpu
andcpuset
subsystem or resource controllers) to isolate and control CPU’s.Let’s take an example of an 8 core intel system with hyperthreading enabled. Here 8 physical cores = 16 virtual cores.
Cores = {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15}
Nomad client: When the nomad client daemon comes up, it should reserve some CPUs for exclusive access, and remove them from CFS view so that workloads assigned to those CPU’s should not get preempted.
CPUs for exclusive access = Number of Cores (0-15) - system reserved cores (cores needed for system work) - nomad reserved cores (cores needed for nomad client)
Let’s say both system and nomad need two cores each.
System reserved cores = 14,15
Nomad reserved cores = 12,13
CPUs for exclusive access = {0,1,2,3,4,5,6,7,8,9,10,11} [6 physical cores]
Nomad client should create a cgroup under cpuset subsystem or resource controller and assign {0-11} to cpuset.cpus and enable (1) cpuset.cpu_exclusive flag for exclusive access.
At this point, nomad client has exclusive access to CPU units 0-11.
Now, when the user launches a nomad job with the following spec.
For the above job, nomad client should create a cgroup example under nomad parent cgroup and assign two cores to it.
When nomad client launches the job (example) it should attach the job PIDs to example cgroup. You can achieve this by adding the example job PIDs to /sys/fs/cgroup/cpuset/nomad/example/cgroup.procs file.
Design consideration
We can keep cpu (e.g. 500 Mhz) and cpu-cores (e.g. 2 cores) as mutually exclusive options i.e if a user is requesting cpu (to request for CPUs in MHZ in a shared setting) s/he cannot request cpu-cores (for exclusive access) at the same time.
Nomad should throw an error back to the user in case both are being set.
Error:
cpu and cpu-cores are mutually exclusive options, and only one of them should be set.
This also maintains backward compatibility for all the jobs that have been using cpu.
References
Tim Hockin talk on resource management and why we are asking the wrong questions!?
The text was updated successfully, but these errors were encountered: