diff --git a/enhancements/management-workload-partitioning.md b/enhancements/management-workload-partitioning.md new file mode 100644 index 0000000000..f3639f8d7e --- /dev/null +++ b/enhancements/management-workload-partitioning.md @@ -0,0 +1,517 @@ +--- +title: management-workload-partitioning +authors: + - "@dhellmann" + - "@mrunalp" + - "@browsell" + - "@haircommander" + - "@rphillips" +reviewers: + - "@deads2k" + - TBD +approvers: + - "@smarterclayton" + - "@derekwaynecarr" + - "@markmc" +creation-date: 2021-03-18 +last-updated: 2021-03-18 +status: implementable +see-also: + - "/enhancements/single-node-production-deployment-approach.md" +replaces: + - https://github.com/openshift/enhancements/pull/628 +--- + +# Management Workload Partitioning + +## Release Signoff Checklist + +- [x] Enhancement is `implementable` +- [ ] Design details are appropriately documented from clear requirements +- [ ] Test plan is defined +- [ ] Operational readiness criteria is defined +- [ ] Graduation criteria for dev preview, tech preview, GA +- [ ] User-facing documentation is created in [openshift-docs](https://github.com/openshift/openshift-docs/) + +## Summary + +This enhancement describes an approach to allow us to isolate the the +control plane services to run on a restricted set of CPUs. This will +be especially useful for resource-constrained enviornments, such as +single-node production deployments, where the user wants to reserve +most of the CPU resources for their own workloads and needs to +configure OpenShift to run on a fixed number of CPUs within the host. + +One example of this use case is seen in telecommunication service +providers implementation of a Radio Access Network (RAN). This use case +is discussed in more detail below. + +## Motivation + +In constrained environments, management workloads, including the +OpenShift control plane, need to be configured to use fewer resources +than they might by default in normal clusters. After examining +[various approaches for scaling the resource +requests](https://github.com/openshift/enhancements/pull/628), we are +reframing the problem to allow us to solve it a different way. + +Customers who want us to reduce the resource consumption of management +workloads have a fixed budget of CPU cores in mind. We want to use +normal scheduling capabilities of kubernetes to manage the number of +pods that can be placed onto those cores, and we want to avoid mixing +management and normal workloads there. + +### Goals + +* This enhancement describes an approach for configuring OpenShift + clusters to run with management workloads on a restricted set of + CPUs. +* Clusters built in this way should pass the same Kubernetes and + OpenShift conformance and functional end-to-end tests as single-node + deployments that are not isolating the management workloads. +* We have a goal of running on 4 hyperthreads today, but we do not + know what constraints we might be given in the future so we need a + solution that is not tied to the current known limit. +* We want a general approach, that can be applied to all OpenShift + control plane components. + +### Non-Goals + +* This enhancement is focused on CPU resources. Other compressible + resource types may need to be managed in the future, and those are + likely to need different approaches. +* This enhancement does not address non-compressible resource + requests, such as for memory. +* This enhancement does not address ways to disable operators or + operands entirely. +* Although the results of this enhancement may be useful for + single-node developer deployments, multi-node production + deployments, and Internet-of-things devices, those use cases are not + addressed directly in this document. +* This enhancement does not address reducing actual utilization. There + is no expectation that a cluster configured to use a small number of + cores for management services would offer exactly the same + performance as the default. It must be stable and continue to + operate reliably, but may respond more slowly. +* This enhancement assumes that the configuration of management + services is done as part of installing the cluster, and cannot be + changed later. Future enhancements may address enabling or + reconfiguring the feature described here on an existing cluster. +* This enhancement describes partitioning concepts that could be + expanded to be used for other purposes. Use cases for partitioning + workloads for other purposes may be addressed by future + enhancements. + +## Proposal + +[Previous attempts](https://github.com/openshift/enhancements/pull/62) +to solve this problem focused on reducing or scaling requests so that +the normal scheduling criteria could be used to safely place them on +CPUs in the shared pool. This proposal reframes the problem, so that +instead of considering "scaling" the requests we think about +"isolating" or "partitioning" them away from the non-management +workloads. This view of the problem is more consistent with how the +requirement was originally presented by the customer. + +We want to define "management workloads" in a flexible way. For the +purposes of this document, "management workloads" include all +OpenShift core components necessary to run the cluster, any add-on +operators necessary to make up the "platform" as defined by telco +customers, and operators or other components from third-party vendors +that the customer deems as management rather than operational. It is +important to note that not all of those components will be delivered +as part of the OpenShift payload and some may be written by the +customer or by vendors who are not our partners. + +The basic proposal is to provide a way to identify "management +workloads" at runtime and to use CRI-O to run them on a user-selected +set of CPUs, while other workloads will be prevented from running +there. This effectively gives 3 pools of CPUs (shared, dedicated, and +management) and means the shared CPU pool will need other CPUs to +support burstable or best-effort workloads. + +We want the isolation to be in place from the initial buildout of the +cluster, to ensure predictable behavior. Therefore, the feature must +be enabled during installation. To enable the feature, the user will +specify the set of CPUs to run all management workloads. When the +management CPU set is not defined, the feature will be completely +disabled. + +We generally want components to opt in to being considered management +workloads. Therefore, for a regular pod to be considered to contain a +management workload it must be labeled with `io.openshift.management: +true`. + +We want to treat all OpenShift components as management workloads, +including those that run the control plane. Therefore, kubelet will be +modified to treat all static pods as management workloads, when the +feature is enabled. We will operator manifests and implementations to +label all OpenShift components not running in static pods with the +`io.openshift.management: true` label, and add a CI job to require +that label be present in all workloads created from the release +payload. + +We need kubelet to know when the feature is enabled, but we cannot +change the configuration schema that kubelet gets from +upstream. Therefore we will have kubelet look for a new configuration +file on startup, and the feature will only be enabled if that file is +found. + +We want to give cluster administrators control over which workloads +are run on the management CPUs. Therefore, only pods in namespaces +labeled with `io.openshift.management: true` will be subject to +special handling. Normal users cannot add a label to a namespace +without the right RBAC permissions. + +We want to continue to use the scheduler for placing management +workloads, but we cannot rely on the CPU requests of those workloads +to accurately reflect the constrained environment in which they are +expected to run. Instead of scaling those CPU request values, we will +change them to request a new [extended +resource](https://kubernetes.io/docs/tasks/administer-cluster/extended-resource-node/), +"management cores." We will modify kubelet to advertise management +cores as an extended resource when the management workload +partitioning feature is enabled, using a value equivalent to the CPU +resources for the entire host. This large value should allow the +scheduler to always be able to place workloads, while still accurately +accounting for those requests. + +We need to ensure fair allocation of CPU time between different +management workloads. Therefore, we will copy the original CPU +requests for management workload pods into an annotation that CRI-O +can use to configure the CPU shares when running the containers. + +We need to change pod definitions as the pods are created, so that the +scheduler, kubelet, and CRI-O all see a consistently updated version +of the pod and do not need to make independent decisions about whether +it should be treated as a management workload. We need to intercept +pod creation for *all* pods, without race conditions that might be +introduced with traditional admission webhooks or controllers. +Therefore, we will build an admission hook in to the kubernetes API +server in OpenShift, to intercept API requests to create pods. + +The API server requires a configuration resource as input, rather than +a ConfigMap or command line flag. Therefore, we need an API-driven +way to enable management workload partitioning in the admission +hook. We will extend the Infrastructure configuration resource with a +new status field to indicate whether the feature is on or off. *Name +TBD, see open questions* + +Some pods used to run OpenShift control plane components are started +before the API server. Therefore, we will have to manually add extra +metadata to those pod definitions, instead of relying on the admission +hook to do it. + +### User Stories + +#### Radio Access Network (RAN) Use Case + +In the context of telcocommunications service providers' 5G Radio Access +Networks, it is increasingly common to see "cloud native" implementations +of the 5G Distributed Unit (DU) component. Due to latency constraints, +this DU component needs to be deployed very close to the radio antenna for +which it is responsible. In practice, this can mean running this +component on anything from a single server at the base of a remote cell +tower or in a datacenter-like environment serving several base stations. + +A hypothetical DU example is an unusually resource-intensive workload, +requiring 20 dedicated cores, 24 GiB of RAM consumed as huge pages, +multiple SR-IOV NICs carrying several Gbps of traffic each, and +specialized accelerator devices. The node hosting this workload must +run a realtime kernel, be carefully tuned to ensure low-latency +requirements can be met, and be configured to support features like +Precision Timing Protocol (PTP). + +The most constrained resource for RAN deployments is CPU, so for now +the focus of this enhancement is managing CPU requirements. +Kubernetes resource requests for CPU time are measured in fractional +"[Kubernetes +CPUs](https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/#meaning-of-cpu)". For +bare metal, 1 Kubernetes CPU is equivalent to 1 hyperthread. + +Due to the resource-intensive workload, the overhead for platform +components such as the OpenShift platform and add-on operators is +severely restricted when compared to other uses. For example, one +customer has allocated 4 hyperthreads for all components other than +their own workloads. OpenShift currently requires 7. + +### Implementation Details/Notes/Constraints + +#### High-level End-to-end Workflow + +This section outlines an end-to-end workflow for deploying a cluster +with management-workload-partitioning enabled and how pods are +correctly scheduled to run on the management CPU pool. + +1. User sits down at their computer. +2. The user creates their `install-config.yaml`, including extra + values in the `bootstrapInPlace` section (since that enables + deploying a single-node). + * `managementCPUIDs` is a CPU set specifier for the CPUs to add to + the isolated set used for management components. The default is + empty, and when the value is empty the management workload + partitioning feature is disabled. The rest of the steps below + assume it is enabled, unless otherwise stated. +3. The user runs the installer. +4. During bootstrapping, the installer uses `managementCPUIDs` to + generate an extra machine config manifest to configure CRI-O to + process management workloads in a special way. +5. During bootstrapping, the installer creates a machine config + manifest to write a configuration file for kubelet. The file should + only be readable by the kubelet. +6. During bootstrapping, the installer will update the status fields + of the Infrastructure config resource to enable the feature for the + admission hook. +7. The kubelet starts up and finds the configuration file enabling the + new feature. +8. The kubelet reads static pod definitions. It replaces the CPU + requests with management CPU requests of the same value and adds + an annotation for CRI-O with the same value. +9. Something schedules a regular pod with the + `io.openshift.management: true` label in a namespace with the + `io.openshift.management: true` label. +10. The admission hook modifies the pod, replacing the CPU requests + with management CPU requests and adding an annotation for CRI-O. +11. The scheduler sees the new pod and finds available management CPU + resources on the node. The scheduler places the pod on the node. +12. Repeat steps 9-11 until all pods are running. +13. Single-node deployment comes up with management components + constrained to subset of available CPUs + +#### CRI-O Changes + +CRI-O will be updated to support a new configuration value that a user +can specify, `mgmt_ctr_cpuset`. + +```ini +[crio.runtime] +mgmt_ctr_cpuset = “0-1” +``` + +This field describes the CPU set that management workloads will be +configured to use. + +CRI-O will be configured to support a new annotation on pods, +`io.openshift.management.cores`. + +```ini +[crio.runtime.runtimes.mgmtrunc] + runtime_path = "/usr/bin/runc" + runtime_type = "oci" + allowed_annotations = ["io.openshift.management.cores"] +``` + +Pods that have the `io.openshift.management.cores` annotation will +have their cpuset configured to the value in `mgmt_ctr_cpuset`, as +well as have their CPU shares configured to the value of the +annotation. + +Note that this field does not conflict with the `infra_ctr_cpuset` +config option, as the infra container will still be put in that +cpuset. They can be configured as the same value if the infra +container should also be considered to be managed. + +#### API Server Admission Hook + +A new admission hook in the kubernetes API server within OpenShift +will mutate pods when they are created to make 2 changes. + +1. It will move CPU requests to "management core" requests, so that + the scheduler can successfully place the pod on the node, even + though the sum of the CPU requests for all management workloads may + exceed the actual CPU capacity of the management CPU pool. +2. It will add an annotation, `io.openshift.management.cores` with a + value equal to the original CPU requests, so that CRI-O can use the + value to configure the CPU shares for the container. + +These changes will only be made on pods that also have memory +requests, because if we mutate the pod so that it has no CPU or memory +requests the quality-of-service class of the pod would be changed +automatically. + +#### Kubelet Changes + +Kubelet will be changed to look for a configuration file, +`/etc/kubernetes/management-pinning`, to enable the management +workload partitioning feature. The file should contain the cpuset +specifier for the CPUs making up the management CPU pool. + +Kubelet will be changed so that when the feature is enabled, [when a +static pod definition is +read](https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/config/file_linux.go) +it is mutated in a way similar to the API server admission hook will +mutate regular pods. + +Kubelet will be changed so that when the feature is enabled, it +advertises a new extended resource of "management cores", representing +all of the CPU capacity of the host (not just the management CPU +pool). + +#### Installer Changes + +The installer will be changed to accept the new `managementCPUIDs` +configuration input value. + +The installer will be changed to generate an extra machine config +manifest to configure CRI-O so that containers from pods with the +`io.openshift.management.cores` annotation are run on the +`mgmt_ctr_cpuset`. + +The installer will be changed to create a machine config manifest to +write the `/etc/kubernetes/management-pinning` configuration file for +kubelet. The file will have SELinux settings configured so that it is +only readable by the kubelet. + +The installer will be changed to update the +`status.managementWorkloadPartitioning` field of the Infrastructure +config resource to `enabled` so the admission hook will know when to +mutate pods. + +### Risks and Mitigations + +The first version of this feature must be enabled when a cluster is +deployed to work correctly. However, some of the enabling +configurations could be modified by an admin on day 2. We need to +document clearly that this is not supported. + +It is possible to build a cluster with the feature enabled and then +deploy an operator or other workload that should take advantage of the +feature without first configuring the namespace properly. We need to +document that the configuration only applies to pods created after it +is set, so if an operator is installed before the namespace setting is +changed the operator or its operands may need to be re-installed or +restarted. + +The schedule for delivering this feature is very aggressive. We have +tried to minimize the number of places that need complex changes to +make it more likely that we can meet the deadline. + +## Design Details + +### Open Questions [optional] + +1. If we think this feature is eventually going to be useful in + regular clusters, do we want the settings in the `bootstrapInPlace` + section? Should we add a `managementWorkloadPartitioning` section, + or something similar, and say for now that it only applies when the + single-node deployment approach is used? +2. What should the new field in the Infrastructure CRD be? + +### Test Plan + +We will add a CI job to ensure that all release payload workloads and +their namespaces are labeled with `io.openshift.management: true`. + +We will add a CI job to ensure that single-node deployments configured +with management workload partitioning pass the compliance tests. + +### Graduation Criteria + +**Note:** *Section not required until targeted at a release.* + +Define graduation milestones. + +These may be defined in terms of API maturity, or as something else. Initial proposal +should keep this high-level with a focus on what signals will be looked at to +determine graduation. + +Consider the following in developing the graduation criteria for this +enhancement: + +- Maturity levels + - [`alpha`, `beta`, `stable` in upstream Kubernetes][maturity-levels] + - `Dev Preview`, `Tech Preview`, `GA` in OpenShift +- [Deprecation policy][deprecation-policy] + +Clearly define what graduation means by either linking to the [API doc definition](https://kubernetes.io/docs/concepts/overview/kubernetes-api/#api-versioning), +or by redefining what graduation means. + +In general, we try to use the same stages (alpha, beta, GA), regardless how the functionality is accessed. + +[maturity-levels]: https://git.k8s.io/community/contributors/devel/sig-architecture/api_changes.md#alpha-beta-and-stable-versions +[deprecation-policy]: https://kubernetes.io/docs/reference/using-api/deprecation-policy/ + +**Examples**: These are generalized examples to consider, in addition +to the aforementioned [maturity levels][maturity-levels]. + +#### Dev Preview -> Tech Preview + +- Ability to utilize the enhancement end to end +- End user documentation, relative API stability +- Sufficient test coverage +- Gather feedback from users rather than just developers +- Enumerate service level indicators (SLIs), expose SLIs as metrics +- Write symptoms-based alerts for the component(s) + +#### Tech Preview -> GA + +- More testing (upgrade, downgrade, scale) +- Sufficient time for feedback +- Available by default +- Backhaul SLI telemetry +- Document SLOs for the component +- Conduct load testing + +**For non-optional features moving to GA, the graduation criteria must include +end to end tests.** + +#### Removing a deprecated feature + +- Announce deprecation and support policy of the existing feature +- Deprecate the feature + +### Upgrade / Downgrade Strategy + +This new behavior will be added in 4.8 for single-node clusters only. + +Enabling the feature after installation is not supported, so we do not +need to address what happens if an older cluster upgrades and then the +feature is turned on. + +### Version Skew Strategy + +N/A + +## Implementation History + +Major milestones in the life cycle of a proposal should be tracked in `Implementation +History`. + +## Drawbacks + +Several of the changes described above are patches that we may end up +carrying downstream indefinitely. Some version of a more general "CPU +pool" feature may be acceptable upstream, and we could reimplement +management workload partitioning to use that new implementation. + +## Alternatives + +### Use PAO to configure this behavior day 2 + +We could use the PAO to apply some of the configuration for the +kubelet on day 2. That would require extra reboot(s), which we want to +avoid because the amount of time it takes to install is already too +long for the goals of some customers. + +### CRI-O Alternatives + +* Use just the annotation in a pod so can keep the runtime class for a + future use case +* Have the cpuset configured on `runtime_class` level instead of + top-level config +* Have the cpuset configured as the value of `io.openshift.management` + instead of hard coded + * This option is not optimal because it requires multiple locations + where the cpuset has to be configured (in the admission controller + that will inject the annotation) +* Have two different annotations, rather than just one. + * This is only needed if we decide to configure the cpuset in the + annotation. + +## Infrastructure Needed [optional] + +Use this section if you need things from the project. Examples include a new +subproject, repos requested, github details, and/or testing infrastructure. + +Listing these here allows the community to get the process for these resources +started right away.