From 290b755a0426e8591dc8ba0216622f6aca0c9b7a Mon Sep 17 00:00:00 2001 From: Doug Hellmann Date: Thu, 4 Feb 2021 11:52:02 -0500 Subject: [PATCH] automated resource request scaling Signed-off-by: Doug Hellmann --- .../automated-resource-request-scaling.md | 517 ++++++++++++++++++ 1 file changed, 517 insertions(+) create mode 100644 enhancements/automated-resource-request-scaling.md diff --git a/enhancements/automated-resource-request-scaling.md b/enhancements/automated-resource-request-scaling.md new file mode 100644 index 00000000000..85a8c2ac1c0 --- /dev/null +++ b/enhancements/automated-resource-request-scaling.md @@ -0,0 +1,517 @@ +--- +title: automated-resource-request-scaling +authors: + - "@dhellmann" + - "@csrwng" + - "@wking" + - "@browsell" +reviewers: + - TBD +approvers: + - "@smarterclayton" + - "@derekwaynecarr" + - "@markmc" +creation-date: 2021-02-04 +last-updated: 2021-02-04 +status: provisional +see-also: + - "/enhancements/single-node-production-deployment-approach.md" +--- + +# Automated Resource Request Scaling for Control Plane + +## Release Signoff Checklist + +- [ ] Enhancement is `implementable` +- [ ] Design details are appropriately documented from clear requirements +- [ ] Test plan is defined +- [ ] Operational readiness criteria is defined +- [ ] Graduation criteria for dev preview, tech preview, GA +- [ ] User-facing documentation is created in [openshift-docs](https://github.com/openshift/openshift-docs/) + +## Summary + +This enhancement describes an approach to allow us to scale the +resource requests for the control plane services to reduce consumption +for constrained environments. This will be especially useful for +single-node production deployments, where the user wants to reserve +most of the CPU resources for their own workloads and needs to +configure OpenShift to run on a fixed number of CPUs within the host. + +One example of this use case is seen in telecommunication service +providers implementation of a Radio Access Network (RAN). This use case +is discussed in more detail below. + +## Motivation + +The resource requests for cluster operators and their operands are +based on performance analysis on multiple cloud VMs using the +end-to-end test suite to gather data. While the resulting numbers work +well for similar cloud environments and even multi-node bare metal +deployments, they result in over-provisioning resources for +single-node deployments. + +### Radio Access Network (RAN) Use Case + +In the context of telcocommunications service providers' 5G Radio Access +Networks, it is increasingly common to see "cloud native" implementations +of the 5G Distributed Unit (DU) component. Due to latency constraints, +this DU component needs to be deployed very close to the radio antenna for +which it is responsible. In practice, this can mean running this +component on anything from a single server at the base of a remote cell +tower or in a datacenter-like environment serving several base stations. + +A hypothetical DU example is an unusually resource-intensive workload, +requiring 20 dedicated cores, 24 GiB of RAM consumed as huge pages, +multiple SR-IOV NICs carrying several Gbps of traffic each, and +specialized accelerator devices. The node hosting this workload must +run a realtime kernel, be carefully tuned to ensure low-latency +requirements can be met, and be configured to support features like +Precision Timing Protocol (PTP). + +The most constrained resource for RAN deployments is CPU, so for now +the focus of this enhancement is scaling CPU requests. Kubernetes +resource requests for CPU time are measured in fractional "[Kubernetes +CPUs](https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/#meaning-of-cpu)". For +bare metal, 1 Kubernetes CPU is equivalent to 1 hyperthread. + +Due to the resource-intensive workload, the overhead for platform +components such as the OpenShift platform and add-on operators is +severely restricted when compared to other uses. For example, one +customer has allocated 4 hyperthreads for all components other than +their own workloads. OpenShift currently requires 7. + +### Goals + +* This enhancement describes an approach for configuring OpenShift + clusters to run with lower CPU resource requests than its default + configuration. +* Clusters built in this way should pass the same Kubernetes and + OpenShift conformance and functional end-to-end tests as single-node + deployments that have not been scaled down. +* We have a goal of 4 hyperthreads today, but we do not know what + constraints we might be given in the future so we need a solution + that is not tied to the current known limit. + +### Non-Goals + +* This enhancement does not address memory resource requests. +* This enhancement does not address ways to disable operators or + operands entirely. +* Although the results of this enhancement may be useful for + single-node developer deployments, multi-node production + deployments, and Internet-of-things devices, those use cases are not + addressed directly in this document. + +## Proposal + +There are two aspects to manage scaling the control plane resource +requests. Most of the cluster operators are deployed by the +cluster-version-operator using static manifests that include the +resource requests. Those operators then deploy other controllers or +workloads dynamically. We need to account for both sets of components. + + + +This is where we get down to the nitty gritty of what the proposal actually is. + +### User Stories + +Detail the things that people will be able to do if this is implemented. +Include as much detail as possible so that people can understand the "how" of +the system. The goal here is to make this feel real for users without getting +bogged down. + +Include a story on how this proposal will be operationalized: lifecycled, monitored and remediated at scale. + +#### Story 1 + +#### Story 2 + +### Implementation Details/Notes/Constraints [optional] + +What are the caveats to the implementation? What are some important details that +didn't come across above. Go in to as much detail as necessary here. This might +be a good place to talk about core concepts and how they relate. + +### Risks and Mitigations + +What are the risks of this proposal and how do we mitigate. Think broadly. For +example, consider both security and how this will impact the larger OKD +ecosystem. + +How will security be reviewed and by whom? How will UX be reviewed and by whom? + +Consider including folks that also work outside your immediate sub-project. + +## Design Details + +### Open Questions + +1. The cluster-version-operator manages cluster operators using static + manifests. How would we make changes to the settings of Deployments + defined in those manifests? +2. How do we express to the cluster operators that they should scale + their operands? +3. How dynamic do we actually want the setting to be? Should the user + be able to change it at will, or is it a deployment-time setting? +4. What happens during an upgrade? The new manifests from the new + release payload would have unscaled limits. +5. How do we handle changes in the base request values for a cluster + operator over an upgrade? +6. Are operators allowed to place a lower bound on the scaling? Is it + possible to request too few resources, over subscribe a CPU, and + cause performance problems for the cluster? + +### Test Plan + +**Note:** *Section not required until targeted at a release.* + +Consider the following in developing a test plan for this enhancement: +- Will there be e2e and integration tests, in addition to unit tests? +- How will it be tested in isolation vs with other components? +- What additional testing is necessary to support managed OpenShift service-based offerings? + +No need to outline all of the test cases, just the general strategy. Anything +that would count as tricky in the implementation and anything particularly +challenging to test should be called out. + +All code is expected to have adequate tests (eventually with coverage +expectations). + +### Graduation Criteria + +**Note:** *Section not required until targeted at a release.* + +Define graduation milestones. + +These may be defined in terms of API maturity, or as something else. Initial proposal +should keep this high-level with a focus on what signals will be looked at to +determine graduation. + +Consider the following in developing the graduation criteria for this +enhancement: + +- Maturity levels + - [`alpha`, `beta`, `stable` in upstream Kubernetes][maturity-levels] + - `Dev Preview`, `Tech Preview`, `GA` in OpenShift +- [Deprecation policy][deprecation-policy] + +Clearly define what graduation means by either linking to the [API doc definition](https://kubernetes.io/docs/concepts/overview/kubernetes-api/#api-versioning), +or by redefining what graduation means. + +In general, we try to use the same stages (alpha, beta, GA), regardless how the functionality is accessed. + +[maturity-levels]: https://git.k8s.io/community/contributors/devel/sig-architecture/api_changes.md#alpha-beta-and-stable-versions +[deprecation-policy]: https://kubernetes.io/docs/reference/using-api/deprecation-policy/ + +**Examples**: These are generalized examples to consider, in addition +to the aforementioned [maturity levels][maturity-levels]. + +#### Dev Preview -> Tech Preview + +- Ability to utilize the enhancement end to end +- End user documentation, relative API stability +- Sufficient test coverage +- Gather feedback from users rather than just developers +- Enumerate service level indicators (SLIs), expose SLIs as metrics +- Write symptoms-based alerts for the component(s) + +#### Tech Preview -> GA + +- More testing (upgrade, downgrade, scale) +- Sufficient time for feedback +- Available by default +- Backhaul SLI telemetry +- Document SLOs for the component +- Conduct load testing + +**For non-optional features moving to GA, the graduation criteria must include +end to end tests.** + +#### Removing a deprecated feature + +- Announce deprecation and support policy of the existing feature +- Deprecate the feature + +### Upgrade / Downgrade Strategy + +If applicable, how will the component be upgraded and downgraded? Make sure this +is in the test plan. + +Consider the following in developing an upgrade/downgrade strategy for this +enhancement: +- What changes (in invocations, configurations, API use, etc.) is an existing + cluster required to make on upgrade in order to keep previous behavior? +- What changes (in invocations, configurations, API use, etc.) is an existing + cluster required to make on upgrade in order to make use of the enhancement? + +Upgrade expectations: +- Each component should remain available for user requests and + workloads during upgrades. Ensure the components leverage best practices in handling [voluntary disruption](https://kubernetes.io/docs/concepts/workloads/pods/disruptions/). Any exception to this should be + identified and discussed here. +- Micro version upgrades - users should be able to skip forward versions within a + minor release stream without being required to pass through intermediate + versions - i.e. `x.y.N->x.y.N+2` should work without requiring `x.y.N->x.y.N+1` + as an intermediate step. +- Minor version upgrades - you only need to support `x.N->x.N+1` upgrade + steps. So, for example, it is acceptable to require a user running 4.3 to + upgrade to 4.5 with a `4.3->4.4` step followed by a `4.4->4.5` step. +- While an upgrade is in progress, new component versions should + continue to operate correctly in concert with older component + versions (aka "version skew"). For example, if a node is down, and + an operator is rolling out a daemonset, the old and new daemonset + pods must continue to work correctly even while the cluster remains + in this partially upgraded state for some time. + +Downgrade expectations: +- If an `N->N+1` upgrade fails mid-way through, or if the `N+1` cluster is + misbehaving, it should be possible for the user to rollback to `N`. It is + acceptable to require some documented manual steps in order to fully restore + the downgraded cluster to its previous state. Examples of acceptable steps + include: + - Deleting any CVO-managed resources added by the new version. The + CVO does not currently delete resources that no longer exist in + the target version. + +### Version Skew Strategy + +How will the component handle version skew with other components? +What are the guarantees? Make sure this is in the test plan. + +Consider the following in developing a version skew strategy for this +enhancement: +- During an upgrade, we will always have skew among components, how will this impact your work? +- Does this enhancement involve coordinating behavior in the control plane and + in the kubelet? How does an n-2 kubelet without this feature available behave + when this feature is used? +- Will any other components on the node change? For example, changes to CSI, CRI + or CNI may require updating that component before the kubelet. + +## Implementation History + +Major milestones in the life cycle of a proposal should be tracked in `Implementation +History`. + +## Drawbacks + +The idea is to find the best form of an argument why this enhancement should _not_ be implemented. + +## Alternatives + +The best approach to achieve the goals described above is not clear, +so the alternatives section is a collection of candidates for +discussion. + +### Expressing a need to scale using an API + +These alternatives discuss what the API would look like, without +considering which CRD it would be on. + +#### A CPU resource request size parameter + +We could define an API to give a general size parameter, using values +like t-shirt sizes (large, medium, small) or flavors. + +Such an API is vague, and it would be up to teams to interpret the +size value appropriately. We would also be limited in the number of +sizes we could reasonably describe to users before it would become +confusing (see Amazon EC2 flavors). + +#### A CPU resource "request limit" parameter + +We could define an API to give an absolute limit of the CPU resource +requests for the cluster services. + +Operators would need to scale the resource requests for their operands +based on an understanding of how their defaults were derived. For +example, today most values are a proportion of the resources needed +for etcd on a 3 node control plane where each node has 2 CPUs, for a +total of 6 CPUs. If the API indicates a limit of 2 CPUs, the operator +would need to configure its operands to request 1/3 of the default +resources. + +#### A CPU resource scaling parameter + +We could define an API to give a scaling parameter that operators +should use for their operands. + +The value would be applied globally based on understanding how the +defaults are set for 6 CPUs and how many CPUs are available in the +real system. We could document the 6 CPU starting point to allow users +to make this calculation themselves. + +Operators could read the setting and use it for their operands, but +not themselves. + +### Locations for API setting + +These alternatives could apply to any of the API forms discussed in +the previous section. + +#### Use a cluster-wide API + +We could add a field to the Infrastructure configuration +resource. Many of our operators, including OLM-based operators, +already consume the Infrastructure API. + +#### Use an operator-specific API + +If we choose to implement a new operator to manage scaling for the +control plane operators (see below), it could define a new API in a +CRD it owns. + +Updating existing operators to read the new field and scale their +operands and documenting an entirely new API might be more work than +using the Infrastructure API that would only be useful if the operator +responding to the API was an add-on. + +### Dynamic vs. Static Setting + +It is not likely that users would need to change the setting after +building a cluster, so we could have the installer set a status field +in whatever CRD we choose for the API. If we do need to support +scaling later, that work can be described in another enhancement. + +### Managing settings for operators installed by cluster-version-operator + +The alternatives in this subsection cover the need to change the +settings for the static manifests currently managed by the +cluster-version-operator. These approaches would only apply to the +operators themselves, and not their operands. + +#### Use a cluster profile to change the resource requests for control plane operators + +We could create a separate cluster profile with different resource +requests for our control plane components. + +This would be quite a lot of work to implement and test. + +We may eventually need different profiles for different customers, +further multiplying the amount of work. + +As an organization, we are trying to keep the number of profiles +small as a general rule. + +Profile settings only apply to the cluster operators Deployments, and +changing the resource requests in the static manifests does not signal +to the operators that they need to change the settings for their operands. + +#### Lower the resource requests for cluster operators statically in the default configuration + +We could apply changes to the default resource requests in the static +manifests used by all deployments, instead of adding one or more new +cluster profiles. + +This would affect all clusters, and if we are too aggressive with the +changes we may lower values too much and cause important cluster +operators to be starved for resources, triggering degraded cluster +performance. + +The static settings only apply to the cluster operators Deployments, +and changing them does not signal to the operators that they need to +change the settings for their operands. + +#### Have a new operator in the release payload perform the scaling for control plane operators + +It would use the new API field to determine how to scale (see above). + +We would need to change the cluster-version-operator to ignore +resource request settings for cluster operators defined in static +manifests. + +It would add yet another component to be running in the cluster, +consuming resources. + +It could apply the request scaling to everything, a static subset, or +more dynamically based on label selectors or namespaces. + +Having an operator change the settings on manifests installed by the +cluster-version-operator may cause some "thrashing" during install and +upgrade. + +#### Have a webhook perform the scaling for control plane operators + +It would use the new API field to determine how to scale (see above). + +We would need to change the cluster-version-operator to ignore +resource request settings for cluster operators defined in static +manifests. + +It would add yet another component to be running in the cluster, +consuming resources. + +It could apply the request scaling to everything, a static +subset, or more dynamically based on label selectors or namespaces. + +We may find race conditions using a webhook to change settings for +cluster operators. + +#### Have the cluster-version-operator perform the scaling for control plane operators + +We could change the cluster-version-operator to scale the Deployments +for control plane operators. + +It would use the new API field to determine how to scale (see above). + +This additional logic would complicate the CVO (*How much?*) and there +is a general desire to avoid adding complexity to the main component +responsible for managing everything else. + +One benefit of this approach is that installation and upgrade would be +handled in the same way, without thrashing, race conditions, or +requirements for add-on components. + +What would apply the scaling to operator-lifecycle-manager (OLM) operators? + +#### Have an out-of-cluster actor perform the scaling for control plane operators + +An outside actor such as ACM policy applied via klusterlet or a +one-time script could change the resource settings. + +An outside actor would not require an API change in the cluster. + +We would need to change the cluster-version-operator to ignore +resource request settings for cluster operators defined in static +manifests. + +Allowing an outside actor to change the resource requests turns them +into something a user might change themselves, which may not be ideal +for our testing and support matrix. + +### Options for globally affecting resource limits + +We could change some of the underlying components to affect scaling +without the knowledge of the rest of the cluster components. + +#### Have the scheduler and kubelet perform the scaling based on priority class + +*Needs more detail.* + +Changing kubelet by itself, or adding a plugin, would capture +everything as it is launched, but only after a Pod is scheduled. That +would mean we would need to leave enough overhead to schedule a large +component, even if the request is going to be rewritten to ask for a +fraction of that value. So, we have to change the scheduler to ask for +less and kubelet to grant less, without changing the Pod definition +and triggering the ReplicaSet controller to rewrite the Pod +definition. + +#### Have kubelet announce more capacity than is available + +*Needs more detail.* + +Add something in the kubelet that actually announces more capacity +than available for CPU, and places system critical workloads into the +right area by subtracting the limits. + +Presenting false data to influence the schedule may be very tricky to +get right and would complicate debugging and support. + +## Infrastructure Needed [optional] + +Use this section if you need things from the project. Examples include a new +subproject, repos requested, github details, and/or testing infrastructure. + +Listing these here allows the community to get the process for these resources +started right away.