From 290b755a0426e8591dc8ba0216622f6aca0c9b7a Mon Sep 17 00:00:00 2001
From: Doug Hellmann <dhellmann@redhat.com>
Date: Thu, 4 Feb 2021 11:52:02 -0500
Subject: [PATCH] automated resource request scaling

Signed-off-by: Doug Hellmann <dhellmann@redhat.com>
---
 .../automated-resource-request-scaling.md     | 517 ++++++++++++++++++
 1 file changed, 517 insertions(+)
 create mode 100644 enhancements/automated-resource-request-scaling.md

diff --git a/enhancements/automated-resource-request-scaling.md b/enhancements/automated-resource-request-scaling.md
new file mode 100644
index 00000000000..85a8c2ac1c0
--- /dev/null
+++ b/enhancements/automated-resource-request-scaling.md
@@ -0,0 +1,517 @@
+---
+title: automated-resource-request-scaling
+authors:
+  - "@dhellmann"
+  - "@csrwng"
+  - "@wking"
+  - "@browsell"
+reviewers:
+  - TBD
+approvers:
+  - "@smarterclayton"
+  - "@derekwaynecarr"
+  - "@markmc"
+creation-date: 2021-02-04
+last-updated: 2021-02-04
+status: provisional
+see-also:
+  - "/enhancements/single-node-production-deployment-approach.md"
+---
+
+# Automated Resource Request Scaling for Control Plane
+
+## Release Signoff Checklist
+
+- [ ] Enhancement is `implementable`
+- [ ] Design details are appropriately documented from clear requirements
+- [ ] Test plan is defined
+- [ ] Operational readiness criteria is defined
+- [ ] Graduation criteria for dev preview, tech preview, GA
+- [ ] User-facing documentation is created in [openshift-docs](https://github.com/openshift/openshift-docs/)
+
+## Summary
+
+This enhancement describes an approach to allow us to scale the
+resource requests for the control plane services to reduce consumption
+for constrained environments. This will be especially useful for
+single-node production deployments, where the user wants to reserve
+most of the CPU resources for their own workloads and needs to
+configure OpenShift to run on a fixed number of CPUs within the host.
+
+One example of this use case is seen in telecommunication service
+providers implementation of a Radio Access Network (RAN). This use case
+is discussed in more detail below.
+
+## Motivation
+
+The resource requests for cluster operators and their operands are
+based on performance analysis on multiple cloud VMs using the
+end-to-end test suite to gather data. While the resulting numbers work
+well for similar cloud environments and even multi-node bare metal
+deployments, they result in over-provisioning resources for
+single-node deployments.
+
+### Radio Access Network (RAN) Use Case
+
+In the context of telcocommunications service providers' 5G Radio Access
+Networks, it is increasingly common to see "cloud native" implementations
+of the 5G Distributed Unit (DU) component. Due to latency constraints,
+this DU component needs to be deployed very close to the radio antenna for
+which it is responsible. In practice, this can mean running this
+component on anything from a single server at the base of a remote cell
+tower or in a datacenter-like environment serving several base stations.
+
+A hypothetical DU example is an unusually resource-intensive workload,
+requiring 20 dedicated cores, 24 GiB of RAM consumed as huge pages,
+multiple SR-IOV NICs carrying several Gbps of traffic each, and
+specialized accelerator devices. The node hosting this workload must
+run a realtime kernel, be carefully tuned to ensure low-latency
+requirements can be met, and be configured to support features like
+Precision Timing Protocol (PTP).
+
+The most constrained resource for RAN deployments is CPU, so for now
+the focus of this enhancement is scaling CPU requests.  Kubernetes
+resource requests for CPU time are measured in fractional "[Kubernetes
+CPUs](https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/#meaning-of-cpu)". For
+bare metal, 1 Kubernetes CPU is equivalent to 1 hyperthread.
+
+Due to the resource-intensive workload, the overhead for platform
+components such as the OpenShift platform and add-on operators is
+severely restricted when compared to other uses. For example, one
+customer has allocated 4 hyperthreads for all components other than
+their own workloads. OpenShift currently requires 7.
+
+### Goals
+
+* This enhancement describes an approach for configuring OpenShift
+  clusters to run with lower CPU resource requests than its default
+  configuration.
+* Clusters built in this way should pass the same Kubernetes and
+  OpenShift conformance and functional end-to-end tests as single-node
+  deployments that have not been scaled down.
+* We have a goal of 4 hyperthreads today, but we do not know what
+  constraints we might be given in the future so we need a solution
+  that is not tied to the current known limit.
+
+### Non-Goals
+
+* This enhancement does not address memory resource requests.
+* This enhancement does not address ways to disable operators or
+  operands entirely.
+* Although the results of this enhancement may be useful for
+  single-node developer deployments, multi-node production
+  deployments, and Internet-of-things devices, those use cases are not
+  addressed directly in this document.
+
+## Proposal
+
+There are two aspects to manage scaling the control plane resource
+requests. Most of the cluster operators are deployed by the
+cluster-version-operator using static manifests that include the
+resource requests. Those operators then deploy other controllers or
+workloads dynamically. We need to account for both sets of components.
+
+
+
+This is where we get down to the nitty gritty of what the proposal actually is.
+
+### User Stories
+
+Detail the things that people will be able to do if this is implemented.
+Include as much detail as possible so that people can understand the "how" of
+the system. The goal here is to make this feel real for users without getting
+bogged down.
+
+Include a story on how this proposal will be operationalized:  lifecycled, monitored and remediated at scale.
+
+#### Story 1
+
+#### Story 2
+
+### Implementation Details/Notes/Constraints [optional]
+
+What are the caveats to the implementation? What are some important details that
+didn't come across above. Go in to as much detail as necessary here. This might
+be a good place to talk about core concepts and how they relate.
+
+### Risks and Mitigations
+
+What are the risks of this proposal and how do we mitigate. Think broadly. For
+example, consider both security and how this will impact the larger OKD
+ecosystem.
+
+How will security be reviewed and by whom? How will UX be reviewed and by whom?
+
+Consider including folks that also work outside your immediate sub-project.
+
+## Design Details
+
+### Open Questions
+
+1. The cluster-version-operator manages cluster operators using static
+   manifests. How would we make changes to the settings of Deployments
+   defined in those manifests?
+2. How do we express to the cluster operators that they should scale
+   their operands?
+3. How dynamic do we actually want the setting to be? Should the user
+   be able to change it at will, or is it a deployment-time setting?
+4. What happens during an upgrade? The new manifests from the new
+   release payload would have unscaled limits.
+5. How do we handle changes in the base request values for a cluster
+   operator over an upgrade?
+6. Are operators allowed to place a lower bound on the scaling? Is it
+   possible to request too few resources, over subscribe a CPU, and
+   cause performance problems for the cluster?
+
+### Test Plan
+
+**Note:** *Section not required until targeted at a release.*
+
+Consider the following in developing a test plan for this enhancement:
+- Will there be e2e and integration tests, in addition to unit tests?
+- How will it be tested in isolation vs with other components?
+- What additional testing is necessary to support managed OpenShift service-based offerings?
+
+No need to outline all of the test cases, just the general strategy. Anything
+that would count as tricky in the implementation and anything particularly
+challenging to test should be called out.
+
+All code is expected to have adequate tests (eventually with coverage
+expectations).
+
+### Graduation Criteria
+
+**Note:** *Section not required until targeted at a release.*
+
+Define graduation milestones.
+
+These may be defined in terms of API maturity, or as something else. Initial proposal
+should keep this high-level with a focus on what signals will be looked at to
+determine graduation.
+
+Consider the following in developing the graduation criteria for this
+enhancement:
+
+- Maturity levels
+  - [`alpha`, `beta`, `stable` in upstream Kubernetes][maturity-levels]
+  - `Dev Preview`, `Tech Preview`, `GA` in OpenShift
+- [Deprecation policy][deprecation-policy]
+
+Clearly define what graduation means by either linking to the [API doc definition](https://kubernetes.io/docs/concepts/overview/kubernetes-api/#api-versioning),
+or by redefining what graduation means.
+
+In general, we try to use the same stages (alpha, beta, GA), regardless how the functionality is accessed.
+
+[maturity-levels]: https://git.k8s.io/community/contributors/devel/sig-architecture/api_changes.md#alpha-beta-and-stable-versions
+[deprecation-policy]: https://kubernetes.io/docs/reference/using-api/deprecation-policy/
+
+**Examples**: These are generalized examples to consider, in addition
+to the aforementioned [maturity levels][maturity-levels].
+
+#### Dev Preview -> Tech Preview
+
+- Ability to utilize the enhancement end to end
+- End user documentation, relative API stability
+- Sufficient test coverage
+- Gather feedback from users rather than just developers
+- Enumerate service level indicators (SLIs), expose SLIs as metrics
+- Write symptoms-based alerts for the component(s)
+
+#### Tech Preview -> GA
+
+- More testing (upgrade, downgrade, scale)
+- Sufficient time for feedback
+- Available by default
+- Backhaul SLI telemetry
+- Document SLOs for the component
+- Conduct load testing
+
+**For non-optional features moving to GA, the graduation criteria must include
+end to end tests.**
+
+#### Removing a deprecated feature
+
+- Announce deprecation and support policy of the existing feature
+- Deprecate the feature
+
+### Upgrade / Downgrade Strategy
+
+If applicable, how will the component be upgraded and downgraded? Make sure this
+is in the test plan.
+
+Consider the following in developing an upgrade/downgrade strategy for this
+enhancement:
+- What changes (in invocations, configurations, API use, etc.) is an existing
+  cluster required to make on upgrade in order to keep previous behavior?
+- What changes (in invocations, configurations, API use, etc.) is an existing
+  cluster required to make on upgrade in order to make use of the enhancement?
+
+Upgrade expectations:
+- Each component should remain available for user requests and
+  workloads during upgrades. Ensure the components leverage best practices in handling [voluntary disruption](https://kubernetes.io/docs/concepts/workloads/pods/disruptions/). Any exception to this should be
+  identified and discussed here.
+- Micro version upgrades - users should be able to skip forward versions within a
+  minor release stream without being required to pass through intermediate
+  versions - i.e. `x.y.N->x.y.N+2` should work without requiring `x.y.N->x.y.N+1`
+  as an intermediate step.
+- Minor version upgrades - you only need to support `x.N->x.N+1` upgrade
+  steps. So, for example, it is acceptable to require a user running 4.3 to
+  upgrade to 4.5 with a `4.3->4.4` step followed by a `4.4->4.5` step.
+- While an upgrade is in progress, new component versions should
+  continue to operate correctly in concert with older component
+  versions (aka "version skew"). For example, if a node is down, and
+  an operator is rolling out a daemonset, the old and new daemonset
+  pods must continue to work correctly even while the cluster remains
+  in this partially upgraded state for some time.
+
+Downgrade expectations:
+- If an `N->N+1` upgrade fails mid-way through, or if the `N+1` cluster is
+  misbehaving, it should be possible for the user to rollback to `N`. It is
+  acceptable to require some documented manual steps in order to fully restore
+  the downgraded cluster to its previous state. Examples of acceptable steps
+  include:
+  - Deleting any CVO-managed resources added by the new version. The
+    CVO does not currently delete resources that no longer exist in
+    the target version.
+
+### Version Skew Strategy
+
+How will the component handle version skew with other components?
+What are the guarantees? Make sure this is in the test plan.
+
+Consider the following in developing a version skew strategy for this
+enhancement:
+- During an upgrade, we will always have skew among components, how will this impact your work?
+- Does this enhancement involve coordinating behavior in the control plane and
+  in the kubelet? How does an n-2 kubelet without this feature available behave
+  when this feature is used?
+- Will any other components on the node change? For example, changes to CSI, CRI
+  or CNI may require updating that component before the kubelet.
+
+## Implementation History
+
+Major milestones in the life cycle of a proposal should be tracked in `Implementation
+History`.
+
+## Drawbacks
+
+The idea is to find the best form of an argument why this enhancement should _not_ be implemented.
+
+## Alternatives
+
+The best approach to achieve the goals described above is not clear,
+so the alternatives section is a collection of candidates for
+discussion.
+
+### Expressing a need to scale using an API
+
+These alternatives discuss what the API would look like, without
+considering which CRD it would be on.
+
+#### A CPU resource request size parameter
+
+We could define an API to give a general size parameter, using values
+like t-shirt sizes (large, medium, small) or flavors.
+
+Such an API is vague, and it would be up to teams to interpret the
+size value appropriately. We would also be limited in the number of
+sizes we could reasonably describe to users before it would become
+confusing (see Amazon EC2 flavors).
+
+#### A CPU resource "request limit" parameter
+
+We could define an API to give an absolute limit of the CPU resource
+requests for the cluster services.
+
+Operators would need to scale the resource requests for their operands
+based on an understanding of how their defaults were derived. For
+example, today most values are a proportion of the resources needed
+for etcd on a 3 node control plane where each node has 2 CPUs, for a
+total of 6 CPUs. If the API indicates a limit of 2 CPUs, the operator
+would need to configure its operands to request 1/3 of the default
+resources.
+
+#### A CPU resource scaling parameter
+
+We could define an API to give a scaling parameter that operators
+should use for their operands.
+
+The value would be applied globally based on understanding how the
+defaults are set for 6 CPUs and how many CPUs are available in the
+real system. We could document the 6 CPU starting point to allow users
+to make this calculation themselves.
+
+Operators could read the setting and use it for their operands, but
+not themselves.
+
+### Locations for API setting
+
+These alternatives could apply to any of the API forms discussed in
+the previous section.
+
+#### Use a cluster-wide API
+
+We could add a field to the Infrastructure configuration
+resource. Many of our operators, including OLM-based operators,
+already consume the Infrastructure API.
+
+#### Use an operator-specific API
+
+If we choose to implement a new operator to manage scaling for the
+control plane operators (see below), it could define a new API in a
+CRD it owns.
+
+Updating existing operators to read the new field and scale their
+operands and documenting an entirely new API might be more work than
+using the Infrastructure API that would only be useful if the operator
+responding to the API was an add-on.
+
+### Dynamic vs. Static Setting
+
+It is not likely that users would need to change the setting after
+building a cluster, so we could have the installer set a status field
+in whatever CRD we choose for the API. If we do need to support
+scaling later, that work can be described in another enhancement.
+
+### Managing settings for operators installed by cluster-version-operator
+
+The alternatives in this subsection cover the need to change the
+settings for the static manifests currently managed by the
+cluster-version-operator. These approaches would only apply to the
+operators themselves, and not their operands.
+
+#### Use a cluster profile to change the resource requests for control plane operators
+
+We could create a separate cluster profile with different resource
+requests for our control plane components.
+
+This would be quite a lot of work to implement and test.
+
+We may eventually need different profiles for different customers,
+further multiplying the amount of work.
+
+As an organization, we are trying to keep the number of profiles
+small as a general rule.
+
+Profile settings only apply to the cluster operators Deployments, and
+changing the resource requests in the static manifests does not signal
+to the operators that they need to change the settings for their operands.
+
+#### Lower the resource requests for cluster operators statically in the default configuration
+
+We could apply changes to the default resource requests in the static
+manifests used by all deployments, instead of adding one or more new
+cluster profiles.
+
+This would affect all clusters, and if we are too aggressive with the
+changes we may lower values too much and cause important cluster
+operators to be starved for resources, triggering degraded cluster
+performance.
+
+The static settings only apply to the cluster operators Deployments,
+and changing them does not signal to the operators that they need to
+change the settings for their operands.
+
+#### Have a new operator in the release payload perform the scaling for control plane operators
+
+It would use the new API field to determine how to scale (see above).
+
+We would need to change the cluster-version-operator to ignore
+resource request settings for cluster operators defined in static
+manifests.
+
+It would add yet another component to be running in the cluster,
+consuming resources.
+
+It could apply the request scaling to everything, a static subset, or
+more dynamically based on label selectors or namespaces.
+
+Having an operator change the settings on manifests installed by the
+cluster-version-operator may cause some "thrashing" during install and
+upgrade.
+
+#### Have a webhook perform the scaling for control plane operators
+
+It would use the new API field to determine how to scale (see above).
+
+We would need to change the cluster-version-operator to ignore
+resource request settings for cluster operators defined in static
+manifests.
+
+It would add yet another component to be running in the cluster,
+consuming resources.
+
+It could apply the request scaling to everything, a static
+subset, or more dynamically based on label selectors or namespaces.
+
+We may find race conditions using a webhook to change settings for
+cluster operators.
+
+#### Have the cluster-version-operator perform the scaling for control plane operators
+
+We could change the cluster-version-operator to scale the Deployments
+for control plane operators.
+
+It would use the new API field to determine how to scale (see above).
+
+This additional logic would complicate the CVO (*How much?*) and there
+is a general desire to avoid adding complexity to the main component
+responsible for managing everything else.
+
+One benefit of this approach is that installation and upgrade would be
+handled in the same way, without thrashing, race conditions, or
+requirements for add-on components.
+
+What would apply the scaling to operator-lifecycle-manager (OLM) operators?
+
+#### Have an out-of-cluster actor perform the scaling for control plane operators
+
+An outside actor such as ACM policy applied via klusterlet or a
+one-time script could change the resource settings.
+
+An outside actor would not require an API change in the cluster.
+
+We would need to change the cluster-version-operator to ignore
+resource request settings for cluster operators defined in static
+manifests.
+
+Allowing an outside actor to change the resource requests turns them
+into something a user might change themselves, which may not be ideal
+for our testing and support matrix.
+
+### Options for globally affecting resource limits
+
+We could change some of the underlying components to affect scaling
+without the knowledge of the rest of the cluster components.
+
+#### Have the scheduler and kubelet perform the scaling based on priority class
+
+*Needs more detail.*
+
+Changing kubelet by itself, or adding a plugin, would capture
+everything as it is launched, but only after a Pod is scheduled. That
+would mean we would need to leave enough overhead to schedule a large
+component, even if the request is going to be rewritten to ask for a
+fraction of that value. So, we have to change the scheduler to ask for
+less and kubelet to grant less, without changing the Pod definition
+and triggering the ReplicaSet controller to rewrite the Pod
+definition.
+
+#### Have kubelet announce more capacity than is available
+
+*Needs more detail.*
+
+Add something in the kubelet that actually announces more capacity
+than available for CPU, and places system critical workloads into the
+right area by subtracting the limits.
+
+Presenting false data to influence the schedule may be very tricky to
+get right and would complicate debugging and support.
+
+## Infrastructure Needed [optional]
+
+Use this section if you need things from the project. Examples include a new
+subproject, repos requested, github details, and/or testing infrastructure.
+
+Listing these here allows the community to get the process for these resources
+started right away.