From a1382ec79022311285fffb21d4c5e51f7e2fcaa2 Mon Sep 17 00:00:00 2001 From: Patrick Ohly Date: Tue, 5 Sep 2023 12:24:00 +0200 Subject: [PATCH 01/13] DRA: promotion to beta This updates the README to reflect what has been done and fills in sections that were left out earlier. The next milestone is 1.29. --- keps/prod-readiness/sig-node/3063.yaml | 2 + .../README.md | 145 ++++++++++++------ .../3063-dynamic-resource-allocation/kep.yaml | 2 +- 3 files changed, 99 insertions(+), 50 deletions(-) diff --git a/keps/prod-readiness/sig-node/3063.yaml b/keps/prod-readiness/sig-node/3063.yaml index 784b9d8f910..08acf5840b3 100644 --- a/keps/prod-readiness/sig-node/3063.yaml +++ b/keps/prod-readiness/sig-node/3063.yaml @@ -4,3 +4,5 @@ kep-number: 3063 alpha: approver: "@johnbelamaric" +beta: + approver: "@johnbelamaric" diff --git a/keps/sig-node/3063-dynamic-resource-allocation/README.md b/keps/sig-node/3063-dynamic-resource-allocation/README.md index 78433285077..71b0aed990c 100644 --- a/keps/sig-node/3063-dynamic-resource-allocation/README.md +++ b/keps/sig-node/3063-dynamic-resource-allocation/README.md @@ -721,8 +721,8 @@ For a resource driver the following components are needed: - *Resource kubelet plugin*: a component which cooperates with kubelet to prepare the usage of the resource on a node. -An utility library for resource drivers will be developed outside of Kubernetes -and does not have to be used by drivers, therefore it is not described further +An [utility library](https://github.com/kubernetes/kubernetes/tree/master/staging/src/k8s.io/dynamic-resource-allocation) for resource drivers was developed. +It does not have to be used by drivers, therefore it is not described further in this KEP. ### State and communication @@ -962,14 +962,6 @@ arbitrarily. Some combinations are more useful than others: ### Coordinating resource allocation through the scheduler -<<[UNRESOLVED pohly]>> -The entire scheduling section is tentative. Key opens: -- Support arbitrary combinations of user- vs. Kubernetes-managed ResourceClaims - and immediate vs. late allocation? - https://github.com/kubernetes/enhancements/pull/3064#discussion_r901948474 -<<[/UNRESOLVED]>> - - For immediate allocation, scheduling Pods is simple because the resource is already allocated and determines the nodes on which the Pod may run. The downside is that pod scheduling is less flexible. @@ -1399,16 +1391,6 @@ type AllocationResult struct { // than one consumer at a time. // +optional Shareable bool - - <<[UNRESOLVED pohly]>> - We will have to discuss use cases and real resource drivers that - support sharing before deciding on a) which limit is useful and - b) whether we need a different API that supports an unlimited - number of users. - - Any solution that handles reservations differently will have to - be very careful about race conditions. - <<[/UNRESOLVED]>> } // AllocationResultResourceHandlesMaxSize represents the maximum number of @@ -2426,13 +2408,23 @@ For Beta and GA, add links to added tests together with links to k8s-triage for https://storage.googleapis.com/k8s-triage/index.html --> -The existing integration tests for kube-scheduler and kubelet will get extended -to cover scenarios involving dynamic resources. A new integration test will get -added for the dynamic resource controller. +The existing [integration tests for kube-scheduler which measure +performance](https://github.com/kubernetes/kubernetes/tree/master/test/integration/scheduler_perf#readme) +were extended to also [cover +DRA](https://github.com/kubernetes/kubernetes/blob/294bde0079a0d56099cf8b8cf558e3ae7230de12/test/integration/scheduler_perf/config/performance-config.yaml#L717-L779) +and to runs as [correctness +tests](https://github.com/kubernetes/kubernetes/commit/cecebe8ea2feee856bc7a62f4c16711ee8a5f5d9) +as part of the normal Kubernetes "integration" jobs. That also covers [the +dynamic resource +controller](https://github.com/kubernetes/kubernetes/blob/294bde0079a0d56099cf8b8cf558e3ae7230de12/test/integration/scheduler_perf/util.go#L135-L139). + +kubelet were extended to cover scenarios involving dynamic resources. For beta: -- : +- kube-scheduler, kube-controller-manager: http://perf-dash.k8s.io/#/, [`k8s.io/kubernetes/test/integration/scheduler_perf.scheduler_perf`](https://testgrid.k8s.io/sig-release-master-blocking#integration-master) +- kubelet: ... + ##### e2e tests @@ -2447,12 +2439,12 @@ We expect no non-infra related flakes in the last month as a GA graduation crite --> End-to-end testing depends on a working resource driver and a container runtime -with CDI support. A mock driver will be developed in parallel to developing the -code in Kubernetes, but as it will depend on the new APIs, we have to get those -merged first. +with CDI support. A [test driver](https://github.com/kubernetes/kubernetes/tree/master/test/e2e/dra/test-driver) +was developed in parallel to developing the +code in Kubernetes. -Such a mock driver could be as simple as taking parameters from ResourceClass -and ResourceClaim and turning them into environment variables that then get +That test driver simply takes parameters from ResourceClass +and ResourceClaim and turns them into environment variables that then get checked inside containers. Tests for different behavior of an driver in various scenarios can be simulated by running the control-plane part of it in the E2E test itself. For interaction with kubelet, proxying of the gRPC interface can @@ -2465,14 +2457,11 @@ All tests that don't involve actually running a Pod can become part of conformance testing. Those tests that run Pods cannot be because CDI support in runtimes is not required. -Once we have end-to-end tests, at least two Prow jobs will be defined: -- A pre-merge job that will be required and run only for the in-tree code of - this KEP (`optional: false`, `run_if_changed` set, `always_run: false`). -- A periodic job that runs the same tests to determine stability and detect - unexpected regressions. - For beta: -- : +- pre-merge with kind (optional, triggered for code which has an impact on DRA): https://testgrid.k8s.io/sig-node-dynamic-resource-allocation#pull-kind-dra +- periodic with kind: https://testgrid.k8s.io/sig-node-dynamic-resource-allocation#ci-kind-dra +- pre-merge with CRI-O: https://testgrid.k8s.io/sig-node-dynamic-resource-allocation#pull-node-dra +- periodic with CRI-O: https://testgrid.k8s.io/sig-node-dynamic-resource-allocation#ci-node-e2e-crio-dra ### Graduation Criteria @@ -2602,7 +2591,7 @@ There will be pods which have a non-empty PodSpec.ResourceClaims field and Resou ###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service? For kube-controller-manager, metrics similar to the generic ephemeral volume -controller will be added: +controller [were added](https://github.com/kubernetes/kubernetes/blob/163553bbe0a6746e7719380e187085cf5441dfde/pkg/controller/resourceclaim/metrics/metrics.go#L32-L47): - [X] Metrics - Metric name: `resource_controller_create_total` @@ -2729,7 +2718,65 @@ already received all the relevant updates (Pod, ResourceClaim, etc.). ###### What are other known failure modes? -To be added for beta. +- DRA driver does not or cannot allocate a resource claim. + + - Detection: The primary mechanism is through vendors-provided monitoring for + their driver. That monitor needs to include health of the driver, availability + of the underlying resource, etc. The common helper code for DRA drivers + posts events for a ResourceClaim when an allocation attempt fails. + + When pods fail to get scheduled, kube-scheduler reports that through events + and pod status. For DRA, that includes "waiting for resource driver to + provide information" (node not selected yet) and "waiting for resource + driver to allocate resource" (node has been selected). The + ["unschedulable_pods"](https://github.com/kubernetes/kubernetes/blob/9fca4ec44afad4775c877971036b436eef1a1759/pkg/scheduler/metrics/metrics.go#L200-L206) + metric will have pods counted under the "dynamicresources" plugin label. + + To troubleshoot, "kubectl describe" can be used on (in this order) Pod, + ResourceClaim, PodSchedulingContext. + + - Mitigations: This depends on the vendor of the DRA driver. + + - Diagnostics: In kube-scheduler, -v=4 enables simple progress reporting + in the "dynamicresources" plugin. -v=5 provides more information about + each plugin method. The special status results mentioned above also get + logged. + + - Testing: E2E testing covers various scenarios that involve waiting + for a DRA driver. This also simulates partial allocation of node-local + resources in one driver and then failing to allocate the remaining + resources in another driver (the "need to deallocate" fallback). + +- A Pod gets scheduled without allocating resources. + + - Detection: The Pod either fails to start (when kubelet has DRA + enabled) or gets started without the resources (when kubelet doesn't + have DRA enabled), which then will fail in an application specific + way. + + - Mitigations: DRA must get enabled properly in kubelet and kube-controller-manager. + Then kube-controller-manager will try to allocate and reserve resources for + already scheduled pods. To prevent this from happening for new pods, DRA + must get enabled in kube-scheduler. + + - Diagnostics: kubelet will log pods without allocated resources as errors + and emit events for them. + + - Testing: An E2E test covers the expected behavior of kubelet and + kube-controller-manager by creating a pod with `spec.nodeName` already set. + +- A DRA driver kubelet plugin fails to prepare resources. + + - Detection: The Pod fails to start after being scheduled. + + - Mitigations: This depends on the specific DRA driver and has to be documented + by vendors. + + - Diagnostics: kubelet will log pods with such errors and emit events for them. + + - Testing: An E2E test covers the expected retry mechanism in kubelet when + `NodePrepareResources` fails intermittently. + +- Kubernetes 1.25: KEP accepted as "implementable". +- Kubernetes 1.26: Code merged as "alpha". +- Kubernetes 1.27: API breaks (batching of NodePrepareResource in kubelet API, + AllocationResult in ResourceClaim status can provide results for multiple + drivers). +- Kubernetes 1.28: API break (ResourceClaim names for claims created from + a template are generated instead of deterministic), scheduler performance + enhancements (no more backoff delays). ## Drawbacks diff --git a/keps/sig-node/3063-dynamic-resource-allocation/kep.yaml b/keps/sig-node/3063-dynamic-resource-allocation/kep.yaml index 8194470cbd7..3495b030762 100644 --- a/keps/sig-node/3063-dynamic-resource-allocation/kep.yaml +++ b/keps/sig-node/3063-dynamic-resource-allocation/kep.yaml @@ -24,7 +24,7 @@ stage: alpha # The most recent milestone for which work toward delivery of this KEP has been # done. This can be the current (upcoming) milestone, if it is being actively # worked on. -latest-milestone: "v1.28" +latest-milestone: "v1.29" # The milestone at which this feature was, or is targeted to be, at each stage. milestone: From 7e17126a432f3ab5d04b11361c99a89b46b2dd86 Mon Sep 17 00:00:00 2001 From: Patrick Ohly Date: Mon, 2 Oct 2023 21:32:40 +0200 Subject: [PATCH 02/13] dra: update Cluster Autoscaler support section The RPC mechanism is likely to have performance challenges. It is better to focus on an extension mechanism for custom autoscaler binaries first. In practice, this is likely to be what cloud providers are running anyway. --- .../README.md | 235 ++++++++++++++---- 1 file changed, 180 insertions(+), 55 deletions(-) diff --git a/keps/sig-node/3063-dynamic-resource-allocation/README.md b/keps/sig-node/3063-dynamic-resource-allocation/README.md index 71b0aed990c..292cd4f8377 100644 --- a/keps/sig-node/3063-dynamic-resource-allocation/README.md +++ b/keps/sig-node/3063-dynamic-resource-allocation/README.md @@ -111,6 +111,9 @@ SIG Architecture for cross-cutting KEPs). - [Reserve](#reserve) - [Unreserve](#unreserve) - [Cluster Autoscaler](#cluster-autoscaler) + - [Generic plugin enhancements](#generic-plugin-enhancements) + - [DRA scheduler plugin extension mechanism](#dra-scheduler-plugin-extension-mechanism) + - [Building a custom Cluster Autoscaler binary](#building-a-custom-cluster-autoscaler-binary) - [kubelet](#kubelet) - [Managing resources](#managing-resources) - [Communication between kubelet and resource kubelet plugin](#communication-between-kubelet-and-resource-kubelet-plugin) @@ -1930,8 +1933,34 @@ Autoscaler](https://github.com/kubernetes/autoscaler/tree/master/cluster-autosca encounters a pod that uses a resource claim, the autoscaler needs assistance by the resource driver for that claim to make the right decisions. Without that assistance, the autoscaler might scale up the wrong node group (resource is -provided by nodes in another group) or scale up unnecessarily (resource is -network-attached and adding nodes won't help). +provided by nodes in another group) or not scale up (pod is pending because of +a claim that cannot be allocated, but looks like it should be scheduleable +to the autoscaler). + +With the following changes, vendors can provide Go code in a package that can +be built into a custom autoscaler binary to support correct scale up +simulations for clusters that use their hardware. Extensions for invoking such +vendor code through some RPC mechanism, as WASM plugin, or some generic +code which just needs to be parameterized for specific hardware could be added +later in separate KEPs. + +The in-tree DRA scheduler plugin is still active. It handles the generic checks +like "can this allocated claim be reserved for this pod" and only calls out to +vendor code when it comes to decisions that only the vendor can handle, like +"can this claim be allocated" and "what effect does allocating this claim have +for the cluster". + +The underlying assumption is that vendors can determine the capabilities of +nodes based on labels. Those labels get set by the autoscaler for simulated +nodes either by cloning some real node or through configuration during scale up +from zero. Then when some vendor code encounters a node which doesn't exit +in the real cluster, it can determine what resource the vendor driver would +be able to make available if it was created for real. + +#### Generic plugin enhancements + +The changes in this section are independent of DRA. They could also be used to +simulate volume provisioning better. At the start of a scale up or scale down cycle, autoscaler takes a snapshot of the current cluster state. Then autoscaler determines whether a real or @@ -1940,69 +1969,165 @@ of scheduler plugins. If a pod fits a node, the snapshot is updated by calling [NodeInfo.AddPod](https://github.com/kubernetes/kubernetes/blob/7e3c98fd303359cb9f79dfc691da733a6ca2a6e3/pkg/scheduler/framework/types.go#L620-L623). This influences further checks for other pending pods. -To support the custom allocation logic that a vendor uses for its resources, -the autoscaler needs an extension mechanism similar to the [scheduler -extender](https://github.com/kubernetes/kubernetes/blob/7e3c98fd303359cb9f79dfc691da733a6ca2a6e3/pkg/scheduler/framework/extender.go#L24-L72). The -existing scheduler extender API has to be extended to include methods that -would only get called by the autoscaler, like starting a cycle. Instead of -adding these methods to the scheduler framework, autoscaler can define its own -interface that inherits from the framework: +The DRA scheduler plugin gets integrated into this snapshotting and simulated +pod scheduling through a new scheduler framework interface: ``` -import "k8s.io/pkg/scheduler/framework" - -type Extender interface { - framework.Extender - - // NodeSelected gets called when the autoscaler determined that - // a pod should run on a node. - NodeSelected(pod *v1.Pod, node *v1.Node) error - - // NodeReady gets called by the autoscaler to check whether - // a new node is fully initialized. - NodeReady(nodeName string) (bool, error) +// ClusterAutoScalerPlugin is an interface that is used only by the cluster autoscaler. +// It enables plugins to store state across different scheduling cycles. +// +// The usual call sequence of a plugin when used in the scheduler is: +// - at program startup: +// - instantiate plugin +// - EventsToRegister +// - for each new pod: +// - PreEnqueue +// - for each pod that is ready to be scheduled, one pod at a time: +// - PreFilter, Filter, etc. +// +// Cluster autoscaler works a bit differently. It identifies all pending pods, +// takes a snapshot of the current cluster state, and then simulates the effect +// of scheduling those pods with additional nodes added to the cluster. To +// determine whether a pod fits into one of these simulated nodes, it +// uses the same PreFilter and Filter plugins as the scheduler. Other extension +// points (Reserve, Bind) are not used. Plugins which modify the cluster state +// therefore need a different way of recording the result of scheduling +// a pod onto a node. This is done through ClusterAutoScalerPlugin. +// +// Cluster autoscaler will: +// - at program startup: +// - instantiate plugin, with real informer factory and no Kubernetes client +// - start informers +// - at the start of a simulation: +// - call StartSimulation with a clean cycle state +// - for each pending pod: +// - call PreFilter and Filter with the same cycle state that +// was passed to StartSimulation +// - call SimulateBindPod with the same cycle state that +// was passed to StartSimulation (i.e. *not* the one which was modified +// by PreFilter or Filter) to indicate that a pod is being scheduled onto a node +// as part of the simulation +// +// A plugin may: +// - Take a snapshot of all relevant cluster state as part of StartSimulation +// and store it in the cycle state. This signals to the other extension +// points that the plugin is being used as part of the cluster autoscaler. +// . In PreFilter and Filter use the cluster snapshot to make decisions +// instead of the normal "live" cluster state. +// - In SimulateBindPod update the snapshot in the cycle state. +type ClusterAutoScalerPlugin interface { + Plugin + // StartSimulation is called when the cluster autoscaler begins + // a simulation. + StartSimulation(ctx context.Context, state *CycleState) *Status + // SimulateBindPod is called when the cluster autoscaler decided to schedule + // a pod onto a certain node. + SimulateBindPod(ctx context.Context, state *CycleState, pod *v1.Pod, nodeInfo *NodeInfo) *Status + // NodeIsReady checks whether some real node has been initialized completely. + // Even if it is "ready" as far Kubernetes is concerned, some DaemonSet pod + // might still be missing or not done with its startup yet. + NodeIsReady(ctx context.Context, node *v1.Node) (bool, error) } ``` -The underlying implementation can either be compiled into a custom autoscaler -binary by cloud provider who controls the entire cluster or use HTTP similar to -the [HTTP -extender](https://github.com/kubernetes/kubernetes/blob/7e3c98fd303359cb9f79dfc691da733a6ca2a6e3/pkg/scheduler/extender.go#L41-L53). -As an initial step, configuring such HTTP webhooks for different resource -drivers can be added to the configuration file defined by the `--cloud-config` -configuration file with a common field that gets added in all cloud provider -configs or a new `--config` parameter can be added. Later, dynamically -discovering deployed webhooks can be added through an autoscaler CRD. - -In contrast to the in-tree HTTP extender implementation, the one for autoscaler -must be session oriented: when creating the extender for a cycle, a new "start" -verb needs to be invoked. When this is called in a resource driver controller -webhook, it needs to take a snapshot of the relevant state and return a session -ID. This session ID must be included in all following HTTP invocations as a -"session" header. Ensuring that a "stop" verb gets called reliably would -complicate the autoscaler. Instead, the webhook should support a small number -of recent session and garbage-collect older ones. - -The existing `extenderv1.ExtenderArgs` and `extenderv1.ExtenderFilterResult` -API can be used for the "filter" operation. The extender can be added to the -list of active scheduler plugins because it implements the plugin interface. -Because filter operations may involve fictional nodes, the full `Node` objects -instead of just the node names must be passed. For fictional nodes, the -resource driver must determine based on labels which resources it can provide -on such a node. New APIs are needed for `NodeSelected` and `NodeReady`. - -`NodeReady` is needed to solve one particular problem: when a new node first +`NodeIsReady` is needed to solve one particular problem: when a new node first starts up, it may be ready to run pods, but the pod from a resource driver's DaemonSet may still be starting up. If the resource driver controller needs information from such a pod, then it will not be able to filter correctly. Similar to how extended resources are handled, the autoscaler then -first needs to wait until the extender also considers the node to be ready. +first needs to wait until the plugin also considers the node to be ready. -Such extenders completely replace the generic scheduler resource plugin. The -generic plugin would be able to filter out nodes based on already allocated -resources. But because it is stateless, it would not handle the use count -restrictions correctly when multiple pods are pending and reference the same -resource. +### DRA scheduler plugin extension mechanism + +The in-tree scheduler plugin gets extended by vendors through the following API +in `k8s.io/dynamic-resource-allocation/simulation`. Vendor code does not depend +on the k/k/pkg/scheduler package nor on autoscaler packages. + +``` +// Registry stores all known plugins which can simulate claim allocation. +// It is thread-safe. +var Registry registry + +// PluginName is a special type that is used to look up plugins for a claim. +// For now it must be the same as the driver name in the resource class of a +// claim. +type PluginName string + +// Add adds or overwrites the plugin for a certain name. +func (r *registry) Add(name PluginName, plugin Plugin) { ... } + +... + +// Plugin is used to register a plugin. +type Plugin interface { + // Activate will get called to prepare the plugin for usage. + Activate(ctx context.Context, client kubernetes.Interface, informerFactory informers.SharedInformerFactory) (ActivePlugin, error) +} + +// ActivePlugin is a plugin which is ready to start a simulation. +type ActivePlugin interface { + // Start will get called at the start of a simulation. The plugin must + // capture the current cluster state. + Start(ctx context.Context) (StartedPlugin, error) + + // NodeIsReady checks whether some real node has been initialized completely. + NodeIsReady(ctx context.Context, node *v1.Node) (bool, error) +} + +// StartedPlugin is a plugin which encapsulates a certain cluster state and +// can make changes to it. +type StartedPlugin interface { + // Clone must create a new, independent copy of the current state. + // This must be fast and cannot fail. If it has to do some long-running + // operation, then it must do that in a new goroutine and check the + // result when some method is called in the returned instance. + Clone() StartedPlugin + + // NodeIsSuitable checks whether a claim could be allocated for + // a pod such that it will be available on the node. + NodeIsSuitable(ctx context.Context, pod *v1.Pod, claim *resourcev1alpha2.ResourceClaim, node *v1.Node) (bool, error) + + // Allocate must adapt the cluster state as if the claim + // had been allocated for use on the selected node and return + // the result for the claim. It must not modify the claim, + // that will be done by the caller. + Allocate(ctx context.Context, claim *resourcev1alpha2.ResourceClaim, node *v1.Node) (*resourcev1alpha2.AllocationResult, error) +} +``` + +When the DRA scheduler plugin gets initialized, it activates all registered +vendor plugins. When `StartSimulation` is called, all vendor plugins are +started. When the scheduler plugin's state data is cloned, the plugin's also +get cloned. In addition, `StartSimulation` captures the state of all claims. + +`NodeIsSuitable` is called during the `Filter` check to determine whether a +pending claim could be allocated for a node. `Allocate` is called as part of +the `SimulateBindPod` implementation. The simulated allocation result is stored +in the claim snapshot and then the claim is reserved for the pod. If the claim +cannot be shared between pods, that will prevent other pods from using the +claim while the autoscaler goes through it's binpacking simulation. + +Finally, `NodeIsReady` of each vendor plugin is called to implement the +scheduler plugin's own `NodeIsReady`. + +#### Building a custom Cluster Autoscaler binary + +Vendors are encouraged to include an "init" package together with their driver +simulation implementation. That "init" package registers their plugin. Then to +build a custom autoscaler binary, one additional file alongside `main.go` is +sufficient: + +``` +package main + +import ( + _ "acme.example.com/dra-resource-driver/simulation-plugin/init" +) +``` + +This init package may also register additional command line flags. Care must be +taken to not cause conflicts between different plugins, so all vendor flags +should start with a unique prefix. ### kubelet From d4cb60c09c76155316194ced64a0f998555040ed Mon Sep 17 00:00:00 2001 From: Patrick Ohly Date: Tue, 3 Oct 2023 16:05:51 +0200 Subject: [PATCH 03/13] DRA: include SimulateEvictPod for scale down, language tweaks As discussed on Slack, scale down must determine whether some currently running pods could get moved. This simulation depends on simulating deallocation, otherwise the allocated claim prevents moving pods. --- .../README.md | 27 ++++++++++++++----- 1 file changed, 20 insertions(+), 7 deletions(-) diff --git a/keps/sig-node/3063-dynamic-resource-allocation/README.md b/keps/sig-node/3063-dynamic-resource-allocation/README.md index 292cd4f8377..ed793e52a5b 100644 --- a/keps/sig-node/3063-dynamic-resource-allocation/README.md +++ b/keps/sig-node/3063-dynamic-resource-allocation/README.md @@ -112,7 +112,7 @@ SIG Architecture for cross-cutting KEPs). - [Unreserve](#unreserve) - [Cluster Autoscaler](#cluster-autoscaler) - [Generic plugin enhancements](#generic-plugin-enhancements) - - [DRA scheduler plugin extension mechanism](#dra-scheduler-plugin-extension-mechanism) + - [DRA scheduler plugin extension mechanism](#dra-scheduler-plugin-extension-mechanism) - [Building a custom Cluster Autoscaler binary](#building-a-custom-cluster-autoscaler-binary) - [kubelet](#kubelet) - [Managing resources](#managing-resources) @@ -1967,7 +1967,11 @@ the current cluster state. Then autoscaler determines whether a real or fictional node fits a pod by calling the pre-filter and filter extension points of scheduler plugins. If a pod fits a node, the snapshot is updated by calling [NodeInfo.AddPod](https://github.com/kubernetes/kubernetes/blob/7e3c98fd303359cb9f79dfc691da733a6ca2a6e3/pkg/scheduler/framework/types.go#L620-L623). This -influences further checks for other pending pods. +influences further checks for other pending pods. During scale down, eviction +is simulated by +[SimulateNodeRemoval](https://github.com/kubernetes/autoscaler/blob/2f7c61e13bd1cbfc0ba4085fb84bd692a1e9ac6e/cluster-autoscaler/simulator/cluster.go#L149) +which [pretends that pods running on a node that is to be removed are not +running](https://github.com/kubernetes/autoscaler/blob/2f7c61e13bd1cbfc0ba4085fb84bd692a1e9ac6e/cluster-autoscaler/simulator/cluster.go#L231-L237). The DRA scheduler plugin gets integrated into this snapshotting and simulated pod scheduling through a new scheduler framework interface: @@ -2023,6 +2027,10 @@ type ClusterAutoScalerPlugin interface { // SimulateBindPod is called when the cluster autoscaler decided to schedule // a pod onto a certain node. SimulateBindPod(ctx context.Context, state *CycleState, pod *v1.Pod, nodeInfo *NodeInfo) *Status + // SimulateEvictPod is called when the cluster autoscaler simulates removal + // of a node. All claims used only by this pod should be considered deallocated, + // to enable starting the same pod elsewhere. + SimulateEvictPod(ctx context.Context, state *CycleState, pod *v1.Pod, nodeName string) *Status // NodeIsReady checks whether some real node has been initialized completely. // Even if it is "ready" as far Kubernetes is concerned, some DaemonSet pod // might still be missing or not done with its startup yet. @@ -2037,11 +2045,11 @@ information from such a pod, then it will not be able to filter correctly. Similar to how extended resources are handled, the autoscaler then first needs to wait until the plugin also considers the node to be ready. -### DRA scheduler plugin extension mechanism +#### DRA scheduler plugin extension mechanism The in-tree scheduler plugin gets extended by vendors through the following API -in `k8s.io/dynamic-resource-allocation/simulation`. Vendor code does not depend -on the k/k/pkg/scheduler package nor on autoscaler packages. +in `k8s.io/dynamic-resource-allocation/simulation`. Vendor code depends +neither on the k/k/pkg/scheduler package nor on autoscaler packages. ``` // Registry stores all known plugins which can simulate claim allocation. @@ -2092,12 +2100,17 @@ type StartedPlugin interface { // the result for the claim. It must not modify the claim, // that will be done by the caller. Allocate(ctx context.Context, claim *resourcev1alpha2.ResourceClaim, node *v1.Node) (*resourcev1alpha2.AllocationResult, error) + + // Deallocate must adapt the cluster state as if the claim + // had been deallocated. It must not modify the claim, + // that will be done by the caller. + Deallocate(ctx context.Context, claim *resourcev1alpha2.ResourceClaim) error } ``` When the DRA scheduler plugin gets initialized, it activates all registered vendor plugins. When `StartSimulation` is called, all vendor plugins are -started. When the scheduler plugin's state data is cloned, the plugin's also +started. When the scheduler plugin's state data is cloned, the plugins also get cloned. In addition, `StartSimulation` captures the state of all claims. `NodeIsSuitable` is called during the `Filter` check to determine whether a @@ -2112,7 +2125,7 @@ scheduler plugin's own `NodeIsReady`. #### Building a custom Cluster Autoscaler binary -Vendors are encouraged to include an "init" package together with their driver +Vendors are encouraged to include an "init" package in their driver simulation implementation. That "init" package registers their plugin. Then to build a custom autoscaler binary, one additional file alongside `main.go` is sufficient: From b9c55d84d56c2fc5d871a1d77c1dbbc824a881c2 Mon Sep 17 00:00:00 2001 From: Patrick Ohly Date: Wed, 4 Oct 2023 15:19:48 +0200 Subject: [PATCH 04/13] DRA: user story for autoscaling and fallback code --- .../README.md | 36 ++++++++++++++++++- 1 file changed, 35 insertions(+), 1 deletion(-) diff --git a/keps/sig-node/3063-dynamic-resource-allocation/README.md b/keps/sig-node/3063-dynamic-resource-allocation/README.md index ed793e52a5b..bab8a61de54 100644 --- a/keps/sig-node/3063-dynamic-resource-allocation/README.md +++ b/keps/sig-node/3063-dynamic-resource-allocation/README.md @@ -77,6 +77,7 @@ SIG Architecture for cross-cutting KEPs). - [User Stories](#user-stories) - [Cluster add-on development](#cluster-add-on-development) - [Cluster configuration](#cluster-configuration) + - [Integration with cluster autoscaling](#integration-with-cluster-autoscaling) - [Partial GPU allocation](#partial-gpu-allocation) - [Network-attached accelerator](#network-attached-accelerator) - [Combined setup of different hardware functions](#combined-setup-of-different-hardware-functions) @@ -113,6 +114,7 @@ SIG Architecture for cross-cutting KEPs). - [Cluster Autoscaler](#cluster-autoscaler) - [Generic plugin enhancements](#generic-plugin-enhancements) - [DRA scheduler plugin extension mechanism](#dra-scheduler-plugin-extension-mechanism) + - [Handling claims without vendor code](#handling-claims-without-vendor-code) - [Building a custom Cluster Autoscaler binary](#building-a-custom-cluster-autoscaler-binary) - [kubelet](#kubelet) - [Managing resources](#managing-resources) @@ -432,6 +434,15 @@ parametersRef: name: acme-gpu-init ``` +#### Integration with cluster autoscaling + +As a cloud provider, I want to support GPUs as part of a hosted Kubernetes +environment, including cluster autoscaling. I ensure that the kernel is +configured as required by the hardware and that the container runtime supports +CDI. I review the Go code provided by the vendor for simulating cluster scaling +and build it into a customized cluster autoscaler binary that supports my cloud +infrastructure. + #### Partial GPU allocation As a user, I want to use a GPU as accelerator, but don't need exclusive access @@ -1930,7 +1941,7 @@ progress. When [Cluster Autoscaler](https://github.com/kubernetes/autoscaler/tree/master/cluster-autoscaler#cluster-autoscaler) -encounters a pod that uses a resource claim, the autoscaler needs assistance by +encounters a pod that uses a resource claim for node-local resources, the autoscaler needs assistance by the resource driver for that claim to make the right decisions. Without that assistance, the autoscaler might scale up the wrong node group (resource is provided by nodes in another group) or not scale up (pod is pending because of @@ -1944,6 +1955,9 @@ vendor code through some RPC mechanism, as WASM plugin, or some generic code which just needs to be parameterized for specific hardware could be added later in separate KEPs. +Such vendor code is *not* needed for network-attached resources. Adding or +removing nodes does not change availability of such resources. + The in-tree DRA scheduler plugin is still active. It handles the generic checks like "can this allocated claim be reserved for this pod" and only calls out to vendor code when it comes to decisions that only the vendor can handle, like @@ -2123,6 +2137,26 @@ claim while the autoscaler goes through it's binpacking simulation. Finally, `NodeIsReady` of each vendor plugin is called to implement the scheduler plugin's own `NodeIsReady`. +#### Handling claims without vendor code + +When the DRA scheduler plugin does not have specific vendor code for a certain +resource class, it falls back to the assumption that resources are unlimited, +i.e. allocation will always work. This is how volume provisioning is currently +handled during cluster autoscaling. + +If a pod is not getting scheduled because a resource claim cannot be allocated +by the real DRA driver, to the autoscaler it will look like the pod should be +schedulable and therefore it will not spin up new nodes for it, which is the +right decision. + +If a pod is not getting scheduled because some other resource requirement is +not satisfied, the autoscaler will simulate scale up and can pick some +arbitrary node pool because the DRA scheduler plugin will accept all of those +nodes. + +During scale down, moving a running pod to a different node is assumed to work, +so that scenario also works. + #### Building a custom Cluster Autoscaler binary Vendors are encouraged to include an "init" package in their driver From 4331c89946947fd3b124ef0239319f58627009d6 Mon Sep 17 00:00:00 2001 From: Patrick Ohly Date: Wed, 4 Oct 2023 18:58:08 +0200 Subject: [PATCH 05/13] DRA: review feedback --- .../README.md | 97 ++++++++++++++++++- 1 file changed, 93 insertions(+), 4 deletions(-) diff --git a/keps/sig-node/3063-dynamic-resource-allocation/README.md b/keps/sig-node/3063-dynamic-resource-allocation/README.md index bab8a61de54..99726c819ed 100644 --- a/keps/sig-node/3063-dynamic-resource-allocation/README.md +++ b/keps/sig-node/3063-dynamic-resource-allocation/README.md @@ -105,6 +105,8 @@ SIG Architecture for cross-cutting KEPs). - [core](#core) - [kube-controller-manager](#kube-controller-manager) - [kube-scheduler](#kube-scheduler) + - [EventsToRegister](#eventstoregister) + - [PreEnqueue](#preenqueue) - [Pre-filter](#pre-filter) - [Filter](#filter) - [Post-filter](#post-filter) @@ -149,6 +151,11 @@ SIG Architecture for cross-cutting KEPs). - [Extend Device Plugins](#extend-device-plugins) - [Webhooks instead of ResourceClaim updates](#webhooks-instead-of-resourceclaim-updates) - [ResourceDriver](#resourcedriver) + - [Complex sharing of ResourceClaim](#complex-sharing-of-resourceclaim) + - [Improving scheduling performance](#improving-scheduling-performance) + - [Optimize for network-attached resources](#optimize-for-network-attached-resources) + - [Moving blocking API calls into goroutines](#moving-blocking-api-calls-into-goroutines) + - [RPC calls instead of PodSchedulingContext](#rpc-calls-instead-of-) - [Infrastructure Needed](#infrastructure-needed) @@ -735,7 +742,7 @@ For a resource driver the following components are needed: - *Resource kubelet plugin*: a component which cooperates with kubelet to prepare the usage of the resource on a node. -An [utility library](https://github.com/kubernetes/kubernetes/tree/master/staging/src/k8s.io/dynamic-resource-allocation) for resource drivers was developed. +A [utility library](https://github.com/kubernetes/kubernetes/tree/master/staging/src/k8s.io/dynamic-resource-allocation) for resource drivers was developed. It does not have to be used by drivers, therefore it is not described further in this KEP. @@ -1825,7 +1832,35 @@ notices this, the current scheduling attempt for the pod must stop and the pod needs to be put back into the work queue. It then gets retried whenever a ResourceClaim gets added or modified. -The following extension points are implemented in the new claim plugin: +The following extension points are implemented in the new claim plugin. Some of +them invoke API calls to create or update objects. This is done to simplify +error handling: a failure during such a call puts the pod into the backoff +queue where it will be retried after a timeout. The downside is that the +latency caused by those blocking calls not only affects pods using claims, but +also all other pending pods because the scheduler only schedules one pod at a +time. + +#### EventsToRegister + +This registers all cluster events that might make an unschedulable pod +schedulable, like creating a claim that the pod needs or finishing the +allocation of a claim. + +[Queuing hints](https://github.com/kubernetes/enhancements/issues/4247) are +supported. These are callbacks that can limit the effect of a cluster event to +specific pods. For example, allocating a claim only makes those pods +scheduleable which reference the claim. There is no need to try scheduling a pod +which waits for some other claim. Hints are also used to trigger the next +scheduling cycle for a pod immediately when some expected and require event +like "drivers have provided information" occurs, instead of forcing the pod to +go through the backoff queue and the usually 5 second long delay associated +with that. + +#### PreEnqueue + +This checks whether all claims referenced by a pod exist. If they don't, +scheduling the pod has to wait until the kube-controller-manager or user create +the claims. #### Pre-filter @@ -2770,8 +2805,16 @@ controller [were added](https://github.com/kubernetes/kubernetes/blob/163553bbe0 - Metric name: `resource_controller_create_failures_total` - Metric name: `workqueue` with `name="resource_claim"` -For kube-scheduler and kubelet, the existing metrics for handling Pods will be -used. +For kube-scheduler and kubelet, existing metrics for handling Pods already +cover most aspects. For example, in the scheduler the +["unschedulable_pods"](https://github.com/kubernetes/kubernetes/blob/6f5fa2eb2f4dc731243b00f7e781e95589b5621f/pkg/scheduler/metrics/metrics.go#L200-L206) +metric will call out pods that are currently unschedulable because of the +`DynamicResources` plugin. + +For the communication between scheduler and controller, the apiserver metrics +about API calls (e.g. `request_total`, `request_duration_seconds`) for the +`podschedulingcontexts` and `resourceclaims` resources provide insights into +the amount of requests and how long they are taking. ###### What are the reasonable SLOs (Service Level Objectives) for the above SLIs? @@ -3113,6 +3156,52 @@ type ResourceDriverFeature struct { } ``` +### Complex sharing of ResourceClaim + +At the moment, the allocation result marks as a claim as either "shareable" by +an unlimited number of consumers or "not shareable". More complex scenarios +might be useful like "may be shared by a certain number of consumers", but so +far such use cases have not come up yet. If they do, the `AllocationResult` can +be extended with new fields as defined by a follow-up KEP. + +### Improving scheduling performance + +Some enhancements are possible which haven't been implemented yet because it is +unclear how important they would be in practice. All of the following ideas +could still be added later as they don't conflict with the underlying design, +either as part of this KEP or in follow-up KEPs. + +#### Optimize for network-attached resources + +When a network-attached resource is available on all nodes in a cluster, the +driver will never mark any nodes as unsuitable. If all claims for a pod fall +into that category, the scheduler a) does not need to wait for information and +b) does not need to publish "potential nodes". + +The `ResourceClass` could be extended with a `AvailableForNodes +*core.NodeSelector`. This can be a selector that matches all nodes or a +subset. Either way, if a potential node matches this selector, the scheduler +knows that claims using this class can be allocated and can do the optimization +outlined above. + +#### Moving blocking API calls into goroutines + +This [is being +discussed](https://github.com/kubernetes/kubernetes/issues/120502) and has been +[partially +implemented](https://github.com/kubernetes/kubernetes/pull/120963). That +implementation made the scheduler framework more complex, so [the +conclusion](https://kubernetes.slack.com/archives/C09TP78DV/p1696307377064469?thread_ts=1696246271.825109&cid=C09TP78DV) +was that using blocking calls is the lesser evil until user feedback indicates +that improvements are really needed. + +#### RPC calls instead of `PodSchedulingContext` + +The current design is not making it a hard requirement that admins change the +scheduler configuration to enable communication between scheduler and DRA +drivers. For scenarios where admins and vendors are willing to invest more +effort and doing so would provide performance benefits, a communication path +similar to scheduler extenders could be added. ## Infrastructure Needed From 3a5507860bd1e458813e0fddbf4682c717dad24e Mon Sep 17 00:00:00 2001 From: Patrick Ohly Date: Thu, 5 Oct 2023 15:28:21 +0200 Subject: [PATCH 06/13] DRA: continue to target alpha in 1.29 The discussion around autoscaling needs more time. --- keps/prod-readiness/sig-node/3063.yaml | 2 -- .../3063-dynamic-resource-allocation/README.md | 10 ++++++++++ .../sig-node/3063-dynamic-resource-allocation/kep.yaml | 2 -- 3 files changed, 10 insertions(+), 4 deletions(-) diff --git a/keps/prod-readiness/sig-node/3063.yaml b/keps/prod-readiness/sig-node/3063.yaml index 08acf5840b3..784b9d8f910 100644 --- a/keps/prod-readiness/sig-node/3063.yaml +++ b/keps/prod-readiness/sig-node/3063.yaml @@ -4,5 +4,3 @@ kep-number: 3063 alpha: approver: "@johnbelamaric" -beta: - approver: "@johnbelamaric" diff --git a/keps/sig-node/3063-dynamic-resource-allocation/README.md b/keps/sig-node/3063-dynamic-resource-allocation/README.md index 99726c819ed..ae68d745dd9 100644 --- a/keps/sig-node/3063-dynamic-resource-allocation/README.md +++ b/keps/sig-node/3063-dynamic-resource-allocation/README.md @@ -1974,6 +1974,16 @@ progress. ### Cluster Autoscaler +-<<[UNRESOLVED pohly]>> +The entire autoscaler section is tentative. Key opens: +- Are DRA driver authors able and willing to provide implementations of + the simulation interface if needed for their driver? +- Is the simulation interface generic enough to work across a variety + of autoscaler forks and/or implementations? What about Karpenter? +- Is the suggested deployment approach (rebuild binary) workable? +- Can we really not do something else, ideally RPC-based? +-<<[/UNRESOLVED]>> + When [Cluster Autoscaler](https://github.com/kubernetes/autoscaler/tree/master/cluster-autoscaler#cluster-autoscaler) encounters a pod that uses a resource claim for node-local resources, the autoscaler needs assistance by diff --git a/keps/sig-node/3063-dynamic-resource-allocation/kep.yaml b/keps/sig-node/3063-dynamic-resource-allocation/kep.yaml index 3495b030762..203060ef52e 100644 --- a/keps/sig-node/3063-dynamic-resource-allocation/kep.yaml +++ b/keps/sig-node/3063-dynamic-resource-allocation/kep.yaml @@ -29,8 +29,6 @@ latest-milestone: "v1.29" # The milestone at which this feature was, or is targeted to be, at each stage. milestone: alpha: "v1.26" - beta: "v1.29" - stable: "v1.31" feature-gates: - name: DynamicResourceAllocation From 2bcdb5c336452d0ae994e0e51988113a562f7f8c Mon Sep 17 00:00:00 2001 From: Patrick Ohly Date: Wed, 24 Jan 2024 10:59:49 +0100 Subject: [PATCH 07/13] DRA: defer to numeric parameters as solution for cluster autoscaling Technically the support for cluster autoscaling can be defined and implemented as an extension of the core DRA, without changing the core feature. By separating out the specification of "numeric parameters" into a separate KEP it might be easier to make progress on the different aspects because they are better separated. --- .../README.md | 273 +----------------- 1 file changed, 14 insertions(+), 259 deletions(-) diff --git a/keps/sig-node/3063-dynamic-resource-allocation/README.md b/keps/sig-node/3063-dynamic-resource-allocation/README.md index ae68d745dd9..74364f0b667 100644 --- a/keps/sig-node/3063-dynamic-resource-allocation/README.md +++ b/keps/sig-node/3063-dynamic-resource-allocation/README.md @@ -77,7 +77,6 @@ SIG Architecture for cross-cutting KEPs). - [User Stories](#user-stories) - [Cluster add-on development](#cluster-add-on-development) - [Cluster configuration](#cluster-configuration) - - [Integration with cluster autoscaling](#integration-with-cluster-autoscaling) - [Partial GPU allocation](#partial-gpu-allocation) - [Network-attached accelerator](#network-attached-accelerator) - [Combined setup of different hardware functions](#combined-setup-of-different-hardware-functions) @@ -114,10 +113,6 @@ SIG Architecture for cross-cutting KEPs). - [Reserve](#reserve) - [Unreserve](#unreserve) - [Cluster Autoscaler](#cluster-autoscaler) - - [Generic plugin enhancements](#generic-plugin-enhancements) - - [DRA scheduler plugin extension mechanism](#dra-scheduler-plugin-extension-mechanism) - - [Handling claims without vendor code](#handling-claims-without-vendor-code) - - [Building a custom Cluster Autoscaler binary](#building-a-custom-cluster-autoscaler-binary) - [kubelet](#kubelet) - [Managing resources](#managing-resources) - [Communication between kubelet and resource kubelet plugin](#communication-between-kubelet-and-resource-kubelet-plugin) @@ -441,15 +436,6 @@ parametersRef: name: acme-gpu-init ``` -#### Integration with cluster autoscaling - -As a cloud provider, I want to support GPUs as part of a hosted Kubernetes -environment, including cluster autoscaling. I ensure that the kernel is -configured as required by the hardware and that the container runtime supports -CDI. I review the Go code provided by the vendor for simulating cluster scaling -and build it into a customized cluster autoscaler binary that supports my cloud -infrastructure. - #### Partial GPU allocation As a user, I want to use a GPU as accelerator, but don't need exclusive access @@ -676,8 +662,8 @@ allocation also may turn out to be insufficient. Some risks are: - Network-attached resources may have additional constraints that are not captured yet (like limited number of nodes that they can be attached to). -- Cluster autoscaling will not work as expected unless the autoscaler and - resource drivers get extended to support it. +- Cluster autoscaling will not work as expected unless the DRA driver + uses [numeric parameters](https://github.com/kubernetes/enhancements/issues/4381). All of these risks will have to be evaluated by gathering feedback from users and resource driver developers. @@ -1974,252 +1960,21 @@ progress. ### Cluster Autoscaler --<<[UNRESOLVED pohly]>> -The entire autoscaler section is tentative. Key opens: -- Are DRA driver authors able and willing to provide implementations of - the simulation interface if needed for their driver? -- Is the simulation interface generic enough to work across a variety - of autoscaler forks and/or implementations? What about Karpenter? -- Is the suggested deployment approach (rebuild binary) workable? -- Can we really not do something else, ideally RPC-based? --<<[/UNRESOLVED]>> - When [Cluster Autoscaler](https://github.com/kubernetes/autoscaler/tree/master/cluster-autoscaler#cluster-autoscaler) -encounters a pod that uses a resource claim for node-local resources, the autoscaler needs assistance by -the resource driver for that claim to make the right decisions. Without that -assistance, the autoscaler might scale up the wrong node group (resource is -provided by nodes in another group) or not scale up (pod is pending because of -a claim that cannot be allocated, but looks like it should be scheduleable -to the autoscaler). - -With the following changes, vendors can provide Go code in a package that can -be built into a custom autoscaler binary to support correct scale up -simulations for clusters that use their hardware. Extensions for invoking such -vendor code through some RPC mechanism, as WASM plugin, or some generic -code which just needs to be parameterized for specific hardware could be added -later in separate KEPs. - -Such vendor code is *not* needed for network-attached resources. Adding or -removing nodes does not change availability of such resources. - -The in-tree DRA scheduler plugin is still active. It handles the generic checks -like "can this allocated claim be reserved for this pod" and only calls out to -vendor code when it comes to decisions that only the vendor can handle, like -"can this claim be allocated" and "what effect does allocating this claim have -for the cluster". - -The underlying assumption is that vendors can determine the capabilities of -nodes based on labels. Those labels get set by the autoscaler for simulated -nodes either by cloning some real node or through configuration during scale up -from zero. Then when some vendor code encounters a node which doesn't exit -in the real cluster, it can determine what resource the vendor driver would -be able to make available if it was created for real. - -#### Generic plugin enhancements - -The changes in this section are independent of DRA. They could also be used to -simulate volume provisioning better. - -At the start of a scale up or scale down cycle, autoscaler takes a snapshot of -the current cluster state. Then autoscaler determines whether a real or -fictional node fits a pod by calling the pre-filter and filter extension points -of scheduler plugins. If a pod fits a node, the snapshot is updated by calling -[NodeInfo.AddPod](https://github.com/kubernetes/kubernetes/blob/7e3c98fd303359cb9f79dfc691da733a6ca2a6e3/pkg/scheduler/framework/types.go#L620-L623). This -influences further checks for other pending pods. During scale down, eviction -is simulated by -[SimulateNodeRemoval](https://github.com/kubernetes/autoscaler/blob/2f7c61e13bd1cbfc0ba4085fb84bd692a1e9ac6e/cluster-autoscaler/simulator/cluster.go#L149) -which [pretends that pods running on a node that is to be removed are not -running](https://github.com/kubernetes/autoscaler/blob/2f7c61e13bd1cbfc0ba4085fb84bd692a1e9ac6e/cluster-autoscaler/simulator/cluster.go#L231-L237). - -The DRA scheduler plugin gets integrated into this snapshotting and simulated -pod scheduling through a new scheduler framework interface: - -``` -// ClusterAutoScalerPlugin is an interface that is used only by the cluster autoscaler. -// It enables plugins to store state across different scheduling cycles. -// -// The usual call sequence of a plugin when used in the scheduler is: -// - at program startup: -// - instantiate plugin -// - EventsToRegister -// - for each new pod: -// - PreEnqueue -// - for each pod that is ready to be scheduled, one pod at a time: -// - PreFilter, Filter, etc. -// -// Cluster autoscaler works a bit differently. It identifies all pending pods, -// takes a snapshot of the current cluster state, and then simulates the effect -// of scheduling those pods with additional nodes added to the cluster. To -// determine whether a pod fits into one of these simulated nodes, it -// uses the same PreFilter and Filter plugins as the scheduler. Other extension -// points (Reserve, Bind) are not used. Plugins which modify the cluster state -// therefore need a different way of recording the result of scheduling -// a pod onto a node. This is done through ClusterAutoScalerPlugin. -// -// Cluster autoscaler will: -// - at program startup: -// - instantiate plugin, with real informer factory and no Kubernetes client -// - start informers -// - at the start of a simulation: -// - call StartSimulation with a clean cycle state -// - for each pending pod: -// - call PreFilter and Filter with the same cycle state that -// was passed to StartSimulation -// - call SimulateBindPod with the same cycle state that -// was passed to StartSimulation (i.e. *not* the one which was modified -// by PreFilter or Filter) to indicate that a pod is being scheduled onto a node -// as part of the simulation -// -// A plugin may: -// - Take a snapshot of all relevant cluster state as part of StartSimulation -// and store it in the cycle state. This signals to the other extension -// points that the plugin is being used as part of the cluster autoscaler. -// . In PreFilter and Filter use the cluster snapshot to make decisions -// instead of the normal "live" cluster state. -// - In SimulateBindPod update the snapshot in the cycle state. -type ClusterAutoScalerPlugin interface { - Plugin - // StartSimulation is called when the cluster autoscaler begins - // a simulation. - StartSimulation(ctx context.Context, state *CycleState) *Status - // SimulateBindPod is called when the cluster autoscaler decided to schedule - // a pod onto a certain node. - SimulateBindPod(ctx context.Context, state *CycleState, pod *v1.Pod, nodeInfo *NodeInfo) *Status - // SimulateEvictPod is called when the cluster autoscaler simulates removal - // of a node. All claims used only by this pod should be considered deallocated, - // to enable starting the same pod elsewhere. - SimulateEvictPod(ctx context.Context, state *CycleState, pod *v1.Pod, nodeName string) *Status - // NodeIsReady checks whether some real node has been initialized completely. - // Even if it is "ready" as far Kubernetes is concerned, some DaemonSet pod - // might still be missing or not done with its startup yet. - NodeIsReady(ctx context.Context, node *v1.Node) (bool, error) -} -``` - -`NodeIsReady` is needed to solve one particular problem: when a new node first -starts up, it may be ready to run pods, but the pod from a resource driver's -DaemonSet may still be starting up. If the resource driver controller needs -information from such a pod, then it will not be able to filter -correctly. Similar to how extended resources are handled, the autoscaler then -first needs to wait until the plugin also considers the node to be ready. - -#### DRA scheduler plugin extension mechanism - -The in-tree scheduler plugin gets extended by vendors through the following API -in `k8s.io/dynamic-resource-allocation/simulation`. Vendor code depends -neither on the k/k/pkg/scheduler package nor on autoscaler packages. - -``` -// Registry stores all known plugins which can simulate claim allocation. -// It is thread-safe. -var Registry registry - -// PluginName is a special type that is used to look up plugins for a claim. -// For now it must be the same as the driver name in the resource class of a -// claim. -type PluginName string - -// Add adds or overwrites the plugin for a certain name. -func (r *registry) Add(name PluginName, plugin Plugin) { ... } - -... - -// Plugin is used to register a plugin. -type Plugin interface { - // Activate will get called to prepare the plugin for usage. - Activate(ctx context.Context, client kubernetes.Interface, informerFactory informers.SharedInformerFactory) (ActivePlugin, error) -} - -// ActivePlugin is a plugin which is ready to start a simulation. -type ActivePlugin interface { - // Start will get called at the start of a simulation. The plugin must - // capture the current cluster state. - Start(ctx context.Context) (StartedPlugin, error) - - // NodeIsReady checks whether some real node has been initialized completely. - NodeIsReady(ctx context.Context, node *v1.Node) (bool, error) -} - -// StartedPlugin is a plugin which encapsulates a certain cluster state and -// can make changes to it. -type StartedPlugin interface { - // Clone must create a new, independent copy of the current state. - // This must be fast and cannot fail. If it has to do some long-running - // operation, then it must do that in a new goroutine and check the - // result when some method is called in the returned instance. - Clone() StartedPlugin - - // NodeIsSuitable checks whether a claim could be allocated for - // a pod such that it will be available on the node. - NodeIsSuitable(ctx context.Context, pod *v1.Pod, claim *resourcev1alpha2.ResourceClaim, node *v1.Node) (bool, error) - - // Allocate must adapt the cluster state as if the claim - // had been allocated for use on the selected node and return - // the result for the claim. It must not modify the claim, - // that will be done by the caller. - Allocate(ctx context.Context, claim *resourcev1alpha2.ResourceClaim, node *v1.Node) (*resourcev1alpha2.AllocationResult, error) - - // Deallocate must adapt the cluster state as if the claim - // had been deallocated. It must not modify the claim, - // that will be done by the caller. - Deallocate(ctx context.Context, claim *resourcev1alpha2.ResourceClaim) error -} -``` - -When the DRA scheduler plugin gets initialized, it activates all registered -vendor plugins. When `StartSimulation` is called, all vendor plugins are -started. When the scheduler plugin's state data is cloned, the plugins also -get cloned. In addition, `StartSimulation` captures the state of all claims. - -`NodeIsSuitable` is called during the `Filter` check to determine whether a -pending claim could be allocated for a node. `Allocate` is called as part of -the `SimulateBindPod` implementation. The simulated allocation result is stored -in the claim snapshot and then the claim is reserved for the pod. If the claim -cannot be shared between pods, that will prevent other pods from using the -claim while the autoscaler goes through it's binpacking simulation. - -Finally, `NodeIsReady` of each vendor plugin is called to implement the -scheduler plugin's own `NodeIsReady`. - -#### Handling claims without vendor code +encounters a pod that uses a resource claim for node-local resources, it needs +to understand the parameters for the claim and available capacity in order +to simulate the effect of allocating claims as part of scheduling and of +creating or removing nodes. -When the DRA scheduler plugin does not have specific vendor code for a certain -resource class, it falls back to the assumption that resources are unlimited, -i.e. allocation will always work. This is how volume provisioning is currently -handled during cluster autoscaling. - -If a pod is not getting scheduled because a resource claim cannot be allocated -by the real DRA driver, to the autoscaler it will look like the pod should be -schedulable and therefore it will not spin up new nodes for it, which is the -right decision. - -If a pod is not getting scheduled because some other resource requirement is -not satisfied, the autoscaler will simulate scale up and can pick some -arbitrary node pool because the DRA scheduler plugin will accept all of those -nodes. - -During scale down, moving a running pod to a different node is assumed to work, -so that scenario also works. - -#### Building a custom Cluster Autoscaler binary - -Vendors are encouraged to include an "init" package in their driver -simulation implementation. That "init" package registers their plugin. Then to -build a custom autoscaler binary, one additional file alongside `main.go` is -sufficient: - -``` -package main - -import ( - _ "acme.example.com/dra-resource-driver/simulation-plugin/init" -) -``` +This is not possible with opaque parameters as described in this KEP. If a DRA +driver developer wants to support Cluster Autoscaler, they have to use numeric +parameters. Numeric parameters are an extension of this KEP that is defined in +[KEP #4381](https://github.com/kubernetes/enhancements/issues/4381). -This init package may also register additional command line flags. Care must be -taken to not cause conflicts between different plugins, so all vendor flags -should start with a unique prefix. +Numeric parameters are not necessary for network-attached resources because +adding or removing nodes doesn't change their availability and thus Cluster +Autoscaler does not need to understand their parameters. ### kubelet @@ -2684,7 +2439,7 @@ For beta: #### Alpha -> Beta Graduation -- Implement integration with Cluster Autoscaler +- Implement integration with Cluster Autoscaler through numeric parameters - Gather feedback from developers and surveys - Positive acknowledgment from 3 would-be implementors of a resource driver, from a diversity of companies or projects From 2db47ba62b7894b540c592b7c1d031f710d8a34b Mon Sep 17 00:00:00 2001 From: Patrick Ohly Date: Wed, 24 Jan 2024 11:06:47 +0100 Subject: [PATCH 08/13] DRA: update for 1.30 --- keps/sig-node/3063-dynamic-resource-allocation/kep.yaml | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/keps/sig-node/3063-dynamic-resource-allocation/kep.yaml b/keps/sig-node/3063-dynamic-resource-allocation/kep.yaml index 203060ef52e..a7d62e682fb 100644 --- a/keps/sig-node/3063-dynamic-resource-allocation/kep.yaml +++ b/keps/sig-node/3063-dynamic-resource-allocation/kep.yaml @@ -24,7 +24,7 @@ stage: alpha # The most recent milestone for which work toward delivery of this KEP has been # done. This can be the current (upcoming) milestone, if it is being actively # worked on. -latest-milestone: "v1.29" +latest-milestone: "v1.30" # The milestone at which this feature was, or is targeted to be, at each stage. milestone: From fa3c5cea2ececb0a212eebed6ccdff61137a2865 Mon Sep 17 00:00:00 2001 From: Patrick Ohly Date: Wed, 31 Jan 2024 12:36:24 +0100 Subject: [PATCH 09/13] DRA: document PreBind https://github.com/kubernetes/kubernetes/pull/121876 changed where the cluster gets updated with blocking API calls. --- .../README.md | 44 +++++++++++++++---- 1 file changed, 35 insertions(+), 9 deletions(-) diff --git a/keps/sig-node/3063-dynamic-resource-allocation/README.md b/keps/sig-node/3063-dynamic-resource-allocation/README.md index 74364f0b667..372a8870689 100644 --- a/keps/sig-node/3063-dynamic-resource-allocation/README.md +++ b/keps/sig-node/3063-dynamic-resource-allocation/README.md @@ -111,6 +111,7 @@ SIG Architecture for cross-cutting KEPs). - [Post-filter](#post-filter) - [Pre-score](#pre-score) - [Reserve](#reserve) + - [PreBind](#prebind) - [Unreserve](#unreserve) - [Cluster Autoscaler](#cluster-autoscaler) - [kubelet](#kubelet) @@ -150,7 +151,7 @@ SIG Architecture for cross-cutting KEPs). - [Improving scheduling performance](#improving-scheduling-performance) - [Optimize for network-attached resources](#optimize-for-network-attached-resources) - [Moving blocking API calls into goroutines](#moving-blocking-api-calls-into-goroutines) - - [RPC calls instead of PodSchedulingContext](#rpc-calls-instead-of-) + - [RPC calls instead of PodSchedulingContext](#rpc-calls-instead-of-podschedulingcontext) - [Infrastructure Needed](#infrastructure-needed) @@ -1818,13 +1819,11 @@ notices this, the current scheduling attempt for the pod must stop and the pod needs to be put back into the work queue. It then gets retried whenever a ResourceClaim gets added or modified. -The following extension points are implemented in the new claim plugin. Some of -them invoke API calls to create or update objects. This is done to simplify -error handling: a failure during such a call puts the pod into the backoff -queue where it will be retried after a timeout. The downside is that the -latency caused by those blocking calls not only affects pods using claims, but -also all other pending pods because the scheduler only schedules one pod at a -time. +The following extension points are implemented in the new claim plugin. Except +for some unlikely edge cases (see below) there are no API calls during the main +scheduling cycle. Instead, the plugin collects information and updates the +cluster in the separate goroutine which invokes PreBind. + #### EventsToRegister @@ -1906,6 +1905,12 @@ At the moment, the claim plugin has no information that might enable it to prioritize which resource to deallocate first. Future extensions of this KEP might attempt to improve this. +This is currently using blocking API calls. They are unlikely because this +situation can only arise when there are multiple claims per pod and allocation +for one of them fails despite all drivers agreeing that a node should be +suitable, or when reusing a claim for multiple pods (not a common use case) and +the original node became unusable for the next pod. + #### Pre-score This is passed a list of nodes that have passed filtering by the claim @@ -1936,9 +1941,21 @@ of its ResourceClaims. The driver can and should already have added the Pod when specifically allocating the claim for it, so it may be possible to skip this update. +All the PodSchedulingContext and ResourceClaim updates are recorded in the +plugin state. They will be written to the cluster during PreBind. + If some resources are not allocated yet or reserving an allocated resource fails, the scheduling attempt needs to be aborted and retried at a later time -or when the statuses change. +or when the statuses change. The Reserve call itself never fails. If resources +are not currently available, that information is recorded in the plugin state +and will cause the PreBind call to fail instead. + +#### PreBind + +This is called in a separate goroutine. The plugin now checks all the +information gathered earlier and updates the cluster accordingly. If some +claims are not allocated or not reserved, PreBind fails and the pod must be +retried. #### Unreserve @@ -1958,6 +1975,13 @@ but eventually one of them will. Not giving up the reservations would lead to a permanent deadlock that somehow would have to be detected and resolved to make progress. +Unreserve is called in two scenarios: +- In the main goroutine when scheduling a pod has failed: in that case the plugin's + Reserve call hasn't actually changed the claim status yet, so there is nothing + that needs to be rolled back. +- After binding has failed: this runs in a goroutine, so reverting the + `claim.status.reservedFor` with a blocking call is acceptable. + ### Cluster Autoscaler When [Cluster @@ -2439,6 +2463,8 @@ For beta: #### Alpha -> Beta Graduation +- In normal scenarios, scheduling pods with claims must not block scheduling of + other pods by doing blocking API calls - Implement integration with Cluster Autoscaler through numeric parameters - Gather feedback from developers and surveys - Positive acknowledgment from 3 would-be implementors of a resource driver, From 9721e4c8ffe54c07b01266237eb925f5e858ff14 Mon Sep 17 00:00:00 2001 From: Patrick Ohly Date: Thu, 1 Feb 2024 15:40:22 +0100 Subject: [PATCH 10/13] DRA: clarifications around semantic parameters "Numeric parameters" are now called "semantic parameters" because they are not just about numbers. --- .../README.md | 27 +++++++++++++++---- 1 file changed, 22 insertions(+), 5 deletions(-) diff --git a/keps/sig-node/3063-dynamic-resource-allocation/README.md b/keps/sig-node/3063-dynamic-resource-allocation/README.md index 372a8870689..1f5fe3073b2 100644 --- a/keps/sig-node/3063-dynamic-resource-allocation/README.md +++ b/keps/sig-node/3063-dynamic-resource-allocation/README.md @@ -664,7 +664,7 @@ allocation also may turn out to be insufficient. Some risks are: captured yet (like limited number of nodes that they can be attached to). - Cluster autoscaling will not work as expected unless the DRA driver - uses [numeric parameters](https://github.com/kubernetes/enhancements/issues/4381). + uses [semantic parameters](https://github.com/kubernetes/enhancements/issues/4381). All of these risks will have to be evaluated by gathering feedback from users and resource driver developers. @@ -1841,6 +1841,11 @@ like "drivers have provided information" occurs, instead of forcing the pod to go through the backoff queue and the usually 5 second long delay associated with that. +Queuing hints are an optional feature of the scheduler, with (as of Kubernetes +1.29) their own `SchedulerQueueingHints` feature gate that defaults to +off. When turned off, performance of scheduling pods with resource claims is +slower compared to a cluster configuration where they are turned on. + #### PreEnqueue This checks whether all claims referenced by a pod exist. If they don't, @@ -1992,11 +1997,11 @@ to simulate the effect of allocating claims as part of scheduling and of creating or removing nodes. This is not possible with opaque parameters as described in this KEP. If a DRA -driver developer wants to support Cluster Autoscaler, they have to use numeric -parameters. Numeric parameters are an extension of this KEP that is defined in +driver developer wants to support Cluster Autoscaler, they have to use semantic +parameters. Semantic parameters are an extension of this KEP that is defined in [KEP #4381](https://github.com/kubernetes/enhancements/issues/4381). -Numeric parameters are not necessary for network-attached resources because +Semantic parameters are not necessary for network-attached resources because adding or removing nodes doesn't change their availability and thus Cluster Autoscaler does not need to understand their parameters. @@ -2465,7 +2470,7 @@ For beta: - In normal scenarios, scheduling pods with claims must not block scheduling of other pods by doing blocking API calls -- Implement integration with Cluster Autoscaler through numeric parameters +- Implement integration with Cluster Autoscaler through semantic parameters - Gather feedback from developers and surveys - Positive acknowledgment from 3 would-be implementors of a resource driver, from a diversity of companies or projects @@ -2822,6 +2827,18 @@ Why should this KEP _not_ be implemented? ## Alternatives +### Semantic Parameters instead of PodSchedulingContext + +When a DRA driver uses semantic parameters, there is no DRA driver controller +and no need for communication between scheduler and such a controller. The +PodSchedulingContext object and the associated support in the scheduler then +aren't needed. Once semantic parameters are mature enough and confirmed to be +sufficient for DRA drivers, it might become possible to remove the +PodSchedulingContext API from this KEP. + +It might still be needed for other drivers and use cases, which then can be +discussed in a new KEP which focuses specifically on those use cases. + ### ResourceClaimTemplate Instead of creating a ResourceClaim from a template, the From 11f65cc7b0614f6bae6ae498ec1c984aae07f5a6 Mon Sep 17 00:00:00 2001 From: Patrick Ohly Date: Thu, 1 Feb 2024 15:42:12 +0100 Subject: [PATCH 11/13] DRA: review feedback --- keps/sig-node/3063-dynamic-resource-allocation/README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/keps/sig-node/3063-dynamic-resource-allocation/README.md b/keps/sig-node/3063-dynamic-resource-allocation/README.md index 1f5fe3073b2..9328f815f1d 100644 --- a/keps/sig-node/3063-dynamic-resource-allocation/README.md +++ b/keps/sig-node/3063-dynamic-resource-allocation/README.md @@ -1910,7 +1910,7 @@ At the moment, the claim plugin has no information that might enable it to prioritize which resource to deallocate first. Future extensions of this KEP might attempt to improve this. -This is currently using blocking API calls. They are unlikely because this +This is currently using blocking API calls. It's quite rare because this situation can only arise when there are multiple claims per pod and allocation for one of them fails despite all drivers agreeing that a node should be suitable, or when reusing a claim for multiple pods (not a common use case) and From 2604296d0d0e320df86b8ede0b4b70d5e7509e46 Mon Sep 17 00:00:00 2001 From: Patrick Ohly Date: Thu, 1 Feb 2024 17:08:37 +0100 Subject: [PATCH 12/13] DRA: fix TOC --- keps/sig-node/3063-dynamic-resource-allocation/README.md | 1 + 1 file changed, 1 insertion(+) diff --git a/keps/sig-node/3063-dynamic-resource-allocation/README.md b/keps/sig-node/3063-dynamic-resource-allocation/README.md index 9328f815f1d..8704625af77 100644 --- a/keps/sig-node/3063-dynamic-resource-allocation/README.md +++ b/keps/sig-node/3063-dynamic-resource-allocation/README.md @@ -141,6 +141,7 @@ SIG Architecture for cross-cutting KEPs). - [Implementation History](#implementation-history) - [Drawbacks](#drawbacks) - [Alternatives](#alternatives) + - [Semantic Parameters instead of PodSchedulingContext](#semantic-parameters-instead-of-podschedulingcontext) - [ResourceClaimTemplate](#resourceclaimtemplate) - [Reusing volume support as-is](#reusing-volume-support-as-is) - [Extend volume support](#extend-volume-support) From 7afe6affb977d62e9fd89ee82a0492fddf789852 Mon Sep 17 00:00:00 2001 From: Patrick Ohly Date: Mon, 5 Feb 2024 10:52:30 +0100 Subject: [PATCH 13/13] DRA: review feedback --- keps/sig-node/3063-dynamic-resource-allocation/README.md | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/keps/sig-node/3063-dynamic-resource-allocation/README.md b/keps/sig-node/3063-dynamic-resource-allocation/README.md index 8704625af77..f49900d7616 100644 --- a/keps/sig-node/3063-dynamic-resource-allocation/README.md +++ b/keps/sig-node/3063-dynamic-resource-allocation/README.md @@ -2830,7 +2830,8 @@ Why should this KEP _not_ be implemented? ### Semantic Parameters instead of PodSchedulingContext -When a DRA driver uses semantic parameters, there is no DRA driver controller +When a DRA driver uses semantic parameters, there is no need for a DRA driver controller +which allocates the claim and no need for communication between scheduler and such a controller. The PodSchedulingContext object and the associated support in the scheduler then aren't needed. Once semantic parameters are mature enough and confirmed to be