-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
KEP-3063: DRA: 1.30 update #4181
Conversation
Some work remains for the 1.29 development cycle:
The KEP needs to be merged, then that work needs to be done and/or finished, and then before the actual beta graduation we need to do another review round to determine whether beta graduation criteria have been met. |
/sig scheduling |
/hold |
As discussed on Slack, scale down must determine whether some currently running pods could get moved. This simulation depends on simulating deallocation, otherwise the allocated claim prevents moving pods.
The discussion around autoscaling needs more time.
Technically the support for cluster autoscaling can be defined and implemented as an extension of the core DRA, without changing the core feature. By separating out the specification of "numeric parameters" into a separate KEP it might be easier to make progress on the different aspects because they are better separated.
kubernetes/kubernetes#121876 changed where the cluster gets updated with blocking API calls.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll give another pass after @johnbelamaric gives LGTM
schedulable, like creating a claim that the pod needs or finishing the | ||
allocation of a claim. | ||
|
||
[Queuing hints](https://github.com/kubernetes/enhancements/issues/4247) are |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI that queueing hints are disabled by default and are unlikely to reach a stable state in 1.30 to be re-enabled. In case this is crucial for the usability of DRA.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1. For this KEP update, we will have to state the status of DRA w/o usage of queueing hints.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not crucial. Performance is different, but DRA is usable also without that optimization. I know that you disagree, but all the use cases in https://docs.google.com/document/d/1XNkTobkyz-MyXhidhTp5RfbMsM-uRCWDoflUMqNcYTk/edit#heading=h.ljj9kaa144nr confirmed that scheduling performance is not a priority because of long-running pods.
I'll update the text to explain this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added something.
``` | ||
This is not possible with opaque parameters as described in this KEP. If a DRA | ||
driver developer wants to support Cluster Autoscaler, they have to use numeric | ||
parameters. Numeric parameters are an extension of this KEP that is defined in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we should phrase this case as only being an enabler for cluster autoscaler. I see that KEP as an opportunity to simplify this KEP before it goes to beta.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can add that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It didn't fit into this "Autoscaler" section, so I added a new section for this thought under "Alternatives".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The scheduling (except for support CA) part looks good. A new nits.
schedulable, like creating a claim that the pod needs or finishing the | ||
allocation of a claim. | ||
|
||
[Queuing hints](https://github.com/kubernetes/enhancements/issues/4247) are |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1. For this KEP update, we will have to state the status of DRA w/o usage of queueing hints.
go through the backoff queue and the usually 5 second long delay associated | ||
with that. | ||
|
||
#### PreEnqueue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW, PreEnqueue is likely to reach GA in 1.30.
@@ -1889,6 +1905,12 @@ At the moment, the claim plugin has no information that might enable it to | |||
prioritize which resource to deallocate first. Future extensions of this KEP | |||
might attempt to improve this. | |||
|
|||
This is currently using blocking API calls. They are unlikely because this |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is currently using blocking API calls. They are unlikely because this | |
This is currently using blocking API calls. It's quite rare because this |
"Numeric parameters" are now called "semantic parameters" because they are not just about numbers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/approve
from sig scheduling
/hold for @johnbelamaric's LGTM
|
||
When pods fail to get scheduled, kube-scheduler reports that through events | ||
and pod status. For DRA, that includes "waiting for resource driver to | ||
provide information" (node not selected yet) and "waiting for resource |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do these events/status updates include information about the specific claim/driver that is blocking progress?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When waiting for a claim, the claim is mentioned. This allows the user to drill down and check the claim.
When waiting for PodSchedulingContext information, that object is not mentioned explicitly. It doesn't need to be named because the name is the same as for the pod.
In both cases it is assumed that users understand the concepts enough to know what "claim" and "pod scheduling context" are.
@@ -2769,6 +2828,18 @@ Why should this KEP _not_ be implemented? | |||
|
|||
## Alternatives | |||
|
|||
### Semantic Parameters instead of PodSchedulingContext | |||
|
|||
When a DRA driver uses semantic parameters, there is no DRA driver controller |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe in the latest cut this has shifted a little? In the semantic parameter version, there is still a driver controller but its responsibility is to evaluate the claim parameters and produce the driver-neutral semantic request. Since we are skipping CEL for this (for now at least...). But the following text on "we might be able to remove PodSchedulingContext" looks right.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I change the text into "there is no need for a DRA driver controller
which allocates the claim and no need for communication between scheduler and such a controller"
That leaves it open whether a controller is still needed for other purposes (will be defined in the "semantic parameters KEP" and may depend on whether users are allowed to create in-tree parameter objects directly) and just focuses on the aspect that is relevant here.
@@ -2894,6 +2965,52 @@ type ResourceDriverFeature struct { | |||
} | |||
``` | |||
|
|||
### Complex sharing of ResourceClaim | |||
|
|||
At the moment, the allocation result marks as a claim as either "shareable" by |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's something not quite right in that model. I wonder if the underlying semantic models can express this?
Also, there are different kinds of "shareable" - for example, I know of use cases where you want to mount the GPU devices in a monitoring container - or perhaps to access it via some config tools.
No change needed, just raising some things to think about post 1.30.
Couple minor comments but overall LGTM. What we are saying here is:
|
/lgtm from SIG Node perspective |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: alculquicondor, dchen1107, pohly The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/lgtm I am root in this repo so I can't do approve without it going through (and I am not really a node approver) |
/hold cancel |
One-line PR description: This updates the README to reflect what has been done and fills in sections that were left out earlier. The next milestone is 1.30 where DRA will remain in alpha.
Issue link: DRA: control plane controller ("classic DRA") #3063