-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
📖 Update spot instances proposal with termination handler design #3528
📖 Update spot instances proposal with termination handler design #3528
Conversation
Hi @alexander-demichev. Thanks for your PR. I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
@alexander-demichev could you please link to the PR/issue/comment where this amend was requested? |
/kind documentation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/hold
For more information
for termination. | ||
|
||
The code for all providers will be mostly similar, the only difference is logic for polling termination endpoint and processing the response. | ||
This means that cloud provider type can be passed as an argument and termination handler can be common for all providers and placed in a separate repository, for example: `kubernetes-sigs/cluster-api-termination-handler`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given that this doesn't actually exist today, should we remove it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you have a POC of what the code would look like?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I'll put it all together and demonstrate it, probably next week.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@CecileRobertMichon Hi, I made a prototype that should work https://github.com/alexander-demichev/termination-handler.
To enable graceful termination of workloads running on non-guaranteed instances, | ||
a termination handler pod will need to be deployed on each spot instance. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where would users get this termination handler?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The plan is to make something similar to what calico
has.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a bit curious as to why a new termination handler rather than trying to leverage/extend existing projects, such as https://github.com/aws/aws-node-termination-handler for this purpose?
One of the things that stands out to me is that the aws-node-termination-handler supports an SQS-based system that would avoid the security concerns mentioned below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Up for discussion as to whether this benefit is worth it or not, but the benefit of having our own termination handler is that it reduces the time to get a new Node.
Using a normal handler, that needs to be able to drain the node, then we have to wait for the Node to go away and MHC to remediate.
Using our own we can trigger machine deletion as soon as the termination notice is served, causing the machineset to create a new machine.
In some cases this won't really be all that useful, but there are a couple I can think of where this is useful.
In GCP preemptible instances are only allowed to run for 24 hours and then get served a termination notice, this would save time here in replacing the lost compute capacity. Secondly, while CAPI doesn't support it yet, if a MachineSet supported multi AZ (which I believe there's discussion of somewhere), the MachineSet may be able to create a new instance in a different AZ if a single AZ is terminating its instances.
As a secondary benefit, by leveraging the machine controller shutdown logic, if we ever encode any extra logic for an instance (pre stop hooks), this would also be leveraged before the instance is actually gone (more useful on AWS with the 2 minute warning than others I admit).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@JoelSpeed is that something that we could address by working with related upstream projects to add support for Cluster API rather than having to replicate support for the various different ways to detect spot instance termination across various providers?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@detiber One more benefit of having our own termination handler is that we will have only one project for all infra providers that support spot instances. This allows us to deliver to users only one binary instead of 3 different for each cloud.
It's not a full replication of other projects, it more like a stripped-down version of them that should reduce the complexity of integrating termination handlers functionality to CAPI.
```yaml | ||
--- | ||
apiVersion: cluster.x-k8s.io/v1alpha3 | ||
kind: MachineHealthCheck | ||
metadata: | ||
name: cluster-api-termination-handler | ||
labels: | ||
api: clusterapi | ||
k8s-app: termination-handler | ||
spec: | ||
selector: | ||
matchLabels: | ||
cluster.x-k8s.io/interruptible-instance: "" # This label is automatically applied to all spot instances | ||
maxUnhealthy: 100% # No reason to block the interruption, it's inevitable | ||
unhealthyConditions: | ||
- type: Terminating | ||
status: "True" | ||
timeout: 0s # Immediately terminate the instance | ||
``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this something that MachinePool controller should handler rather than MachineHealthCheck?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Spot instances are implemented for InfraMachines, not just MachinePools (in fact they aren't implemented for MachinePools yet, see Future work below). This solution only works for Machines. For ASGs/VMSS this would need to be part of the MachinePool controller, but we would need both.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In that case, should the Machine controller be responsible for noticing this condition? That way it would be the same in both Machine and MachinePool?
``` | ||
|
||
#### Running the termination handler | ||
The Termination Pod will be part of a DaemonSet, which should be deployed manually by users. It will select Nodes which are labelled as spot instances to ensure the Termination Pod only runs on instances that require termination handlers. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"which should be deployed manually by users" -> can ClusterResourceSets be leveraged instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are ClusterResourceSets usable at this point/will they be by the time we want to implement this proposal? I believe they would solve the issue well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They are usable (we use them in e2e) but are still experimental and turned on by feature flag.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can definitely use ClusterResourceSets for this
```yaml | ||
--- | ||
apiVersion: cluster.x-k8s.io/v1alpha3 | ||
kind: MachineHealthCheck | ||
metadata: | ||
name: cluster-api-termination-handler | ||
labels: | ||
api: clusterapi | ||
k8s-app: termination-handler | ||
spec: | ||
selector: | ||
matchLabels: | ||
cluster.x-k8s.io/interruptible-instance: "" # This label is automatically applied to all spot instances | ||
maxUnhealthy: 100% # No reason to block the interruption, it's inevitable | ||
unhealthyConditions: | ||
- type: Terminating | ||
status: "True" | ||
timeout: 0s # Immediately terminate the instance | ||
``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Spot instances are implemented for InfraMachines, not just MachinePools (in fact they aren't implemented for MachinePools yet, see Future work below). This solution only works for Machines. For ASGs/VMSS this would need to be part of the MachinePool controller, but we would need both.
|
||
Once the Node has been marked with the `Terminating` condition, it will be | ||
the MachineHealthCheck controller's responsibility to ensure that the Machine | ||
is deleted, triggering it to be drained and removed from the cluster. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't MHC cause a new Machine to be created? Since the idea is to replace an unhealthy machine? Is that what we want when a spot VM is terminated for lack of capacity?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think yes, we need to delete machine and leverage node draining here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given the Machine is going to be Failed
, is there any reason to keep it around apart from the fact we know a replacement would fail?
I guess when you're running a set up like this you kind of need the cluster autoscaler to notice that there's no capacity and scale up elsewhere, or maybe have the MachineSet bring up a node in an alternate failure domain (if we go down the route of supporting that? Would probably need to recognise somehow that the spot capacity is out and retry a different failure domain)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that for a first implementation letting the machine be recreated is reasonable. We can always do smarter things based on usage feedback once we have something in place.
|
||
The termination pod will be developed to poll the metadata service for the Node | ||
that it is running on. | ||
We will implement request/response handlers for each of the three cloud providers |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"each of the three cloud providers" seems a bit restrictive, maybe this can be
We will implement request/response handlers for each of the three cloud providers | |
We will implement request/response handlers for each infrastructure provider that supports non-guaranteed instances |
for termination. | ||
|
||
The code for all providers will be mostly similar, the only difference is logic for polling termination endpoint and processing the response. | ||
This means that cloud provider type can be passed as an argument and termination handler can be common for all providers and placed in a separate repository, for example: `kubernetes-sigs/cluster-api-termination-handler`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you have a POC of what the code would look like?
#### Running the termination handler | ||
The Termination Pod will be part of a DaemonSet, which should be deployed manually by users. It will select Nodes which are labelled as spot instances to ensure the Termination Pod only runs on instances that require termination handlers. | ||
|
||
The spot label will be added by the cloud providers as they create instances, provided they support spot instances and the instance is a spot instance. This also requires ability of cloud providers to sync labels between machines and nodes on workload cluster. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"This also requires ability of cloud providers to sync labels between machines and nodes on workload cluster." -> this seems like an important limitation, by this do you mean the Cluster API infra provider has to be able to sync labels between CAPI Machines and nodes, Infra Machines and nodes, or something else?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think somehow we need to mark nodes as spot instances (using a well known label) so that the DaemonSet only deploys to spot instances.
Is there anywhere in CAPI already where we can put a label on a Machine and have that copied to the Node once it joins the cluster? If there is, could the Infra provider add a label to that set when it creates a spot instance?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Otherwise the core machine controller could infer from the InfrastructureRef
if this is a non-guaranteed instance and add the label to the node as it adds the nodeRef
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I created an issue for tracking this #3504 and a POC PR for aws kubernetes-sigs/cluster-api-provider-aws#1876
The spot label will be added by the cloud providers as they create instances, provided they support spot instances and the instance is a spot instance. This also requires ability of cloud providers to sync labels between machines and nodes on workload cluster. | ||
|
||
#### Termination handler security | ||
To be able to perform the actions required by the termination handler, the pod will need to be relatively privileged. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"the pod will need to be relatively privileged." seems a bit vague, what access does it need exactly? What are the actions required by the termination handler?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed this is a bit vague, I believe it is a precursive sentence for the rest of the content of this section, which explains the privilege level required to run this termination handler. Could probably just drop this whole sentence.
To enable graceful termination of workloads running on non-guaranteed instances, | ||
a termination handler pod will need to be deployed on each spot instance. | ||
|
||
The termination pod will be developed to poll the metadata service for the Node |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not 100% sure this would work for Azure. At the bottom of this proposal we wrote:
Azure uses their Scheduled Events API to notify Spot VMs that they are due to be preempted.
This is a similar service to the AWS metadata service that each machine can poll to see events for itself.
Azure only gives 30 seconds warning for nodes being preempted though.
A Daemonset solution similar to the AWS termination handlers could be implemented to provide graceful shutdown with Azure Spot VMs.
For example see this [existing solution](https://github.com/awesomenix/drainsafe).
cc @awesomenix
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are planning integrate https://github.com/awesomenix/drainsafe to automatic cordon and drain any terminating instances for machinepools. We can probably extend it to execute additional termination handlers. Just a reminder that you get at most 30 seconds for a Spot VM on azure, I am not sure how much you can do in that time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For termination handlers, it should be sufficient to use node draining logic that we already have implemented in CAPI https://github.com/kubernetes-sigs/cluster-api/blob/master/controllers/machine_controller.go#L276
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have a POC for AWS which marks the node with a condition as described in this proposal, the azure one can be pretty much identical to this, just substituting the logic for how to read the response and adding the required header to the request.
While I agree 30 seconds isn't very long to actually do much in the drain, I think the benefit would come from replacing nodes quicker. Without the contents of this proposal, the instance gets terminated, if an MHC is present, it will remove the Machine after some time of being unhealthy, then the MachineSet replaces the node.
With this proposal, we get an up to 30 second warning, that then allows us to mark the Node, have an MHC notice this mark, delete the Machine and then have the MachineSet controller create a new Machine. This should pretty much all happen "instantly" so we should start the creation of the new instance before the old instance goes away rather than a few minutes after it has already shut down
Still need to agree on #3528 (comment) but overall this looks good to me. |
There's a KEP that was discussed on sig-node to hook into systemd's ability to allow you to delay shutdown to handle shutdown gracefully from the kubelet: kubernetes/enhancements#2001 I do like the idea of hooking this as a separate component as done here. |
@cpuguy83 Thanks for sharing the link. The logic described in this proposal might not work with spot instances, I am not sure that we can delay the shutdown of a spot instance. |
@CecileRobertMichon @vincepri @JoelSpeed This proposal amendment is ready for another round of reviews. |
/ok-to-test |
/milestone v0.4.0 |
/milestone Next Folks, what's the status of this proposal? |
I think all concerns and comments were addressed, I'm happy to continue with this proposal. |
@alexander-demichev are there any breaking changes in this proposal? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@alexander-demichev please update the last-updated
date on the proposal
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
@CecileRobertMichon sure, it's done |
@CecileRobertMichon There are no breaking changes. |
@alexander-demichev: The following test failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
Couple of things:
I would be tempted to reduce the proposal here to its essentials - defining the contract, and defer the per provider implementation to CAEPs (or preferred processes) for those providers. We should also at least mention the existence of https://github.com/aws/aws-node-termination-handler in the alternatives section. |
``` | ||
|
||
#### Running the termination handler | ||
The Termination Pod will be part of a DaemonSet, that can be deployed using [ClusterResourceSet](https://github.com/kubernetes-sigs/cluster-api/blob/master/docs/proposals/20200220-cluster-resource-set.md). The DaemonSet will select Nodes which are labelled as spot instances to ensure the Termination Pod only runs on instances that require termination handlers. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why? I think this should be enabled for all nodes by default.
Even if the problem is clearly more important for spot instances, every node that is about to be terminated, for whatever reason, should attempt to properly drain.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The termination notices that these termination handlers rely upon are only implemented on the spot/preemptible instances. Eg on AWS, if the spot instance is going away, we get a 2 minute warning via the metadata endpoint. There is no equivalent for an on-demand instance.
I appreciate that there are times that cloud providers remove instances for maintenance reasons, and there may be some way programmatically to detect this, but as far as I know they use a different mechanism and are also far less frequent. For now at least, I think they are out of scope for these termination handlers
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Proposal referenced here #3528 (comment) describes the situation where the node can be unexpectedly shut down.
This got auto closed because of the branch rename to main. |
What this PR does / why we need it:
This PR updates spot instances proposal with termination handler design.
Thanks to @JoelSpeed because it's mostly copy-paste of his proposal done for OpenShift.