Begin draining a node when it enters `Terminating` state #5

jhuntwork · 2020-08-12T22:14:13Z

... and continue with other operations once it reaches
Terminating:wait state.

As per https://docs.aws.amazon.com/elasticloadbalancing/latest/classic/config-conn-drain.html

If your instances are part of an Auto Scaling group and connection draining
is enabled for your load balancer, Auto Scaling waits for the in-flight
requests to complete, or for the maximum timeout to expire, before
terminating instances due to a scaling event or health check replacement

ASGs will beging draining nodes from load balancers while in
Terminating state, and will even remove them from the load balancer
before the node transitions to Terminating:wait. This means if you
depend on pod evictions to move a critical service to another available
node in the load balancer target group, this will only happen after
that node has already been drained and removed from the load balancer.

This effect is amplified whenever there is a timeout, or Deregistration delay
value set on the load balancer. The default value is 300 seconds, as per
here: https://docs.aws.amazon.com/elasticloadbalancing/latest/application/load-balancer-target-groups.html#deregistration-delay

By draining sooner, critical pods providing service through that load
balancer can move to other nodes and maintain uptime while the node is
being deregistered from the load balancer.

... and continue with other operations once it reaches `Terminating:wait` state. As per https://docs.aws.amazon.com/elasticloadbalancing/latest/classic/config-conn-drain.html > If your instances are part of an Auto Scaling group and connection draining > is enabled for your load balancer, Auto Scaling waits for the in-flight > requests to complete, or for the maximum timeout to expire, before > terminating instances due to a scaling event or health check replacement ASGs will beging draining nodes from load balancers while in `Terminating` state, and will even remove them from the load balancer before the node transitions to `Terminating:wait`. This means if you depend on pod evictions to move a critical service to another available node in the load balancer target group, this will only happen _after_ that node has already been drained and removed from the load balancer. This effect is amplified whenever there is a timeout, or Deregistration delay value set on the load balancer. The default value is 300 seconds, as per here: https://docs.aws.amazon.com/elasticloadbalancing/latest/application/load-balancer-target-groups.html#deregistration-delay By draining sooner, critical pods providing service through that load balancer can move to other nodes and maintain uptime while the node is being deregistered from the load balancer.

pawelprazak · 2020-08-13T07:17:18Z

Yes, according to the ASG lifecycle hooks, this should help.

https://docs.aws.amazon.com/autoscaling/ec2/userguide/AutoScalingGroupLifecycle.html#as-lifecycle-hooks

When Amazon EC2 Auto Scaling responds to a scale-in event, it terminates one or more instances. These instances are detached from the Auto Scaling group and enter the Terminating state. If you added an autoscaling:EC2_INSTANCE_TERMINATING lifecycle hook to your Auto Scaling group, the instances move from the Terminating state to the Terminating:Wait state. After you complete the lifecycle action, the instances enter the Terminating:Proceed state. When the instances are fully terminated, they enter the Terminated state.

I like this idea, thank you.

pawelprazak

LGTM

pawelprazak · 2020-08-13T07:18:39Z

pkg/trigger/aws/handler.go

@@ -29,17 +30,25 @@ func (h *HookHandler) Loop(nodeName string) {
 			continue
 		}
 		glog.Infof("Status of instance '%v' is '%v', autoscaling group is '%v'", h.AutoScaling.Options.InstanceID, *status, *autoScalingGroupName)
-		if !h.AutoScaling.IsTerminating(status) {
+		if !h.AutoScaling.IsTerminating(status) && !h.AutoScaling.IsTerminatingWait(status) {


I love the fact that you've used both to make sure it triggers if the docs and impl from AWS diverge.

pawelprazak · 2020-08-13T07:21:17Z

pkg/trigger/aws/handler.go

-		err = h.Drainer.Drain(nodeName)
-		if err != nil {
-			glog.Warningf("Not all pods on this host can be evicted, will try again: %s", err)
+		if !drained {


so if I understand correctly, there was a misleading message in logs, that was not always true, and now we fix that, right?

Well, after moving the draining to happen as soon as it sees the "Termination" state, it will probably loop several times before it hits the "Termination:Wait" state, and this is to just avoid it trying to cordon and drain again if it's already done so successfully.

pawelprazak approved these changes Aug 13, 2020

View reviewed changes

pawelprazak merged commit f6aa8f8 into VirtusLab:master Aug 13, 2020

jhuntwork deleted the asg branch August 13, 2020 12:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Begin draining a node when it enters `Terminating` state #5

Begin draining a node when it enters `Terminating` state #5

jhuntwork commented Aug 12, 2020

pawelprazak commented Aug 13, 2020

pawelprazak left a comment

pawelprazak Aug 13, 2020

pawelprazak Aug 13, 2020

jhuntwork Aug 13, 2020

Begin draining a node when it enters Terminating state #5

Begin draining a node when it enters Terminating state #5

Conversation

jhuntwork commented Aug 12, 2020

pawelprazak commented Aug 13, 2020

pawelprazak left a comment

Choose a reason for hiding this comment

pawelprazak Aug 13, 2020

Choose a reason for hiding this comment

pawelprazak Aug 13, 2020

Choose a reason for hiding this comment

jhuntwork Aug 13, 2020

Choose a reason for hiding this comment

Begin draining a node when it enters `Terminating` state #5

Begin draining a node when it enters `Terminating` state #5