-
Notifications
You must be signed in to change notification settings - Fork 10
Begin draining a node when it enters Terminating
state
#5
Conversation
... and continue with other operations once it reaches `Terminating:wait` state. As per https://docs.aws.amazon.com/elasticloadbalancing/latest/classic/config-conn-drain.html > If your instances are part of an Auto Scaling group and connection draining > is enabled for your load balancer, Auto Scaling waits for the in-flight > requests to complete, or for the maximum timeout to expire, before > terminating instances due to a scaling event or health check replacement ASGs will beging draining nodes from load balancers while in `Terminating` state, and will even remove them from the load balancer before the node transitions to `Terminating:wait`. This means if you depend on pod evictions to move a critical service to another available node in the load balancer target group, this will only happen _after_ that node has already been drained and removed from the load balancer. This effect is amplified whenever there is a timeout, or Deregistration delay value set on the load balancer. The default value is 300 seconds, as per here: https://docs.aws.amazon.com/elasticloadbalancing/latest/application/load-balancer-target-groups.html#deregistration-delay By draining sooner, critical pods providing service through that load balancer can move to other nodes and maintain uptime while the node is being deregistered from the load balancer.
Yes, according to the ASG lifecycle hooks, this should help.
I like this idea, thank you. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@@ -29,17 +30,25 @@ func (h *HookHandler) Loop(nodeName string) { | |||
continue | |||
} | |||
glog.Infof("Status of instance '%v' is '%v', autoscaling group is '%v'", h.AutoScaling.Options.InstanceID, *status, *autoScalingGroupName) | |||
if !h.AutoScaling.IsTerminating(status) { | |||
if !h.AutoScaling.IsTerminating(status) && !h.AutoScaling.IsTerminatingWait(status) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I love the fact that you've used both to make sure it triggers if the docs and impl from AWS diverge.
err = h.Drainer.Drain(nodeName) | ||
if err != nil { | ||
glog.Warningf("Not all pods on this host can be evicted, will try again: %s", err) | ||
if !drained { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so if I understand correctly, there was a misleading message in logs, that was not always true, and now we fix that, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, after moving the draining to happen as soon as it sees the "Termination" state, it will probably loop several times before it hits the "Termination:Wait" state, and this is to just avoid it trying to cordon and drain again if it's already done so successfully.
... and continue with other operations once it reaches
Terminating:wait
state.As per https://docs.aws.amazon.com/elasticloadbalancing/latest/classic/config-conn-drain.html
ASGs will beging draining nodes from load balancers while in
Terminating
state, and will even remove them from the load balancerbefore the node transitions to
Terminating:wait
. This means if youdepend on pod evictions to move a critical service to another available
node in the load balancer target group, this will only happen after
that node has already been drained and removed from the load balancer.
This effect is amplified whenever there is a timeout, or Deregistration delay
value set on the load balancer. The default value is 300 seconds, as per
here: https://docs.aws.amazon.com/elasticloadbalancing/latest/application/load-balancer-target-groups.html#deregistration-delay
By draining sooner, critical pods providing service through that load
balancer can move to other nodes and maintain uptime while the node is
being deregistered from the load balancer.