When a lot of nodes need to be added, autoscaler seems to misbehave #2980

nwohlgemuth · 2020-03-25T06:50:01Z

Autoscaler Version: 1.15.3

When we have a large increase in the number of pending pods in the cluster and the autoscaler needs to add a lot of nodes (relative to the existing size of the cluster), the autoscaler will start emitting the following error:

E0325 06:29:29.475686 24 azure_scale_set.go:213] virtualMachineScaleSetsClient.CreateOrUpdate for scale set "k8s-agentpool1-38858572-vmss" failed: Future#WaitForCompletion: context has been cancelled: StatusCode=200 -- Original Error: context deadline exceeded
E0325 06:29:29.475724 24 azure_scale_set.go:184] Failed to update the capacity for vmss k8s-agentpool1-38858572-vmss with error Future#WaitForCompletion: context has been cancelled: StatusCode=200 -- Original Error: context deadline exceeded, invalidate the cache so as to get the real size from API
E0325 06:29:40.210447 24 azure_scale_set.go:213] virtualMachineScaleSetsClient.CreateOrUpdate for scale set "k8s-agentpool1-38858572-vmss" failed: Future#WaitForCompletion: context has been cancelled: StatusCode=200 -- Original Error: context deadline exceeded

The Azure VMSS instance is successfully adding nodes but it seems it is too slow for the Autoscaler. Sometimes this results in the Autoscaler getting "stuck". Deleting the Autoscaler pod and letting it get recreated sometimes gets it "unstuck" - meaning it will go back to adding nodes to the cluster. I am currently trying out raising '--max-total-unready-percentage' and '--max-node-provision-time'. Are these the right settings to change? Are there other settings I should be changing instead/in addition?

marwanad · 2020-03-25T23:53:02Z

The errors you're seeing are most likely failures coming from the VMSS side. If I had to guess, those are either allocation error or VM extension provisioning errors. I'd advice against changing max-node-provision-time because I've seen some hardcoded references to the 15 minutes in the core autoscaler logic so it might lead to unwanted effects/weird behaviour.

autoscaler/cluster-autoscaler/clusterstate/clusterstate.go

Line 44 in bf3a9fb

MaxNodeStartupTime = 15 * time.Minute

Do you see those nodes registered in K8s? If it takes too long for them to come-up and in order to prevent autoscaler from deleting those, you'd probably want to set max-node-provision-time to something higher. I think the VMSS failures you're seeing in that case are because of a bug in go-autorest assuming the virtual machines come up fine after a while.

/area provider/azure

marwanad · 2020-03-26T00:19:29Z

For the context deadline exceeded, I think there's a bug with the version of go-autorest we're using where it doesn't respect contexts without timeouts.

cc @feiskyer since you've looked at something similar in the past.

feiskyer · 2020-03-27T13:10:37Z

@nwohlgemuth what do you mean by "stuck"?

@marwanad it's a known issue and we should actually add timeouts in our context in case the requests are stuck there forever.

marwanad · 2020-03-27T16:15:55Z

@feiskyer we're hitting this Azure/go-autorest#357, no?

feiskyer · 2020-05-18T02:51:07Z

@marwanad you're right. The fixes from go-autorest are included since Kubernetes v1.16.x and autoscaler v1.16.x.

@nwohlgemuth are you able to upgrade the cluster together with autoscaler to v1.16 or above?

marwanad · 2020-07-31T17:22:21Z

Addressed the question above. The flag would control the "unregistered" node calculations in CA but you'll still see those context timeout logs since autorest defaults to 15 min.

/close

k8s-ci-robot · 2020-07-31T17:22:36Z

@marwanad: Closing this issue.

In response to this:

Addressed the question above. The flag would control the "unregistered" node calculations in CA but you'll still see those context timeout logs since autorest defaults to 15 min.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot added the area/provider/azure Issues or PRs related to azure provider label Mar 25, 2020

k8s-ci-robot closed this as completed Jul 31, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

When a lot of nodes need to be added, autoscaler seems to misbehave #2980

When a lot of nodes need to be added, autoscaler seems to misbehave #2980

nwohlgemuth commented Mar 25, 2020

marwanad commented Mar 25, 2020 •

edited

Loading

marwanad commented Mar 26, 2020

feiskyer commented Mar 27, 2020

marwanad commented Mar 27, 2020

feiskyer commented May 18, 2020 •

edited

Loading

marwanad commented Jul 31, 2020

k8s-ci-robot commented Jul 31, 2020

When a lot of nodes need to be added, autoscaler seems to misbehave #2980

When a lot of nodes need to be added, autoscaler seems to misbehave #2980

Comments

nwohlgemuth commented Mar 25, 2020

marwanad commented Mar 25, 2020 • edited Loading

marwanad commented Mar 26, 2020

feiskyer commented Mar 27, 2020

marwanad commented Mar 27, 2020

feiskyer commented May 18, 2020 • edited Loading

marwanad commented Jul 31, 2020

k8s-ci-robot commented Jul 31, 2020

marwanad commented Mar 25, 2020 •

edited

Loading

feiskyer commented May 18, 2020 •

edited

Loading