Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

When a lot of nodes need to be added, autoscaler seems to misbehave #2980

Closed
nwohlgemuth opened this issue Mar 25, 2020 · 7 comments
Closed
Labels
area/provider/azure Issues or PRs related to azure provider

Comments

@nwohlgemuth
Copy link

Autoscaler Version: 1.15.3

When we have a large increase in the number of pending pods in the cluster and the autoscaler needs to add a lot of nodes (relative to the existing size of the cluster), the autoscaler will start emitting the following error:

E0325 06:29:29.475686 24 azure_scale_set.go:213] virtualMachineScaleSetsClient.CreateOrUpdate for scale set "k8s-agentpool1-38858572-vmss" failed: Future#WaitForCompletion: context has been cancelled: StatusCode=200 -- Original Error: context deadline exceeded
E0325 06:29:29.475724 24 azure_scale_set.go:184] Failed to update the capacity for vmss k8s-agentpool1-38858572-vmss with error Future#WaitForCompletion: context has been cancelled: StatusCode=200 -- Original Error: context deadline exceeded, invalidate the cache so as to get the real size from API
E0325 06:29:40.210447 24 azure_scale_set.go:213] virtualMachineScaleSetsClient.CreateOrUpdate for scale set "k8s-agentpool1-38858572-vmss" failed: Future#WaitForCompletion: context has been cancelled: StatusCode=200 -- Original Error: context deadline exceeded

The Azure VMSS instance is successfully adding nodes but it seems it is too slow for the Autoscaler. Sometimes this results in the Autoscaler getting "stuck". Deleting the Autoscaler pod and letting it get recreated sometimes gets it "unstuck" - meaning it will go back to adding nodes to the cluster. I am currently trying out raising '--max-total-unready-percentage' and '--max-node-provision-time'. Are these the right settings to change? Are there other settings I should be changing instead/in addition?

@marwanad
Copy link
Member

marwanad commented Mar 25, 2020

The errors you're seeing are most likely failures coming from the VMSS side. If I had to guess, those are either allocation error or VM extension provisioning errors. I'd advice against changing max-node-provision-time because I've seen some hardcoded references to the 15 minutes in the core autoscaler logic so it might lead to unwanted effects/weird behaviour.

MaxNodeStartupTime = 15 * time.Minute

Do you see those nodes registered in K8s? If it takes too long for them to come-up and in order to prevent autoscaler from deleting those, you'd probably want to set max-node-provision-time to something higher. I think the VMSS failures you're seeing in that case are because of a bug in go-autorest assuming the virtual machines come up fine after a while.

/area provider/azure

@k8s-ci-robot k8s-ci-robot added the area/provider/azure Issues or PRs related to azure provider label Mar 25, 2020
@marwanad
Copy link
Member

For the context deadline exceeded, I think there's a bug with the version of go-autorest we're using where it doesn't respect contexts without timeouts.

cc @feiskyer since you've looked at something similar in the past.

@feiskyer
Copy link
Member

@nwohlgemuth what do you mean by "stuck"?

@marwanad it's a known issue and we should actually add timeouts in our context in case the requests are stuck there forever.

@marwanad
Copy link
Member

@feiskyer we're hitting this Azure/go-autorest#357, no?

@feiskyer
Copy link
Member

feiskyer commented May 18, 2020

@marwanad you're right. The fixes from go-autorest are included since Kubernetes v1.16.x and autoscaler v1.16.x.

@nwohlgemuth are you able to upgrade the cluster together with autoscaler to v1.16 or above?

@marwanad
Copy link
Member

Addressed the question above. The flag would control the "unregistered" node calculations in CA but you'll still see those context timeout logs since autorest defaults to 15 min.

/close

@k8s-ci-robot
Copy link
Contributor

@marwanad: Closing this issue.

In response to this:

Addressed the question above. The flag would control the "unregistered" node calculations in CA but you'll still see those context timeout logs since autorest defaults to 15 min.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/provider/azure Issues or PRs related to azure provider
Projects
None yet
Development

No branches or pull requests

4 participants