-
Notifications
You must be signed in to change notification settings - Fork 4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
When a lot of nodes need to be added, autoscaler seems to misbehave #2980
Comments
The errors you're seeing are most likely failures coming from the VMSS side. If I had to guess, those are either allocation error or VM extension provisioning errors. I'd advice against changing
Do you see those nodes registered in K8s? If it takes too long for them to come-up and in order to prevent autoscaler from deleting those, you'd probably want to set /area provider/azure |
For the context deadline exceeded, I think there's a bug with the version of cc @feiskyer since you've looked at something similar in the past. |
@nwohlgemuth what do you mean by "stuck"? @marwanad it's a known issue and we should actually add timeouts in our context in case the requests are stuck there forever. |
@feiskyer we're hitting this Azure/go-autorest#357, no? |
@marwanad you're right. The fixes from go-autorest are included since Kubernetes v1.16.x and autoscaler v1.16.x. @nwohlgemuth are you able to upgrade the cluster together with autoscaler to v1.16 or above? |
Addressed the question above. The flag would control the "unregistered" node calculations in CA but you'll still see those context timeout logs since autorest defaults to 15 min. /close |
@marwanad: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Autoscaler Version: 1.15.3
When we have a large increase in the number of pending pods in the cluster and the autoscaler needs to add a lot of nodes (relative to the existing size of the cluster), the autoscaler will start emitting the following error:
E0325 06:29:29.475686 24 azure_scale_set.go:213] virtualMachineScaleSetsClient.CreateOrUpdate for scale set "k8s-agentpool1-38858572-vmss" failed: Future#WaitForCompletion: context has been cancelled: StatusCode=200 -- Original Error: context deadline exceeded
E0325 06:29:29.475724 24 azure_scale_set.go:184] Failed to update the capacity for vmss k8s-agentpool1-38858572-vmss with error Future#WaitForCompletion: context has been cancelled: StatusCode=200 -- Original Error: context deadline exceeded, invalidate the cache so as to get the real size from API
E0325 06:29:40.210447 24 azure_scale_set.go:213] virtualMachineScaleSetsClient.CreateOrUpdate for scale set "k8s-agentpool1-38858572-vmss" failed: Future#WaitForCompletion: context has been cancelled: StatusCode=200 -- Original Error: context deadline exceeded
The Azure VMSS instance is successfully adding nodes but it seems it is too slow for the Autoscaler. Sometimes this results in the Autoscaler getting "stuck". Deleting the Autoscaler pod and letting it get recreated sometimes gets it "unstuck" - meaning it will go back to adding nodes to the cluster. I am currently trying out raising '--max-total-unready-percentage' and '--max-node-provision-time'. Are these the right settings to change? Are there other settings I should be changing instead/in addition?
The text was updated successfully, but these errors were encountered: