You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
While using kubeflow 1.0, 1.2, 1.3 I have noticed that sometimes nodes do not scale down.
AFAIU this happens because of node auto-provisioning. Nodes are scaled up and in some cases kube-system pods might start running on them, preventing them from scaling down.
One option to consider is to put a taint on the nodepool that you want to be able to scale to 0. That way system pods will not be able to run on those nodes, so they won't block scale-down. Downside is you'll need to add a toleration to all the pods that you want to run on this nodepool (this can be automated with mutating admission webhook). This is a very useful pattern if you have a nodepool with particularly expensive nodes.
Alternatively you can create PDBs for all non-daemonset system pods. Note: restarting some system pods can cause various types of disruption to your cluster, which is why CA does not restart them by default (ex. restarting metrics-server will break all HPAs in your cluster for a few minutes). It's up to you to decide which disruptions you're ok with.
While using kubeflow 1.0, 1.2, 1.3 I have noticed that sometimes nodes do not scale down.
AFAIU this happens because of node auto-provisioning. Nodes are scaled up and in some cases kube-system pods might start running on them, preventing them from scaling down.
kubernetes/autoscaler#2377 (comment)
Not sure if relevant but maybe these lines require an update?
https://github.com/kubeflow/gcp-blueprints/blob/1d41c6ca7fc904d91dfcfb44e61e42435801e72c/kubeflow/common/cluster/upstream/cluster.yaml#L32-L37
Currently I'm considering to cancel the node auto-provisioning although it would be nice to have this working as expected.
Any ideas how to fix this?
The text was updated successfully, but these errors were encountered: