Make antrea-controller not tolerate Node unreachable #5521

tnqn · 2023-09-22T03:07:17Z

When a Node becomes unreachable, currently it takes 5m45s+ for Kubernetes to move antrea-controller Pod to another Node. The time spent in the process includes:

40s (default value of NodeMonitorGracePeriod) to mark a Node's Ready condition to Unknown
5s to taint the Node with node.kubernetes.io/unreachable:NoExecute
5m (default value of defaultUnreachableTolerationSeconds) to tolerate the taint

The 1st duration is kind of inevitable. The 2nd duration seems a bug in kube-controller-manager, which I have opened an issue kubernetes/kubernetes#120815 and may be fixed in a future release. The 3rd duration is because Kubernetes automatically adds a default toleration for node.kubernetes.io/unreachable:NoExecute with tolerationSeconds of 300s if the Pod doesn't have one.

This commit adds a toleration with tolerationSeconds of 0s for node.kubernetes.io/unreachable:NoExecute explicitly, which reduces the failover time by 5m.

antoninbas

LGTM

antoninbas · 2023-09-22T16:45:52Z

/test-all

When a Node becomes unreachable, currently it takes 5m45s+ for Kubernetes to move antrea-controller Pod to another Node. The time spent in the process includes: * 40s (default value of NodeMonitorGracePeriod) to mark a Node's Ready condition to Unknown * 5s to taint the Node with `node.kubernetes.io/unreachable:NoExecute` * 5m (default value of defaultUnreachableTolerationSeconds) to tolerate the taint The 1st duration is kind of inevitable. The 2nd duration seems a bug in kube-controller-manager, which I have opened an issue kubernetes/kubernetes#120815 and may be fixed in a future release. The 3rd duration is because Kubernetes automatically adds a default toleration for `node.kubernetes.io/unreachable:NoExecute` with tolerationSeconds of 300s if the Pod doesn't have one. This commit adds a toleration with tolerationSeconds of 0s for `node.kubernetes.io/unreachable:NoExecute` explicitly, which reduces the failover time by 5m. Signed-off-by: Quan Tian <qtian@vmware.com>

tnqn · 2023-09-25T02:53:57Z

Fixed helm README

antoninbas · 2023-09-25T17:03:40Z

/test-all

tnqn added the action/release-note Indicates a PR that should be included in release notes. label Sep 22, 2023

tnqn requested review from antoninbas and jianjuns September 22, 2023 03:13

tnqn force-pushed the controller-do-not-tolerate-unreachable branch from 0e4e4c4 to fa5aeb4 Compare September 22, 2023 03:17

antoninbas previously approved these changes Sep 22, 2023

View reviewed changes

jianjuns previously approved these changes Sep 23, 2023

View reviewed changes

tnqn dismissed stale reviews from jianjuns and antoninbas via 557a1f0 September 25, 2023 02:53

tnqn force-pushed the controller-do-not-tolerate-unreachable branch from fa5aeb4 to 557a1f0 Compare September 25, 2023 02:53

antoninbas approved these changes Sep 25, 2023

View reviewed changes

tnqn merged commit 4806be3 into antrea-io:main Sep 26, 2023

tnqn deleted the controller-do-not-tolerate-unreachable branch September 26, 2023 14:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make antrea-controller not tolerate Node unreachable #5521

Make antrea-controller not tolerate Node unreachable #5521

tnqn commented Sep 22, 2023 •

edited

Loading

antoninbas left a comment

antoninbas commented Sep 22, 2023

tnqn commented Sep 25, 2023

antoninbas commented Sep 25, 2023

Make antrea-controller not tolerate Node unreachable #5521

Make antrea-controller not tolerate Node unreachable #5521

Conversation

tnqn commented Sep 22, 2023 • edited Loading

antoninbas left a comment

Choose a reason for hiding this comment

antoninbas commented Sep 22, 2023

tnqn commented Sep 25, 2023

antoninbas commented Sep 25, 2023

tnqn commented Sep 22, 2023 •

edited

Loading