Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make antrea-controller not tolerate Node unreachable #5521

Merged
merged 1 commit into from
Sep 26, 2023

Conversation

tnqn
Copy link
Member

@tnqn tnqn commented Sep 22, 2023

When a Node becomes unreachable, currently it takes 5m45s+ for Kubernetes to move antrea-controller Pod to another Node. The time spent in the process includes:

  • 40s (default value of NodeMonitorGracePeriod) to mark a Node's Ready condition to Unknown
  • 5s to taint the Node with node.kubernetes.io/unreachable:NoExecute
  • 5m (default value of defaultUnreachableTolerationSeconds) to tolerate the taint

The 1st duration is kind of inevitable. The 2nd duration seems a bug in kube-controller-manager, which I have opened an issue kubernetes/kubernetes#120815 and may be fixed in a future release. The 3rd duration is because Kubernetes automatically adds a default toleration for node.kubernetes.io/unreachable:NoExecute with tolerationSeconds of 300s if the Pod doesn't have one.

This commit adds a toleration with tolerationSeconds of 0s for node.kubernetes.io/unreachable:NoExecute explicitly, which reduces the failover time by 5m.

@tnqn tnqn added the action/release-note Indicates a PR that should be included in release notes. label Sep 22, 2023
@tnqn tnqn requested review from antoninbas and jianjuns September 22, 2023 03:13
@tnqn tnqn force-pushed the controller-do-not-tolerate-unreachable branch from 0e4e4c4 to fa5aeb4 Compare September 22, 2023 03:17
antoninbas
antoninbas previously approved these changes Sep 22, 2023
Copy link
Contributor

@antoninbas antoninbas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@antoninbas
Copy link
Contributor

/test-all

jianjuns
jianjuns previously approved these changes Sep 23, 2023
When a Node becomes unreachable, currently it takes 5m45s+ for
Kubernetes to move antrea-controller Pod to another Node. The time spent
in the process includes:

* 40s (default value of NodeMonitorGracePeriod) to mark a Node's Ready
  condition to Unknown
* 5s to taint the Node with `node.kubernetes.io/unreachable:NoExecute`
* 5m (default value of defaultUnreachableTolerationSeconds) to tolerate
  the taint

The 1st duration is kind of inevitable. The 2nd duration seems a bug in
kube-controller-manager, which I have opened an issue
kubernetes/kubernetes#120815 and may be fixed in a future release. The
3rd duration is because Kubernetes automatically adds a default
toleration for `node.kubernetes.io/unreachable:NoExecute` with
tolerationSeconds of 300s if the Pod doesn't have one.

This commit adds a toleration with tolerationSeconds of 0s for
`node.kubernetes.io/unreachable:NoExecute` explicitly, which reduces the
failover time by 5m.

Signed-off-by: Quan Tian <qtian@vmware.com>
@tnqn tnqn dismissed stale reviews from jianjuns and antoninbas via 557a1f0 September 25, 2023 02:53
@tnqn tnqn force-pushed the controller-do-not-tolerate-unreachable branch from fa5aeb4 to 557a1f0 Compare September 25, 2023 02:53
@tnqn
Copy link
Member Author

tnqn commented Sep 25, 2023

Fixed helm README

@antoninbas
Copy link
Contributor

/test-all

@tnqn tnqn merged commit 4806be3 into antrea-io:main Sep 26, 2023
@tnqn tnqn deleted the controller-do-not-tolerate-unreachable branch September 26, 2023 14:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
action/release-note Indicates a PR that should be included in release notes.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants