Skip to content

Commit

Permalink
Make antrea-controller not tolerate Node unreachable
Browse files Browse the repository at this point in the history
When a Node becomes unreachable, currently it takes 5m45s+ for
Kubernetes to move antrea-controller Pod to another Node. The time spent
in the process includes:

* 40s (default value of NodeMonitorGracePeriod) to mark a Node's Ready
  condition to Unknown
* 5s to taint the Node with `node.kubernetes.io/unreachable:NoExecute`
* 5m (default value of defaultUnreachableTolerationSeconds) to tolerate
  the taint

The 1st duration is kind of inevitable. The 2nd duration seems a bug in
kube-controller-manager, which I have opened an issue
kubernetes/kubernetes#120815 and may be fixed in a future release. The
3rd duration is because Kubernetes automatically adds a default
toleration for `node.kubernetes.io/unreachable:NoExecute` with
tolerationSeconds of 300s if the Pod doesn't have one.

This commit adds a toleration with tolerationSeconds of 0s for
`node.kubernetes.io/unreachable:NoExecute` explicitly, which reduces the
failover time by 5m.

Signed-off-by: Quan Tian <qtian@vmware.com>
  • Loading branch information
tnqn committed Sep 25, 2023
1 parent 813372b commit 557a1f0
Show file tree
Hide file tree
Showing 7 changed files with 28 additions and 1 deletion.
2 changes: 1 addition & 1 deletion build/charts/antrea/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -69,7 +69,7 @@ Kubernetes: `>= 1.16.0-0`
| controller.podLabels | object | `{}` | Labels to be added to antrea-controller Pod. |
| controller.priorityClassName | string | `"system-cluster-critical"` | Prority class to use for the antrea-controller Pod. |
| controller.selfSignedCert | bool | `true` | Indicates whether to use auto-generated self-signed TLS certificates. If false, a Secret named "antrea-controller-tls" must be provided with the following keys: ca.crt, tls.crt, tls.key. |
| controller.tolerations | list | `[{"key":"CriticalAddonsOnly","operator":"Exists"},{"effect":"NoSchedule","key":"node-role.kubernetes.io/master"},{"effect":"NoSchedule","key":"node-role.kubernetes.io/control-plane"}]` | Tolerations for the antrea-controller Pod. |
| controller.tolerations | list | `[{"key":"CriticalAddonsOnly","operator":"Exists"},{"effect":"NoSchedule","key":"node-role.kubernetes.io/master"},{"effect":"NoSchedule","key":"node-role.kubernetes.io/control-plane"},{"effect":"NoExecute","key":"node.kubernetes.io/unreachable","operator":"Exists","tolerationSeconds":0}]` | Tolerations for the antrea-controller Pod. |
| defaultMTU | int | `0` | Default MTU to use for the host gateway interface and the network interface of each Pod. By default, antrea-agent will discover the MTU of the Node's primary interface and adjust it to accommodate for tunnel encapsulation overhead if applicable. |
| disableTXChecksumOffload | bool | `false` | Disable TX checksum offloading for container network interfaces. It's supposed to be set to true when the datapath doesn't support TX checksum offloading, which causes packets to be dropped due to bad checksum. It affects Pods running on Linux Nodes only. |
| dnsServerOverride | string | `""` | Address of DNS server, to override the kube-dns Service. It's used to resolve hostnames in a FQDN policy. |
Expand Down
7 changes: 7 additions & 0 deletions build/charts/antrea/values.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -296,6 +296,13 @@ controller:
# Control-plane taint for Kubernetes >= 1.24.
- key: node-role.kubernetes.io/control-plane
effect: NoSchedule
# Evict it immediately once Node is detected unreachable.
# Must be set explicitly, otherwise DefaultTolerationSeconds plugin will
# add a default toleration with tolerationSeconds of 300s.
- key: node.kubernetes.io/unreachable
effect: NoExecute
operator: Exists
tolerationSeconds: 0
# -- Node selector for the antrea-controller Pod.
nodeSelector:
kubernetes.io/os: linux
Expand Down
4 changes: 4 additions & 0 deletions build/yamls/antrea-aks.yml
Original file line number Diff line number Diff line change
Expand Up @@ -7078,6 +7078,10 @@ spec:
key: node-role.kubernetes.io/master
- effect: NoSchedule
key: node-role.kubernetes.io/control-plane
- effect: NoExecute
key: node.kubernetes.io/unreachable
operator: Exists
tolerationSeconds: 0
serviceAccountName: antrea-controller
containers:
- name: antrea-controller
Expand Down
4 changes: 4 additions & 0 deletions build/yamls/antrea-eks.yml
Original file line number Diff line number Diff line change
Expand Up @@ -7079,6 +7079,10 @@ spec:
key: node-role.kubernetes.io/master
- effect: NoSchedule
key: node-role.kubernetes.io/control-plane
- effect: NoExecute
key: node.kubernetes.io/unreachable
operator: Exists
tolerationSeconds: 0
serviceAccountName: antrea-controller
containers:
- name: antrea-controller
Expand Down
4 changes: 4 additions & 0 deletions build/yamls/antrea-gke.yml
Original file line number Diff line number Diff line change
Expand Up @@ -7076,6 +7076,10 @@ spec:
key: node-role.kubernetes.io/master
- effect: NoSchedule
key: node-role.kubernetes.io/control-plane
- effect: NoExecute
key: node.kubernetes.io/unreachable
operator: Exists
tolerationSeconds: 0
serviceAccountName: antrea-controller
containers:
- name: antrea-controller
Expand Down
4 changes: 4 additions & 0 deletions build/yamls/antrea-ipsec.yml
Original file line number Diff line number Diff line change
Expand Up @@ -7135,6 +7135,10 @@ spec:
key: node-role.kubernetes.io/master
- effect: NoSchedule
key: node-role.kubernetes.io/control-plane
- effect: NoExecute
key: node.kubernetes.io/unreachable
operator: Exists
tolerationSeconds: 0
serviceAccountName: antrea-controller
containers:
- name: antrea-controller
Expand Down
4 changes: 4 additions & 0 deletions build/yamls/antrea.yml
Original file line number Diff line number Diff line change
Expand Up @@ -7076,6 +7076,10 @@ spec:
key: node-role.kubernetes.io/master
- effect: NoSchedule
key: node-role.kubernetes.io/control-plane
- effect: NoExecute
key: node.kubernetes.io/unreachable
operator: Exists
tolerationSeconds: 0
serviceAccountName: antrea-controller
containers:
- name: antrea-controller
Expand Down

0 comments on commit 557a1f0

Please sign in to comment.