Use different CNI conf file when configuring chaining with Antrea #4042

antoninbas · 2022-07-20T23:09:32Z

The current solution which consists of overwriting the existing CNI conf
file (e.g., 10-aws.conflist) suffers from one issue for which I cannot
find a simple workaround:
When a Node restarts, there can be a short window of time during which
the CNI conf file reverts to the old one (without Antrea). If some Pods
are restarted / scheduled on the Node during that time, they will not be
processed by Antrea and NetworkPolicies may not be applied to them.

The solution I have come up with is to create a new CNI conf file with
higher priority (05-antrea.conflist). Because that file will stay the
same during Node restart, the problematic window of time does not exist
anymore. We still watch for changes to the intial CNI conf file (e.g.,
10-aws.conflist), so we can update 05-antrea.conflist as needed.

We also update antrea-aks-node-init.yml and antrea-gke-node-init.yml to
use the same container image as antrea-eks-node-init.yml. Using v2
ensures that the script is run again if it is modified at runtime.

Signed-off-by: Antonin Bas abas@vmware.com

antoninbas · 2022-07-20T23:10:03Z

@tnqn let me know if you have an issue with this approach (of creating a new conflist file)

The current solution which consists of overwriting the existing CNI conf file (e.g., 10-aws.conflist) suffers from one issue for which I cannot find a simple workaround: When a Node restarts, there can be a short window of time during which the CNI conf file reverts to the old one (without Antrea). If some Pods are restarted / scheduled on the Node during that time, they will not be processed by Antrea and NetworkPolicies may not be applied to them. The solution I have come up with is to create a new CNI conf file with higher priority (05-antrea.conflist). Because that file will stay the same during Node restart, the problematic window of time does not exist anymore. We still watch for changes to the intial CNI conf file (e.g., 10-aws.conflist), so we can update 05-antrea.conflist as needed. We also update antrea-aks-node-init.yml and antrea-gke-node-init.yml to use the same container image as antrea-eks-node-init.yml. Using v2 ensures that the script is run again if it is modified at runtime. Signed-off-by: Antonin Bas <abas@vmware.com>

codecov · 2022-07-20T23:19:07Z

Codecov Report

Merging #4042 (c9f33a5) into main (2a092ab) will increase coverage by 0.79%.
The diff coverage is n/a.

@@            Coverage Diff             @@
##             main    #4042      +/-   ##
==========================================
+ Coverage   63.93%   64.72%   +0.79%     
==========================================
  Files         292      293       +1     
  Lines       43671    43669       -2     
==========================================
+ Hits        27922    28266     +344     
+ Misses      13492    13121     -371     
- Partials     2257     2282      +25

Flag	Coverage Δ
e2e-tests	`41.04% <ø> (?)`
kind-e2e-tests	`50.91% <ø> (+0.91%)`	⬆️
unit-tests	`44.17% <ø> (-0.02%)`	⬇️

Impacted Files	Coverage Δ
pkg/agent/flowexporter/exporter/certificate.go	`27.77% <0.00%> (-22.23%)`	⬇️
pkg/ipfix/ipfix_process.go	`81.25% <0.00%> (-18.75%)`	⬇️
pkg/agent/util/sysctl/sysctl_linux.go	`25.92% <0.00%> (-14.82%)`	⬇️
.../flowexporter/connections/conntrack_connections.go	`66.66% <0.00%> (-14.77%)`	⬇️
...g/agent/controller/serviceexternalip/controller.go	`69.62% <0.00%> (-11.95%)`	⬇️
pkg/agent/controller/networkpolicy/packetin.go	`66.90% <0.00%> (-7.05%)`	⬇️
pkg/ovs/ovsctl/ofctl.go	`35.61% <0.00%> (-5.48%)`	⬇️
pkg/util/ip/ip.go	`80.48% <0.00%> (-4.88%)`	⬇️
pkg/agent/flowexporter/utils.go	`76.59% <0.00%> (-4.26%)`	⬇️
pkg/ovs/openflow/ofctrl_packetin.go	`69.62% <0.00%> (-3.80%)`	⬇️
... and 26 more

jianjuns

LGTM

jianjuns · 2022-07-21T01:02:05Z

@reachjainrahul : would you take a look?

tnqn

The solution makes sense to me.

antoninbas · 2022-07-21T19:12:39Z

/test-all

antoninbas · 2022-07-22T20:27:09Z

AKS and EKS CI jobs are passing for this PR. I will ping @reachjainrahul to see if he has time to review.

reachjainrahul · 2022-07-25T07:19:06Z

build/yamls/antrea-eks-node-init.yml


              while true; do
-                  curl localhost:61679 && retry=false || retry=true
+                  curl -sS -o /dev/null localhost:61679 && retry=false || retry=true
                  if [ $retry == false ]; then break ; fi
                  sleep 2s
                  echo "Waiting for aws-k8s-agent"
              done

              # Fetch running containers from aws-k8s-agent and kill them


@antoninbas We do restart the pods which could be scheduled with just AWS CNI when antrea is installed. So eventually all pods will be managed by antrea. If you are worried about very small window in which pod is scheduled with AWS cni and killed when antrea cni is installed, then I am OK with the change.

This is fine when initially deploying Antrea on an EKS cluster, but there are several "edge" cases that were not accounted for initially:

aws-node agent restart (will cause the CNI conf file to be overwritten)

K8s Node restart, which causes all Pods to restart. aws-node will overwrite the CNI conf file, antrea-eks-node-init will not run again.

a new Node is added to the cluster: there is no guaranteed order of execution for aws-node, antrea-agent, antrea-eks-node-init

In my last series of patches (this one being the latest), I have tried to implement a solution that covers all of these edge cases.

antoninbas · 2022-07-26T18:24:47Z

@jsalatiel this should be a big improvement in terms of robustness for the Antrea EKS support. However, because it's a pretty big change, I am not planning on back-porting it to v1.7 at the moment (it will be in v1.8). Let me know if this is a big deal for you.

jsalatiel · 2022-07-26T18:47:45Z

Np, i will try to update to 1.8 after a while.

…trea-io#4042) The current solution which consists of overwriting the existing CNI conf file (e.g., 10-aws.conflist) suffers from one issue for which I cannot find a simple workaround: When a Node restarts, there can be a short window of time during which the CNI conf file reverts to the old one (without Antrea). If some Pods are restarted / scheduled on the Node during that time, they will not be processed by Antrea and NetworkPolicies may not be applied to them. The solution I have come up with is to create a new CNI conf file with higher priority (05-antrea.conflist). Because that file will stay the same during Node restart, the problematic window of time does not exist anymore. We still watch for changes to the intial CNI conf file (e.g., 10-aws.conflist), so we can update 05-antrea.conflist as needed. We also update antrea-aks-node-init.yml and antrea-gke-node-init.yml to use the same container image as antrea-eks-node-init.yml. Using v2 ensures that the script is run again if it is modified at runtime. Signed-off-by: Antonin Bas <abas@vmware.com>

antoninbas marked this pull request as ready for review July 20, 2022 23:09

antoninbas requested review from jianjuns and tnqn July 20, 2022 23:09

antoninbas added area/provider/aws Issues or PRs related to aws provider. area/provider/azure Issues or PRs related to azure provider. action/release-note Indicates a PR that should be included in release notes. labels Jul 20, 2022

antoninbas force-pushed the use-different-cni-conf-file-in-networkPolicyOnly-mode branch from 4acb971 to 0969602 Compare July 20, 2022 23:16

antoninbas force-pushed the use-different-cni-conf-file-in-networkPolicyOnly-mode branch from 0969602 to c9f33a5 Compare July 20, 2022 23:18

jianjuns reviewed Jul 21, 2022

View reviewed changes

jianjuns approved these changes Jul 21, 2022

View reviewed changes

tnqn approved these changes Jul 21, 2022

View reviewed changes

reachjainrahul reviewed Jul 25, 2022

View reviewed changes

antoninbas merged commit e5a98dc into antrea-io:main Jul 26, 2022

antoninbas deleted the use-different-cni-conf-file-in-networkPolicyOnly-mode branch July 26, 2022 18:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use different CNI conf file when configuring chaining with Antrea #4042

Use different CNI conf file when configuring chaining with Antrea #4042

antoninbas commented Jul 20, 2022 •

edited

Loading

antoninbas commented Jul 20, 2022

codecov bot commented Jul 20, 2022 •

edited

Loading

jianjuns left a comment

jianjuns commented Jul 21, 2022

tnqn left a comment

antoninbas commented Jul 21, 2022

antoninbas commented Jul 22, 2022

reachjainrahul Jul 25, 2022

antoninbas Jul 25, 2022

antoninbas commented Jul 26, 2022

jsalatiel commented Jul 26, 2022

Use different CNI conf file when configuring chaining with Antrea #4042

Use different CNI conf file when configuring chaining with Antrea #4042

Conversation

antoninbas commented Jul 20, 2022 • edited Loading

antoninbas commented Jul 20, 2022

codecov bot commented Jul 20, 2022 • edited Loading

Codecov Report

jianjuns left a comment

Choose a reason for hiding this comment

jianjuns commented Jul 21, 2022

tnqn left a comment

Choose a reason for hiding this comment

antoninbas commented Jul 21, 2022

antoninbas commented Jul 22, 2022

reachjainrahul Jul 25, 2022

Choose a reason for hiding this comment

antoninbas Jul 25, 2022

Choose a reason for hiding this comment

antoninbas commented Jul 26, 2022

jsalatiel commented Jul 26, 2022

antoninbas commented Jul 20, 2022 •

edited

Loading

codecov bot commented Jul 20, 2022 •

edited

Loading