Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use different CNI conf file when configuring chaining with Antrea #4042

Conversation

antoninbas
Copy link
Contributor

@antoninbas antoninbas commented Jul 20, 2022

The current solution which consists of overwriting the existing CNI conf
file (e.g., 10-aws.conflist) suffers from one issue for which I cannot
find a simple workaround:
When a Node restarts, there can be a short window of time during which
the CNI conf file reverts to the old one (without Antrea). If some Pods
are restarted / scheduled on the Node during that time, they will not be
processed by Antrea and NetworkPolicies may not be applied to them.

The solution I have come up with is to create a new CNI conf file with
higher priority (05-antrea.conflist). Because that file will stay the
same during Node restart, the problematic window of time does not exist
anymore. We still watch for changes to the intial CNI conf file (e.g.,
10-aws.conflist), so we can update 05-antrea.conflist as needed.

We also update antrea-aks-node-init.yml and antrea-gke-node-init.yml to
use the same container image as antrea-eks-node-init.yml. Using v2
ensures that the script is run again if it is modified at runtime.

Signed-off-by: Antonin Bas abas@vmware.com

@antoninbas antoninbas marked this pull request as ready for review July 20, 2022 23:09
@antoninbas antoninbas requested review from jianjuns and tnqn July 20, 2022 23:09
@antoninbas
Copy link
Contributor Author

@tnqn let me know if you have an issue with this approach (of creating a new conflist file)

@antoninbas antoninbas added area/provider/aws Issues or PRs related to aws provider. area/provider/azure Issues or PRs related to azure provider. action/release-note Indicates a PR that should be included in release notes. labels Jul 20, 2022
@antoninbas antoninbas force-pushed the use-different-cni-conf-file-in-networkPolicyOnly-mode branch from 4acb971 to 0969602 Compare July 20, 2022 23:16
The current solution which consists of overwriting the existing CNI conf
file (e.g., 10-aws.conflist) suffers from one issue for which I cannot
find a simple workaround:
When a Node restarts, there can be a short window of time during which
the CNI conf file reverts to the old one (without Antrea). If some Pods
are restarted / scheduled on the Node during that time, they will not be
processed by Antrea and NetworkPolicies may not be applied to them.

The solution I have come up with is to create a new CNI conf file with
higher priority (05-antrea.conflist). Because that file will stay the
same during Node restart, the problematic window of time does not exist
anymore. We still watch for changes to the intial CNI conf file (e.g.,
10-aws.conflist), so we can update 05-antrea.conflist as needed.

We also update antrea-aks-node-init.yml and antrea-gke-node-init.yml to
use the same container image as antrea-eks-node-init.yml. Using v2
ensures that the script is run again if it is modified at runtime.

Signed-off-by: Antonin Bas <abas@vmware.com>
@antoninbas antoninbas force-pushed the use-different-cni-conf-file-in-networkPolicyOnly-mode branch from 0969602 to c9f33a5 Compare July 20, 2022 23:18
@codecov
Copy link

codecov bot commented Jul 20, 2022

Codecov Report

Merging #4042 (c9f33a5) into main (2a092ab) will increase coverage by 0.79%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #4042      +/-   ##
==========================================
+ Coverage   63.93%   64.72%   +0.79%     
==========================================
  Files         292      293       +1     
  Lines       43671    43669       -2     
==========================================
+ Hits        27922    28266     +344     
+ Misses      13492    13121     -371     
- Partials     2257     2282      +25     
Flag Coverage Δ
e2e-tests 41.04% <ø> (?)
kind-e2e-tests 50.91% <ø> (+0.91%) ⬆️
unit-tests 44.17% <ø> (-0.02%) ⬇️
Impacted Files Coverage Δ
pkg/agent/flowexporter/exporter/certificate.go 27.77% <0.00%> (-22.23%) ⬇️
pkg/ipfix/ipfix_process.go 81.25% <0.00%> (-18.75%) ⬇️
pkg/agent/util/sysctl/sysctl_linux.go 25.92% <0.00%> (-14.82%) ⬇️
.../flowexporter/connections/conntrack_connections.go 66.66% <0.00%> (-14.77%) ⬇️
...g/agent/controller/serviceexternalip/controller.go 69.62% <0.00%> (-11.95%) ⬇️
pkg/agent/controller/networkpolicy/packetin.go 66.90% <0.00%> (-7.05%) ⬇️
pkg/ovs/ovsctl/ofctl.go 35.61% <0.00%> (-5.48%) ⬇️
pkg/util/ip/ip.go 80.48% <0.00%> (-4.88%) ⬇️
pkg/agent/flowexporter/utils.go 76.59% <0.00%> (-4.26%) ⬇️
pkg/ovs/openflow/ofctrl_packetin.go 69.62% <0.00%> (-3.80%) ⬇️
... and 26 more

Copy link
Contributor

@jianjuns jianjuns left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@jianjuns
Copy link
Contributor

@reachjainrahul : would you take a look?

Copy link
Member

@tnqn tnqn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The solution makes sense to me.

@antoninbas
Copy link
Contributor Author

/test-all

@antoninbas
Copy link
Contributor Author

AKS and EKS CI jobs are passing for this PR. I will ping @reachjainrahul to see if he has time to review.


while true; do
curl localhost:61679 && retry=false || retry=true
curl -sS -o /dev/null localhost:61679 && retry=false || retry=true
if [ $retry == false ]; then break ; fi
sleep 2s
echo "Waiting for aws-k8s-agent"
done

# Fetch running containers from aws-k8s-agent and kill them
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@antoninbas We do restart the pods which could be scheduled with just AWS CNI when antrea is installed. So eventually all pods will be managed by antrea. If you are worried about very small window in which pod is scheduled with AWS cni and killed when antrea cni is installed, then I am OK with the change.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is fine when initially deploying Antrea on an EKS cluster, but there are several "edge" cases that were not accounted for initially:

  • aws-node agent restart (will cause the CNI conf file to be overwritten)
  • K8s Node restart, which causes all Pods to restart. aws-node will overwrite the CNI conf file, antrea-eks-node-init will not run again.
  • a new Node is added to the cluster: there is no guaranteed order of execution for aws-node, antrea-agent, antrea-eks-node-init

In my last series of patches (this one being the latest), I have tried to implement a solution that covers all of these edge cases.

@antoninbas antoninbas merged commit e5a98dc into antrea-io:main Jul 26, 2022
@antoninbas antoninbas deleted the use-different-cni-conf-file-in-networkPolicyOnly-mode branch July 26, 2022 18:23
@antoninbas
Copy link
Contributor Author

@jsalatiel this should be a big improvement in terms of robustness for the Antrea EKS support. However, because it's a pretty big change, I am not planning on back-porting it to v1.7 at the moment (it will be in v1.8). Let me know if this is a big deal for you.

@jsalatiel
Copy link

Np, i will try to update to 1.8 after a while.

hjiajing pushed a commit to hjiajing/antrea that referenced this pull request Jul 28, 2022
…trea-io#4042)

The current solution which consists of overwriting the existing CNI conf
file (e.g., 10-aws.conflist) suffers from one issue for which I cannot
find a simple workaround:
When a Node restarts, there can be a short window of time during which
the CNI conf file reverts to the old one (without Antrea). If some Pods
are restarted / scheduled on the Node during that time, they will not be
processed by Antrea and NetworkPolicies may not be applied to them.

The solution I have come up with is to create a new CNI conf file with
higher priority (05-antrea.conflist). Because that file will stay the
same during Node restart, the problematic window of time does not exist
anymore. We still watch for changes to the intial CNI conf file (e.g.,
10-aws.conflist), so we can update 05-antrea.conflist as needed.

We also update antrea-aks-node-init.yml and antrea-gke-node-init.yml to
use the same container image as antrea-eks-node-init.yml. Using v2
ensures that the script is run again if it is modified at runtime.

Signed-off-by: Antonin Bas <abas@vmware.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
action/release-note Indicates a PR that should be included in release notes. area/provider/aws Issues or PRs related to aws provider. area/provider/azure Issues or PRs related to azure provider.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants