-
Notifications
You must be signed in to change notification settings - Fork 386
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use different CNI conf file when configuring chaining with Antrea #4042
Use different CNI conf file when configuring chaining with Antrea #4042
Conversation
@tnqn let me know if you have an issue with this approach (of creating a new conflist file) |
4acb971
to
0969602
Compare
The current solution which consists of overwriting the existing CNI conf file (e.g., 10-aws.conflist) suffers from one issue for which I cannot find a simple workaround: When a Node restarts, there can be a short window of time during which the CNI conf file reverts to the old one (without Antrea). If some Pods are restarted / scheduled on the Node during that time, they will not be processed by Antrea and NetworkPolicies may not be applied to them. The solution I have come up with is to create a new CNI conf file with higher priority (05-antrea.conflist). Because that file will stay the same during Node restart, the problematic window of time does not exist anymore. We still watch for changes to the intial CNI conf file (e.g., 10-aws.conflist), so we can update 05-antrea.conflist as needed. We also update antrea-aks-node-init.yml and antrea-gke-node-init.yml to use the same container image as antrea-eks-node-init.yml. Using v2 ensures that the script is run again if it is modified at runtime. Signed-off-by: Antonin Bas <abas@vmware.com>
0969602
to
c9f33a5
Compare
Codecov Report
@@ Coverage Diff @@
## main #4042 +/- ##
==========================================
+ Coverage 63.93% 64.72% +0.79%
==========================================
Files 292 293 +1
Lines 43671 43669 -2
==========================================
+ Hits 27922 28266 +344
+ Misses 13492 13121 -371
- Partials 2257 2282 +25
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@reachjainrahul : would you take a look? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The solution makes sense to me.
/test-all |
AKS and EKS CI jobs are passing for this PR. I will ping @reachjainrahul to see if he has time to review. |
|
||
while true; do | ||
curl localhost:61679 && retry=false || retry=true | ||
curl -sS -o /dev/null localhost:61679 && retry=false || retry=true | ||
if [ $retry == false ]; then break ; fi | ||
sleep 2s | ||
echo "Waiting for aws-k8s-agent" | ||
done | ||
|
||
# Fetch running containers from aws-k8s-agent and kill them |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@antoninbas We do restart the pods which could be scheduled with just AWS CNI when antrea is installed. So eventually all pods will be managed by antrea. If you are worried about very small window in which pod is scheduled with AWS cni and killed when antrea cni is installed, then I am OK with the change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is fine when initially deploying Antrea on an EKS cluster, but there are several "edge" cases that were not accounted for initially:
- aws-node agent restart (will cause the CNI conf file to be overwritten)
- K8s Node restart, which causes all Pods to restart. aws-node will overwrite the CNI conf file, antrea-eks-node-init will not run again.
- a new Node is added to the cluster: there is no guaranteed order of execution for aws-node, antrea-agent, antrea-eks-node-init
In my last series of patches (this one being the latest), I have tried to implement a solution that covers all of these edge cases.
@jsalatiel this should be a big improvement in terms of robustness for the Antrea EKS support. However, because it's a pretty big change, I am not planning on back-porting it to v1.7 at the moment (it will be in v1.8). Let me know if this is a big deal for you. |
Np, i will try to update to 1.8 after a while. |
…trea-io#4042) The current solution which consists of overwriting the existing CNI conf file (e.g., 10-aws.conflist) suffers from one issue for which I cannot find a simple workaround: When a Node restarts, there can be a short window of time during which the CNI conf file reverts to the old one (without Antrea). If some Pods are restarted / scheduled on the Node during that time, they will not be processed by Antrea and NetworkPolicies may not be applied to them. The solution I have come up with is to create a new CNI conf file with higher priority (05-antrea.conflist). Because that file will stay the same during Node restart, the problematic window of time does not exist anymore. We still watch for changes to the intial CNI conf file (e.g., 10-aws.conflist), so we can update 05-antrea.conflist as needed. We also update antrea-aks-node-init.yml and antrea-gke-node-init.yml to use the same container image as antrea-eks-node-init.yml. Using v2 ensures that the script is run again if it is modified at runtime. Signed-off-by: Antonin Bas <abas@vmware.com>
The current solution which consists of overwriting the existing CNI conf
file (e.g., 10-aws.conflist) suffers from one issue for which I cannot
find a simple workaround:
When a Node restarts, there can be a short window of time during which
the CNI conf file reverts to the old one (without Antrea). If some Pods
are restarted / scheduled on the Node during that time, they will not be
processed by Antrea and NetworkPolicies may not be applied to them.
The solution I have come up with is to create a new CNI conf file with
higher priority (05-antrea.conflist). Because that file will stay the
same during Node restart, the problematic window of time does not exist
anymore. We still watch for changes to the intial CNI conf file (e.g.,
10-aws.conflist), so we can update 05-antrea.conflist as needed.
We also update antrea-aks-node-init.yml and antrea-gke-node-init.yml to
use the same container image as antrea-eks-node-init.yml. Using v2
ensures that the script is run again if it is modified at runtime.
Signed-off-by: Antonin Bas abas@vmware.com