-
Notifications
You must be signed in to change notification settings - Fork 390
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
EKS node randomly stop applying netpolicies to pods #3946
Comments
Anything that could have triggered that situation? Maybe the Node rebooted at some point (for an update or something else)? Could you share the contents of the BTW, the error logs seem to be a different issue (but maybe the same root cause?). These log messages come from AntreaProxy code. |
No, the node has not been restarted. ( same uptime as the others )
cat /etc/cni/net.d/10-aws.conflist
|
Not sure if it matters, but running in EKS using containerd. |
More debugging here:
If I apply a rule that allows the pod connects to the internet on port 80 , I can see that apparently the policy is applied , but no connectivity.
I also have icmp allowed for private ips, so: If I SSH to the node , the node itself has full connectivity. |
It seems to me that NetworkPolicy enforcement is working well. It would be good to see if the packet leaves the Pod network correctly and is forwarded out of the Node. For this traffic:
Could you do a packet capture on And could you share the following info:
|
ip addr
ip route
tcpdump
|
Maybe check the traffic on Did you validate Pod-to-Pod traffic on the same Node? |
Pod-to-Pod traffic on the same Node works just fine. |
I have more commands for you to run: |
Hi @antoninbas , Unfortunatelly I had to get a working node back, but before terminating the node I restarted the aws-node pod and got the connectivity back. So it is working right now. |
These are hostNetwork Pods right? It's interesting that restarting the aws-node Pod on that Node resolves the issue. It must fix the routing configuration, but I don't know what can cause it to be messed up in the first place. |
aws-node is hostNetwork: true After restarting aws-node, the connectivity is reestablished, but all netpolicies are being bypassed on that node. I suppose this is not the expected behaviour? Should antrea-node-init also be restarted after aws-node ?
|
I think there is some confusion - at least on my part - about whether this is a connectivity issue or a NetworkPolicy enforcement issues. |
Hi @antoninbas , well, it has been so much information here that it really became confuse.
|
I keep trying to debug. Since this is a production cluster ( pretty stable deployments) it takes time to understand what triggers the problem and even if I am debugging correctly. The ones marked in red are expected to fail due to missing the internet=true label. The ones in green should not have failed. ( Very unlikely google is offline ) After this initial test I have restarted the AWS nodes( kubectl delete pods -l k8s-app=aws-node -n kube-system) and as I mentioned before, all netpolicies are just bypassed and this is what I got: As I mentioned, old pods are running just fine, I only noticed the problem because I had to deploy a new app and update the image for another one. Otherwise I would not have noticed that. I will keep antrea running for more few days, but because this odd beahaviour I will probably have to uninstall it. I run a few other clusters onprem using antrea and they do not face any kind of problem. I only see this on EKS ( netpolicy only mode ). I have another cluster on EKS ( also pretty stable deployments there ). I will redo the tests and check if they trigger the same results. |
Thanks @jsalatiel. Yes, it seems that restarting aws-node may cause Antrea networking to be bypassed. I need to take a look at it. I see that one Node is broken in the first screenshot (the one with IP ip addr
ip route
ip rule list
ip route show table all And collecting a supportbundle at the same time would be helpful too: |
After restarting aws-node, what can I do to antrea reapply the policies ? Restart the antrea-eks-init? Or the only solution is to add more nodes ? |
Do you want the supportbundle even if the policies are now being skipped? |
Restarting the
No, we need it for a Node in the broken connectivity state (not after aws-node restart). You can pass a Node name to |
OK. I will get back when I get a node back in that state. It should not take long. |
@antoninbas here is the info you asked. This node is in the same state. ip addr
ip route
ip rule list
ip route show table all
This is the get pods from the failing nodes, so you can get the IP assigned to the pod and match the info above.
|
@antoninbas I have done the same test on my second EKS production cluster and I am also facing the same problem. |
@jsalatiel I have created an EKS cluster of my own and I am trying to reproduce |
@jsalatiel I found one possible issue, for Pods which receive an IP from a secondary network interface. I am not totally sure that this is the same issue as the one you are experiencing. To check if this is the same issue, you can try this fix and run the following command on all your Nodes:
As for me, I will be working on an actual Antrea patch to address this issue. |
@antoninbas It worked!!!! I will let the test pods running for more time, but at least so far they are working as expected. Failing when they are supposed to fail and completing fast when they are supposed to work. When you find an actual fix, will it be ported to 1.7 and released as 1.7.1 ? |
Yes We may also be able to handle #3974 in time for 1.7.1. It's a more minor issue. |
Great, thank you very much for your help tracking this. |
When using Antrea in policyOnly mode on an EKS cluster, an additional iptables rule is needed in the PREROUTING chain of the nat table. The rule ensures that Pod-to-external traffic coming from Pods whose IP address comes from a secondary network interface (secondary ENI) is marked correctly, so that it hits the appropriate routing table. Without this, traffic is SNATed with the source IP address of the primary network interface, while being sent out of the secondary network interface, causing the VPC to drop the traffic. Relevant rules (before the fix): ``` -A PREROUTING -m comment --comment "kubernetes service portals" -j KUBE-SERVICES -A PREROUTING -i eni+ -m comment --comment "AWS, outbound connections" -m state --state NEW -j AWS-CONNMARK-CHAIN-0 -A PREROUTING -m comment --comment "AWS, CONNMARK" -j CONNMARK --restore-mark --nfmask 0x80 --ctmask 0x80 -A OUTPUT -m comment --comment "kubernetes service portals" -j KUBE-SERVICES -A POSTROUTING -m comment --comment "kubernetes postrouting rules" -j KUBE-POSTROUTING -A POSTROUTING -m comment --comment "AWS SNAT CHAIN" -j AWS-SNAT-CHAIN-0 -A POSTROUTING -m comment --comment "Antrea: jump to Antrea postrouting rules" -j ANTREA-POSTROUTING -A ANTREA-POSTROUTING -o antrea-gw0 -m comment --comment "Antrea: masquerade LOCAL traffic" -m addrtype ! --src-type LOCAL --limit-iface-out -m addrtype --src-type LOCAL -j MASQUERADE --random-fully -A AWS-CONNMARK-CHAIN-0 ! -d 192.168.0.0/16 -m comment --comment "AWS CONNMARK CHAIN, VPC CIDR" -j AWS-CONNMARK-CHAIN-1 -A AWS-CONNMARK-CHAIN-1 -m comment --comment "AWS, CONNMARK" -j CONNMARK --set-xmark 0x80/0x80 -A AWS-SNAT-CHAIN-0 ! -d 192.168.0.0/16 -m comment --comment "AWS SNAT CHAIN" -j AWS-SNAT-CHAIN-1 -A AWS-SNAT-CHAIN-1 ! -o vlan+ -m comment --comment "AWS, SNAT" -m addrtype ! --dst-type LOCAL -j SNAT --to-source 192.168.18.153 --random-fully 0: from all lookup local 512: from all to 192.168.29.56 lookup main 512: from all to 192.168.24.134 lookup main 512: from all to 192.168.31.135 lookup main 512: from all to 192.168.31.223 lookup main 512: from all to 192.168.29.27 lookup main 512: from all to 192.168.16.158 lookup main 512: from all to 192.168.2.135 lookup main 1024: from all fwmark 0x80/0x80 lookup main 1536: from 192.168.31.223 lookup 2 1536: from 192.168.29.27 lookup 2 1536: from 192.168.16.158 lookup 2 1536: from 192.168.2.135 lookup 2 32766: from all lookup main 32767: from all lookup default default via 192.168.0.1 dev eth1 192.168.0.1 dev eth1 scope link ``` The fix is simply to add a new PREROUTING rule: ``` -A PREROUTING -m comment --comment "kubernetes service portals" -j KUBE-SERVICES -A PREROUTING -i eni+ -m comment --comment "AWS, outbound connections" -m state --state NEW -j AWS-CONNMARK-CHAIN-0 -A PREROUTING -i antrea-gw0 -m comment --comment "Antrea: AWS, outbound connections" -m state --state NEW -j AWS-CONNMARK-CHAIN-0 -A PREROUTING -m comment --comment "AWS, CONNMARK" -j CONNMARK --restore-mark --nfmask 0x80 --ctmask 0x80 ``` Fixes antrea-io#3946 Signed-off-by: Antonin Bas <abas@vmware.com>
When using Antrea in policyOnly mode on an EKS cluster, an additional iptables rule is needed in the PREROUTING chain of the nat table. The rule ensures that Pod-to-external traffic coming from Pods whose IP address comes from a secondary network interface (secondary ENI) is marked correctly, so that it hits the appropriate routing table. Without this, traffic is SNATed with the source IP address of the primary network interface, while being sent out of the secondary network interface, causing the VPC to drop the traffic. The fix is to add new PREROUTING rules, in the ANTREA-PREROUTING chain: ``` -A ANTREA-PREROUTING -i antrea-gw0 -m comment --comment "Antrea: AWS, outbound connections" -j AWS-CONNMARK-CHAIN-0 -A ANTREA-PREROUTING -m comment --comment "Antrea: AWS, CONNMARK (first packet)" -j CONNMARK --restore-mark --nfmask 0x80 --ctmask 0x80 ``` Fixes antrea-io#3946 Signed-off-by: Antonin Bas <abas@vmware.com>
When using Antrea in policyOnly mode on an EKS cluster, an additional iptables rule is needed in the PREROUTING chain of the nat table. The rule ensures that Pod-to-external traffic coming from Pods whose IP address comes from a secondary network interface (secondary ENI) is marked correctly, so that it hits the appropriate routing table. Without this, traffic is SNATed with the source IP address of the primary network interface, while being sent out of the secondary network interface, causing the VPC to drop the traffic. The fix is to add new PREROUTING rules, in the ANTREA-PREROUTING chain: ``` -A ANTREA-PREROUTING -i antrea-gw0 -m comment --comment "Antrea: AWS, outbound connections" -j AWS-CONNMARK-CHAIN-0 -A ANTREA-PREROUTING -m comment --comment "Antrea: AWS, CONNMARK (first packet)" -j CONNMARK --restore-mark --nfmask 0x80 --ctmask 0x80 ``` Fixes antrea-io#3946 Signed-off-by: Antonin Bas <abas@vmware.com>
When using Antrea in policyOnly mode on an EKS cluster, an additional iptables rule is needed in the PREROUTING chain of the nat table. The rule ensures that Pod-to-external traffic coming from Pods whose IP address comes from a secondary network interface (secondary ENI) is marked correctly, so that it hits the appropriate routing table. Without this, traffic is SNATed with the source IP address of the primary network interface, while being sent out of the secondary network interface, causing the VPC to drop the traffic. The fix is to add new PREROUTING rules, in the ANTREA-PREROUTING chain: ``` -A ANTREA-PREROUTING -i antrea-gw0 -m comment --comment "Antrea: AWS, outbound connections" -j AWS-CONNMARK-CHAIN-0 -A ANTREA-PREROUTING -m comment --comment "Antrea: AWS, CONNMARK (first packet)" -j CONNMARK --restore-mark --nfmask 0x80 --ctmask 0x80 ``` Fixes #3946 Signed-off-by: Antonin Bas <abas@vmware.com>
When using Antrea in policyOnly mode on an EKS cluster, an additional iptables rule is needed in the PREROUTING chain of the nat table. The rule ensures that Pod-to-external traffic coming from Pods whose IP address comes from a secondary network interface (secondary ENI) is marked correctly, so that it hits the appropriate routing table. Without this, traffic is SNATed with the source IP address of the primary network interface, while being sent out of the secondary network interface, causing the VPC to drop the traffic. The fix is to add new PREROUTING rules, in the ANTREA-PREROUTING chain: ``` -A ANTREA-PREROUTING -i antrea-gw0 -m comment --comment "Antrea: AWS, outbound connections" -j AWS-CONNMARK-CHAIN-0 -A ANTREA-PREROUTING -m comment --comment "Antrea: AWS, CONNMARK (first packet)" -j CONNMARK --restore-mark --nfmask 0x80 --ctmask 0x80 ``` Fixes antrea-io#3946 Signed-off-by: Antonin Bas <abas@vmware.com>
When using Antrea in policyOnly mode on an EKS cluster, an additional iptables rule is needed in the PREROUTING chain of the nat table. The rule ensures that Pod-to-external traffic coming from Pods whose IP address comes from a secondary network interface (secondary ENI) is marked correctly, so that it hits the appropriate routing table. Without this, traffic is SNATed with the source IP address of the primary network interface, while being sent out of the secondary network interface, causing the VPC to drop the traffic. The fix is to add new PREROUTING rules, in the ANTREA-PREROUTING chain: ``` -A ANTREA-PREROUTING -i antrea-gw0 -m comment --comment "Antrea: AWS, outbound connections" -j AWS-CONNMARK-CHAIN-0 -A ANTREA-PREROUTING -m comment --comment "Antrea: AWS, CONNMARK (first packet)" -j CONNMARK --restore-mark --nfmask 0x80 --ctmask 0x80 ``` Fixes #3946 Signed-off-by: Antonin Bas <abas@vmware.com> Co-authored-by: Antonin Bas <abas@vmware.com>
Describe the bug
Out of nowhere, eks nodes stop applying netpolicies to new pods.
To Reproduce
This is really hard to reproduce. Eventually after a few days when I try to scale some deployment the new pods won't have the netpols applied.
Expected
Netpolicy should apply as expected
Actual behavior
New pods have no connectivity
Versions:
Please provide the following information:
kubectl version
). 1.22 eksAdditional context
Right now I have one node in this "no connectivity" on new pods. So I maybe able to provide logs if needed.
I also tried to restart antrea-agent ( for that node ) and antrea controller, but still same problem.
Antrea agent for that node is full of:
I have a default DROP policy on baseline tier, but kube-system pods can talk inside the same namespace.
The node itself ( ssh'ng to it ) has full connectivity.
I am using antrea in netpolicy mode only.
The text was updated successfully, but these errors were encountered: