Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

EKS node randomly stop applying netpolicies to pods #3946

Closed
jsalatiel opened this issue Jun 28, 2022 · 27 comments · Fixed by #3975
Closed

EKS node randomly stop applying netpolicies to pods #3946

jsalatiel opened this issue Jun 28, 2022 · 27 comments · Fixed by #3975
Assignees
Labels
area/provider/aws Issues or PRs related to aws provider. kind/bug Categorizes issue or PR as related to a bug. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now.

Comments

@jsalatiel
Copy link

Describe the bug
Out of nowhere, eks nodes stop applying netpolicies to new pods.

To Reproduce
This is really hard to reproduce. Eventually after a few days when I try to scale some deployment the new pods won't have the netpols applied.

Expected
Netpolicy should apply as expected

Actual behavior
New pods have no connectivity

Versions:
Please provide the following information:

  • Antrea version (Docker image tag). 1.7
  • Kubernetes version (use kubectl version). 1.22 eks

Additional context
Right now I have one node in this "no connectivity" on new pods. So I maybe able to provide logs if needed.
I also tried to restart antrea-agent ( for that node ) and antrea controller, but still same problem.

Antrea agent for that node is full of:

I0628 01:16:33.680826       1 reconciler.go:293] Reconciling rule 3ebab8a8cc736c8e of NetworkPolicy AntreaClusterNetworkPolicy:102-kube-system-aws
E0628 01:16:34.450279       1 utils.go:164] Skipping invalid IP: 
E0628 01:16:34.450300       1 utils.go:164] Skipping invalid IP: 
E0628 01:16:34.450307       1 utils.go:164] Skipping invalid IP: 

I have a default DROP policy on baseline tier, but kube-system pods can talk inside the same namespace.

The node itself ( ssh'ng to it ) has full connectivity.
I am using antrea in netpolicy mode only.

@jsalatiel jsalatiel added the kind/bug Categorizes issue or PR as related to a bug. label Jun 28, 2022
@antoninbas antoninbas added the area/provider/aws Issues or PRs related to aws provider. label Jun 28, 2022
@antoninbas
Copy link
Contributor

antoninbas commented Jun 28, 2022

Anything that could have triggered that situation? Maybe the Node rebooted at some point (for an update or something else)?

Could you share the contents of the /etc/cni/net.d/ directory on the Node that is broken? List of files and content for each file.

BTW, the error logs seem to be a different issue (but maybe the same root cause?). These log messages come from AntreaProxy code.

@jsalatiel
Copy link
Author

No, the node has not been restarted. ( same uptime as the others )

ls -la /etc/cni/net.d/ 
total 4
drwxr-xr-x 2 root root  29 jun 23 20:46 .
drwxr-xr-x 3 root root  19 jun 23 20:45 ..
-rw-r--r-- 1 root root 999 jun 23 20:46 10-aws.conflist

cat /etc/cni/net.d/10-aws.conflist

{
  "cniVersion": "0.3.1",
  "name": "aws-cni",
  "plugins": [
    {
      "name": "aws-cni",
      "type": "aws-cni",
      "vethPrefix": "eni",
      "mtu": "9001",
      "pluginLogFile": "/var/log/aws-routed-eni/plugin.log",
      "pluginLogLevel": "DEBUG"
    },
    {
      "name": "egress-v4-cni",
      "type": "egress-v4-cni",
      "mtu": 9001,
      "enabled": "false",
      "nodeIP": "10.138.2.114",
      "ipam": {
        "type": "host-local",
        "ranges": [
          [
            {
              "subnet": "169.254.172.0/22"
            }
          ]
        ],
        "routes": [
          {
            "dst": "0.0.0.0/0"
          }
        ],
        "dataDir": "/run/cni/v6pd/egress-v4-ipam"
      },
      "pluginLogFile": "/var/log/aws-routed-eni/egress-v4-plugin.log",
      "pluginLogLevel": "DEBUG"
    },
    {
      "type": "portmap",
      "capabilities": {
        "portMappings": true
      },
      "snat": true
    },
    {
      "type": "antrea"
    }
  ]
}

@jsalatiel
Copy link
Author

Not sure if it matters, but running in EKS using containerd.

@jsalatiel
Copy link
Author

jsalatiel commented Jun 28, 2022

More debugging here:
As I told previously I have a default deny policy. After launching a new pod and try to connect for any webserver I get this in the np.log ( which means that the default policy applied )

2022/06/27 22:37:10.248870 EgressDefaultRule AntreaClusterNetworkPolicy:799-default-deny Drop 16 10.138.2.226 42508 132.226.247.73 80 TCP 60

If I apply a rule that allows the pod connects to the internet on port 80 , I can see that apparently the policy is applied , but no connectivity.

2022/06/27 22:42:28.515236 AntreaPolicyEgressRule AntreaClusterNetworkPolicy:internet-by-label Allow 14899 10.138.2.226 53616 193.122.6.168 80 TCP 60

I also have icmp allowed for private ips, so:
ping 8.8.8.8 logs the deny. But ping 10.138.2.2 logs nothing ( but also it does not gets a reply )

If I SSH to the node , the node itself has full connectivity.
If I start a new pod in HostNetwork mode, there is no connectivity problems.

@antoninbas
Copy link
Contributor

antoninbas commented Jun 28, 2022

It seems to me that NetworkPolicy enforcement is working well. It would be good to see if the packet leaves the Pod network correctly and is forwarded out of the Node.

For this traffic:

2022/06/27 22:42:28.515236 AntreaPolicyEgressRule AntreaClusterNetworkPolicy:internet-by-label Allow 14899 10.138.2.226 53616 193.122.6.168 80 TCP 60

Could you do a packet capture on antrea-gw0?

And could you share the following info:

  • output of ip addr
  • output of ip route

@jsalatiel
Copy link
Author

ip addr

1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host
       valid_lft forever preferred_lft forever
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9001 qdisc mq state UP group default qlen 1000
    link/ether 06:13:ea:3d:2b:66 brd ff:ff:ff:ff:ff:ff
    inet 10.138.2.114/24 brd 10.138.2.255 scope global dynamic eth0
       valid_lft 2690sec preferred_lft 2690sec
    inet6 fe80::413:eaff:fe3d:2b66/64 scope link
       valid_lft forever preferred_lft forever
4: ovs-system: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000
    link/ether a2:da:ca:55:8f:29 brd ff:ff:ff:ff:ff:ff
5: antrea-gw0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9001 qdisc noqueue state UNKNOWN group default qlen 1000
    link/ether a6:b6:d0:77:44:ca brd ff:ff:ff:ff:ff:ff
    inet 10.138.2.114/32 scope global antrea-gw0
       valid_lft forever preferred_lft forever
    inet6 fe80::a4b6:d0ff:fe77:44ca/64 scope link
       valid_lft forever preferred_lft forever
6: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9001 qdisc mq state UP group default qlen 1000
    link/ether 06:bf:a9:e8:1d:36 brd ff:ff:ff:ff:ff:ff
    inet 10.138.2.237/24 brd 10.138.2.255 scope global eth1
       valid_lft forever preferred_lft forever
    inet6 fe80::4bf:a9ff:fee8:1d36/64 scope link
       valid_lft forever preferred_lft forever
8: dummy0: <BROADCAST,NOARP> mtu 1500 qdisc noop state DOWN group default qlen 1000
    link/ether 9a:2d:f4:68:c0:e7 brd ff:ff:ff:ff:ff:ff
9: antrea-egress0: <BROADCAST,NOARP> mtu 1500 qdisc noop state DOWN group default
    link/ether e2:c0:a0:32:fa:91 brd ff:ff:ff:ff:ff:ff
14: eni3477fb7c303@if3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9001 qdisc noqueue master ovs-system state UP group default
    link/ether d2:29:56:db:8b:a0 brd ff:ff:ff:ff:ff:ff link-netnsid 1
    inet6 fe80::d029:56ff:fedb:8ba0/64 scope link
       valid_lft forever preferred_lft forever
15: eni9ccef59a7f5@if3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9001 qdisc noqueue master ovs-system state UP group default
    link/ether f6:d3:5e:33:3f:dd brd ff:ff:ff:ff:ff:ff link-netnsid 2
    inet6 fe80::f4d3:5eff:fe33:3fdd/64 scope link
       valid_lft forever preferred_lft forever
19: eni68cac675fbe@if3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9001 qdisc noqueue master ovs-system state UP group default
    link/ether fe:c4:48:c2:c3:33 brd ff:ff:ff:ff:ff:ff link-netnsid 5
    inet6 fe80::fcc4:48ff:fec2:c333/64 scope link
       valid_lft forever preferred_lft forever
21: eni7b5faf95746@if3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9001 qdisc noqueue master ovs-system state UP group default
    link/ether 6a:01:c4:7b:9d:0e brd ff:ff:ff:ff:ff:ff link-netnsid 7
    inet6 fe80::6801:c4ff:fe7b:9d0e/64 scope link
       valid_lft forever preferred_lft forever
22: eni0e224775784@if3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9001 qdisc noqueue master ovs-system state UP group default
    link/ether 7e:44:96:fe:cc:28 brd ff:ff:ff:ff:ff:ff link-netnsid 8
    inet6 fe80::7c44:96ff:fefe:cc28/64 scope link
       valid_lft forever preferred_lft forever
23: eni88a8313a1aa@if3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9001 qdisc noqueue master ovs-system state UP group default
    link/ether c2:cb:76:11:e4:f2 brd ff:ff:ff:ff:ff:ff link-netnsid 9
    inet6 fe80::c0cb:76ff:fe11:e4f2/64 scope link
       valid_lft forever preferred_lft forever
24: enic98790450d4@if3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9001 qdisc noqueue master ovs-system state UP group default
    link/ether b6:3f:01:fe:2e:bd brd ff:ff:ff:ff:ff:ff link-netnsid 10
    inet6 fe80::b43f:1ff:fefe:2ebd/64 scope link
       valid_lft forever preferred_lft forever
25: eni8c3fc32af42@if3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9001 qdisc noqueue master ovs-system state UP group default
    link/ether 5a:9a:14:67:8a:df brd ff:ff:ff:ff:ff:ff link-netnsid 3
    inet6 fe80::589a:14ff:fe67:8adf/64 scope link
       valid_lft forever preferred_lft forever
26: eni6612fb0e0a5@if3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9001 qdisc noqueue master ovs-system state UP group default
    link/ether 22:6f:97:b2:73:5a brd ff:ff:ff:ff:ff:ff link-netnsid 4
    inet6 fe80::206f:97ff:feb2:735a/64 scope link
       valid_lft forever preferred_lft forever
28: eni5489fe41698@if3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9001 qdisc noqueue master ovs-system state UP group default
    link/ether ae:38:ec:fd:c8:f9 brd ff:ff:ff:ff:ff:ff link-netnsid 0
    inet6 fe80::ac38:ecff:fefd:c8f9/64 scope link
       valid_lft forever preferred_lft forever
55: eni76908ed0137@if3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9001 qdisc noqueue master ovs-system state UP group default
    link/ether 96:3b:d1:23:fd:8c brd ff:ff:ff:ff:ff:ff link-netnsid 6
    inet6 fe80::943b:d1ff:fe23:fd8c/64 scope link
       valid_lft forever preferred_lft forever
56: eni31be650f793@if3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9001 qdisc noqueue master ovs-system state UP group default
    link/ether ae:20:43:3b:03:f6 brd ff:ff:ff:ff:ff:ff link-netnsid 11
    inet6 fe80::ac20:43ff:fe3b:3f6/64 scope link
       valid_lft forever preferred_lft forever
57: eth2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9001 qdisc mq state UP group default qlen 1000
    link/ether 06:9b:29:4d:97:a4 brd ff:ff:ff:ff:ff:ff
    inet 10.138.2.70/24 brd 10.138.2.255 scope global eth2
       valid_lft forever preferred_lft forever
    inet6 fe80::49b:29ff:fe4d:97a4/64 scope link
       valid_lft forever preferred_lft forever
61: enid95a91ee048@if3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9001 qdisc noqueue master ovs-system state UP group default
    link/ether 26:d1:36:2a:08:65 brd ff:ff:ff:ff:ff:ff link-netnsid 13
    inet6 fe80::24d1:36ff:fe2a:865/64 scope link
       valid_lft forever preferred_lft forever
78: eni782ddc49d9e@if3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9001 qdisc noqueue master ovs-system state UP group default
    link/ether ee:e0:d1:7e:d7:b4 brd ff:ff:ff:ff:ff:ff link-netnsid 12
    inet6 fe80::ece0:d1ff:fe7e:d7b4/64 scope link
       valid_lft forever preferred_lft forever
~                                                                                                                                                                                                                                                                                                                                                                                           
~                                                                                                                                                                      

ip route

default via 10.138.2.1 dev eth0
10.138.2.0/24 dev eth0 proto kernel scope link src 10.138.2.114 
10.138.2.45 dev antrea-gw0 scope link 
10.138.2.47 dev antrea-gw0 scope link
10.138.2.63 dev antrea-gw0 scope link
10.138.2.64 dev antrea-gw0 scope link
10.138.2.65 dev antrea-gw0 scope link
10.138.2.76 dev antrea-gw0 scope link
10.138.2.99 dev antrea-gw0 scope link
10.138.2.101 dev antrea-gw0 scope link
10.138.2.140 dev antrea-gw0 scope link
10.138.2.152 dev antrea-gw0 scope link
10.138.2.197 dev antrea-gw0 scope link
10.138.2.202 dev antrea-gw0 scope link
10.138.2.231 dev antrea-gw0 scope link
10.138.2.242 dev antrea-gw0 scope link
169.254.169.254 dev eth0
~                                                                

tcpdump

1	0.000000	10.138.2.47	193.122.6.168	TCP	74	56200 → 80 [SYN] Seq=0 Win=62727 Len=0 MSS=8961 SACK_PERM=1 TSval=4132982453 TSecr=0 WS=128
2	1.021845	10.138.2.47	193.122.6.168	TCP	74	[TCP Retransmission] [TCP Port numbers reused] 56200 → 80 [SYN] Seq=0 Win=62727 Len=0 MSS=8961 SACK_PERM=1 TSval=4132983475 TSecr=0 WS=128
3	3.037742	10.138.2.47	193.122.6.168	TCP	74	[TCP Retransmission] [TCP Port numbers reused] 56200 → 80 [SYN] Seq=0 Win=62727 Len=0 MSS=8961 SACK_PERM=1 TSval=4132985491 TSecr=0 WS=128

@antoninbas
Copy link
Contributor

Maybe check the traffic on eth0. Because the routes seem correct (along with the traffic captured on the gateway), I would assume that traffic goes out of eth0 correctly, but that for some reason we don't get any reply traffic (leading to TCP retransmissions).

Did you validate Pod-to-Pod traffic on the same Node?

@jsalatiel
Copy link
Author

Pod-to-Pod traffic on the same Node works just fine.
Same tcpdump on eth0: ( sudo tcpdump -v -i eth0 host 193.122.6.168 )
0 packets received by filter
0 packets dropped by kernel

@antoninbas
Copy link
Contributor

I have more commands for you to run:
ip rule list
ip route show table all

@jsalatiel
Copy link
Author

Hi @antoninbas , Unfortunatelly I had to get a working node back, but before terminating the node I restarted the aws-node pod and got the connectivity back. So it is working right now.
I know that the problem will get back in a few days , so I will keep this issue opened.
As I mentioned on the first comment, I have a default DROP policy on baseline tier, but kube-system pods can talk inside the same namespace.
And I also have the aws controller pods ( aws-node, aws-loadbalancer) free to talk to 0.0.0.0/0 and I can see them communicating to some random IPs on the internet, so I am sure they are not being blocked.

@antoninbas
Copy link
Contributor

And I also have the aws controller pods ( aws-node, aws-loadbalancer) free to talk to 0.0.0.0/0 and I can see them communicating to some random IPs on the internet, so I am sure they are not being blocked.

These are hostNetwork Pods right?
We'll need to know the full IP rules and route tables. It must be something wrong there.

It's interesting that restarting the aws-node Pod on that Node resolves the issue. It must fix the routing configuration, but I don't know what can cause it to be messed up in the first place.

@jsalatiel
Copy link
Author

aws-node is hostNetwork: true
aws-load-balancer is not. The traffic I see in the logs is from aws-load-balancer pod.

After restarting aws-node, the connectivity is reestablished, but all netpolicies are being bypassed on that node. I suppose this is not the expected behaviour? Should antrea-node-init also be restarted after aws-node ?
All I can see in the antrea-agent logs is a lot of:

E0701 09:01:01.795171       1 utils.go:164] Skipping invalid IP: 
E0701 09:01:01.795178       1 utils.go:164] Skipping invalid IP: 
E0701 09:01:01.795183       1 utils.go:164] Skipping invalid IP: 
E0701 09:01:01.795331       1 utils.go:164] Skipping invalid IP: 
E0701 09:01:01.795344       1 utils.go:164] Skipping invalid IP: 
E0701 09:01:01.795351       1 utils.go:164] Skipping invalid IP: 
E0701 09:01:01.795359       1 utils.go:164] Skipping invalid IP: 
I0701 09:00:05.239618       1 reconciler.go:293] Reconciling rule 6303e551ab19dca7 of NetworkPolicy ....

@antoninbas
Copy link
Contributor

I think there is some confusion - at least on my part - about whether this is a connectivity issue or a NetworkPolicy enforcement issues.
I thought this was a connectivity issue based on earlier comments, NP logs, and on the tcpdump capture on the gateway (showing TCP retransmissions). But then you wrote that aws-loadbalancer can talk to the internet and that it is a "regular" Pod (hostNetwork set to false).

@jsalatiel
Copy link
Author

Hi @antoninbas , well, it has been so much information here that it really became confuse.
Let me do a timeline

  1. I have opened the ticket because after sometime, new pods were not having netpolicies applied. ( at least this was what i thought at first ). Since I have a default drop policy ( on baseline tier ) and applying label to new pods were not giving them connectivity i supposed the problem was netpolicies not being applied.
  2. Restarting aws-node fixes the problem ( for a while at least )
  3. Since default deny would break a few things in aws, I had a netpolicy that would exempt kube-system traffic for internal IPs and aws-pod-controllers to 0.0.0.0/0. I did not know at first that aws-node is hostNetwork, so the policy does not really apply to it. The logs I mentioned on aws-loadbalancer if for old pods, If i restart the aws-loadbalancer it wont work ( the log will show on np.log, but there is no connectivity )
  4. Restarting aws-node "fixes the problem" but also appears to bypass all network policies on that node, so new pods have full connectivity.

@jsalatiel
Copy link
Author

jsalatiel commented Jul 1, 2022

I keep trying to debug. Since this is a production cluster ( pretty stable deployments) it takes time to understand what triggers the problem and even if I am debugging correctly.
I decided to do the following test. I have created 3 cronjobs that runs every minute and then start a container that only does curl www.google.com. Two of those 3 containers have the label internet=true which allows the container to access the internet. The last one does not.
So, for each execution I expect two pods to complete and the last one keep running for 30 seconds and fail ( curl timeout ), but that's not what I am getting, I keep getting pods to fail randomly.

The ones marked in red are expected to fail due to missing the internet=true label. The ones in green should not have failed. ( Very unlikely google is offline )
image

After this initial test I have restarted the AWS nodes( kubectl delete pods -l k8s-app=aws-node -n kube-system) and as I mentioned before, all netpolicies are just bypassed and this is what I got:
NOT one single failure, even the pods that were supposed to fail are now bypassing all netpolicies.

image

As I mentioned, old pods are running just fine, I only noticed the problem because I had to deploy a new app and update the image for another one. Otherwise I would not have noticed that. I will keep antrea running for more few days, but because this odd beahaviour I will probably have to uninstall it.

I run a few other clusters onprem using antrea and they do not face any kind of problem. I only see this on EKS ( netpolicy only mode ). I have another cluster on EKS ( also pretty stable deployments there ). I will redo the tests and check if they trigger the same results.

@antoninbas
Copy link
Contributor

antoninbas commented Jul 1, 2022

Thanks @jsalatiel. Yes, it seems that restarting aws-node may cause Antrea networking to be bypassed. I need to take a look at it.

I see that one Node is broken in the first screenshot (the one with IP 10.138.3.191). if you get a Node in this state again, please share the information I requested above:

ip addr
ip route
ip rule list
ip route show table all

And collecting a supportbundle at the same time would be helpful too: antctl supportbundle from any machine with access to the cluster. Warning: supportbundle collects logs from Antrea which may not be acceptable for you.

@jsalatiel
Copy link
Author

After restarting aws-node, what can I do to antrea reapply the policies ? Restart the antrea-eks-init? Or the only solution is to add more nodes ?

@jsalatiel
Copy link
Author

Do you want the supportbundle even if the policies are now being skipped?
I will check what kind of information it collects.

@antoninbas
Copy link
Contributor

After restarting aws-node, what can I do to antrea reapply the policies ? Restart the antrea-eks-init? Or the only solution is to add more nodes ?

Restarting the antrea-eks-init Pod on that Node should be enough. But I will experiment early next week on an EKS cluster, so I can circle back to you after I test it.

Do you want the supportbundle even if the policies are now being skipped?
I will check what kind of information it collects.

No, we need it for a Node in the broken connectivity state (not after aws-node restart). You can pass a Node name to antctl supportbundle so it only collects information from that Node.

@jsalatiel
Copy link
Author

OK. I will get back when I get a node back in that state. It should not take long.

@jsalatiel
Copy link
Author

@antoninbas here is the info you asked. This node is in the same state.

ip addr

1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host
       valid_lft forever preferred_lft forever
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9001 qdisc mq state UP group default qlen 1000
    link/ether 0a:cc:6f:c9:cb:c4 brd ff:ff:ff:ff:ff:ff
    inet 10.138.3.41/24 brd 10.138.3.255 scope global dynamic eth0
       valid_lft 1938sec preferred_lft 1938sec
    inet6 fe80::8cc:6fff:fec9:cbc4/64 scope link
       valid_lft forever preferred_lft forever
4: ovs-system: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000
    link/ether 22:52:a5:2e:dd:f4 brd ff:ff:ff:ff:ff:ff
5: antrea-gw0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9001 qdisc noqueue state UNKNOWN group default qlen 1000
    link/ether 4a:8f:13:25:5e:df brd ff:ff:ff:ff:ff:ff
    inet 10.138.3.41/32 scope global antrea-gw0
       valid_lft forever preferred_lft forever
    inet6 fe80::488f:13ff:fe25:5edf/64 scope link
       valid_lft forever preferred_lft forever
6: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9001 qdisc mq state UP group default qlen 1000
    link/ether 0a:e7:68:2d:64:48 brd ff:ff:ff:ff:ff:ff
    inet 10.138.3.203/24 brd 10.138.3.255 scope global eth1
       valid_lft forever preferred_lft forever
    inet6 fe80::8e7:68ff:fe2d:6448/64 scope link
       valid_lft forever preferred_lft forever
7: dummy0: <BROADCAST,NOARP> mtu 1500 qdisc noop state DOWN group default qlen 1000
    link/ether 72:93:fc:6e:67:6f brd ff:ff:ff:ff:ff:ff
8: antrea-egress0: <BROADCAST,NOARP> mtu 1500 qdisc noop state DOWN group default
    link/ether 0a:b9:1e:94:2c:16 brd ff:ff:ff:ff:ff:ff
14: enie7bffac5f2f@if3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9001 qdisc noqueue master ovs-system state UP group default
    link/ether 0e:17:bd:9c:41:49 brd ff:ff:ff:ff:ff:ff link-netnsid 2
    inet6 fe80::c17:bdff:fe9c:4149/64 scope link
       valid_lft forever preferred_lft forever
15: eni581be6be098@if3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9001 qdisc noqueue master ovs-system state UP group default
    link/ether 8e:c7:82:88:34:cd brd ff:ff:ff:ff:ff:ff link-netnsid 4
    inet6 fe80::8cc7:82ff:fe88:34cd/64 scope link
       valid_lft forever preferred_lft forever
16: eni9bbcadeff40@if3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9001 qdisc noqueue master ovs-system state UP group default
    link/ether 82:dc:b4:f7:4b:4d brd ff:ff:ff:ff:ff:ff link-netnsid 0
    inet6 fe80::80dc:b4ff:fef7:4b4d/64 scope link
       valid_lft forever preferred_lft forever
17: eni6352dc86622@if3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9001 qdisc noqueue master ovs-system state UP group default
    link/ether 7e:24:26:f2:e1:5f brd ff:ff:ff:ff:ff:ff link-netnsid 1
    inet6 fe80::7c24:26ff:fef2:e15f/64 scope link
       valid_lft forever preferred_lft forever
68: eni8280a4036a5@if3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9001 qdisc noqueue master ovs-system state UP group default
    link/ether 0e:40:4d:63:94:8a brd ff:ff:ff:ff:ff:ff link-netnsid 3
    inet6 fe80::c40:4dff:fe63:948a/64 scope link
       valid_lft forever preferred_lft forever
70: eni515a1a7b3d0@if3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9001 qdisc noqueue master ovs-system state UP group default
    link/ether ce:30:51:0c:6b:99 brd ff:ff:ff:ff:ff:ff link-netnsid 6
    inet6 fe80::cc30:51ff:fe0c:6b99/64 scope link
       valid_lft forever preferred_lft forever
71: eni06df4a90ef0@if3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9001 qdisc noqueue master ovs-system state UP group default
    link/ether 72:0e:2c:6e:a2:56 brd ff:ff:ff:ff:ff:ff link-netnsid 7
    inet6 fe80::700e:2cff:fe6e:a256/64 scope link
       valid_lft forever preferred_lft forever
72: enif29a5ce6168@if3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9001 qdisc noqueue master ovs-system state UP group default
    link/ether ce:09:d1:8d:d2:f3 brd ff:ff:ff:ff:ff:ff link-netnsid 8
    inet6 fe80::cc09:d1ff:fe8d:d2f3/64 scope link
       valid_lft forever preferred_lft forever
73: eni7089f0973fe@if3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9001 qdisc noqueue master ovs-system state UP group default
    link/ether e6:52:e5:1c:27:c3 brd ff:ff:ff:ff:ff:ff link-netnsid 9
    inet6 fe80::e452:e5ff:fe1c:27c3/64 scope link
       valid_lft forever preferred_lft forever
75: eni1a57b787475@if3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9001 qdisc noqueue master ovs-system state UP group default
    link/ether 4e:5b:73:7a:a4:2b brd ff:ff:ff:ff:ff:ff link-netnsid 11
    inet6 fe80::4c5b:73ff:fe7a:a42b/64 scope link
       valid_lft forever preferred_lft forever
76: enid9dc6c38c23@if3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9001 qdisc noqueue master ovs-system state UP group default
    link/ether ea:39:82:4a:4c:44 brd ff:ff:ff:ff:ff:ff link-netnsid 12
    inet6 fe80::e839:82ff:fe4a:4c44/64 scope link
       valid_lft forever preferred_lft forever
77: eni2a4fb7b9261@if3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9001 qdisc noqueue master ovs-system state UP group default
    link/ether 6a:e8:6e:d5:51:f0 brd ff:ff:ff:ff:ff:ff link-netnsid 13
    inet6 fe80::68e8:6eff:fed5:51f0/64 scope link
       valid_lft forever preferred_lft forever
78: enidfed2e3e0fb@if3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9001 qdisc noqueue master ovs-system state UP group default
    link/ether 92:74:f5:64:97:18 brd ff:ff:ff:ff:ff:ff link-netnsid 14
    inet6 fe80::9074:f5ff:fe64:9718/64 scope link
       valid_lft forever preferred_lft forever
79: eni7bf7593694f@if3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9001 qdisc noqueue master ovs-system state UP group default
    link/ether aa:b9:fa:07:f8:85 brd ff:ff:ff:ff:ff:ff link-netnsid 15
    inet6 fe80::a8b9:faff:fe07:f885/64 scope link
       valid_lft forever preferred_lft forever
80: eni1bae4d1a9ed@if3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9001 qdisc noqueue master ovs-system state UP group default
    link/ether f2:c0:66:90:c6:18 brd ff:ff:ff:ff:ff:ff link-netnsid 5
    inet6 fe80::f0c0:66ff:fe90:c618/64 scope link
       valid_lft forever preferred_lft forever
81: eth2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9001 qdisc mq state UP group default qlen 1000
    link/ether 0a:2e:ff:1f:3d:b4 brd ff:ff:ff:ff:ff:ff
    inet 10.138.3.25/24 brd 10.138.3.255 scope global eth2
       valid_lft forever preferred_lft forever
    inet6 fe80::82e:ffff:fe1f:3db4/64 scope link
       valid_lft forever preferred_lft forever
84: eni5489fe41698@if3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9001 qdisc noqueue master ovs-system state UP group default
    link/ether e6:f2:63:3e:ae:50 brd ff:ff:ff:ff:ff:ff link-netnsid 10
    inet6 fe80::e4f2:63ff:fe3e:ae50/64 scope link
       valid_lft forever preferred_lft forever
85: enife0a1d68374@if3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9001 qdisc noqueue master ovs-system state UP group default
    link/ether 42:24:e2:86:be:c3 brd ff:ff:ff:ff:ff:ff link-netnsid 16
    inet6 fe80::4024:e2ff:fe86:bec3/64 scope link
       valid_lft forever preferred_lft forever
86: eni4b4864ccadf@if3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9001 qdisc noqueue master ovs-system state UP group default
    link/ether b6:60:e2:7a:4f:8e brd ff:ff:ff:ff:ff:ff link-netnsid 17
    inet6 fe80::b460:e2ff:fe7a:4f8e/64 scope link
       valid_lft forever preferred_lft forever
88: enia34ff6c5426@if3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9001 qdisc noqueue master ovs-system state UP group default
    link/ether ca:cc:47:c9:c0:ba brd ff:ff:ff:ff:ff:ff link-netnsid 19
    inet6 fe80::c8cc:47ff:fec9:c0ba/64 scope link
       valid_lft forever preferred_lft forever
91: eni158acaf61fc@if3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9001 qdisc noqueue master ovs-system state UP group default
    link/ether 86:f3:33:95:50:d3 brd ff:ff:ff:ff:ff:ff link-netnsid 18
    inet6 fe80::84f3:33ff:fe95:50d3/64 scope link
       valid_lft forever preferred_lft forever

ip route

default via 10.138.3.1 dev eth0
10.138.3.0/24 dev eth0 proto kernel scope link src 10.138.3.41
10.138.3.8 dev antrea-gw0 scope link
10.138.3.33 dev antrea-gw0 scope link
10.138.3.43 dev antrea-gw0 scope link
10.138.3.54 dev antrea-gw0 scope link
10.138.3.64 dev antrea-gw0 scope link
10.138.3.79 dev antrea-gw0 scope link
10.138.3.113 dev antrea-gw0 scope link
10.138.3.115 dev antrea-gw0 scope link
10.138.3.118 dev antrea-gw0 scope link
10.138.3.121 dev antrea-gw0 scope link
10.138.3.135 dev antrea-gw0 scope link
10.138.3.139 dev antrea-gw0 scope link
10.138.3.149 dev antrea-gw0 scope link
10.138.3.152 dev antrea-gw0 scope link
10.138.3.170 dev antrea-gw0 scope link
10.138.3.179 dev antrea-gw0 scope link
10.138.3.202 dev antrea-gw0 scope link
10.138.3.207 dev antrea-gw0 scope link
10.138.3.215 dev antrea-gw0 scope link
10.138.3.220 dev antrea-gw0 scope link
10.138.3.250 dev antrea-gw0 scope link
169.254.169.254 dev eth0

ip rule list

0:	from all lookup local
512:	from all to 10.138.3.110 lookup main
512:	from all to 10.138.3.152 lookup main
512:	from all to 10.138.3.149 lookup main
512:	from all to 10.138.3.170 lookup main
512:	from all to 10.138.3.250 lookup main
512:	from all to 10.138.3.115 lookup main
512:	from all to 10.138.3.8 lookup main
512:	from all to 10.138.3.118 lookup main
512:	from all to 10.138.3.220 lookup main
512:	from all to 10.138.3.215 lookup main
512:	from all to 10.138.3.43 lookup main
512:	from all to 10.138.3.54 lookup main
512:	from all to 10.138.3.135 lookup main
512:	from all to 10.138.3.202 lookup main
512:	from all to 10.138.3.179 lookup main
512:	from all to 10.138.3.207 lookup main
512:	from all to 10.138.3.33 lookup main
512:	from all to 10.138.3.139 lookup main
512:	from all to 10.138.3.64 lookup main
512:	from all to 10.138.3.79 lookup main
512:	from all to 10.138.3.121 lookup main
512:	from all to 10.138.3.35 lookup main
1024:	from all fwmark 0x80/0x80 lookup main
1536:	from 10.138.3.170 lookup 2
1536:	from 10.138.3.220 lookup 2
1536:	from 10.138.3.43 lookup 2
1536:	from 10.138.3.202 lookup 2
1536:	from 10.138.3.33 lookup 2
1536:	from 10.138.3.64 lookup 3
1536:	from 10.138.3.79 lookup 2
1536:	from 10.138.3.35 lookup 2
32766:	from all lookup main
32767:	from all lookup default

ip route show table all

default via 10.138.3.1 dev eth1 table 2
10.138.3.1 dev eth1 table 2 scope link
default via 10.138.3.1 dev eth2 table 3
10.138.3.1 dev eth2 table 3 scope link
default via 10.138.3.1 dev eth0
10.138.3.0/24 dev eth0 proto kernel scope link src 10.138.3.41
10.138.3.8 dev antrea-gw0 scope link
10.138.3.33 dev antrea-gw0 scope link
10.138.3.35 dev antrea-gw0 scope link
10.138.3.43 dev antrea-gw0 scope link
10.138.3.54 dev antrea-gw0 scope link
10.138.3.64 dev antrea-gw0 scope link
10.138.3.79 dev antrea-gw0 scope link
10.138.3.115 dev antrea-gw0 scope link
10.138.3.118 dev antrea-gw0 scope link
10.138.3.121 dev antrea-gw0 scope link
10.138.3.135 dev antrea-gw0 scope link
10.138.3.139 dev antrea-gw0 scope link
10.138.3.149 dev antrea-gw0 scope link
10.138.3.152 dev antrea-gw0 scope link
10.138.3.170 dev antrea-gw0 scope link
10.138.3.179 dev antrea-gw0 scope link
10.138.3.202 dev antrea-gw0 scope link
10.138.3.207 dev antrea-gw0 scope link
10.138.3.215 dev antrea-gw0 scope link
10.138.3.220 dev antrea-gw0 scope link
10.138.3.250 dev antrea-gw0 scope link
169.254.169.254 dev eth0
broadcast 10.138.3.0 dev eth0 table local proto kernel scope link src 10.138.3.41
broadcast 10.138.3.0 dev eth1 table local proto kernel scope link src 10.138.3.203
broadcast 10.138.3.0 dev eth2 table local proto kernel scope link src 10.138.3.25
local 10.138.3.25 dev eth2 table local proto kernel scope host src 10.138.3.25
local 10.138.3.41 dev eth0 table local proto kernel scope host src 10.138.3.41
local 10.138.3.41 dev antrea-gw0 table local proto kernel scope host src 10.138.3.41
local 10.138.3.203 dev eth1 table local proto kernel scope host src 10.138.3.203
broadcast 10.138.3.255 dev eth0 table local proto kernel scope link src 10.138.3.41
broadcast 10.138.3.255 dev eth1 table local proto kernel scope link src 10.138.3.203
broadcast 10.138.3.255 dev eth2 table local proto kernel scope link src 10.138.3.25
broadcast 127.0.0.0 dev lo table local proto kernel scope link src 127.0.0.1
local 127.0.0.0/8 dev lo table local proto kernel scope host src 127.0.0.1
local 127.0.0.1 dev lo table local proto kernel scope host src 127.0.0.1
broadcast 127.255.255.255 dev lo table local proto kernel scope link src 127.0.0.1
unreachable ::/96 dev lo metric 1024 pref medium
unreachable ::ffff:0.0.0.0/96 dev lo metric 1024 pref medium
unreachable 2002:a00::/24 dev lo metric 1024 pref medium
unreachable 2002:7f00::/24 dev lo metric 1024 pref medium
unreachable 2002:a9fe::/32 dev lo metric 1024 pref medium
unreachable 2002:ac10::/28 dev lo metric 1024 pref medium
unreachable 2002:c0a8::/32 dev lo metric 1024 pref medium
unreachable 2002:e000::/19 dev lo metric 1024 pref medium
unreachable 3ffe:ffff::/32 dev lo metric 1024 pref medium
fe80::/64 dev antrea-gw0 proto kernel metric 256 pref medium
fe80::/64 dev antrea-gw0 proto kernel metric 256 pref medium
fe80::/64 dev eth1 proto kernel metric 256 pref medium
fe80::/64 dev enie7bffac5f2f proto kernel metric 256 pref medium
fe80::/64 dev eni581be6be098 proto kernel metric 256 pref medium
fe80::/64 dev eni9bbcadeff40 proto kernel metric 256 pref medium
fe80::/64 dev eni6352dc86622 proto kernel metric 256 pref medium
fe80::/64 dev eni8280a4036a5 proto kernel metric 256 pref medium
fe80::/64 dev eni515a1a7b3d0 proto kernel metric 256 pref medium
fe80::/64 dev eni06df4a90ef0 proto kernel metric 256 pref medium
fe80::/64 dev enif29a5ce6168 proto kernel metric 256 pref medium
fe80::/64 dev eni7089f0973fe proto kernel metric 256 pref medium
fe80::/64 dev eni1a57b787475 proto kernel metric 256 pref medium
fe80::/64 dev enid9dc6c38c23 proto kernel metric 256 pref medium
fe80::/64 dev eni2a4fb7b9261 proto kernel metric 256 pref medium
fe80::/64 dev enidfed2e3e0fb proto kernel metric 256 pref medium
fe80::/64 dev eni7bf7593694f proto kernel metric 256 pref medium
fe80::/64 dev eni1bae4d1a9ed proto kernel metric 256 pref medium
fe80::/64 dev eth2 proto kernel metric 256 pref medium
fe80::/64 dev eni5489fe41698 proto kernel metric 256 pref medium
fe80::/64 dev enife0a1d68374 proto kernel metric 256 pref medium
fe80::/64 dev eni4b4864ccadf proto kernel metric 256 pref medium
fe80::/64 dev enia34ff6c5426 proto kernel metric 256 pref medium
fe80::/64 dev eni158acaf61fc proto kernel metric 256 pref medium
fe80::/64 dev eni838b7f7e667 proto kernel metric 256 pref medium
local ::1 dev lo table local proto kernel metric 0 pref medium
local fe80::82e:ffff:fe1f:3db4 dev eth2 table local proto kernel metric 0 pref medium
local fe80::8cc:6fff:fec9:cbc4 dev eth0 table local proto kernel metric 0 pref medium
local fe80::8e7:68ff:fe2d:6448 dev eth1 table local proto kernel metric 0 pref medium
local fe80::c17:bdff:fe9c:4149 dev enie7bffac5f2f table local proto kernel metric 0 pref medium
local fe80::c40:4dff:fe63:948a dev eni8280a4036a5 table local proto kernel metric 0 pref medium
local fe80::4024:e2ff:fe86:bec3 dev enife0a1d68374 table local proto kernel metric 0 pref medium
local fe80::488f:13ff:fe25:5edf dev antrea-gw0 table local proto kernel metric 0 pref medium
local fe80::4c5b:73ff:fe7a:a42b dev eni1a57b787475 table local proto kernel metric 0 pref medium
local fe80::68e8:6eff:fed5:51f0 dev eni2a4fb7b9261 table local proto kernel metric 0 pref medium
local fe80::700e:2cff:fe6e:a256 dev eni06df4a90ef0 table local proto kernel metric 0 pref medium
local fe80::7c24:26ff:fef2:e15f dev eni6352dc86622 table local proto kernel metric 0 pref medium
local fe80::7c3e:34ff:fe81:8840 dev eni838b7f7e667 table local proto kernel metric 0 pref medium
local fe80::80dc:b4ff:fef7:4b4d dev eni9bbcadeff40 table local proto kernel metric 0 pref medium
local fe80::84f3:33ff:fe95:50d3 dev eni158acaf61fc table local proto kernel metric 0 pref medium
local fe80::8cc7:82ff:fe88:34cd dev eni581be6be098 table local proto kernel metric 0 pref medium
local fe80::9074:f5ff:fe64:9718 dev enidfed2e3e0fb table local proto kernel metric 0 pref medium
local fe80::a8b9:faff:fe07:f885 dev eni7bf7593694f table local proto kernel metric 0 pref medium
local fe80::b460:e2ff:fe7a:4f8e dev eni4b4864ccadf table local proto kernel metric 0 pref medium
local fe80::c8cc:47ff:fec9:c0ba dev enia34ff6c5426 table local proto kernel metric 0 pref medium
local fe80::cc09:d1ff:fe8d:d2f3 dev enif29a5ce6168 table local proto kernel metric 0 pref medium
local fe80::cc30:51ff:fe0c:6b99 dev eni515a1a7b3d0 table local proto kernel metric 0 pref medium
local fe80::e452:e5ff:fe1c:27c3 dev eni7089f0973fe table local proto kernel metric 0 pref medium
local fe80::e4f2:63ff:fe3e:ae50 dev eni5489fe41698 table local proto kernel metric 0 pref medium
local fe80::e839:82ff:fe4a:4c44 dev enid9dc6c38c23 table local proto kernel metric 0 pref medium
local fe80::f0c0:66ff:fe90:c618 dev eni1bae4d1a9ed table local proto kernel metric 0 pref medium
multicast ff00::/8 dev eth0 table local proto kernel metric 256 pref medium
multicast ff00::/8 dev antrea-gw0 table local proto kernel metric 256 pref medium
multicast ff00::/8 dev eth1 table local proto kernel metric 256 pref medium
multicast ff00::/8 dev enie7bffac5f2f table local proto kernel metric 256 pref medium
multicast ff00::/8 dev eni581be6be098 table local proto kernel metric 256 pref medium
multicast ff00::/8 dev eni9bbcadeff40 table local proto kernel metric 256 pref medium
multicast ff00::/8 dev eni6352dc86622 table local proto kernel metric 256 pref medium
multicast ff00::/8 dev eni8280a4036a5 table local proto kernel metric 256 pref medium
multicast ff00::/8 dev eni515a1a7b3d0 table local proto kernel metric 256 pref medium
multicast ff00::/8 dev eni06df4a90ef0 table local proto kernel metric 256 pref medium
multicast ff00::/8 dev enif29a5ce6168 table local proto kernel metric 256 pref medium
multicast ff00::/8 dev eni7089f0973fe table local proto kernel metric 256 pref medium
multicast ff00::/8 dev eni1a57b787475 table local proto kernel metric 256 pref medium
multicast ff00::/8 dev enid9dc6c38c23 table local proto kernel metric 256 pref medium
multicast ff00::/8 dev eni2a4fb7b9261 table local proto kernel metric 256 pref medium
multicast ff00::/8 dev enidfed2e3e0fb table local proto kernel metric 256 pref medium
multicast ff00::/8 dev eni7bf7593694f table local proto kernel metric 256 pref medium
multicast ff00::/8 dev eni1bae4d1a9ed table local proto kernel metric 256 pref medium
multicast ff00::/8 dev eth2 table local proto kernel metric 256 pref medium
multicast ff00::/8 dev eni5489fe41698 table local proto kernel metric 256 pref medium
multicast ff00::/8 dev enife0a1d68374 table local proto kernel metric 256 pref medium
multicast ff00::/8 dev eni4b4864ccadf table local proto kernel metric 256 pref medium
multicast ff00::/8 dev enia34ff6c5426 table local proto kernel metric 256 pref medium
multicast ff00::/8 dev eni158acaf61fc table local proto kernel metric 256 pref medium
multicast ff00::/8 dev eni838b7f7e667 table local proto kernel metric 256 pref medium

This is the get pods from the failing nodes, so you can get the IP assigned to the pod and match the info above.

test-node-1-27611973--1-fht6f   0/1     Error       0          12m     10.138.3.113   ip-10-138-3-41.us-east-2.compute.internal    <none>           <none>            true
test-node-1-27611973--1-lhthn   0/1     Error       0          11m     10.138.3.35    ip-10-138-3-41.us-east-2.compute.internal    <none>           <none>            true
test-node-1-27611979--1-7dvj4   0/1     Error       0          6m26s   10.138.3.97    ip-10-138-3-41.us-east-2.compute.internal    <none>           <none>            true
test-node-1-27611980--1-ht5r7   0/1     Error       0          4m55s   10.138.3.113   ip-10-138-3-41.us-east-2.compute.internal    <none>           <none>            true
test-node-1-27611980--1-sqdxk   0/1     Error       0          5m26s   10.138.3.15    ip-10-138-3-41.us-east-2.compute.internal    <none>           <none>            true
test-node-1-27611982--1-fjgdl   0/1     Error       0          2m11s   10.138.3.75    ip-10-138-3-41.us-east-2.compute.internal    <none>           <none>            true
test-node-1-27611982--1-nhhtg   0/1     Error       0          3m23s   10.138.3.113   ip-10-138-3-41.us-east-2.compute.internal    <none>           <none>            true
test-node-1-27611982--1-rlvp4   0/1     Error       0          2m52s   10.138.3.35    ip-10-138-3-41.us-east-2.compute.internal    <none>           <none>            true
test-node-2-27611975--1-8bkf7   0/1     Error       0          10m     10.138.3.186   ip-10-138-3-41.us-east-2.compute.internal    <none>           <none>            true
test-node-2-27611975--1-ds9b6   0/1     Error       0          9m14s   10.138.3.186   ip-10-138-3-41.us-east-2.compute.internal    <none>           <none>            true
test-node-2-27611975--1-tb274   0/1     Error       0          9m55s   10.138.3.113   ip-10-138-3-41.us-east-2.compute.internal    <none>           <none>            true
test-node-2-27611978--1-n2vdt   0/1     Error       0          7m26s   10.138.3.35    ip-10-138-3-41.us-east-2.compute.internal    <none>           <none>            true

@jsalatiel
Copy link
Author

jsalatiel commented Jul 5, 2022

@antoninbas I have done the same test on my second EKS production cluster and I am also facing the same problem.
So, unfortunately, both of my clusters randomly can lose connectivity.

@antoninbas
Copy link
Contributor

antoninbas commented Jul 6, 2022

@jsalatiel I have created an EKS cluster of my own and I am trying to reproduce

@antoninbas
Copy link
Contributor

@jsalatiel I found one possible issue, for Pods which receive an IP from a secondary network interface.
The issue affects connectivity to the internet, as SNAT is not done properly for these Pods.

I am not totally sure that this is the same issue as the one you are experiencing. To check if this is the same issue, you can try this fix and run the following command on all your Nodes:

sudo iptables -t nat -I PREROUTING 3 -i antrea-gw0 -m comment --comment "AWS, outbound connections" -m state --state NEW -j AWS-CONNMARK-CHAIN-0

As for me, I will be working on an actual Antrea patch to address this issue.

@jsalatiel
Copy link
Author

@antoninbas It worked!!!! I will let the test pods running for more time, but at least so far they are working as expected. Failing when they are supposed to fail and completing fast when they are supposed to work.

image

When you find an actual fix, will it be ported to 1.7 and released as 1.7.1 ?

@antoninbas
Copy link
Contributor

When you find an actual fix, will it be ported to 1.7 and released as 1.7.1 ?

Yes

We may also be able to handle #3974 in time for 1.7.1. It's a more minor issue.

@jsalatiel
Copy link
Author

Great, thank you very much for your help tracking this.

antoninbas added a commit to antoninbas/antrea that referenced this issue Jul 7, 2022
When using Antrea in policyOnly mode on an EKS cluster, an additional
iptables rule is needed in the PREROUTING chain of the nat table. The
rule ensures that Pod-to-external traffic coming from Pods whose IP
address comes from a secondary network interface (secondary ENI) is
marked correctly, so that it hits the appropriate routing table. Without
this, traffic is SNATed with the source IP address of the primary
network interface, while being sent out of the secondary network
interface, causing the VPC to drop the traffic.

Relevant rules (before the fix):

```
-A PREROUTING -m comment --comment "kubernetes service portals" -j KUBE-SERVICES
-A PREROUTING -i eni+ -m comment --comment "AWS, outbound connections" -m state --state NEW -j AWS-CONNMARK-CHAIN-0
-A PREROUTING -m comment --comment "AWS, CONNMARK" -j CONNMARK --restore-mark --nfmask 0x80 --ctmask 0x80
-A OUTPUT -m comment --comment "kubernetes service portals" -j KUBE-SERVICES
-A POSTROUTING -m comment --comment "kubernetes postrouting rules" -j KUBE-POSTROUTING
-A POSTROUTING -m comment --comment "AWS SNAT CHAIN" -j AWS-SNAT-CHAIN-0
-A POSTROUTING -m comment --comment "Antrea: jump to Antrea postrouting rules" -j ANTREA-POSTROUTING
-A ANTREA-POSTROUTING -o antrea-gw0 -m comment --comment "Antrea: masquerade LOCAL traffic" -m addrtype ! --src-type LOCAL --limit-iface-out -m addrtype --src-type LOCAL -j MASQUERADE --random-fully
-A AWS-CONNMARK-CHAIN-0 ! -d 192.168.0.0/16 -m comment --comment "AWS CONNMARK CHAIN, VPC CIDR" -j AWS-CONNMARK-CHAIN-1
-A AWS-CONNMARK-CHAIN-1 -m comment --comment "AWS, CONNMARK" -j CONNMARK --set-xmark 0x80/0x80
-A AWS-SNAT-CHAIN-0 ! -d 192.168.0.0/16 -m comment --comment "AWS SNAT CHAIN" -j AWS-SNAT-CHAIN-1
-A AWS-SNAT-CHAIN-1 ! -o vlan+ -m comment --comment "AWS, SNAT" -m addrtype ! --dst-type LOCAL -j SNAT --to-source 192.168.18.153 --random-fully

0:	from all lookup local
512:	from all to 192.168.29.56 lookup main
512:	from all to 192.168.24.134 lookup main
512:	from all to 192.168.31.135 lookup main
512:	from all to 192.168.31.223 lookup main
512:	from all to 192.168.29.27 lookup main
512:	from all to 192.168.16.158 lookup main
512:	from all to 192.168.2.135 lookup main
1024:	from all fwmark 0x80/0x80 lookup main
1536:	from 192.168.31.223 lookup 2
1536:	from 192.168.29.27 lookup 2
1536:	from 192.168.16.158 lookup 2
1536:	from 192.168.2.135 lookup 2
32766:	from all lookup main
32767:	from all lookup default

default via 192.168.0.1 dev eth1
192.168.0.1 dev eth1 scope link
```

The fix is simply to add a new PREROUTING rule:

```
-A PREROUTING -m comment --comment "kubernetes service portals" -j KUBE-SERVICES
-A PREROUTING -i eni+ -m comment --comment "AWS, outbound connections" -m state --state NEW -j AWS-CONNMARK-CHAIN-0
-A PREROUTING -i antrea-gw0 -m comment --comment "Antrea: AWS, outbound connections" -m state --state NEW -j AWS-CONNMARK-CHAIN-0
-A PREROUTING -m comment --comment "AWS, CONNMARK" -j CONNMARK --restore-mark --nfmask 0x80 --ctmask 0x80
```

Fixes antrea-io#3946

Signed-off-by: Antonin Bas <abas@vmware.com>
@antoninbas antoninbas added the priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. label Jul 8, 2022
antoninbas added a commit to antoninbas/antrea that referenced this issue Jul 8, 2022
When using Antrea in policyOnly mode on an EKS cluster, an additional
iptables rule is needed in the PREROUTING chain of the nat table. The
rule ensures that Pod-to-external traffic coming from Pods whose IP
address comes from a secondary network interface (secondary ENI) is
marked correctly, so that it hits the appropriate routing table. Without
this, traffic is SNATed with the source IP address of the primary
network interface, while being sent out of the secondary network
interface, causing the VPC to drop the traffic.

The fix is to add new PREROUTING rules, in the ANTREA-PREROUTING chain:

```
-A ANTREA-PREROUTING -i antrea-gw0 -m comment --comment "Antrea: AWS, outbound connections" -j AWS-CONNMARK-CHAIN-0
-A ANTREA-PREROUTING -m comment --comment "Antrea: AWS, CONNMARK (first packet)" -j CONNMARK --restore-mark --nfmask 0x80 --ctmask 0x80
```

Fixes antrea-io#3946

Signed-off-by: Antonin Bas <abas@vmware.com>
antoninbas added a commit to antoninbas/antrea that referenced this issue Jul 8, 2022
When using Antrea in policyOnly mode on an EKS cluster, an additional
iptables rule is needed in the PREROUTING chain of the nat table. The
rule ensures that Pod-to-external traffic coming from Pods whose IP
address comes from a secondary network interface (secondary ENI) is
marked correctly, so that it hits the appropriate routing table. Without
this, traffic is SNATed with the source IP address of the primary
network interface, while being sent out of the secondary network
interface, causing the VPC to drop the traffic.

The fix is to add new PREROUTING rules, in the ANTREA-PREROUTING chain:

```
-A ANTREA-PREROUTING -i antrea-gw0 -m comment --comment "Antrea: AWS, outbound connections" -j AWS-CONNMARK-CHAIN-0
-A ANTREA-PREROUTING -m comment --comment "Antrea: AWS, CONNMARK (first packet)" -j CONNMARK --restore-mark --nfmask 0x80 --ctmask 0x80
```

Fixes antrea-io#3946

Signed-off-by: Antonin Bas <abas@vmware.com>
tnqn pushed a commit that referenced this issue Jul 13, 2022
When using Antrea in policyOnly mode on an EKS cluster, an additional
iptables rule is needed in the PREROUTING chain of the nat table. The
rule ensures that Pod-to-external traffic coming from Pods whose IP
address comes from a secondary network interface (secondary ENI) is
marked correctly, so that it hits the appropriate routing table. Without
this, traffic is SNATed with the source IP address of the primary
network interface, while being sent out of the secondary network
interface, causing the VPC to drop the traffic.

The fix is to add new PREROUTING rules, in the ANTREA-PREROUTING chain:

```
-A ANTREA-PREROUTING -i antrea-gw0 -m comment --comment "Antrea: AWS, outbound connections" -j AWS-CONNMARK-CHAIN-0
-A ANTREA-PREROUTING -m comment --comment "Antrea: AWS, CONNMARK (first packet)" -j CONNMARK --restore-mark --nfmask 0x80 --ctmask 0x80
```

Fixes #3946

Signed-off-by: Antonin Bas <abas@vmware.com>
tnqn pushed a commit to tnqn/antrea that referenced this issue Jul 13, 2022
When using Antrea in policyOnly mode on an EKS cluster, an additional
iptables rule is needed in the PREROUTING chain of the nat table. The
rule ensures that Pod-to-external traffic coming from Pods whose IP
address comes from a secondary network interface (secondary ENI) is
marked correctly, so that it hits the appropriate routing table. Without
this, traffic is SNATed with the source IP address of the primary
network interface, while being sent out of the secondary network
interface, causing the VPC to drop the traffic.

The fix is to add new PREROUTING rules, in the ANTREA-PREROUTING chain:

```
-A ANTREA-PREROUTING -i antrea-gw0 -m comment --comment "Antrea: AWS, outbound connections" -j AWS-CONNMARK-CHAIN-0
-A ANTREA-PREROUTING -m comment --comment "Antrea: AWS, CONNMARK (first packet)" -j CONNMARK --restore-mark --nfmask 0x80 --ctmask 0x80
```

Fixes antrea-io#3946

Signed-off-by: Antonin Bas <abas@vmware.com>
tnqn added a commit that referenced this issue Jul 13, 2022
When using Antrea in policyOnly mode on an EKS cluster, an additional
iptables rule is needed in the PREROUTING chain of the nat table. The
rule ensures that Pod-to-external traffic coming from Pods whose IP
address comes from a secondary network interface (secondary ENI) is
marked correctly, so that it hits the appropriate routing table. Without
this, traffic is SNATed with the source IP address of the primary
network interface, while being sent out of the secondary network
interface, causing the VPC to drop the traffic.

The fix is to add new PREROUTING rules, in the ANTREA-PREROUTING chain:

```
-A ANTREA-PREROUTING -i antrea-gw0 -m comment --comment "Antrea: AWS, outbound connections" -j AWS-CONNMARK-CHAIN-0
-A ANTREA-PREROUTING -m comment --comment "Antrea: AWS, CONNMARK (first packet)" -j CONNMARK --restore-mark --nfmask 0x80 --ctmask 0x80
```

Fixes #3946

Signed-off-by: Antonin Bas <abas@vmware.com>

Co-authored-by: Antonin Bas <abas@vmware.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/provider/aws Issues or PRs related to aws provider. kind/bug Categorizes issue or PR as related to a bug. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants