[Windows] Use IP and MAC to find virtual management adatper #3641

wenyingd · 2022-04-14T10:43:06Z

After creating HNSNetwork, Windows host creates a virtual management
network adapter which takes over the uplink's IP and MAC. Originally
the name with a format "vEthernet ($uplink_name)" is used to get the
virtual adapter, but it might fail when the name is taken by other
adapter. In this change, uses the uplink's IP and MAC to find the
adpter, and uses the prefix "vEthernet" as a filter.
Remove the virtual adapter name from the name list to search the
Windows Node transport interface's IP configuration in agent restart
case. This is because the IP is finally moved to OVS bridge
interface, which is renamed from the virtual network adapter. So in a
restart case, a virtual network adapter with the name format
"vEthernet ($uplink_name)" should not exist.

Fixes #3636

Signed-off-by: wenyingd wenyingd@vmware.com

wenyingd · 2022-04-14T10:52:37Z

/test-windows-all
/test-all
/skip-ipv6-all
/skip-ipv6-only-all

codecov-commenter · 2022-04-14T12:04:12Z

Codecov Report

Merging #3641 (381fb96) into main (d7b1eed) will decrease coverage by 13.16%.
The diff coverage is n/a.

❗ Current head 381fb96 differs from pull request most recent head 185e0c2. Consider uploading reports for the commit 185e0c2 to get more accurate results

@@             Coverage Diff             @@
##             main    #3641       +/-   ##
===========================================
- Coverage   63.35%   50.18%   -13.17%     
===========================================
  Files         278      248       -30     
  Lines       39367    35664     -3703     
===========================================
- Hits        24941    17899     -7042     
- Misses      12472    15968     +3496     
+ Partials     1954     1797      -157

Flag	Coverage Δ
e2e-tests	`50.18% <ø> (?)`
kind-e2e-tests	`?`
unit-tests	`?`

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
pkg/controller/networkpolicy/endpoint_querier.go	`4.58% <0.00%> (-88.08%)`	⬇️
pkg/agent/util/iptables/lock.go	`0.00% <0.00%> (-80.00%)`	⬇️
pkg/cni/client.go	`0.00% <0.00%> (-77.78%)`	⬇️
pkg/controller/networkpolicy/crd_utils.go	`14.48% <0.00%> (-77.25%)`	⬇️
...lowaggregator/clickhouseclient/clickhouseclient.go	`0.00% <0.00%> (-76.62%)`	⬇️
pkg/controller/externalippool/validate.go	`0.00% <0.00%> (-75.87%)`	⬇️
pkg/apiserver/handlers/featuregates/handler.go	`1.63% <0.00%> (-73.78%)`	⬇️
.../registry/networkpolicy/clustergroupmember/rest.go	`11.11% <0.00%> (-73.62%)`	⬇️
pkg/controller/networkpolicy/clustergroup.go	`3.50% <0.00%> (-73.57%)`	⬇️
...kg/agent/flowexporter/connections/conntrack_ovs.go	`0.00% <0.00%> (-70.91%)`	⬇️
... and 146 more

1. After creating HNSNetwork, Windows host creates a virtual management network adapter which takes over the uplink's IP and MAC. Originally the name with a format "vEthernet ($uplink_name)" is used to get the virtual adapter, but it might fail when the name is taken by other adapters. In this change, uses the uplink's IP and MAC to find the adpter, and uses the prefix "vEthernet" as a filter. 2. Remove the virtual adapter name from the name list to search the Windows Node transport interface's IP configuration in agent restart case. This is because the IP is finally moved to OVS bridge interface, which is renamed from the virtual network adapter. So in a restart case, a virtual network adapter with the name format "vEthernet ($uplink_name)" should not exist. Signed-off-by: wenyingd <wenyingd@vmware.com>

wenyingd · 2022-04-14T14:22:26Z

/test-windows-all
/test-all
/skip-ipv6-all
/skip-ipv6-only-all

antoninbas

Thanks for the quick fix @wenyingd. This LGTM.

I tried it on my existing Windows Node first. I got the following logs from the agent:

ubuntu@ip-10-0-0-25:~$ kubectl  -n kube-system logs antrea-agent-windows-fbx9l -f

    Directory: C:\host\k\antrea

Mode                 LastWriteTime         Length Name
----                 -------------         ------ ----
d----           4/14/2022  8:33 PM                bin
I0414 20:34:34.151157    6448 log_file.go:99] Set log file max size to 104857600
I0414 20:34:34.225900    6448 agent.go:84] Starting Antrea agent (version v1.7.0-dev-185e0c2.dirty)
I0414 20:34:34.225900    6448 client.go:81] No kubeconfig file was specified. Falling back to in-cluster config
W0414 20:34:34.233898    6448 env.go:83] Environment variable POD_NAMESPACE not found
W0414 20:34:34.235898    6448 env.go:121] Failed to get Pod Namespace from environment. Using "kube-system" as the Antrea Service Namespace
I0414 20:34:34.235898    6448 prometheus.go:171] Initializing prometheus metrics
I0414 20:34:34.235898    6448 ovs_client.go:68] Connecting to OVSDB at address \\.\pipe\C:openvswitchvarrunopenvswitchdb.sock
I0414 20:34:34.236899    6448 agent.go:331] Setting up node network
I0414 20:34:43.286042    6448 agent.go:837] "Setting Node MTU" MTU=8951
I0414 20:34:48.793827    6448 net_windows.go:386] "Creating HNSNetwork" name="antrea-hnsnetwork" subnet="192.168.3.0/24" nodeIP="10.0.0.189/24" adapter=&{Index:11 MTU:9001 Name:Ethernet HardwareAddr:06:5e:47:7f:7f:93 Flags:up|broadcast|multicast}
I0414 20:34:50.430779    6448 net_windows.go:408] "Moving uplink configuration to the management virtual network adapter" adapter="vEthernet (Ethernet) 3"
I0414 20:35:02.896840    6448 net_windows.go:431] "Moved uplink configuration to the management virtual network adapter" adapter="vEthernet (Ethernet) 3"



^C
ubuntu@ip-10-0-0-25:~$ ping 10.0.0.189
PING 10.0.0.189 (10.0.0.189) 56(84) bytes of data.
^C
--- 10.0.0.189 ping statistics ---
5 packets transmitted, 0 received, 100% packet loss, time 4084ms

ubuntu@ip-10-0-0-25:~$

After that, connectivity was lost to the instance and I had to force reboot from the AWS console to recover connectivity. I had the same issue after rebooting the Antrea Agent.

However, I tried on a fresh Windows instance, and I didn't observe the issue:

ubuntu@ip-10-0-0-25:~$ kubectl  -n kube-system logs antrea-agent-windows-g62fc -f

    Directory: C:\host\k\antrea

Mode                 LastWriteTime         Length Name
----                 -------------         ------ ----
d----           4/14/2022  8:49 PM                bin
I0414 20:49:37.801997    7584 log_file.go:99] Set log file max size to 104857600
I0414 20:49:37.866647    7584 agent.go:84] Starting Antrea agent (version v1.7.0-dev-185e0c2.dirty)
I0414 20:49:37.867650    7584 client.go:81] No kubeconfig file was specified. Falling back to in-cluster config
W0414 20:49:37.875656    7584 env.go:83] Environment variable POD_NAMESPACE not found
W0414 20:49:37.877655    7584 env.go:121] Failed to get Pod Namespace from environment. Using "kube-system" as the Antrea Service Namespace
I0414 20:49:37.878663    7584 prometheus.go:171] Initializing prometheus metrics
I0414 20:49:37.878663    7584 ovs_client.go:68] Connecting to OVSDB at address \\.\pipe\C:openvswitchvarrunopenvswitchdb.sock
I0414 20:49:37.879668    7584 agent.go:331] Setting up node network
I0414 20:49:37.920852    7584 agent.go:837] "Setting Node MTU" MTU=8951
I0414 20:49:43.122357    7584 net_windows.go:386] "Creating HNSNetwork" name="antrea-hnsnetwork" subnet="192.168.4.0/24" nodeIP="10.0.0.10/24" adapter=&{Index:9 MTU:9001 Name:Ethernet HardwareAddr:06:37:40:9b:4a:09 Flags:up|broadcast|multicast}
I0414 20:49:58.292344    7584 net_windows.go:514] Enabled Receive Segment Coalescing (RSC) for vSwitch antrea-hnsnetwork
I0414 20:49:58.292480    7584 net_windows.go:453] "Created HNSNetwork" name="antrea-hnsnetwork" id="8918EBD5-E86A-4B3F-B6F6-46C485DB0806"
I0414 20:49:58.293621    7584 ovs_client.go:119] Created bridge: 0d1b7b88-5b32-4db1-8d36-1e9e98e73819
...

I don't know if the error in the first instance is something we need to worry about. I know that this corresponds to a different code path in PrepareHNSNetwork, but I don't know enough about it.

wenyingd · 2022-04-15T01:37:47Z

/test-windows-conformance

wenyingd · 2022-04-15T02:56:16Z

After that, connectivity was lost to the instance and I had to force reboot from the AWS console to recover connectivity. I had the same issue after rebooting the Antrea Agent.

I think we should focus on the issue on the first instance. @antoninbas Could you help dump the IP/route configurations from console after the network is lost? These logic happens when HNS network doesn't move the IP to the virtual management adapter (although it is not expected), and agent will try to move the configurations instead.

wenyingd · 2022-04-15T03:01:56Z

@antoninbas I have another question, is OVS working correctly on the Windows Node in your first instance? To verify it, maybe you can try with antrea 1.4?

antoninbas · 2022-04-15T17:27:46Z

@wenyingd Unfortunately I deleted that instance yesterday after I got the new instance working, so I can't collect the information you are asking for :/

Could you help dump the IP/route configurations from console after the network is lost?

I only have RDP access to the instance, so I don't think I could have done that... Once the network goes down, I don't have any access to the instance anymore.

Let me know if we can merge this PR.

wenyingd · 2022-04-18T02:18:27Z

@wenyingd Unfortunately I deleted that instance yesterday after I got the new instance working, so I can't collect the information you are asking for :/

Could you help dump the IP/route configurations from console after the network is lost?

I only have RDP access to the instance, so I don't think I could have done that... Once the network goes down, I don't have any access to the instance anymore.

Let me know if we can merge this PR.

Then maybe we could merge this PR first? In my opinion, there should be some different issue for the network connectivity lost, and we can process that issue when it is reproduced and collected enough infomation. What do you think @antoninbas ?

antoninbas · 2022-04-18T18:28:01Z

@wenyingd sound good to me

antoninbas · 2022-04-18T18:28:47Z

@wenyingd could you backport this as needed?

wenyingd requested review from antoninbas, XinShuYang and tnqn April 14, 2022 10:43

wenyingd force-pushed the issue_3636 branch from 7a30af2 to 2754311 Compare April 14, 2022 10:52

wenyingd force-pushed the issue_3636 branch from 2754311 to 1890ae0 Compare April 14, 2022 11:46

wenyingd force-pushed the issue_3636 branch from 1890ae0 to 9312941 Compare April 14, 2022 13:51

wenyingd force-pushed the issue_3636 branch from 9312941 to 185e0c2 Compare April 14, 2022 14:21

antoninbas approved these changes Apr 14, 2022

View reviewed changes

antoninbas added area/OS/windows Issues or PRs related to the Windows operating system. kind/bug Categorizes issue or PR as related to a bug. action/release-note Indicates a PR that should be included in release notes. labels Apr 18, 2022

antoninbas merged commit ec06feb into antrea-io:main Apr 18, 2022

wenyingd deleted the issue_3636 branch August 15, 2022 03:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Windows] Use IP and MAC to find virtual management adatper #3641

[Windows] Use IP and MAC to find virtual management adatper #3641

wenyingd commented Apr 14, 2022 •

edited

Loading

wenyingd commented Apr 14, 2022

codecov-commenter commented Apr 14, 2022 •

edited

Loading

wenyingd commented Apr 14, 2022

antoninbas left a comment

wenyingd commented Apr 15, 2022

wenyingd commented Apr 15, 2022

wenyingd commented Apr 15, 2022

antoninbas commented Apr 15, 2022

wenyingd commented Apr 18, 2022 •

edited

Loading

antoninbas commented Apr 18, 2022

antoninbas commented Apr 18, 2022

[Windows] Use IP and MAC to find virtual management adatper #3641

[Windows] Use IP and MAC to find virtual management adatper #3641

Conversation

wenyingd commented Apr 14, 2022 • edited Loading

wenyingd commented Apr 14, 2022

codecov-commenter commented Apr 14, 2022 • edited Loading

Codecov Report

wenyingd commented Apr 14, 2022

antoninbas left a comment

Choose a reason for hiding this comment

wenyingd commented Apr 15, 2022

wenyingd commented Apr 15, 2022

wenyingd commented Apr 15, 2022

antoninbas commented Apr 15, 2022

wenyingd commented Apr 18, 2022 • edited Loading

antoninbas commented Apr 18, 2022

antoninbas commented Apr 18, 2022

wenyingd commented Apr 14, 2022 •

edited

Loading

codecov-commenter commented Apr 14, 2022 •

edited

Loading

wenyingd commented Apr 18, 2022 •

edited

Loading