Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Windows] Antrea Agent may fail to program OpenFlow entries if OVS allocates OF port with a time longer than 5s #6721

Closed
wenyingd opened this issue Oct 8, 2024 · 0 comments · Fixed by #6763
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug.

Comments

@wenyingd
Copy link
Contributor

wenyingd commented Oct 8, 2024

Describe the bug

We recently hit Windows Pod networking issues in a scale setup that antrea-agent fails to program the Pod's OpenFlow entries. From antrea-agent logs, we got such failures,

E0921 04:38:39.289057 4364 interface_configuration_windows.go:488] "Failed to execute postInterfaceCreateHook" err="timed out: \"wait\" timed out after 5012 ms" interface="vEthernet (vsphere--ab7002)"

After some investigations, we observed that this failure is occurred because OVS fails to allocate the OpenFlow port within 5s, e.g., OVS may take several minutes to make it finally because of the limited system resources (CPU/memory), or other OVS internal bug.

On Windows, the OVS port creation and OpenFlow entries programing is async, because the Pod's vNIC is actually created by Windows after containerd starts the container while antrea-agent uses OVSDB to manage the Pod's IPAM and HNSEndpoint configurations based on the CNI request. antrea-agent uses a PostInterfaceCreateHook to modify OVS interface type from "system" to "internal" and program the corresponding OpenFlow entries. The original solution is antrea-agent uses a sync wait with 5s for the OpenFlow port ready. With the OVS issue happened, antrea-agent may fail to program the OpenFlow entries because no OpenFlow port allocated.

It is not a good solution to simply increase the wait time because we don't have a up limit for the value when the OVS failure happens.

A valid fix is to use OpenFlow PortStatus message instead, which is sent from OVS (OF Switch) after a new OpenFlow port is added. Notes in this solution include,

  1. OVS requires the OpenFlow port status must be "LIVE" if it is used in OpenFlow entries
  2. PortStatus messages are sent only on the "new" ports' modifications after OF Switch is connected to OF controller, no messages are generated for the existing Ports.

Versions:

  • Antrea version: main

Additional context

@wenyingd wenyingd added the kind/bug Categorizes issue or PR as related to a bug. label Oct 8, 2024
@antoninbas antoninbas added this to the Antrea v2.2 release milestone Oct 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants