[Windows] Antrea Agent may fail to program OpenFlow entries if OVS allocates OF port with a time longer than 5s #6721
Labels
kind/bug
Categorizes issue or PR as related to a bug.
Milestone
Describe the bug
We recently hit Windows Pod networking issues in a scale setup that antrea-agent fails to program the Pod's OpenFlow entries. From antrea-agent logs, we got such failures,
After some investigations, we observed that this failure is occurred because OVS fails to allocate the OpenFlow port within 5s, e.g., OVS may take several minutes to make it finally because of the limited system resources (CPU/memory), or other OVS internal bug.
On Windows, the OVS port creation and OpenFlow entries programing is async, because the Pod's vNIC is actually created by Windows after containerd starts the container while antrea-agent uses OVSDB to manage the Pod's IPAM and HNSEndpoint configurations based on the CNI request. antrea-agent uses a
PostInterfaceCreateHook
to modify OVS interface type from "system" to "internal" and program the corresponding OpenFlow entries. The original solution is antrea-agent uses a sync wait with 5s for the OpenFlow port ready. With the OVS issue happened, antrea-agent may fail to program the OpenFlow entries because no OpenFlow port allocated.It is not a good solution to simply increase the wait time because we don't have a up limit for the value when the OVS failure happens.
A valid fix is to use OpenFlow PortStatus message instead, which is sent from OVS (OF Switch) after a new OpenFlow port is added. Notes in this solution include,
Versions:
Additional context
The text was updated successfully, but these errors were encountered: