Fix LAG going down after warm reboot with SONiC neighbors #17040

saiarcot895 · 2023-10-30T16:41:18Z

Why I did it

On physical devices, when doing a warm reboot on an image that supports the teamd retry count feature, with neighbor devices that are also running SONiC and support the teamd retry count feature, the LAG will become operationally down. Specifically, while the LAG won't go into expired or defaulted state, the DUT undergoing warm reboot will send a LACP PDU indicating it's not ready to send or receive traffic:

17:49:21.645115 1c:34:da:eb:bc:80 (oui Unknown) > 01:80:c2:00:00:02 (oui Unknown), ethertype Slow Protocols (0x8809), length 124: LACPv1, length 110
        Actor Information TLV (0x01), length 20
          System 1c:34:da:eb:bc:80 (oui Unknown), System Priority 65535, Key 1103, Port 121, Port Priority 255
          State Flags [none]
        Partner Information TLV (0x02), length 20
          System 52:54:00:05:bd:e5 (oui Unknown), System Priority 65535, Key 11, Port 2, Port Priority 255
          State Flags [Activity, Aggregation, Synchronization, Collecting, Distributing]
        Collector Information TLV (0x03), length 16
          Max Delay 0
        Terminator TLV (0x00), length 0

This is indicated by the state flags in the actor information TLV being none (i.e. no flags are set).

This LACP PDU was generated and sent from the DUT when teamd was initializating in warm-reboot mode. The purpose of this mode is to keep the LAG alive across warm reboots. To do this, the last PDU packet received from the partner is saved to disk before warm reboot. This packet is then read after the warm reboot happens and teamd starts up, and processed as if it was just received from the partner. This allows teamd to send a PDU packet as soon as possible, to avoid hitting the 90-second limit and risking the LAG going down. This also prevents teamd from sending a PDU packet to the partner without any partner information filled in and without any of the actor's state flags being set, like what would happen with a normal teamd startup after cold reboot; this would tell the partner that the LAG went down, and there would be a traffic disruption.

Now, with the teamd retry count feature, a new LACP packet version (version 0xf1) is used for LACP packets. As part of the HLD/implementation for this, if there's a change in LACP packet version received from the peer (i.e. the peer changed from sending 0x1 version PDU packets to 0xf1 version PDU packets, or vice-versa), then a PDU packet is immediately sent, as an acknowledgment that the version has changed. The "initial" version that is used by teamd for this purpose is 0x1, meaning it defaults to assuming 0x1 packets are being used. Because the last saved packet will almost certainly be version 0xf1 when this feature is enabled, this would be seen as a version change, and an acknowledgement packet would be sent.

That alone isn't an issue. The issue comes from the fact that at the point this version check is done and the acknowledgement packet is generated and sent, the actor's own state flags haven't been fully set/initialized yet. This happens after the state of the LAG (whether it's in active, expired, or defaulted state) is set within teamd, and this happens in the lacp_port_set_state function. The actor state flags update happens in the lacp_port_actor_update function, which gets called both within lacp_port_set_state and additionally after lacp_port_set_state gets called (this function is idempotent, so it can be called multiple times without any negative impact). The lacp_port_set_state function gets called a couple lines after the version check is done, which is the problem. As a result, the partner thinks that the DUT is not ready to use the LAG, and so brings the LAG down, disrupting traffic.

This issue doesn't happen on KVM likely because datapath in KVM goes down during warm/fast-reboot, and the kernel interfaces aren't shown as being oper up until well after teamd starts. Teamd checks to see if the interface is oper up before reading and processing the saved PDU packet.

Work item tracking

Microsoft ADO (number only): 25513519

How I did it

Make sure that if an ack packet needs to be sent becuase the retry count has changed, it is sent after lacp_port_actor_update has been called.

How to verify it

Tested on warm reboot on Mellanox DUT with SONiC neighbors, and verified the SONiC neighbors are not disabling the port channel interface.

Which release branch to backport (provide reason below if selected)

Tested branch (Please provide the tested image version)

Description for the changelog

Link to config_db schema for YANG module changes

A picture of a cute animal (not mandatory but encouraged)

Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>

…17040) * Fix LAG going down after warm reboot with SONiC neighbors Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>

mssonicbld · 2023-11-08T10:12:21Z

Cherry-pick PR to 202305: #17117

…17117)

Fix LAG going down after warm reboot with SONiC neighbors

7df4769

Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>

saiarcot895 marked this pull request as ready for review October 31, 2023 23:25

saiarcot895 requested a review from lguohan as a code owner October 31, 2023 23:25

saiarcot895 requested review from yxieca and vaibhavhd October 31, 2023 23:25

saiarcot895 added the Request for 202305 Branch label Nov 1, 2023

stepanblyschak approved these changes Nov 2, 2023

View reviewed changes

Remove extra else-clause that is basically a no-op

4eb8fe5

Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>

yxieca approved these changes Nov 3, 2023

View reviewed changes

yxieca merged commit 686678a into sonic-net:master Nov 3, 2023

saiarcot895 deleted the fix-teamd-warm-reboot-pdu branch November 3, 2023 06:48

dgsudharsan added the Bug 🐛 label Nov 7, 2023

StormLiangMS added the Approved for 202305 Branch label Nov 8, 2023

mssonicbld added the Created PR to 202305 Branch label Nov 8, 2023

mssonicbld mentioned this pull request Nov 8, 2023

[action] [PR:17040] Fix LAG going down after warm reboot with SONiC neighbors #17117

Merged

11 tasks

mssonicbld added a commit that referenced this pull request Nov 8, 2023

Fix LAG going down after warm reboot with SONiC neighbors (#17040) (#…

f64f96e

…17117)

mssonicbld added Included in 202305 Branch and removed Created PR to 202305 Branch labels Nov 8, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix LAG going down after warm reboot with SONiC neighbors #17040

Fix LAG going down after warm reboot with SONiC neighbors #17040

saiarcot895 commented Oct 30, 2023 •

edited

Loading

mssonicbld commented Nov 8, 2023

Fix LAG going down after warm reboot with SONiC neighbors #17040

Fix LAG going down after warm reboot with SONiC neighbors #17040

Conversation

saiarcot895 commented Oct 30, 2023 • edited Loading

Why I did it

Work item tracking

How I did it

How to verify it

Which release branch to backport (provide reason below if selected)

Tested branch (Please provide the tested image version)

Description for the changelog

Link to config_db schema for YANG module changes

A picture of a cute animal (not mandatory but encouraged)

mssonicbld commented Nov 8, 2023

saiarcot895 commented Oct 30, 2023 •

edited

Loading