Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix LAG going down after warm reboot with SONiC neighbors #17040

Merged
merged 2 commits into from
Nov 3, 2023

Conversation

saiarcot895
Copy link
Contributor

@saiarcot895 saiarcot895 commented Oct 30, 2023

Fixes #16875.

Why I did it

On physical devices, when doing a warm reboot on an image that supports the teamd retry count feature, with neighbor devices that are also running SONiC and support the teamd retry count feature, the LAG will become operationally down. Specifically, while the LAG won't go into expired or defaulted state, the DUT undergoing warm reboot will send a LACP PDU indicating it's not ready to send or receive traffic:

17:49:21.645115 1c:34:da:eb:bc:80 (oui Unknown) > 01:80:c2:00:00:02 (oui Unknown), ethertype Slow Protocols (0x8809), length 124: LACPv1, length 110
        Actor Information TLV (0x01), length 20
          System 1c:34:da:eb:bc:80 (oui Unknown), System Priority 65535, Key 1103, Port 121, Port Priority 255
          State Flags [none]
        Partner Information TLV (0x02), length 20
          System 52:54:00:05:bd:e5 (oui Unknown), System Priority 65535, Key 11, Port 2, Port Priority 255
          State Flags [Activity, Aggregation, Synchronization, Collecting, Distributing]
        Collector Information TLV (0x03), length 16
          Max Delay 0
        Terminator TLV (0x00), length 0

This is indicated by the state flags in the actor information TLV being none (i.e. no flags are set).

This LACP PDU was generated and sent from the DUT when teamd was initializating in warm-reboot mode. The purpose of this mode is to keep the LAG alive across warm reboots. To do this, the last PDU packet received from the partner is saved to disk before warm reboot. This packet is then read after the warm reboot happens and teamd starts up, and processed as if it was just received from the partner. This allows teamd to send a PDU packet as soon as possible, to avoid hitting the 90-second limit and risking the LAG going down. This also prevents teamd from sending a PDU packet to the partner without any partner information filled in and without any of the actor's state flags being set, like what would happen with a normal teamd startup after cold reboot; this would tell the partner that the LAG went down, and there would be a traffic disruption.

Now, with the teamd retry count feature, a new LACP packet version (version 0xf1) is used for LACP packets. As part of the HLD/implementation for this, if there's a change in LACP packet version received from the peer (i.e. the peer changed from sending 0x1 version PDU packets to 0xf1 version PDU packets, or vice-versa), then a PDU packet is immediately sent, as an acknowledgment that the version has changed. The "initial" version that is used by teamd for this purpose is 0x1, meaning it defaults to assuming 0x1 packets are being used. Because the last saved packet will almost certainly be version 0xf1 when this feature is enabled, this would be seen as a version change, and an acknowledgement packet would be sent.

That alone isn't an issue. The issue comes from the fact that at the point this version check is done and the acknowledgement packet is generated and sent, the actor's own state flags haven't been fully set/initialized yet. This happens after the state of the LAG (whether it's in active, expired, or defaulted state) is set within teamd, and this happens in the lacp_port_set_state function. The actor state flags update happens in the lacp_port_actor_update function, which gets called both within lacp_port_set_state and additionally after lacp_port_set_state gets called (this function is idempotent, so it can be called multiple times without any negative impact). The lacp_port_set_state function gets called a couple lines after the version check is done, which is the problem. As a result, the partner thinks that the DUT is not ready to use the LAG, and so brings the LAG down, disrupting traffic.

This issue doesn't happen on KVM likely because datapath in KVM goes down during warm/fast-reboot, and the kernel interfaces aren't shown as being oper up until well after teamd starts. Teamd checks to see if the interface is oper up before reading and processing the saved PDU packet.

Work item tracking
  • Microsoft ADO (number only): 25513519

How I did it

Make sure that if an ack packet needs to be sent becuase the retry count has changed, it is sent after lacp_port_actor_update has been called.

How to verify it

Tested on warm reboot on Mellanox DUT with SONiC neighbors, and verified the SONiC neighbors are not disabling the port channel interface.

Which release branch to backport (provide reason below if selected)

  • 201811
  • 201911
  • 202006
  • 202012
  • 202106
  • 202111
  • 202205
  • 202211
  • 202305

Tested branch (Please provide the tested image version)

Description for the changelog

Link to config_db schema for YANG module changes

A picture of a cute animal (not mandatory but encouraged)

Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>
@saiarcot895 saiarcot895 marked this pull request as ready for review October 31, 2023 23:25
@saiarcot895 saiarcot895 requested a review from lguohan as a code owner October 31, 2023 23:25
Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>
@yxieca yxieca merged commit 686678a into sonic-net:master Nov 3, 2023
@saiarcot895 saiarcot895 deleted the fix-teamd-warm-reboot-pdu branch November 3, 2023 06:48
mssonicbld pushed a commit to mssonicbld/sonic-buildimage that referenced this pull request Nov 8, 2023
…17040)

* Fix LAG going down after warm reboot with SONiC neighbors

Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>
@mssonicbld
Copy link
Collaborator

Cherry-pick PR to 202305: #17117

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

LAG flap seen in warm-boot due to recent teamd retry count feature
6 participants