Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LAG flap seen in warm-boot due to recent teamd retry count feature #16875

Closed
stepanblyschak opened this issue Oct 13, 2023 · 0 comments · Fixed by #17040
Closed

LAG flap seen in warm-boot due to recent teamd retry count feature #16875

stepanblyschak opened this issue Oct 13, 2023 · 0 comments · Fixed by #17040
Assignees
Labels
Issue for 202305 MSFT Triaged this issue has been triaged

Comments

@stepanblyschak
Copy link
Collaborator

stepanblyschak commented Oct 13, 2023

Description

Recently added teamd retry count features breaks warm-reboot. New warm-reboot flow for LAG is:

  • Probe peer’s teamd retry count capability & Increase teamd retry count to 5
  • Device and it’s peer now exchange new PDUs version set to 0xf1
  • Send SIGUSR1 to teamd, teamd is saving last received PDU with 0xf1:
root@qa-eth-vt02-3-2700a0:/home/admin# hexdump /host/warmboot/teamd/Ethernet40
0000000 8001 00c2 0200 341c bfda 00b8 0988 f101
0000010 1401 ffff 341c bfda 00b8 1227 ff00 7100
0000020 003d 0000 1402 ffff 8a24 0b07 0027 1227
0000030 ff00 2900 003d 0000 1003 0000 0000 0000
0000040 0000 0000 0000 0000 0480 0003 0481 0005
0000050 0000 0000 0000 0000 0000 0000 0000 0000
*
0000070 0000 0000 0000 0000 0000 0000
000007c
root@qa-eth-vt02-3-2700a0:/home/admin#
root@qa-eth-vt02-3-2700a0:/home/admin#
root@qa-eth-vt02-3-2700a0:/home/admin# cat /host/warmboot/teamd/PortChannel0002
1
1
Ethernet40
1

Here's a trace of PDU exchage during warm-reboot between DUT (24:8a:07:0b:27:00) and AUX (1c:34:da:bf:b8:00):

Last PDU sent by DUT before kexec:

17:17:20.422850 24:8a:07:0b:27:00 > 01:80:c2:00:00:02, ethertype Slow Protocols (0x8809), length 124: LACP version 241 packet not supported                                                                                                                                                                                                                                                     
        0x0000:  0180 c200 0002 248a 070b 2700 8809 01f1  ......$...'.....                                                                                                                                                                                                                                                                                                                      
        0x0010:  0114 ffff 248a 070b 2700 2712 00ff 0029  ....$...'.'....)                                                                                                                                                                                                               
        0x0020:  3d00 0000 0214 ffff 1c34 dabf b800 2712  =........4....'.                                                                                                                                                                                                                                                                                                                      
        0x0030:  00ff 0071 3d00 0000 0310 0000 0000 0000  ...q=...........                                                                                                                                                                                                                                                                                                                      
        0x0040:  0000 0000 0000 0000 8004 0500 8104 0300  ................                                                                                                                                                                                                                                                                                                                      
        0x0050:  0000 0000 0000 0000 0000 0000 0000 0000  ................                                                                                                                                                                                                                                                                                                                      
        0x0060:  0000 0000 0000 0000 0000 0000 0000 0000  ................                                                                                                                                                                                                                                                                                                                      
        0x0070:  0000 0000 0000 0000 0000 0000            ............    

First PDU sent by DUT after kexec:

17:18:44.246352 24:8a:07:0b:27:00 > 01:80:c2:00:00:02, ethertype Slow Protocols (0x8809), length 124: LACPv1, length 110
        Actor Information TLV (0x01), length 20
          System 24:8a:07:0b:27:00, System Priority 65535, Key 10002, Port 41, Port Priority 255
          State Flags [none]
          0x0000:  ffff 248a 070b 2700 2712 00ff 0029 0000
          0x0010:  0000
        Partner Information TLV (0x02), length 20
          System 1c:34:da:bf:b8:00, System Priority 65535, Key 10002, Port 113, Port Priority 255
          State Flags [Activity, Aggregation, Synchronization, Collecting, Distributing]
          0x0000:  ffff 1c34 dabf b800 2712 00ff 0071 3d00
          0x0010:  0000
        Collector Information TLV (0x03), length 16
          Max Delay 0
          0x0000:  0000 0000 0000 0000 0000 0000 0000
        Terminator TLV (0x00), length 0
        0x0000:  0180 c200 0002 248a 070b 2700 8809 0101  ......$...'.....
        0x0010:  0114 ffff 248a 070b 2700 2712 00ff 0029  ....$...'.'....)
        0x0020:  0000 0000 0214 ffff 1c34 dabf b800 2712  .........4....'.
        0x0030:  00ff 0071 3d00 0000 0310 0000 0000 0000  ...q=...........
        0x0040:  0000 0000 0000 0000 0000 0000 0000 0000  ................
        0x0050:  0000 0000 0000 0000 0000 0000 0000 0000  ................
        0x0060:  0000 0000 0000 0000 0000 0000 0000 0000  ................
        0x0070:  0000 0000 0000 0000 0000 0000            ............
17:18:44.247304 1c:34:da:bf:b8:00 > 01:80:c2:00:00:02, ethertype Slow Protocols (0x8809), length 124: LACPv1, length 110
        Actor Information TLV (0x01), length 20
          System 1c:34:da:bf:b8:00, System Priority 65535, Key 10002, Port 113, Port Priority 255
          State Flags [Activity, Aggregation, Synchronization, Collecting, Distributing]
          0x0000:  ffff 1c34 dabf b800 2712 00ff 0071 3d00
          0x0010:  0000
        Partner Information TLV (0x02), length 20
          System 24:8a:07:0b:27:00, System Priority 65535, Key 10002, Port 41, Port Priority 255
          State Flags [none]
          0x0000:  ffff 248a 070b 2700 2712 00ff 0029 0000
          0x0010:  0000
        Collector Information TLV (0x03), length 16
          Max Delay 0
          0x0000:  0000 0000 0000 0000 0000 0000 0000
        Terminator TLV (0x00), length 0
        0x0000:  0180 c200 0002 1c34 dabf b800 8809 0101  .......4........
        0x0010:  0114 ffff 1c34 dabf b800 2712 00ff 0071  .....4....'....q
        0x0020:  3d00 0000 0214 ffff 248a 070b 2700 2712  =.......$...'.'.
        0x0030:  00ff 0029 0000 0000 0310 0000 0000 0000  ...)............
        0x0040:  0000 0000 0000 0000 0000 0000 0000 0000  ................
        0x0050:  0000 0000 0000 0000 0000 0000 0000 0000  ................
        0x0060:  0000 0000 0000 0000 0000 0000 0000 0000  ................
        0x0070:  0000 0000 0000 0000 0000 0000            ............

The state flag in actor's TLV is empty and causes a short flap for few ms on AUX.

Steps to reproduce the issue:

The setup consists of two SONiC switches running 202305:

IXIA - DUT - AUX - IXIA
  1. Configure portchannels between DUT and AUX
  2. Configure BGP between DUT and IXIA, AUX and IXIA
  3. Run traffic and perform warm-reboot on DUT

Issue is always reproduced on LAG with BGP session configured on it.

Describe the results you received:

Observed PortChannel0002 LAG flap and traffic drop on AUX when DUT switch is warm booting.

Describe the results you expected:

No LAG flap, no traffic drop.

Output of show version:

SONiC Software Version: SONiC.202305_RC.7-c8447efe1_Internal
SONiC OS Version: 11
Distribution: Debian 11.7
Kernel: 5.10.0-18-2-amd64
Build commit: 7eabbdab4
Build date: Thu Oct  5 00:53:57 UTC 2023
Built by: sw-r2d2-bot@r-build-sonic-ci03-241

Platform: x86_64-mlnx_msn2700-r0
HwSKU: ACS-MSN2700
ASIC: mellanox
ASIC Count: 1
Serial Number: MT1828K21616
Model Number: MSN2700-CS2F
Hardware Revision: A1
Uptime: 01:47:32 up 4 min,  1 user,  load average: 3.36, 2.93, 1.41
Date: Sat 07 Oct 2023 01:47:32

Docker images:
REPOSITORY                                         TAG                              IMAGE ID       SIZE
docker-platform-monitor                            202305_RC.7-c8447efe1_Internal   a2811c8c99f9   840MB
docker-platform-monitor                            latest                           a2811c8c99f9   840MB
docker-syncd-mlnx                                  202305_RC.7-c8447efe1_Internal   34cc706cca46   848MB
docker-syncd-mlnx                                  latest                           34cc706cca46   848MB
docker-orchagent                                   202305_RC.7-c8447efe1_Internal   3c66e546bc6e   342MB
docker-orchagent                                   latest                           3c66e546bc6e   342MB
docker-macsec                                      latest                           b61a4d1793fa   333MB
docker-snmp                                        202305_RC.7-c8447efe1_Internal   4b240bc9ff6a   352MB
docker-snmp                                        latest                           4b240bc9ff6a   352MB
docker-dhcp-relay                                  latest                           078bb073db40   321MB
docker-eventd                                      202305_RC.7-c8447efe1_Internal   2649ba7cfdb4   313MB
docker-eventd                                      latest                           2649ba7cfdb4   313MB
docker-fpm-frr                                     202305_RC.7-c8447efe1_Internal   89b9e7b48f14   362MB
docker-fpm-frr                                     latest                           89b9e7b48f14   362MB
docker-nat                                         202305_RC.7-c8447efe1_Internal   df7cb2b20a83   334MB
docker-nat                                         latest                           df7cb2b20a83   334MB
docker-sonic-telemetry                             202305_RC.7-c8447efe1_Internal   a02a0a952897   400MB
docker-sonic-telemetry                             latest                           a02a0a952897   400MB
docker-sflow                                       202305_RC.7-c8447efe1_Internal   661acead2152   332MB
docker-sflow                                       latest                           661acead2152   332MB
docker-teamd                                       202305_RC.7-c8447efe1_Internal   5d738ec37def   331MB
docker-teamd                                       latest                           5d738ec37def   331MB
docker-lldp                                        202305_RC.7-c8447efe1_Internal   00868d6a849b   355MB
docker-lldp                                        latest                           00868d6a849b   355MB
docker-router-advertiser                           202305_RC.7-c8447efe1_Internal   1a0c48f6d2c1   313MB
docker-router-advertiser                           latest                           1a0c48f6d2c1   313MB
docker-mux                                         202305_RC.7-c8447efe1_Internal   127ea46b5298   362MB
docker-mux                                         latest                           127ea46b5298   362MB
docker-database                                    202305_RC.7-c8447efe1_Internal   ec8c2a896c93   313MB
docker-database                                    latest                           ec8c2a896c93   313MB
docker-sonic-mgmt-framework                        202305_RC.7-c8447efe1_Internal   f72ebd228421   416MB
docker-sonic-mgmt-framework                        latest                           f72ebd228421   416MB

Output of show techsupport:

(paste your output here or download and attach the file here )

Additional information you deem important (e.g. issue happens only occasionally):

sonic_dump_qa-eth-vt02-5-2700a1_2023-10-07T015133.tar.gz

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Issue for 202305 MSFT Triaged this issue has been triaged
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants