Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Applying ACL rule causes BGP neighbor to go down #21183

Closed
Javier-Tan opened this issue Dec 16, 2024 · 10 comments · Fixed by sonic-net/sonic-mgmt#16727
Closed

Applying ACL rule causes BGP neighbor to go down #21183

Javier-Tan opened this issue Dec 16, 2024 · 10 comments · Fixed by sonic-net/sonic-mgmt#16727
Assignees
Labels
BRCM Chassis 🤖 Modular chassis support Issue for 202405 Triaged this issue has been triaged

Comments

@Javier-Tan
Copy link

Javier-Tan commented Dec 16, 2024

Description

We noticed that applying a specific ACL rules causes one specific BGP neighbor to go down (fc00::a) during ACL tests (specifically those with "IPV6" and "INGRESS" parameters). Removing it brings it back up.

admin@sonic:~$ show acl rule
...
DATA_INGRESS_IPV6_TEST  RULE_15       9985        DROP      DST_IPV6: 20c0:a800::9/128      {'asic0': 'Active', 'asic1': 'Active'}
                                                           IP_TYPE: IPV6ANY
...

admin@sonic:~$ show ipv6 bgp sum
...

Neighbhor      V     AS    MsgRcvd    MsgSent    TblVer    InQ    OutQ  Up/Down    State/PfxRcd    NeighborName
-----------  ---  -----  ---------  ---------  --------  -----  ------  ---------  --------------  --------------
...
fc00::a        4  65200        278         52         0      0       0  00:01:18   Connect         ARISTA03T3

admin@sonic:~$ show ipv6 interface
Interface       Master    IPv6 address/mask                            Admin/Oper    BGP Neighbor    Neighbor IP
--------------  --------  -------------------------------------------  ------------  --------------  -------------
...
Ethernet64                fc00::9/126                                  up/up         ARISTA03T3      fc00::a

Steps to reproduce the issue:

  1. Run any ACL tests with ipv6+ingress parameters e.g. acl/test_acl.py::TestBasicAcl::test_ingress_unmatched_blocked[ipv6-ingress-downlink->uplink-default-no_vlan] with breakpoint after ACL rules are applied
  2. After rule 15 is added, once BGP updates (~3mins), neighbor fc00::a will go down
  3. Removing the rule will bring it immediately back up

NOTE: BGP neighbor fc00::a will always go down when the rule is applied during ipv6+ingress test runs, however, only tests that fail is acl/test_acl.py::TestAclWithReboot...[ipv6-ingress...] as there are explicit BGP neighbor up checks.

Describe the results you received:

ACL rule 15 causes BGP neighbor fc00::a to go down when they are seeminly unrelated.

Describe the results you expected:

BGP neighbor fc00::a should stay up.

Output of show version:

SONiC Software Version: SONiC.20240510.16
BRCM SAI ver: [11.2.13.1], OCP SAI ver: [1.14.0], SDK ver: [sdk-6.5.30-SP4]

Output of show techsupport:

(paste your output here or download and attach the file here )

Additional information you deem important (e.g. issue happens only occasionally):

Rules applied can be found at sonic-mgmt-int/tests/acl/templates/acltb_v6_test_rules.j2

{
    "acl": {
        "acl-sets": {
            "acl-set": {
                "{{ acl_table_name }}": {
                    "acl-entries": {
                        "acl-entry": {
                            ...
                            "15": {
                                "actions": {
                                    "config": {
                                        "forwarding-action": "DROP"
                                    }
                                },
                                "config": {
                                    "sequence-id": 15
                                },
                                "ip": {
                                    "config": {
                                        "destination-ip-address": "20c0:a800::9/128"
                                    }
                                }
                            },
                            ...
                        }
                    }
                }
            }
        }
    }
}

@Javier-Tan
Copy link
Author

@arlakshm for vis

@arlakshm
Copy link
Contributor

after this change sonic-net/sonic-mgmt#15921. If any the bgp sessions are down, the test is marked as failed.

@arlakshm
Copy link
Contributor

@arista-nwolfe, @kenneth-arista, @saksarav-nokia, @sanjair-git do you see these failures as well?

@arista-nwolfe
Copy link
Contributor

@arista-nwolfe, @kenneth-arista, @saksarav-nokia, @sanjair-git do you see these failures as well?

I'll try out the manual steps @Javier-Tan outlined with the pdb and wait to see if the bgp neighbors go down, but our ACL pass rate has been pretty consistently at 100% so we aren't seeing the failures caused by this.

@arlakshm
Copy link
Contributor

Thanks @arista-nwolfe, are you using the latest sonic-mgmt code for 202405. As I mentioned earlier after this change sonic-net/sonic-mgmt#15921. We check if all the bgp session are up after appling the ACLs

@arista-nwolfe
Copy link
Contributor

Thanks @arista-nwolfe, are you using the latest sonic-mgmt code for 202405. As I mentioned earlier after this change sonic-net/sonic-mgmt#15921. We check if all the bgp session are up after appling the ACLs

Yeah this last weekend's run has this change and we didn't see any failures due to All BGP sessions are not up after reboot, no point in continuing the test on any of our 3 testbeds.

@sanjair-git
Copy link

Hi @arlakshm, we have the latest code change from #15921 and all the tests from ACL are passing in our test beds too.

@arista-nwolfe
Copy link
Contributor

@arista-nwolfe, @kenneth-arista, @saksarav-nokia, @sanjair-git do you see these failures as well?

I'll try out the manual steps @Javier-Tan outlined with the pdb and wait to see if the bgp neighbors go down, but our ACL pass rate has been pretty consistently at 100% so we aren't seeing the failures caused by this.

I see the same behavior @Javier-Tan sees when I put a pdb after setup_rules:

DATA_INGRESS_IPV6_TEST  RULE_15       9985        DROP      DST_IPV6: 20c0:a800::9/128      {'asic0': 'Active', 'asic1': 'Active'}
                                                            IP_TYPE: IPV6ANY
Neighbhor       V     AS    MsgRcvd    MsgSent    TblVer    InQ    OutQ  Up/Down    State/PfxRcd    NeighborName
------------  ---  -----  ---------  ---------  --------  -----  ------  ---------  --------------  --------------
fc00:3000::1    4  65100       1226       1220         0      0       0  00:22:33   6               ASIC0
fc00:3000::3    4  65100       1219       1227         0      0       0  00:22:33   7               ASIC1
fc00:3000::5    4  65100        719       1221         0      0       0  00:22:36   519             cmp214-6-ASIC0
fc00:3000::5    4  65100        719       1228         0      0       0  00:22:36   519             cmp214-6-ASIC0
fc00:3000::7    4  65100        719       1228         0      0       0  00:22:37   519             cmp214-6-ASIC1
fc00:3000::7    4  65100        722       1224         0      0       0  00:22:42   519             cmp214-6-ASIC1
fc00::2         4  65200        724       1046         0      0       0  00:22:38   34050           ARISTA01T3
fc00::16        4  65200        726        796         0      0       0  00:22:43   34050           ARISTA06T3
fc00::a         4  65200        699        757         0      0       0  00:01:11   Connect         ARISTA03T3
fc00::e         4  65200        725        795         0      0       0  00:22:39   34050           ARISTA04T3

It's just the 1 neighbor down that goes down strangely.

@Javier-Tan
Copy link
Author

Javier-Tan commented Dec 16, 2024

Sorry, I wasn't clear enough in the description but it was just that 1 BGP neighbor "fc00::a" that goes down @arista-nwolfe , so this is the same bug we see

@rlhui rlhui added the BRCM label Dec 18, 2024
@arlakshm arlakshm self-assigned this Jan 14, 2025
@rlhui rlhui added the Triaged this issue has been triaged label Jan 22, 2025
@rlhui
Copy link
Contributor

rlhui commented Jan 22, 2025

CS00012383871

mssonicbld added a commit to mssonicbld/sonic-mgmt.msft that referenced this issue Jan 31, 2025
<!--
Please make sure you've read and understood our contributing guidelines;
https://github.com/sonic-net/SONiC/blob/gh-pages/CONTRIBUTING.md

Please provide following information to help code review process a bit easier:
-->
### Description of PR
<!--
- Please include a summary of the change and which issue is fixed.
- Please also include relevant motivation and context. Where should reviewer start? background context?
- List any dependencies that are required for this change.
-->

Summary:
Fixes sonic-net/sonic-buildimage#21183

### Type of change

<!--
- Fill x for your type of change.
- e.g.
- [x] Bug fix
-->

- [x] Bug fix
- [ ] Testbed and Framework(new/improvement)
- [ ] New Test case
    - [ ] Skipped for non-supported platforms
- [ ] Test case improvement

### Back port request
- [ ] 202012
- [ ] 202205
- [ ] 202305
- [ ] 202311
- [ ] 202405
- [ ] 202411

### Approach
#### What is the motivation for this PR?
Prevent T2 BGP neighbors going down during ACL tests
#### How did you do it?
Prevent last 64 bits of a DROP rule IP from being the same as a BGP neighbor
#### How did you verify/test it?
Run on T2 devices
T1 regression test: https://elastictest.org/scheduler/testplan/679c3507f5a74203a8e1b10b
#### Any platform specific information?
N/A
#### Supported testbed topology if it's a new test case?
N/A
### Documentation
<!--
(If it's a new feature, new test case)
Did you update documentation/Wiki relevant to your implementation?
Link to the wiki page?
-->
arlakshm added a commit to Azure/sonic-mgmt.msft that referenced this issue Feb 1, 2025
Description of PR
Summary:
Fixes sonic-net/sonic-buildimage#21183

Approach
What is the motivation for this PR?
Prevent T2 BGP neighbors going down during ACL tests

How did you do it?
Prevent last 64 bits of a DROP rule IP from being the same as a BGP neighbor

How did you verify/test it?
Run on T2 devices
T1 regression test: https://elastictest.org/scheduler/testplan/679c3507f5a74203a8e1b10b
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
BRCM Chassis 🤖 Modular chassis support Issue for 202405 Triaged this issue has been triaged
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

5 participants