Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[T2-Chassis][Route-Convergence]: route-convergence takes upto 10secs in Process crash(swss/syncd) scenarios #21586

Closed
deepak-singhal0408 opened this issue Jan 31, 2025 · 1 comment
Assignees
Labels
P0 Priority of the issue

Comments

@deepak-singhal0408
Copy link
Contributor

On T2 Chassis, running 202405 image, process crash takes upto 10 seconds for traffic to converge. It needs to be optimized. Ideally, we should achieve subsecond convergence in this scenario..

Testplan:
https://github.com/sonic-net/sonic-mgmt/blob/master/docs/testplan/Convergence%20measurement%20in%20data%20center%20networks.md#test-case--26

Number of prefixes:
60k(30k V4+30k v6) from each Upstream Neighbors

Number of Upstream Neighbors: 16

Testcase:
https://github.com/sonic-net/sonic-mgmt/blob/master/tests/snappi_tests/multidut/bgp/test_bgp_outbound_uplink_process_crash.py

@deepak-singhal0408 deepak-singhal0408 self-assigned this Jan 31, 2025
@deepak-singhal0408 deepak-singhal0408 added the P0 Priority of the issue label Jan 31, 2025
@deepak-singhal0408
Copy link
Contributor Author

The observation is that traffic loss is observed when these processes are coming up.. They start learning routes from their neighbors(upstream) and immediately start advertising to their other neighbors(downstream) and start attracting traffic even before programming them in their asics.

@deepak-singhal0408 deepak-singhal0408 changed the title [T2-Chassis][Route-Convergence]: route-convergence takes upto 10secs in Process crash swss/syncd scenarios [T2-Chassis][Route-Convergence]: route-convergence takes upto 10secs in Process crash(swss/syncd) scenarios Jan 31, 2025
rlhui pushed a commit that referenced this issue Feb 11, 2025
mssonicbld added a commit to mssonicbld/sonic-buildimage-msft that referenced this issue Feb 11, 2025
<!--
     Please make sure you've read and understood our contributing guidelines:
     https://github.com/Azure/SONiC/blob/gh-pages/CONTRIBUTING.md

     ** Make sure all your commits include a signature generated with `git commit -s` **

     If this is a bug fix, make sure your description includes "fixes #xxxx", or
     "closes #xxxx" or "resolves #xxxx"

     Please provide the following information:
-->

#### Why I did it
Fixes issue: sonic-net/sonic-buildimage#21586

##### Work item tracking
- Microsoft ADO **31196012**:

#### How I did it
Run TSA-TSB service upon swss/swss0/swss1/.. startup. If the service is already running, reset the TSA-TSB timer.

#### How to verify it
Ran the T2 process crash sonic-mgmt snappi test to verify the convergence.
Before fix: ~10second
After Fix: <10ms

<!--
If PR needs to be backported, then the PR must be tested against the base branch and the earliest backport release branch and provide tested image version on these two branches. For example, if the PR is requested for master, 202211 and 202012, then the requester needs to provide test results on master and 202012.
-->

#### Which release branch to backport (provide reason below if selected)

<!--
- Note we only backport fixes to a release branch, *not* features!
- Please also provide a reason for the backporting below.
- e.g.
- [x] 202006
-->

- [ ] 201811
- [ ] 201911
- [ ] 202006
- [ ] 202012
- [ ] 202106
- [ ] 202111
- [ ] 202205
- [ ] 202211
- [ ] 202305

#### Tested branch (Please provide the tested image version)
SONiC.20240532.04
<!--
- Please provide tested image version
- e.g.
- [x] 20201231.100
-->

- [ ] <!-- image version 1 -->
- [ ] <!-- image version 2 -->

#### Description for the changelog
<!--
Write a short (one line) summary that describes the changes in this
pull request for inclusion in the changelog:
-->

<!--
 Ensure to add label/tag for the feature raised. example - PR#2174 under sonic-utilities repo. where, Generic Config and Update feature has been labelled as GCU.
-->

#### Link to config_db schema for YANG module changes
<!--
Provide a link to config_db schema for the table for which YANG model
is defined
Link should point to correct section on https://github.com/Azure/sonic-buildimage/blob/master/src/sonic-yang-models/doc/Configuration.md
-->

#### A picture of a cute animal (not mandatory but encouraged)
mssonicbld added a commit to Azure/sonic-buildimage-msft that referenced this issue Feb 12, 2025
<!--
 Please make sure you've read and understood our contributing guidelines:
 https://github.com/Azure/SONiC/blob/gh-pages/CONTRIBUTING.md

 failure_prs.log skip_prs.log Make sure all your commits include a signature generated with `git commit -s` **

 If this is a bug fix, make sure your description includes "fixes #xxxx", or
 "closes #xxxx" or "resolves #xxxx"

 Please provide the following information:
-->

#### Why I did it
Fixes issue: sonic-net/sonic-buildimage#21586

##### Work item tracking
- Microsoft ADO **31196012**:

#### How I did it
Run TSA-TSB service upon swss/swss0/swss1/.. startup. If the service is already running, reset the TSA-TSB timer.

#### How to verify it
Ran the T2 process crash sonic-mgmt snappi test to verify the convergence.
Before fix: ~10second
After Fix: <10ms

<!--
If PR needs to be backported, then the PR must be tested against the base branch and the earliest backport release branch and provide tested image version on these two branches. For example, if the PR is requested for master, 202211 and 202012, then the requester needs to provide test results on master and 202012.
-->

#### Which release branch to backport (provide reason below if selected)

<!--
- Note we only backport fixes to a release branch, *not* features!
- Please also provide a reason for the backporting below.
- e.g.
- [x] 202006
-->

- [ ] 201811
- [ ] 201911
- [ ] 202006
- [ ] 202012
- [ ] 202106
- [ ] 202111
- [ ] 202205
- [ ] 202211
- [ ] 202305

#### Tested branch (Please provide the tested image version)
SONiC.20240532.04
<!--
- Please provide tested image version
- e.g.
- [x] 20201231.100
-->

- [ ] <!-- image version 1 -->
- [ ] <!-- image version 2 -->

#### Description for the changelog
<!--
Write a short (one line) summary that describes the changes in this
pull request for inclusion in the changelog:
-->

<!--
 Ensure to add label/tag for the feature raised. example - PR#2174 under sonic-utilities repo. where, Generic Config and Update feature has been labelled as GCU.
-->

#### Link to config_db schema for YANG module changes
<!--
Provide a link to config_db schema for the table for which YANG model
is defined
Link should point to correct section on https://github.com/Azure/sonic-buildimage/blob/master/src/sonic-yang-models/doc/Configuration.md
-->

#### A picture of a cute animal (not mandatory but encouraged)
@rlhui rlhui closed this as completed Feb 12, 2025
mssonicbld added a commit to mssonicbld/sonic-mgmt.msft that referenced this issue Feb 14, 2025
<!--
Please make sure you've read and understood our contributing guidelines;
https://github.com/sonic-net/SONiC/blob/gh-pages/CONTRIBUTING.md

Please provide following information to help code review process a bit easier:
-->
### Description of PR
<!--
- Please include a summary of the change and which issue is fixed.
- Please also include relevant motivation and context. Where should reviewer start? background context?
- List any dependencies that are required for this change.
-->

Summary:
TSA-TSB service Testcases: Adjust the testcases to adhere to new behavior of config_reload

### Type of change

<!--
- Fill x for your type of change.
- e.g.
- [x] Bug fix
-->

- [ ] Bug fix
- [ ] Testbed and Framework(new/improvement)
- [ ] New Test case
    - [ ] Skipped for non-supported platforms
- [x] Test case improvement

### Back port request
- [ ] 202012
- [ ] 202205
- [ ] 202305
- [ ] 202311
- [x] 202405
- [x] 202411

### Approach
#### What is the motivation for this PR?
As a fix for the issue sonic-net/sonic-buildimage#21586, TSA-TSB service is invoked upon swss bring up(sonic-net/sonic-buildimage#21587).
This affects config_reload behavior, where after config reload the tsa-tsb service will be restarted, and the device will be in TSA state till timer expires. Adjusting the testcase to explicitly execute TSB for the DUT to be ready for next testcase,

#### How did you do it?
Enhanced the config_reload api to optionally take exec_tsb parameter. For startup-TSA-TSB and reliable TSA-TSB testcases, pass this flag to True to explicitly execute TSB on the device after config reload.

#### How did you verify/test it?
Ran the tests on t2
#### Any platform specific information?
NA
#### Supported testbed topology if it's a new test case?
NA
### Documentation
<!--
(If it's a new feature, new test case)
Did you update documentation/Wiki relevant to your implementation?
Link to the wiki page?
-->
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
P0 Priority of the issue
Projects
Status: Done
Development

No branches or pull requests

2 participants