-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Multi-asic] On LC init one ASIC ends up in TSA
while the other ends up in TSB
#21816
Comments
This was referenced Feb 22, 2025
mssonicbld
added a commit
to mssonicbld/sonic-buildimage-msft
that referenced
this issue
Feb 26, 2025
<!-- Please make sure you've read and understood our contributing guidelines: https://github.com/Azure/SONiC/blob/gh-pages/CONTRIBUTING.md ** Make sure all your commits include a signature generated with `git commit -s` ** If this is a bug fix, make sure your description includes "fixes #xxxx", or "closes #xxxx" or "resolves #xxxx" Please provide the following information: --> #### Why I did it Fixes sonic-net/sonic-buildimage#21816 ##### Work item tracking - Microsoft ADO **31499777**: #### How I did it Setting the STATE_DB ALL_SERVICE_STATUS|tsa_tsb_service flag first as part of startup_tsa_tsb service, followed by configuring TSA. And as part of the case, when tsa_ena is False (genuine or due to race condition), we explictly call TSA again to ensure all asics go to TSA state. #### How to verify it Reboot the multi-asic linecard, and validate that all asics are in TSA state and TSA-TSB timer is running config_reload Tested following scenarios: 1. reboot multi-asic linecard 2. config reload 3. execute TSA while the service is running 4. TSA, config save and then config_reload 5. execute TSB while the service is running <!-- If PR needs to be backported, then the PR must be tested against the base branch and the earliest backport release branch and provide tested image version on these two branches. For example, if the PR is requested for master, 202211 and 202012, then the requester needs to provide test results on master and 202012. --> #### Which release branch to backport (provide reason below if selected) <!-- - Note we only backport fixes to a release branch, *not* features! - Please also provide a reason for the backporting below. - e.g. - [x] 202006 --> - [ ] 201811 - [ ] 201911 - [ ] 202006 - [ ] 202012 - [ ] 202106 - [ ] 202111 - [ ] 202205 - [ ] 202211 - [ ] 202305 #### Tested branch (Please provide the tested image version) 20240532.08 <!-- - Please provide tested image version - e.g. - [x] 20201231.100 --> - [ ] <!-- image version 1 --> - [ ] <!-- image version 2 --> #### Description for the changelog <!-- Write a short (one line) summary that describes the changes in this pull request for inclusion in the changelog: --> <!-- Ensure to add label/tag for the feature raised. example - PR#2174 under sonic-utilities repo. where, Generic Config and Update feature has been labelled as GCU. --> #### Link to config_db schema for YANG module changes <!-- Provide a link to config_db schema for the table for which YANG model is defined Link should point to correct section on https://github.com/Azure/sonic-buildimage/blob/master/src/sonic-yang-models/doc/Configuration.md --> #### A picture of a cute animal (not mandatory but encouraged)
11 tasks
mssonicbld
added a commit
to Azure/sonic-buildimage-msft
that referenced
this issue
Feb 26, 2025
<!-- Please make sure you've read and understood our contributing guidelines: https://github.com/Azure/SONiC/blob/gh-pages/CONTRIBUTING.md failure_prs.log skip_prs.log Make sure all your commits include a signature generated with `git commit -s` ** If this is a bug fix, make sure your description includes "fixes #xxxx", or "closes #xxxx" or "resolves #xxxx" Please provide the following information: --> #### Why I did it Fixes sonic-net/sonic-buildimage#21816 ##### Work item tracking - Microsoft ADO **31499777**: #### How I did it Setting the STATE_DB ALL_SERVICE_STATUS|tsa_tsb_service flag first as part of startup_tsa_tsb service, followed by configuring TSA. And as part of the case, when tsa_ena is False (genuine or due to race condition), we explictly call TSA again to ensure all asics go to TSA state. #### How to verify it Reboot the multi-asic linecard, and validate that all asics are in TSA state and TSA-TSB timer is running config_reload Tested following scenarios: 1. reboot multi-asic linecard 2. config reload 3. execute TSA while the service is running 4. TSA, config save and then config_reload 5. execute TSB while the service is running <!-- If PR needs to be backported, then the PR must be tested against the base branch and the earliest backport release branch and provide tested image version on these two branches. For example, if the PR is requested for master, 202211 and 202012, then the requester needs to provide test results on master and 202012. --> #### Which release branch to backport (provide reason below if selected) <!-- - Note we only backport fixes to a release branch, *not* features! - Please also provide a reason for the backporting below. - e.g. - [x] 202006 --> - [ ] 201811 - [ ] 201911 - [ ] 202006 - [ ] 202012 - [ ] 202106 - [ ] 202111 - [ ] 202205 - [ ] 202211 - [ ] 202305 #### Tested branch (Please provide the tested image version) 20240532.08 <!-- - Please provide tested image version - e.g. - [x] 20201231.100 --> - [ ] <!-- image version 1 --> - [ ] <!-- image version 2 --> #### Description for the changelog <!-- Write a short (one line) summary that describes the changes in this pull request for inclusion in the changelog: --> <!-- Ensure to add label/tag for the feature raised. example - PR#2174 under sonic-utilities repo. where, Generic Config and Update feature has been labelled as GCU. --> #### Link to config_db schema for YANG module changes <!-- Provide a link to config_db schema for the table for which YANG model is defined Link should point to correct section on https://github.com/Azure/sonic-buildimage/blob/master/src/sonic-yang-models/doc/Configuration.md --> #### A picture of a cute animal (not mandatory but encouraged)
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
There is a race condition introduced by #21587 where each swss instance is going to restart the
startup_tsa_tsb.service
when they come up.This results in the
startup_tsa_tsb.service
stopping 2 times and starting 3 times.E.G.
The problem with this stopping and starting is we could kill the process at any point and could cause the state between asic instances to be out of sync.
See the last restart logs:
tsa_enabled
isTrue
forasic0
butFalse for
asic1`.This is because
TSA
iterates over the asics one at a time and writestsa_enabled=true
, so ifswss1
kills this service when it's written toasic0
but notasic1
we end up in this state.The bigger issue here is that
tsa_enabled=true
is set on either asic's config_db but the following entry doesn't exist inSTATE_DB
:'sonic-db-cli', 'STATE_DB', 'HSET', 'ALL_SERVICE_STATUS|tsa_tsb_service', 'running', 'OK'
This is because this write occurs after the
TSA
is complete:And note we don't see
Setting TSA-TSB service field in STATE_DB
in any of the logs above.Without the
ALL_SERVICE_STATUS|tsa_tsb_service
we won't return True inconfig_tsa
:Which is why we see the log on line 90 instead.
With
config_tsa
returningFalse
we won't start the timer and therefor won't take asic0 out of TSAThe text was updated successfully, but these errors were encountered: