Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Multi-asic] On LC init one ASIC ends up in TSA while the other ends up in TSB #21816

Closed
arista-nwolfe opened this issue Feb 20, 2025 · 0 comments · Fixed by #21830
Closed

[Multi-asic] On LC init one ASIC ends up in TSA while the other ends up in TSB #21816

arista-nwolfe opened this issue Feb 20, 2025 · 0 comments · Fixed by #21830
Assignees

Comments

@arista-nwolfe
Copy link
Contributor

arista-nwolfe commented Feb 20, 2025

There is a race condition introduced by #21587 where each swss instance is going to restart the startup_tsa_tsb.service when they come up.
This results in the startup_tsa_tsb.service stopping 2 times and starting 3 times.

E.G.

2025 Feb 20 19:28:39.844540 cmp227-4 NOTICE kernel: [    0.000000] Linux version 6.1.0-22-2-amd64
  1. Started by Systemd
2025 Feb 20 19:28:51.626593 cmp227-4 INFO systemd[1]: Started startup_tsa_tsb.service - STARTUP TSA-TSB SERVICE.
2025 Feb 20 19:28:52.631791 cmp227-4 INFO startup_tsa_tsb: asic0 - CONFIG_DB.BGP_DEVICE_GLOBAL.STATE.tsa_enabled : false
2025 Feb 20 19:28:53.290011 cmp227-4 INFO startup_tsa_tsb: asic1 - CONFIG_DB.BGP_DEVICE_GLOBAL.STATE.tsa_enabled : false
2025 Feb 20 19:28:53.290118 cmp227-4 INFO startup_tsa_tsb: Configuring TSA
  1. Restarted by swss0 (stopped and started)
2025 Feb 20 19:28:53.929423 cmp227-4 INFO swss.sh[3355]: swss0: starting TSA-TSB service
2025 Feb 20 19:28:53.939779 cmp227-4 INFO systemd[1]: Stopping startup_tsa_tsb.service - STARTUP TSA-TSB SERVICE...
2025 Feb 20 19:28:54.070178 cmp227-4 INFO startup_tsa_tsb: Resetting environment variable
2025 Feb 20 19:28:54.094711 cmp227-4 INFO systemd[1]: startup_tsa_tsb.service: Deactivated successfully.
2025 Feb 20 19:28:54.094981 cmp227-4 INFO systemd[1]: Stopped startup_tsa_tsb.service - STARTUP TSA-TSB SERVICE.
2025 Feb 20 19:28:54.375061 cmp227-4 INFO systemd[1]: Started startup_tsa_tsb.service - STARTUP TSA-TSB SERVICE.
2025 Feb 20 19:28:55.215203 cmp227-4 INFO startup_tsa_tsb: asic0 - CONFIG_DB.BGP_DEVICE_GLOBAL.STATE.tsa_enabled : false
2025 Feb 20 19:28:55.788054 cmp227-4 INFO startup_tsa_tsb: asic1 - CONFIG_DB.BGP_DEVICE_GLOBAL.STATE.tsa_enabled : false
2025 Feb 20 19:28:55.788102 cmp227-4 INFO startup_tsa_tsb: Configuring TSA
2025 Feb 20 19:28:58.479678 cmp227-4 INFO TSA: System Mode: Normal -> Maintenance
  1. Restarted by swss1 (stopped and started)
2025 Feb 20 19:28:58.888987 cmp227-4 INFO swss.sh[4588]: swss1: starting TSA-TSB service
2025 Feb 20 19:28:58.926216 cmp227-4 INFO systemd[1]: Stopping startup_tsa_tsb.service - STARTUP TSA-TSB SERVICE...
2025 Feb 20 19:28:59.087019 cmp227-4 INFO startup_tsa_tsb: Resetting environment variable
2025 Feb 20 19:28:59.124181 cmp227-4 INFO systemd[1]: startup_tsa_tsb.service: Deactivated successfully.
2025 Feb 20 19:28:59.124519 cmp227-4 INFO systemd[1]: Stopped startup_tsa_tsb.service - STARTUP TSA-TSB SERVICE.
2025 Feb 20 19:28:59.162638 cmp227-4 INFO systemd[1]: Started startup_tsa_tsb.service - STARTUP TSA-TSB SERVICE.
2025 Feb 20 19:29:00.030231 cmp227-4 INFO startup_tsa_tsb: asic0 - CONFIG_DB.BGP_DEVICE_GLOBAL.STATE.tsa_enabled : true
2025 Feb 20 19:29:00.726294 cmp227-4 INFO startup_tsa_tsb: asic1 - CONFIG_DB.BGP_DEVICE_GLOBAL.STATE.tsa_enabled : false
2025 Feb 20 19:29:00.736585 cmp227-4 INFO startup_tsa_tsb: Either TSA is already configured or switch sub_role is not Frontend - not configuring TSA

The problem with this stopping and starting is we could kill the process at any point and could cause the state between asic instances to be out of sync.

See the last restart logs:

2025 Feb 20 19:29:00.030231 cmp227-4 INFO startup_tsa_tsb: asic0 - CONFIG_DB.BGP_DEVICE_GLOBAL.STATE.tsa_enabled : true
2025 Feb 20 19:29:00.726294 cmp227-4 INFO startup_tsa_tsb: asic1 - CONFIG_DB.BGP_DEVICE_GLOBAL.STATE.tsa_enabled : false

tsa_enabled is True for asic0 but False for asic1`.

This is because TSA iterates over the asics one at a time and writes tsa_enabled=true, so if swss1 kills this service when it's written to asic0 but not asic1 we end up in this state.

The bigger issue here is that tsa_enabled=true is set on either asic's config_db but the following entry doesn't exist in STATE_DB:
'sonic-db-cli', 'STATE_DB', 'HSET', 'ALL_SERVICE_STATUS|tsa_tsb_service', 'running', 'OK'

This is because this write occurs after the TSA is complete:

69     if tsa_ena == True:
70         logger.log_info("Configuring TSA")
71         subprocess.check_output(['TSA']).strip()
72         logger.log_info("Setting TSA-TSB service field in STATE_DB")
73         subprocess.check_output([
74             'sonic-db-cli', 'STATE_DB', 'HSET', 'ALL_SERVICE_STATUS|tsa_tsb_service', 'running', 'OK'
75         ]).strip()

And note we don't see Setting TSA-TSB service field in STATE_DB in any of the logs above.

Without the ALL_SERVICE_STATUS|tsa_tsb_service we won't return True in config_tsa:

77         #check if tsa_tsb service is already running, restart the timer
78         try:
79             startup_tsa_tsb_service_status = subprocess.check_output([
80                 'sonic-db-cli', 'STATE_DB', 'HGET', 'ALL_SERVICE_STATUS|tsa_tsb_service', 'running'
81             ]).strip().decode('utf-8')  # Convert bytes to string
82         except subprocess.CalledProcessError:
83             startup_tsa_tsb_service_status = None  # Default if the field is missing
84
85         if startup_tsa_tsb_service_status == 'OK':
86             logger.log_info("TSA-TSB service is already running, just restart the timer")
87             return True
88         else:
89             if num_asics > 1:
90                 logger.log_info("Either TSA is already configured or switch sub_role is not Frontend - not configuring TSA")
91             else:
92                 logger.log_info("Either TSA is already configured - not configuring TSA")

Which is why we see the log on line 90 instead.

With config_tsa returning False we won't start the timer and therefor won't take asic0 out of TSA

130 def start_tsa_tsb(timer):
131     #Configure TSA if it was not configured already in CONFIG_DB
132     tsa_enabled = config_tsa()
133     if tsa_enabled == True:
134         #Start the timer to configure TSB
135         start_tsb_timer(timer)
136     return
root@cmp227-4# TSC
BGP0 : System Mode: Maintenance
BGP1 : System Mode: Normal
The rates are calculated within 5 seconds period
@deepak-singhal0408 deepak-singhal0408 self-assigned this Feb 20, 2025
mssonicbld added a commit to mssonicbld/sonic-buildimage-msft that referenced this issue Feb 26, 2025
<!--
     Please make sure you've read and understood our contributing guidelines:
     https://github.com/Azure/SONiC/blob/gh-pages/CONTRIBUTING.md

     ** Make sure all your commits include a signature generated with `git commit -s` **

     If this is a bug fix, make sure your description includes "fixes #xxxx", or
     "closes #xxxx" or "resolves #xxxx"

     Please provide the following information:
-->

#### Why I did it
Fixes sonic-net/sonic-buildimage#21816

##### Work item tracking
- Microsoft ADO **31499777**:

#### How I did it
Setting the STATE_DB ALL_SERVICE_STATUS|tsa_tsb_service flag first as part of startup_tsa_tsb service, followed by configuring TSA.
And as part of the case, when tsa_ena is False (genuine or due to race condition), we explictly call TSA again to ensure all asics go to TSA state.
#### How to verify it
Reboot the multi-asic linecard, and validate that all asics are in TSA state and TSA-TSB timer is running
config_reload

Tested following scenarios:
1. reboot multi-asic linecard
2. config reload
3. execute TSA while the service is running
4. TSA, config save and then config_reload
5. execute TSB while the service is running
<!--
If PR needs to be backported, then the PR must be tested against the base branch and the earliest backport release branch and provide tested image version on these two branches. For example, if the PR is requested for master, 202211 and 202012, then the requester needs to provide test results on master and 202012.
-->

#### Which release branch to backport (provide reason below if selected)

<!--
- Note we only backport fixes to a release branch, *not* features!
- Please also provide a reason for the backporting below.
- e.g.
- [x] 202006
-->

- [ ] 201811
- [ ] 201911
- [ ] 202006
- [ ] 202012
- [ ] 202106
- [ ] 202111
- [ ] 202205
- [ ] 202211
- [ ] 202305

#### Tested branch (Please provide the tested image version)
20240532.08
<!--
- Please provide tested image version
- e.g.
- [x] 20201231.100
-->

- [ ] <!-- image version 1 -->
- [ ] <!-- image version 2 -->

#### Description for the changelog
<!--
Write a short (one line) summary that describes the changes in this
pull request for inclusion in the changelog:
-->

<!--
 Ensure to add label/tag for the feature raised. example - PR#2174 under sonic-utilities repo. where, Generic Config and Update feature has been labelled as GCU.
-->

#### Link to config_db schema for YANG module changes
<!--
Provide a link to config_db schema for the table for which YANG model
is defined
Link should point to correct section on https://github.com/Azure/sonic-buildimage/blob/master/src/sonic-yang-models/doc/Configuration.md
-->

#### A picture of a cute animal (not mandatory but encouraged)
mssonicbld added a commit to Azure/sonic-buildimage-msft that referenced this issue Feb 26, 2025
<!--
 Please make sure you've read and understood our contributing guidelines:
 https://github.com/Azure/SONiC/blob/gh-pages/CONTRIBUTING.md

 failure_prs.log skip_prs.log Make sure all your commits include a signature generated with `git commit -s` **

 If this is a bug fix, make sure your description includes "fixes #xxxx", or
 "closes #xxxx" or "resolves #xxxx"

 Please provide the following information:
-->

#### Why I did it
Fixes sonic-net/sonic-buildimage#21816

##### Work item tracking
- Microsoft ADO **31499777**:

#### How I did it
Setting the STATE_DB ALL_SERVICE_STATUS|tsa_tsb_service flag first as part of startup_tsa_tsb service, followed by configuring TSA.
And as part of the case, when tsa_ena is False (genuine or due to race condition), we explictly call TSA again to ensure all asics go to TSA state.
#### How to verify it
Reboot the multi-asic linecard, and validate that all asics are in TSA state and TSA-TSB timer is running
config_reload

Tested following scenarios:
1. reboot multi-asic linecard
2. config reload
3. execute TSA while the service is running
4. TSA, config save and then config_reload
5. execute TSB while the service is running
<!--
If PR needs to be backported, then the PR must be tested against the base branch and the earliest backport release branch and provide tested image version on these two branches. For example, if the PR is requested for master, 202211 and 202012, then the requester needs to provide test results on master and 202012.
-->

#### Which release branch to backport (provide reason below if selected)

<!--
- Note we only backport fixes to a release branch, *not* features!
- Please also provide a reason for the backporting below.
- e.g.
- [x] 202006
-->

- [ ] 201811
- [ ] 201911
- [ ] 202006
- [ ] 202012
- [ ] 202106
- [ ] 202111
- [ ] 202205
- [ ] 202211
- [ ] 202305

#### Tested branch (Please provide the tested image version)
20240532.08
<!--
- Please provide tested image version
- e.g.
- [x] 20201231.100
-->

- [ ] <!-- image version 1 -->
- [ ] <!-- image version 2 -->

#### Description for the changelog
<!--
Write a short (one line) summary that describes the changes in this
pull request for inclusion in the changelog:
-->

<!--
 Ensure to add label/tag for the feature raised. example - PR#2174 under sonic-utilities repo. where, Generic Config and Update feature has been labelled as GCU.
-->

#### Link to config_db schema for YANG module changes
<!--
Provide a link to config_db schema for the table for which YANG model
is defined
Link should point to correct section on https://github.com/Azure/sonic-buildimage/blob/master/src/sonic-yang-models/doc/Configuration.md
-->

#### A picture of a cute animal (not mandatory but encouraged)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

2 participants