-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update Auto_ts doc to include orchagent abort case #1128
Changes from all commits
dacb7ee
7ea5f13
0a3f114
ee50b5f
9d04ddf
108fbb6
7ca5d2b
487cef0
225ae8f
4edb37d
927b93b
cabbac0
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -19,10 +19,11 @@ | |
* [7.6 MultiAsic consideration](#76-MultiAsic-consideration) | ||
* [7.7 Design choices for max-techsupport-limit & max-techsupport-limit arguments](#77-Design-choices-for-max-core-limit-&-max-techsupport-limit-arguments) | ||
* [7.8 Techsupport Locking](#78-Techsupport-Locking) | ||
* [7.9 Orchagent abort consideration](#79-Orchagent-abort-consideration) | ||
* [8. Test Plan](#8-Test-Plan) | ||
* [9. SONiC-to-SONiC Upgrade Considerations](#9-SONiC-to-SONiC-Upgrade-Considerations) | ||
* [10. App Extension Consideration](#9-App-Extension-Considerations) | ||
* [11. Open questions](#10-Open-questions) | ||
* [10. App Extension Consideration](#10-App-Extension-Considerations) | ||
* [11. Open questions](#11-Open-questions) | ||
|
||
|
||
### Revision | ||
|
@@ -32,6 +33,7 @@ | |
| 1.1 | 04/08/2022 | Vivek Reddy Karri | Add the capability to Register/Deregister app extension to AUTO_TECHSUPPORT_FEATURE table | | ||
| 2.0 | TBD | Vivek Reddy Karri | Extending Support for Kernel Dumps | | ||
| 3.0 | 02/2022 | Stepan Blyshchak | Extending Support for memory usage threshold crossed | | ||
| 4.0 | 10/2022 | Vivek Reddy Karri | Handle abort by orchagent to collect saisdkdump before syncd restart | | ||
|
||
## About this Manual | ||
This document describes the details of the system which facilitates the auto techsupport invocation support in SONiC. The auto invocation is triggered when any process inside the docker crashes and a core dump is generated. | ||
|
@@ -458,12 +460,45 @@ Although if the admin feels otherwise, these values are configurable. | |
|
||
### 7.8 Techsupport Locking | ||
|
||
Recently, an enhancement was made to techsupport script to only run one instance at a time by using a locking mechanism. When other script instance of techsupport tries to run, it'll exit with a relevent code. This would apply nevertheless of how a techsupport was invoked i.e. manual or through auto-techsupport. | ||
Recently, an enhancement was made to techsupport script to only run one instance at a time by using a locking mechanism. When other instance of techsupport tries to run, it'll exit with a relevent code. This would apply nevertheless of how a techsupport was invoked i.e. manual or through auto-techsupport. | ||
|
||
With this change, rate-limit-interval of zero would not make any difference. The locking mechanism would implicitly impose a minimum rate-limit-interval of techsupport execution time. And since, the techsupport execution time can't be found out and varies based on underlying machine and system state, the range of values configurable for the rate-limit-interval is left unchanged | ||
|
||
A relevant message will be logged to syslog when the invocation fails because of LOCKFAIL exit code. | ||
|
||
### 7.9 Orchagent abort consideration | ||
|
||
Note: This solution currently applies only for Mellanox platforms. | ||
|
||
When the orchagent deems a SAI CRUD call as a failure based on the return status, it aborts itself. | ||
|
||
This'll result in | ||
1. A core dump to be generated and it triggers auto-techsupport if enabled | ||
2. All the services are restarted including syncd | ||
|
||
So, it is highly likely that by the time auto-techsupport collects saisdkdump, syncd might have been restarted or in the process of restarting. In either case, we'd be loosing the saisdkdump information before restart which will contain useful information for triaging. Thus, a special handling is needed for the core dumps generated from swss container. | ||
|
||
This requires enhancements not just in auto-techsupport but also in orchagent process and also in /usr/local/bin/syncd.sh script. | ||
|
||
data:image/s3,"s3://crabby-images/f139d/f139d0ed837f05dd808783ba2e77a6e26408c028" alt="Orch Abort Flow" | ||
|
||
Firstly, orchagent can terminate because of various reasons, i.e. some failure in orchagent like SEGFAULT, config reload etc but we are only interested in the case of SAI programming failure. During a SAI programming failure, orchagent will call abort i.e. kernel will send SIGABRT and thus the process terminates. To differentiate among these cases, orchagent will write to a table in STATE_DB named "ORCH_ABRT_STATUS". | ||
|
||
#### Schema: | ||
``` | ||
ORCH_ABRT_STATUS | ||
<1|empty> | ||
|
||
Eg: | ||
root@sonic:/home/admin# sonic-db-cli STATE_DB GET ORCH_ABRT_STATUS | ||
1 | ||
``` | ||
|
||
During sai programming failure, orchagent will set the status to ORCH_ABRT_STATUS flag in STATE_DB. syncd.sh script checks if the ORCH_ABRT_STATUS flag is set in STATE_DB before stopping the syncd container and if yes proceeds with collecting saisdkdump to `/var/log/orch_abrt_saisdkdump/` on the host and also creates a file under /tmp named 'saidump_collection_notify_flag'. This is used to synchronize b/w auto-techsupport and syncd. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. assuming it will be generic and some ASIC vendors will nor refer to the new state db adds, what will be the system behaviour? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Orchagent will write to STATE_DB irrespective of the vendor. syncd.sh script will look like this.
So, the auto-techsupport will function the same for all vendors. Only difference being the dump is not collected for other platforms. |
||
|
||
coredump_gen_handler.py checks the ORCH_ABRT_STATUS flag in STATE_DB and waits until /tmp/saidump_collection_notify_flag is created. Once the file is created it proceeds with the standard logic of invoking the techsupport is the configuration permits. generate_dump script will also be updated to collect dumps from `/var/log/orch_abrt_saisdkdump/` and not invoke command from syncd container if it is not restarted. Before the coredump_gen_handler.py finishes the execution it deletes the saidump_collection_notify_flag file. | ||
|
||
|
||
## 8. Test Plan | ||
|
||
Enhance the existing techsupport sonic-mgmt test with the following cases. | ||
|
@@ -475,6 +510,7 @@ Enhance the existing techsupport sonic-mgmt test with the following cases. | |
| 3 | Check if the global rate-& & per-process rate-limit-interval is working as expected | | ||
| 4 | Check if the core-dump cleanup is working as expected | | ||
| 5 | Check if the core-dump generated when reaching memory threshold | | ||
| 6 | Add a test to check the orch abort scenario | | ||
|
||
## 9. SONiC-to-SONiC Upgrade Considerations | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vivekrnv it is ok to have the implementation for Nvidia/Mellanox syncd only, the question if the flow can be invoked on any ASIC vendor if they will add the support for that. if so, I think it should be considered as generic based on code availability yet as any other features in SAI. what do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
code availability will be present for all members. Every SAI vendor is expected to implement sai_dbg_generate_dump call which is used in saisdkdump. But it's not possible to determine if the dump is important for a particular vendor. As we already know only Nvidia is using saisdkdump according to techsupport.
So, i think we should keep it specific to Nvdia for now. if and when other vendors decide if it's important, they can add enable this for their platform.