-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Mellanox] Stop pmon ahead of syncd #3505
Merged
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
…onal branch to fix the issue that pmon not started after warm reboot
During system starting, pmon isn't supposed to start ahead of syncd starting in order to avoid racing condition between syncd and pmon. Currently it is done by killing pmon if is alive when syncd is starting. However such implementation is still risky. Consider the following flow: 1. pmon is inactive when syncd.sh is checking. but syncd.sh is scheduled out somehow just ahead of "chipdown" called 2. systemd is switched in and starts pmon service 3. at this point, pmon and syncd are running simultaneously, critical section broken and racing condition formed To prevent that issue, ony solution is to add syncd as "After" in pmon.service, which ensure that whenever pmon starts syncd has been started. However, dong so requires to defer starting pmon.service after syncd.service has fully started otherwise a deadlock is formed as following: 1. syncd.sh starts pmon ahead of itself fully started, while 2. pmon not being able to start due to syncd, one of its "After", not fully started. 3. as a result, syncd and pmon have to wait for each other forever To solve that, move starting pmon.service to "wait()" so that pmon is started after syncd fully started, breaking the deadlock.
…-start [syncd.sh,pmon.service] Prevent pmon from starting ahead of syncd
stepanblyschak
approved these changes
Sep 25, 2019
nazariig
approved these changes
Sep 25, 2019
qiluo-msft
approved these changes
Sep 25, 2019
jleveque
suggested changes
Sep 25, 2019
jleveque
approved these changes
Sep 25, 2019
6 tasks
mssonicbld
added a commit
that referenced
this pull request
Aug 29, 2024
…atically (#20069) #### Why I did it src/sonic-utilities ``` * d7788d4d - (HEAD -> 202311, origin/202311) Add lock to config reload/load_minigraph (#3475) (#3505) (10 hours ago) [Longxiang Lyu] * b36b9b50 - [config] no op if Golden Config is invalid (#3367) (18 hours ago) [jingwenxie] * f72699f7 - [config] Check golden config exist early if flag is set (#3169) (#3504) (22 hours ago) [mssonicbld] ``` #### How I did it #### How to verify it #### Description for the changelog
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
- What I did
Issue Overview
shutdown flow
For any shutdown flow, which means all dockers are stopped in order, pmon docker stops after syncd docker has stopped, causing pmon docker fail to release sx_core resources and leaving sx_core in a bad state. The related logs are like the following:
config reload & service swss.restart
In the flows like "config reload" and "service swss restart", the failure cause further consequences:
reboot, warm-reboot & fast-reboot
In the reboot flows including "reboot", "fast-reboot" and "warm-reboot" this failure doesn't have further negative effects since the system has already rebooted. In addition, "warm-reboot" requires the system to be shutdown as soon as possible to meet the GR time restriction of both BGP and LACP. "fast-reboot" also requires to meet the GR time restriction of BGP which is longer than LACP. In this sense, any unnecessary steps should be avoided. It's better to keep those flows untouched.
summary
To summarize, we have to come up with a way to ensure:
Solution
To solve the issue, pmon shoud be stopped ahead of syncd stopped for all flows except for the warm-reboot.
- How I did it
This is done by add "syncd.service" as "After" to pmon.service and startin /usr/local/bin/syncd.sh::wait()
To start pmon automatically after syncd started.
- How to verify it
Test the following flows and ensure pmon and syncd started and stopped in the correct sequence:
- Description for the changelog
- A picture of a cute animal (not mandatory but encouraged)