Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Buffer orch] Fix maximum headroom check failure in cold reboot #2948

Merged
merged 5 commits into from
Nov 16, 2023

Conversation

stephenxs
Copy link
Collaborator

@stephenxs stephenxs commented Oct 30, 2023

What I did

Fix maximum headroom check failure in the cold reboot.
The accumulative headroom on a port should be compared with the maximum headroom supported on the port whenever a buffer priority group is created/updated in the dynamic buffer model which depends on the maximum headroom being exposed to the STATE_DB during orchagent initialization.
However, in the cold reboot, orchagent starts slow which prevents the threshold from being exposed on time. In this case, the buffer manager is not able to perform the headroom check and the buffer orchagent should handle the possible failure from SAI in case the accumulative headroom exceeds the threshold.

Signed-off-by: Stephen Sun stephens@nvidia.com

Why I did it

To fix the issue.

How I verified it

Manually/regression test.

Details if related

The accumulative headroom on a port is compared with the maximum headroom supported on the port
whenever a buffer priority group is created/updated.
This depends on the maximum headroom being exposed to the STATE_DB during orchagent initialization.
However, in the cold reboot, orchagent starts slow which prevents the threshold from being exposed on time.
In this case, the buffer manager is not able to perform the headroom check
and the buffer orchagent should handle the possible failure from SAI in case the accumulative headroom exceeds the threshold.

Signed-off-by: Stephen Sun <stephens@nvidia.com>
@stephenxs stephenxs requested a review from prsunny as a code owner October 30, 2023 02:18
@stephenxs stephenxs marked this pull request as draft October 30, 2023 02:18
@stephenxs stephenxs marked this pull request as ready for review October 31, 2023 05:43
@prsunny
Copy link
Collaborator

prsunny commented Nov 8, 2023

@stephenxs , @dgsudharsan , if it failed, how does it recover?

@stephenxs
Copy link
Collaborator Author

stephenxs commented Nov 8, 2023 via email

@prsunny
Copy link
Collaborator

prsunny commented Nov 8, 2023

Hi This is caused by wrong user configuration. User should correct the configuration if it fails

It doesn't align with this description "The accumulative headroom on a port is compared with the maximum headroom supported on the port whenever a buffer priority group is created/updated in the dynamic buffer model which depends on the maximum headroom being exposed to the STATE_DB during orchagent initialization."

@stephenxs
Copy link
Collaborator Author

stephenxs commented Nov 8, 2023 via email

@stephenxs
Copy link
Collaborator Author

/azpw run

@mssonicbld
Copy link
Collaborator

/AzurePipelines run

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@liat-grozovik
Copy link
Collaborator

@prsunny any further comments?

@prsunny
Copy link
Collaborator

prsunny commented Nov 10, 2023

@prsunny any further comments?

From what I understand, it is not a common scenario (like manually editing a config_db.json) to simulate this failure. So this handling, imo is unnecessary.

@stephenxs
Copy link
Collaborator Author

stephenxs commented Nov 11, 2023

@prsunny any further comments?

From what I understand, it is not a common scenario (like manually editing a config_db.json) to simulate this failure. So this handling, imo is unnecessary.

Hi @prsunny
It is one scenario that a customer manually modifies the config_db.json and reboot the switch.
There is another scenario.
A customer manually configured a very long cable length. The buffer manager denies the configuration but it can not remove it from the config db. Once the user saves configuration and reboots the switch, the failure will happen and OA needs to be restarted once more.
By ignoring this error, OA doesn't need to restart after cold reboot. It won't incur any risk as well.
So, I suggest having it.
What do you think?

@prsunny prsunny merged commit 4a0bb1c into sonic-net:master Nov 16, 2023
@dgsudharsan
Copy link
Collaborator

Verified on top of 202305 commit sonic-net/sonic-buildimage@a3f8153

@stephenxs stephenxs deleted the fix-max-headroom branch November 17, 2023 01:06
StormLiangMS pushed a commit that referenced this pull request Nov 19, 2023
The accumulative headroom on a port is compared with the maximum headroom supported on the port
whenever a buffer priority group is created/updated.
This depends on the maximum headroom being exposed to the STATE_DB during orchagent initialization.
However, in the cold reboot, orchagent starts slow which prevents the threshold from being exposed on time.
In this case, the buffer manager is not able to perform the headroom check
and the buffer orchagent should handle the possible failure from SAI in case the accumulative headroom exceeds the threshold.
mssonicbld pushed a commit to mssonicbld/sonic-swss that referenced this pull request Jan 2, 2024
The accumulative headroom on a port is compared with the maximum headroom supported on the port
whenever a buffer priority group is created/updated.
This depends on the maximum headroom being exposed to the STATE_DB during orchagent initialization.
However, in the cold reboot, orchagent starts slow which prevents the threshold from being exposed on time.
In this case, the buffer manager is not able to perform the headroom check
and the buffer orchagent should handle the possible failure from SAI in case the accumulative headroom exceeds the threshold.
@mssonicbld
Copy link
Collaborator

Cherry-pick PR to 202205: #3001

@mssonicbld
Copy link
Collaborator

@stephenxs cherry pick PR didn't pass PR checker. Please check!!!
#3001

2 similar comments
@mssonicbld
Copy link
Collaborator

@stephenxs cherry pick PR didn't pass PR checker. Please check!!!
#3001

@mssonicbld
Copy link
Collaborator

@stephenxs cherry pick PR didn't pass PR checker. Please check!!!
#3001

@stephenxs
Copy link
Collaborator Author

@stephenxs cherry pick PR didn't pass PR checker. Please check!!!#3001

It no longer needs to cherry-pick to 202205. Just removed the flag.
Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants