Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Mellanox] Fix the issue with ASIC detection on the SN4280 platform #20397

Merged

Conversation

oleksandrivantsiv
Copy link
Collaborator

Why I did it

Fix the issue with ASIC detection on the SN4280 platform.

The root cause of the issue is in the PCI subsystem race condition. When the Dark Mode is enabled on the system start we do the following actions in parallel:

  1. The dpuctl service starts and powers down the DPUs which causes the DPU PCI devices removal.
  2. At the same time the syncd service starts. It launches mlnx-fw-upgrade.sh script which queries the available ASIC devices from the PCI subsystem using the lspci command.

There is a small period after the removal of the DPU PCI device when the PCI subsystem in Linux remains inconsistent and lspci command might return an error upon execution. This might cause an error in mlnx-fw-upgrade.sh which interrupts the syncd container start.

Work item tracking
  • Microsoft ADO (number only):

How I did it

Add a retry mechanism for the lspci command. Cache lspci output to reduce the number of command executions.

How to verify it

Run regression.

Which release branch to backport (provide reason below if selected)

  • 201811
  • 201911
  • 202006
  • 202012
  • 202106
  • 202111
  • 202205
  • 202211
  • 202305

Tested branch (Please provide the tested image version)

Description for the changelog

Link to config_db schema for YANG module changes

A picture of a cute animal (not mandatory but encouraged)

@oleksandrivantsiv
Copy link
Collaborator Author

/azpw ms_conflict

2 similar comments
@oleksandrivantsiv
Copy link
Collaborator Author

/azpw ms_conflict

@oleksandrivantsiv
Copy link
Collaborator Author

/azpw ms_conflict

@liat-grozovik liat-grozovik merged commit 651ce0d into sonic-net:master Oct 22, 2024
12 checks passed
@liat-grozovik
Copy link
Collaborator

@oleksandrivantsiv seems there is a conflict to take it for 202405. please refer this issue to the new PR on 202405 for tracking

oleksandrivantsiv added a commit to oleksandrivantsiv/sonic-buildimage that referenced this pull request Oct 25, 2024
…onic-net#20397)

- Why I did it
Fix the issue with ASIC detection on the SN4280 platform.

The root cause of the issue is in the PCI subsystem race condition. When the Dark Mode is enabled on the system start we do the following actions in parallel:

The dpuctl service starts and powers down the DPUs which causes the DPU PCI devices removal.
At the same time the syncd service starts. It launches mlnx-fw-upgrade.sh script which queries the available ASIC devices from the PCI subsystem using the lspci command.
There is a small period after the removal of the DPU PCI device when the PCI subsystem in Linux remains inconsistent and lspci command might return an error upon execution. This might cause an error in mlnx-fw-upgrade.sh which interrupts the syncd container start.

- How I did it
Add a retry mechanism for the lspci command. Cache lspci output to reduce the number of command executions.

- How to verify it
Run regression.
bingwang-ms pushed a commit that referenced this pull request Oct 30, 2024
…20397) (#20621)

- Why I did it
Fix the issue with ASIC detection on the SN4280 platform.

The root cause of the issue is in the PCI subsystem race condition. When the Dark Mode is enabled on the system start we do the following actions in parallel:

The dpuctl service starts and powers down the DPUs which causes the DPU PCI devices removal.
At the same time the syncd service starts. It launches mlnx-fw-upgrade.sh script which queries the available ASIC devices from the PCI subsystem using the lspci command.
There is a small period after the removal of the DPU PCI device when the PCI subsystem in Linux remains inconsistent and lspci command might return an error upon execution. This might cause an error in mlnx-fw-upgrade.sh which interrupts the syncd container start.

- How I did it
Add a retry mechanism for the lspci command. Cache lspci output to reduce the number of command executions.

- How to verify it
Run regression.
rkavitha-hcl pushed a commit to rkavitha-hcl/sonic-buildimage that referenced this pull request Nov 15, 2024
…onic-net#20397)

- Why I did it
Fix the issue with ASIC detection on the SN4280 platform.

The root cause of the issue is in the PCI subsystem race condition. When the Dark Mode is enabled on the system start we do the following actions in parallel:

The dpuctl service starts and powers down the DPUs which causes the DPU PCI devices removal.
At the same time the syncd service starts. It launches mlnx-fw-upgrade.sh script which queries the available ASIC devices from the PCI subsystem using the lspci command.
There is a small period after the removal of the DPU PCI device when the PCI subsystem in Linux remains inconsistent and lspci command might return an error upon execution. This might cause an error in mlnx-fw-upgrade.sh which interrupts the syncd container start.

- How I did it
Add a retry mechanism for the lspci command. Cache lspci output to reduce the number of command executions.

- How to verify it
Run regression.
aidan-gallagher pushed a commit to aidan-gallagher/sonic-buildimage that referenced this pull request Nov 16, 2024
…onic-net#20397)

- Why I did it
Fix the issue with ASIC detection on the SN4280 platform.

The root cause of the issue is in the PCI subsystem race condition. When the Dark Mode is enabled on the system start we do the following actions in parallel:

The dpuctl service starts and powers down the DPUs which causes the DPU PCI devices removal.
At the same time the syncd service starts. It launches mlnx-fw-upgrade.sh script which queries the available ASIC devices from the PCI subsystem using the lspci command.
There is a small period after the removal of the DPU PCI device when the PCI subsystem in Linux remains inconsistent and lspci command might return an error upon execution. This might cause an error in mlnx-fw-upgrade.sh which interrupts the syncd container start.

- How I did it
Add a retry mechanism for the lspci command. Cache lspci output to reduce the number of command executions.

- How to verify it
Run regression.
Junchao-Mellanox pushed a commit to Junchao-Mellanox/sonic-buildimage that referenced this pull request Nov 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants