Skip to content

Commit

Permalink
[Mellanox] Fix the issue with ASIC detection on the SN4280 platform (s…
Browse files Browse the repository at this point in the history
…onic-net#20397)

- Why I did it
Fix the issue with ASIC detection on the SN4280 platform.

The root cause of the issue is in the PCI subsystem race condition. When the Dark Mode is enabled on the system start we do the following actions in parallel:

The dpuctl service starts and powers down the DPUs which causes the DPU PCI devices removal.
At the same time the syncd service starts. It launches mlnx-fw-upgrade.sh script which queries the available ASIC devices from the PCI subsystem using the lspci command.
There is a small period after the removal of the DPU PCI device when the PCI subsystem in Linux remains inconsistent and lspci command might return an error upon execution. This might cause an error in mlnx-fw-upgrade.sh which interrupts the syncd container start.

- How I did it
Add a retry mechanism for the lspci command. Cache lspci output to reduce the number of command executions.

- How to verify it
Run regression.
  • Loading branch information
oleksandrivantsiv authored and Aidan Gallagher committed Nov 16, 2024
1 parent 4f576d1 commit 10edb66
Showing 1 changed file with 17 additions and 5 deletions.
22 changes: 17 additions & 5 deletions platform/mellanox/mlnx-fw-upgrade.j2
Original file line number Diff line number Diff line change
Expand Up @@ -246,19 +246,31 @@ function GetAsicType() {
local -r SPC4_PRODUCT_ID="cf80"
local -r BF3_PRODUCT_ID="a2dc"

if lspci -n | grep "${VENDOR_ID}:${SPC1_PRODUCT_ID}" &>/dev/null; then
local -i QUERY_RETRY_COUNT="0"
local -i QUERY_RETRY_COUNT_MAX="10"
local pcitree=$(lspci -n 2>/dev/null)
ERROR_CODE="$?"

while [[ ("${QUERY_RETRY_COUNT}" -lt "QUERY_RETRY_COUNT_MAX") && ("${ERROR_CODE}" != "${EXIT_SUCCESS}") ]]; do
sleep 1s
((QUERY_RETRY_COUNT++))
pcitree=$(lspci -n 2>/dev/null)
ERROR_CODE="$?"
done

if echo $pcitree | grep "${VENDOR_ID}:${SPC1_PRODUCT_ID}" &>/dev/null; then
echo "${SPC1_ASIC}"
exit "${EXIT_SUCCESS}"
elif lspci -n | grep "${VENDOR_ID}:${SPC2_PRODUCT_ID}" &>/dev/null; then
elif echo $pcitree | grep "${VENDOR_ID}:${SPC2_PRODUCT_ID}" &>/dev/null; then
echo "${SPC2_ASIC}"
exit "${EXIT_SUCCESS}"
elif lspci -n | grep "${VENDOR_ID}:${SPC3_PRODUCT_ID}" &>/dev/null; then
elif echo $pcitree | grep "${VENDOR_ID}:${SPC3_PRODUCT_ID}" &>/dev/null; then
echo "${SPC3_ASIC}"
exit "${EXIT_SUCCESS}"
elif lspci -n | grep "${VENDOR_ID}:${SPC4_PRODUCT_ID}" &>/dev/null; then
elif echo $pcitree | grep "${VENDOR_ID}:${SPC4_PRODUCT_ID}" &>/dev/null; then
echo "${SPC4_ASIC}"
exit "${EXIT_SUCCESS}"
elif lspci -n | grep "${VENDOR_ID}:${BF3_PRODUCT_ID}" &>/dev/null; then
elif echo $pcitree | grep "${VENDOR_ID}:${BF3_PRODUCT_ID}" &>/dev/null; then
echo "${BF3_NIC}"
exit "${EXIT_SUCCESS}"
fi
Expand Down

0 comments on commit 10edb66

Please sign in to comment.