Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add more debug information when PFC WD is triggered #2858

Merged
merged 12 commits into from
Oct 30, 2023

Conversation

stephenxs
Copy link
Collaborator

What I did

Add more debug information when PFC WD is triggered

Why I did it

When PFC WD occurs, we need the relevant counters before and after. But PFC WD is checked at a very small period of time and it's very hard to collect the counters at that period, otherwise the data will be very large.
It will be helpful if the counters can be logged when the PFC WD is triggered.

How I verified it

Manually and regression test.

Details if related

  1. The lua plugin to decide what counters to be added to the log. For vendors that do not provide it, nothing will be added.
  2. PFC WD orchagent to handle the additional counters info, printing to syslog and publishing via event mechanism

@stephenxs
Copy link
Collaborator Author

Failed by a common issue which fails every PR


    def wait_for_result(
        polling_function: Callable[[], Tuple[bool, Any]],
        polling_config: PollingConfig = PollingConfig(),
        failure_message: str = None,
    ) -> Tuple[bool, Any]:
        """Run `polling_function` periodically using the specified `polling_config`.
    
        Args:
            polling_function: The function being polled. The function cannot take any arguments and
                must return a status which indicates if the function was succesful or not, as well as
                some return value.
            polling_config: The parameters to use to poll the polling function.
            failure_message: The message to print if the call times out. This will only take effect
                if the PollingConfig is set to strict.
    
        Returns:
            If the polling function succeeds, then this method will return True and the output of the
            polling function.
    
            If it does not succeed within the provided timeout, it will return False and whatever the
            output of the polling function was on the final attempt.
        """
        for _ in range(polling_config.iterations()):
            status, result = polling_function()
    
            if status:
                return (True, result)
    
            time.sleep(polling_config.polling_interval)
    
        if polling_config.strict:
            message = failure_message or f"Operation timed out after {polling_config.timeout} seconds with result {result}"
>           assert False, message
E           AssertionError: Operation timed out after 60 seconds with result {'arp_update': 'STOPPED', 'bgpd': 'STOPPED', 'buffermgrd': 'STOPPED', 'containercfgd': 'STOPPED', 'coppmgrd': 'STOPPED', 'fdbsyncd': 'STOPPED', 'fpmsyncd': 'STOPPED', 'gbsyncd': 'STOPPED', 'gearsyncd': 'STOPPED', 'intfmgrd': 'STOPPED', 'natmgrd': 'STOPPED', 'natsyncd': 'STOPPED', 'nbrmgrd': 'STOPPED', 'neighsyncd': 'STOPPED', 'orchagent': 'STOPPED', 'portmgrd': 'STOPPED', 'portsyncd': 'STOPPED', 'redis-chassis': 'STOPPED', 'redis-server': 'RUNNING', 'restore_neighbors': 'STOPPED', 'rsyslogd': 'RUNNING', 'sflowmgrd': 'STOPPED', 'start.sh': 'EXITED', 'staticd': 'STOPPED', 'syncd': 'FATAL', 'teammgrd': 'STOPPED', 'teamsyncd': 'STOPPED', 'tunnelmgrd': 'STOPPED', 'vlanmgrd': 'STOPPED', 'vrfmgrd': 'STOPPED', 'vxlanmgrd': 'STOPPED', 'zebra': 'STOPPED'}

@stephenxs
Copy link
Collaborator Author

/easycla

@stephenxs
Copy link
Collaborator Author

/azpw run

@mssonicbld
Copy link
Collaborator

/AzurePipelines run

@azure-pipelines
Copy link

No commit pushedDate could be found for PR 2858 in repo sonic-net/sonic-swss

@stephenxs stephenxs force-pushed the enhance-pfcwd-info branch from acb1196 to bfb3cfd Compare July 20, 2023 09:18
Signed-off-by: Stephen Sun <stephens@nvidia.com>
Signed-off-by: Stephen Sun <stephens@nvidia.com>
Signed-off-by: Stephen Sun <stephens@nvidia.com>
@stephenxs stephenxs force-pushed the enhance-pfcwd-info branch from bfb3cfd to 8ba2c65 Compare July 27, 2023 01:08
@stephenxs
Copy link
Collaborator Author

/azpw run

@mssonicbld
Copy link
Collaborator

/AzurePipelines run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@stephenxs
Copy link
Collaborator Author

/azpw run

@mssonicbld
Copy link
Collaborator

/AzurePipelines run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

stephenxs and others added 2 commits August 12, 2023 08:54
Signed-off-by: Stephen Sun <stephens@nvidia.com>
Signed-off-by: Stephen Sun <stephens@nvidia.com>
@stephenxs stephenxs marked this pull request as draft August 20, 2023 10:40
Signed-off-by: Stephen Sun <stephens@nvidia.com>
@liat-grozovik
Copy link
Collaborator

@yxieca @neethajohn kindly reminder to review or assign someone

@neethajohn
Copy link
Contributor

@stephenxs, has this change been tested against sonic-mgmt testcases? Do any of the existing testcase need modification?

@liat-grozovik
Copy link
Collaborator

@neethajohn @stephenxs can you follow up on this PR? i suggest we will backport it to 202305 as well. This is for debug purposes but it is the release maintainer call if needed or not.

@stephenxs
Copy link
Collaborator Author

stephenxs commented Oct 10, 2023 via email

@stephenxs stephenxs requested a review from neethajohn October 12, 2023 10:14
@stephenxs
Copy link
Collaborator Author

/azpw run

@mssonicbld
Copy link
Collaborator

/AzurePipelines run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@dgsudharsan dgsudharsan merged commit 917c21e into sonic-net:master Oct 30, 2023
@stephenxs stephenxs deleted the enhance-pfcwd-info branch October 30, 2023 23:10
@StormLiangMS
Copy link
Contributor

enhancement

@stephenxs
Copy link
Collaborator Author

enhancement

Hi @StormLiangMS
This is an enhancement but it is important. Currently, without this PR, there is no diagnosis information when a PFC watchdog is triggered, which makes it very difficult for customers/developers to understand what happened. In this sense, we would like to have it in 202305 as well.
If we decided not to take it into 202305, we may want to revert sonic-net/sonic-mgmt#10305 from 202305 as well.

@StormLiangMS
Copy link
Contributor

@stephenxs could you update the test result with 202305?

@stephenxs
Copy link
Collaborator Author

stephenxs commented Nov 8, 2023 via email

@StormLiangMS
Copy link
Contributor

Hi @stephenxs thanks, is this a general change or only for Mellanox?

@stephenxs
Copy link
Collaborator Author

Hi @stephenxs thanks, is this a general change or only for Mellanox?

Hi @StormLiangMS
Currently, it's for Mellanox only.
Other vendors can add their own diagnosis info in their lua plug if they want to in the future.

@stephenxs
Copy link
Collaborator Author

A test log: pfc test.txt

StormLiangMS pushed a commit that referenced this pull request Nov 9, 2023
Add more debug information when PFC WD is triggered
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants