Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[pytest] Ignore specific syslog error messages when testing autorestart #2430

Merged
merged 3 commits into from
Oct 31, 2020
Merged
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
95 changes: 86 additions & 9 deletions tests/autorestart/test_container_autorestart.py
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,55 @@
CONTAINER_RESTART_THRESHOLD_SECS = 180


@pytest.fixture(autouse=True)
def ignore_expected_loganalyzer_exception(duthost, loganalyzer):
"""
Ignore expected failure/error messages during testing the autorestart feature.

First, since we killed a critical process in a specific container to test the feature of
autorestart, we expect to see error messages which were fired by Monit such as
"ERR monit[563]: 'lldp|lldpd' status failed (1) -- 'lldpd' is not running."

Second, if teammgrd process was killed for testing, orchagent process in
swss container would write the error messages into syslog such as
"ERR swss#orchagent: :- removeLag: Failed to remove ref count 1 LAG PortChannel10008."
When teamd was restarted, there was an error message in the syslog: "ERR teamd#teamsyncd:
:- readData: netlink reports an error=-33 on reading a netlink socket."

Third, during pmon container was restarted due to ledd process was killed for testing,
xcvrd process would write an error message into syslog such as "ERR pmon#xcvrd[29]: :-initialize
GlobalConfig: Sonic database config global file doesn't exist at /var/run/redis/sonic-db/database_global.json."
thermalctld process would write an error message into syslog such as "ERR pmon#thermalctld[33]:
Caught exception while initializing thermal manager."

Fourth, if orchagent process was killed and swss container was restarted, then syncd process
would write error messages such as "ERR syncd#syncd: [none] driverEgressMemoryUpdate:1395
Error getting cosq for port 1.". At the same time, syncd process also wrote two WARNING messages
into syslog such as "WARNING syncd#syncd:- saiDiscover: skipping since it causes crash:
SAI_STP_ATTR_BRIDGE_ID". Since there was a keyword "crash" in these warning message, logAnalyzer
would fail.

Fifth, systemd would fire an error message:"ERR systemd[1]: Failed to start SNMP/TEAMD container." since
SNMP/TEAMD container hits the limitation of restart. route_check.py also wrote an error message into syslog.

"""
ignoreRegex = [
".*ERR monit.*",
".*ERR swss#orchagent.*removeLag.*",
".*ERR pmon#xcvrd.*initializeGlobalConfig.*",
".*ERR pmon#thermalctld.*Caught exception while initializing thermal manager.*",
".*ERR syncd#syncd.*driverEgressMemoryUpdate.*",
".*ERR syncd#syncd.*brcm_sai*",
".*ERR teamd#teamsyncd.*readData.*netlink reports an error=-33 on reading a netlink socket.*",
".*WARNING syncd#syncd.*saiDiscover: skipping since it cause crash.*",
".*ERR systemd.*Failed to start .* container*",
".*ERR kernel.*PortChannel.*",
".*ERR route_check.*",
]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please split this regex array to one array for each service.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great Suggestion! Fixed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea of splitting the regex into groups was so that we can ignore only a subset of messages for each container being tested. You split them into groups, but you then you simply add all the groups in unconditionally, which results in the same behavior as before.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jleveque The idea is to split the ignore regex into subgroups and each subgroup represents a container. When a container was tested for example syncd, this function will be called such that the messages in subgroup of syncd will be ignored, right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Joe, Thanks for clarification!

if loganalyzer:
loganalyzer.ignore_regex.extend(ignoreRegex)


def get_group_program_info(duthost, container_name, group_name):
"""
@summary: Get program names, running status and their pids by analyzing the command
Expand All @@ -34,7 +83,7 @@ def get_group_program_info(duthost, container_name, group_name):
program_status = None
program_pid = -1

program_list = duthost.shell("docker exec {} supervisorctl status".format(container_name))
program_list = duthost.shell("docker exec {} supervisorctl status".format(container_name), module_ignore_errors=True)
for program_info in program_list["stdout_lines"]:
if program_info.find(group_name) != -1:
program_name = program_info.split()[0].split(':')[1].strip()
Expand All @@ -59,7 +108,7 @@ def get_program_info(duthost, container_name, program_name):
program_status = None
program_pid = -1

program_list = duthost.shell("docker exec {} supervisorctl status".format(container_name))
program_list = duthost.shell("docker exec {} supervisorctl status".format(container_name), module_ignore_errors=True)
for program_info in program_list["stdout_lines"]:
if program_info.find(program_name) != -1:
program_status = program_info.split()[1].strip()
Expand Down Expand Up @@ -137,6 +186,20 @@ def is_hiting_start_limit(duthost, container_name):
return False


def clear_failed_flag_and_restart(duthost, container_name):
"""
@summary: If a container hits the restart limitation, then we clear the failed flag and
restart it.
"""
logger.info("{} hits start limit and clear reset-failed flag".format(container_name))
duthost.shell("sudo systemctl reset-failed {}.service".format(container_name))
duthost.shell("sudo systemctl start {}.service".format(container_name))
restarted = wait_until(CONTAINER_RESTART_THRESHOLD_SECS,
CONTAINER_CHECK_INTERVAL_SECS,
check_container_state, duthost, container_name, True)
pytest_assert(restarted, "Failed to restart container '{}' after reset-failed was cleared".format(container_name))


def verify_autorestart_with_critical_process(duthost, container_name, program_name,
program_status, program_pid):
"""
Expand Down Expand Up @@ -165,13 +228,7 @@ def verify_autorestart_with_critical_process(duthost, container_name, program_na
check_container_state, duthost, container_name, True)
if not restarted:
if is_hiting_start_limit(duthost, container_name):
logger.info("{} hits start limit and clear reset-failed flag".format(container_name))
duthost.shell("sudo systemctl reset-failed {}.service".format(container_name))
duthost.shell("sudo systemctl start {}.service".format(container_name))
restarted = wait_until(CONTAINER_RESTART_THRESHOLD_SECS,
CONTAINER_CHECK_INTERVAL_SECS,
check_container_state, duthost, container_name, True)
pytest_assert(restarted, "Failed to restart container '{}' after reset-failed was cleared".format(container_name))
clear_failed_flag_and_restart(duthost, container_name)
else:
pytest.fail("Failed to restart container '{}'".format(container_name))

Expand Down Expand Up @@ -203,6 +260,24 @@ def verify_no_autorestart_with_non_critical_process(duthost, container_name, pro
duthost.shell("docker exec {} supervisorctl start {}".format(container_name, program_name))


def postcheck_critical_processes_status(duthost, container_autorestart_states):
"""
@summary: Do the post check to see whether all the critical processes are alive after testing
the autorestart feature.
First we restart the containers which hit the restart limitation and then do the post check
"""
for container_name in container_autorestart_states.keys():
if is_hiting_start_limit(duthost, container_name):
clear_failed_flag_and_restart(duthost, container_name)

# Sleep 20 seconds such that containers have a chance to start.
time.sleep(20)
processes_status = duthost.all_critical_process_status()
for container_name, processes in processes_status.items():
if processes["status"] is False or len(processes["exited_critical_process"]) > 0:
pytest.fail("Critical process(es) was not running after container '{}' was restarted.".format(container_name))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please convert this block into pytest_assert(wait_until(...), failure message)

Copy link
Contributor Author

@yozhao101 yozhao101 Oct 30, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great suggestion! Fixed.



def test_containers_autorestart(duthost, tbinfo):
"""
@summary: Test the auto-restart feature of each container against two scenarios: killing
Expand Down Expand Up @@ -275,3 +350,5 @@ def test_containers_autorestart(duthost, tbinfo):
duthost.shell("sudo config feature autorestart {} disabled".format(container_name))

logger.info("End of testing the container '{}'".format(container_name))

postcheck_critical_processes_status(duthost, container_autorestart_states)