-
Notifications
You must be signed in to change notification settings - Fork 770
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[pytest] Ignore specific syslog error messages when testing autorestart #2430
Merged
lguohan
merged 3 commits into
sonic-net:master
from
yozhao101:fix_autorestart_loganalyzer
Oct 31, 2020
Merged
Changes from 2 commits
Commits
Show all changes
3 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -21,6 +21,55 @@ | |
CONTAINER_RESTART_THRESHOLD_SECS = 180 | ||
|
||
|
||
@pytest.fixture(autouse=True) | ||
def ignore_expected_loganalyzer_exception(duthost, loganalyzer): | ||
""" | ||
Ignore expected failure/error messages during testing the autorestart feature. | ||
|
||
First, since we killed a critical process in a specific container to test the feature of | ||
autorestart, we expect to see error messages which were fired by Monit such as | ||
"ERR monit[563]: 'lldp|lldpd' status failed (1) -- 'lldpd' is not running." | ||
|
||
Second, if teammgrd process was killed for testing, orchagent process in | ||
swss container would write the error messages into syslog such as | ||
"ERR swss#orchagent: :- removeLag: Failed to remove ref count 1 LAG PortChannel10008." | ||
When teamd was restarted, there was an error message in the syslog: "ERR teamd#teamsyncd: | ||
:- readData: netlink reports an error=-33 on reading a netlink socket." | ||
|
||
Third, during pmon container was restarted due to ledd process was killed for testing, | ||
xcvrd process would write an error message into syslog such as "ERR pmon#xcvrd[29]: :-initialize | ||
GlobalConfig: Sonic database config global file doesn't exist at /var/run/redis/sonic-db/database_global.json." | ||
thermalctld process would write an error message into syslog such as "ERR pmon#thermalctld[33]: | ||
Caught exception while initializing thermal manager." | ||
|
||
Fourth, if orchagent process was killed and swss container was restarted, then syncd process | ||
would write error messages such as "ERR syncd#syncd: [none] driverEgressMemoryUpdate:1395 | ||
Error getting cosq for port 1.". At the same time, syncd process also wrote two WARNING messages | ||
into syslog such as "WARNING syncd#syncd:- saiDiscover: skipping since it causes crash: | ||
SAI_STP_ATTR_BRIDGE_ID". Since there was a keyword "crash" in these warning message, logAnalyzer | ||
would fail. | ||
|
||
Fifth, systemd would fire an error message:"ERR systemd[1]: Failed to start SNMP/TEAMD container." since | ||
SNMP/TEAMD container hits the limitation of restart. route_check.py also wrote an error message into syslog. | ||
|
||
""" | ||
ignoreRegex = [ | ||
".*ERR monit.*", | ||
".*ERR swss#orchagent.*removeLag.*", | ||
".*ERR pmon#xcvrd.*initializeGlobalConfig.*", | ||
".*ERR pmon#thermalctld.*Caught exception while initializing thermal manager.*", | ||
".*ERR syncd#syncd.*driverEgressMemoryUpdate.*", | ||
".*ERR syncd#syncd.*brcm_sai*", | ||
".*ERR teamd#teamsyncd.*readData.*netlink reports an error=-33 on reading a netlink socket.*", | ||
".*WARNING syncd#syncd.*saiDiscover: skipping since it cause crash.*", | ||
".*ERR systemd.*Failed to start .* container*", | ||
".*ERR kernel.*PortChannel.*", | ||
".*ERR route_check.*", | ||
] | ||
if loganalyzer: | ||
loganalyzer.ignore_regex.extend(ignoreRegex) | ||
|
||
|
||
def get_group_program_info(duthost, container_name, group_name): | ||
""" | ||
@summary: Get program names, running status and their pids by analyzing the command | ||
|
@@ -34,7 +83,7 @@ def get_group_program_info(duthost, container_name, group_name): | |
program_status = None | ||
program_pid = -1 | ||
|
||
program_list = duthost.shell("docker exec {} supervisorctl status".format(container_name)) | ||
program_list = duthost.shell("docker exec {} supervisorctl status".format(container_name), module_ignore_errors=True) | ||
for program_info in program_list["stdout_lines"]: | ||
if program_info.find(group_name) != -1: | ||
program_name = program_info.split()[0].split(':')[1].strip() | ||
|
@@ -59,7 +108,7 @@ def get_program_info(duthost, container_name, program_name): | |
program_status = None | ||
program_pid = -1 | ||
|
||
program_list = duthost.shell("docker exec {} supervisorctl status".format(container_name)) | ||
program_list = duthost.shell("docker exec {} supervisorctl status".format(container_name), module_ignore_errors=True) | ||
for program_info in program_list["stdout_lines"]: | ||
if program_info.find(program_name) != -1: | ||
program_status = program_info.split()[1].strip() | ||
|
@@ -137,6 +186,20 @@ def is_hiting_start_limit(duthost, container_name): | |
return False | ||
|
||
|
||
def clear_failed_flag_and_restart(duthost, container_name): | ||
""" | ||
@summary: If a container hits the restart limitation, then we clear the failed flag and | ||
restart it. | ||
""" | ||
logger.info("{} hits start limit and clear reset-failed flag".format(container_name)) | ||
duthost.shell("sudo systemctl reset-failed {}.service".format(container_name)) | ||
duthost.shell("sudo systemctl start {}.service".format(container_name)) | ||
restarted = wait_until(CONTAINER_RESTART_THRESHOLD_SECS, | ||
CONTAINER_CHECK_INTERVAL_SECS, | ||
check_container_state, duthost, container_name, True) | ||
pytest_assert(restarted, "Failed to restart container '{}' after reset-failed was cleared".format(container_name)) | ||
|
||
|
||
def verify_autorestart_with_critical_process(duthost, container_name, program_name, | ||
program_status, program_pid): | ||
""" | ||
|
@@ -165,13 +228,7 @@ def verify_autorestart_with_critical_process(duthost, container_name, program_na | |
check_container_state, duthost, container_name, True) | ||
if not restarted: | ||
if is_hiting_start_limit(duthost, container_name): | ||
logger.info("{} hits start limit and clear reset-failed flag".format(container_name)) | ||
duthost.shell("sudo systemctl reset-failed {}.service".format(container_name)) | ||
duthost.shell("sudo systemctl start {}.service".format(container_name)) | ||
restarted = wait_until(CONTAINER_RESTART_THRESHOLD_SECS, | ||
CONTAINER_CHECK_INTERVAL_SECS, | ||
check_container_state, duthost, container_name, True) | ||
pytest_assert(restarted, "Failed to restart container '{}' after reset-failed was cleared".format(container_name)) | ||
clear_failed_flag_and_restart(duthost, container_name) | ||
else: | ||
pytest.fail("Failed to restart container '{}'".format(container_name)) | ||
|
||
|
@@ -203,6 +260,24 @@ def verify_no_autorestart_with_non_critical_process(duthost, container_name, pro | |
duthost.shell("docker exec {} supervisorctl start {}".format(container_name, program_name)) | ||
|
||
|
||
def postcheck_critical_processes_status(duthost, container_autorestart_states): | ||
""" | ||
@summary: Do the post check to see whether all the critical processes are alive after testing | ||
the autorestart feature. | ||
First we restart the containers which hit the restart limitation and then do the post check | ||
""" | ||
for container_name in container_autorestart_states.keys(): | ||
if is_hiting_start_limit(duthost, container_name): | ||
clear_failed_flag_and_restart(duthost, container_name) | ||
|
||
# Sleep 20 seconds such that containers have a chance to start. | ||
time.sleep(20) | ||
processes_status = duthost.all_critical_process_status() | ||
for container_name, processes in processes_status.items(): | ||
if processes["status"] is False or len(processes["exited_critical_process"]) > 0: | ||
pytest.fail("Critical process(es) was not running after container '{}' was restarted.".format(container_name)) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Please convert this block into pytest_assert(wait_until(...), failure message) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Great suggestion! Fixed. |
||
|
||
|
||
def test_containers_autorestart(duthost, tbinfo): | ||
""" | ||
@summary: Test the auto-restart feature of each container against two scenarios: killing | ||
|
@@ -275,3 +350,5 @@ def test_containers_autorestart(duthost, tbinfo): | |
duthost.shell("sudo config feature autorestart {} disabled".format(container_name)) | ||
|
||
logger.info("End of testing the container '{}'".format(container_name)) | ||
|
||
postcheck_critical_processes_status(duthost, container_autorestart_states) |
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please split this regex array to one array for each service.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great Suggestion! Fixed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The idea of splitting the regex into groups was so that we can ignore only a subset of messages for each container being tested. You split them into groups, but you then you simply add all the groups in unconditionally, which results in the same behavior as before.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jleveque The idea is to split the ignore regex into subgroups and each subgroup represents a container. When a container was tested for example syncd, this function will be called such that the messages in subgroup of syncd will be ignored, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Joe, Thanks for clarification!