-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[memory_checker] Change log severity to WARNING in a case when the docker service is not running #16018
[memory_checker] Change log severity to WARNING in a case when the docker service is not running #16018
Conversation
@yozhao101 could you please take a look at this PR? |
@yozhao101 kind reminder |
@qiluo-msft @yozhao101 could you please help to review? this is a fix following a PR merged/reviewed by you so are the optimal reviewer for this one |
@@ -140,10 +140,10 @@ def get_running_container_names(): | |||
running_container_list = docker_client.containers.list(filters={"status": "running"}) | |||
running_container_names = [ container.name for container in running_container_list ] | |||
except (docker.errors.APIError, docker.errors.DockerException) as err: | |||
syslog.syslog(syslog.LOG_ERR, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is the idea of get_running_container_names()
, if the docker service
is not running it just gives us a warning message to the syslog and returns the empty list of docker containers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To clarify:
- check docker service is running, if not running, no ERR, no WARNING. Low frequent INFO may be good.
- if docker service is running, retrieve the running container list. If failed to retrieve, treat it still as ERR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We already have the check if the docker service
is running, but in some cases (which I described in the PR description) is not enough.
This check is right before the get_running_container_names()
function call (161 line number).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the explanation! Let me update my proposal:
- when you catch an exception, if you are sure docker service is not running, log INFO in low frequency
- if not a docker service issue, something really bad happens, let's still use ERR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The implementation is aligned according to your last comment, please take a look
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@qiluo-msft please take a look
bdbe756
to
8e70bf0
Compare
… service is not running. Signed-off-by: vadymhlushko-mlnx <vadymh@nvidia.com>
8e70bf0
to
e0df3f4
Compare
… service is not running. (sonic-net#16018) #### Why I did it To fix the logic introduced by [[memory_checker] Do not check memory usage of containers which are not created sonic-net#11129](sonic-net#11129). There could be a scenario before the reboot, where 1. The `docker service` has stopped 2. In a very short period of time, the monit service performs the `root@sonic:/home/admin# monit status container_memory_telemetry` In such scenario, the `memory_checker` script will throw an error to the syslog: ``` ERR memory_checker: Failed to retrieve the running container list from docker daemon! Error message is: 'Error while fetching server API version: ('Connection aborted.', FileNotFoundError(2, 'No such file or directory'))' ``` But, actually, this scenario is a correct behavior, because when the docker service is stopped, the Unix socket is destroyed and that is why we could see the `FileNotFoundError(2, 'No such file or directory'` exception in the syslog. #### How I did it Change the log severity to the warning and changed the return value. #### How to verify it It is really hard to catch the exact moment described in the `Why I did it` section. In order to check the logic: 1. Change the Unix socket path to non-existing in [/usr/bin/memory_checker](https://github.com/sonic-net/sonic-buildimage/blob/47742dfc2c0d1fa27198d69c9183ddc044e11b22/files/image_config/monit/memory_checker#L139) file on the switch. 2. Execute the `root@sonic:/home/admin# monit restart container_memory_telemetry` 3. Check the syslog for such messages: ``` WARNING memory_checker: Failed to retrieve the running container list from docker daemon! Error message is: 'Error while fetching server API version: ('Connection aborte d.', FileNotFoundError(2, 'No such file or directory'))' INFO memory_checker: [memory_checker] Exits without checking memory usage since container 'telemetry' is not running! ```
Cherry-pick PR to 202305: #16405 |
… service is not running. (#16018) #### Why I did it To fix the logic introduced by [[memory_checker] Do not check memory usage of containers which are not created #11129](#11129). There could be a scenario before the reboot, where 1. The `docker service` has stopped 2. In a very short period of time, the monit service performs the `root@sonic:/home/admin# monit status container_memory_telemetry` In such scenario, the `memory_checker` script will throw an error to the syslog: ``` ERR memory_checker: Failed to retrieve the running container list from docker daemon! Error message is: 'Error while fetching server API version: ('Connection aborted.', FileNotFoundError(2, 'No such file or directory'))' ``` But, actually, this scenario is a correct behavior, because when the docker service is stopped, the Unix socket is destroyed and that is why we could see the `FileNotFoundError(2, 'No such file or directory'` exception in the syslog. #### How I did it Change the log severity to the warning and changed the return value. #### How to verify it It is really hard to catch the exact moment described in the `Why I did it` section. In order to check the logic: 1. Change the Unix socket path to non-existing in [/usr/bin/memory_checker](https://github.com/sonic-net/sonic-buildimage/blob/47742dfc2c0d1fa27198d69c9183ddc044e11b22/files/image_config/monit/memory_checker#L139) file on the switch. 2. Execute the `root@sonic:/home/admin# monit restart container_memory_telemetry` 3. Check the syslog for such messages: ``` WARNING memory_checker: Failed to retrieve the running container list from docker daemon! Error message is: 'Error while fetching server API version: ('Connection aborte d.', FileNotFoundError(2, 'No such file or directory'))' INFO memory_checker: [memory_checker] Exits without checking memory usage since container 'telemetry' is not running! ```
… service is not running. (sonic-net#16018) #### Why I did it To fix the logic introduced by [[memory_checker] Do not check memory usage of containers which are not created sonic-net#11129](sonic-net#11129). There could be a scenario before the reboot, where 1. The `docker service` has stopped 2. In a very short period of time, the monit service performs the `root@sonic:/home/admin# monit status container_memory_telemetry` In such scenario, the `memory_checker` script will throw an error to the syslog: ``` ERR memory_checker: Failed to retrieve the running container list from docker daemon! Error message is: 'Error while fetching server API version: ('Connection aborted.', FileNotFoundError(2, 'No such file or directory'))' ``` But, actually, this scenario is a correct behavior, because when the docker service is stopped, the Unix socket is destroyed and that is why we could see the `FileNotFoundError(2, 'No such file or directory'` exception in the syslog. #### How I did it Change the log severity to the warning and changed the return value. #### How to verify it It is really hard to catch the exact moment described in the `Why I did it` section. In order to check the logic: 1. Change the Unix socket path to non-existing in [/usr/bin/memory_checker](https://github.com/sonic-net/sonic-buildimage/blob/47742dfc2c0d1fa27198d69c9183ddc044e11b22/files/image_config/monit/memory_checker#L139) file on the switch. 2. Execute the `root@sonic:/home/admin# monit restart container_memory_telemetry` 3. Check the syslog for such messages: ``` WARNING memory_checker: Failed to retrieve the running container list from docker daemon! Error message is: 'Error while fetching server API version: ('Connection aborte d.', FileNotFoundError(2, 'No such file or directory'))' INFO memory_checker: [memory_checker] Exits without checking memory usage since container 'telemetry' is not running! ```
… service is not running. (sonic-net#16018) #### Why I did it To fix the logic introduced by [[memory_checker] Do not check memory usage of containers which are not created sonic-net#11129](sonic-net#11129). There could be a scenario before the reboot, where 1. The `docker service` has stopped 2. In a very short period of time, the monit service performs the `root@sonic:/home/admin# monit status container_memory_telemetry` In such scenario, the `memory_checker` script will throw an error to the syslog: ``` ERR memory_checker: Failed to retrieve the running container list from docker daemon! Error message is: 'Error while fetching server API version: ('Connection aborted.', FileNotFoundError(2, 'No such file or directory'))' ``` But, actually, this scenario is a correct behavior, because when the docker service is stopped, the Unix socket is destroyed and that is why we could see the `FileNotFoundError(2, 'No such file or directory'` exception in the syslog. #### How I did it Change the log severity to the warning and changed the return value. #### How to verify it It is really hard to catch the exact moment described in the `Why I did it` section. In order to check the logic: 1. Change the Unix socket path to non-existing in [/usr/bin/memory_checker](https://github.com/sonic-net/sonic-buildimage/blob/47742dfc2c0d1fa27198d69c9183ddc044e11b22/files/image_config/monit/memory_checker#L139) file on the switch. 2. Execute the `root@sonic:/home/admin# monit restart container_memory_telemetry` 3. Check the syslog for such messages: ``` WARNING memory_checker: Failed to retrieve the running container list from docker daemon! Error message is: 'Error while fetching server API version: ('Connection aborte d.', FileNotFoundError(2, 'No such file or directory'))' INFO memory_checker: [memory_checker] Exits without checking memory usage since container 'telemetry' is not running! ```
Cherry-pick PR to 202211: #16821 |
@vadymhlushko-mlnx cherry pick PR didn't pass PR checker. Please check!!! |
7 similar comments
@vadymhlushko-mlnx cherry pick PR didn't pass PR checker. Please check!!! |
@vadymhlushko-mlnx cherry pick PR didn't pass PR checker. Please check!!! |
@vadymhlushko-mlnx cherry pick PR didn't pass PR checker. Please check!!! |
@vadymhlushko-mlnx cherry pick PR didn't pass PR checker. Please check!!! |
@vadymhlushko-mlnx cherry pick PR didn't pass PR checker. Please check!!! |
@vadymhlushko-mlnx cherry pick PR didn't pass PR checker. Please check!!! |
@vadymhlushko-mlnx cherry pick PR didn't pass PR checker. Please check!!! |
… service is not running. (#16018) #### Why I did it To fix the logic introduced by [[memory_checker] Do not check memory usage of containers which are not created #11129](#11129). There could be a scenario before the reboot, where 1. The `docker service` has stopped 2. In a very short period of time, the monit service performs the `root@sonic:/home/admin# monit status container_memory_telemetry` In such scenario, the `memory_checker` script will throw an error to the syslog: ``` ERR memory_checker: Failed to retrieve the running container list from docker daemon! Error message is: 'Error while fetching server API version: ('Connection aborted.', FileNotFoundError(2, 'No such file or directory'))' ``` But, actually, this scenario is a correct behavior, because when the docker service is stopped, the Unix socket is destroyed and that is why we could see the `FileNotFoundError(2, 'No such file or directory'` exception in the syslog. #### How I did it Change the log severity to the warning and changed the return value. #### How to verify it It is really hard to catch the exact moment described in the `Why I did it` section. In order to check the logic: 1. Change the Unix socket path to non-existing in [/usr/bin/memory_checker](https://github.com/sonic-net/sonic-buildimage/blob/47742dfc2c0d1fa27198d69c9183ddc044e11b22/files/image_config/monit/memory_checker#L139) file on the switch. 2. Execute the `root@sonic:/home/admin# monit restart container_memory_telemetry` 3. Check the syslog for such messages: ``` WARNING memory_checker: Failed to retrieve the running container list from docker daemon! Error message is: 'Error while fetching server API version: ('Connection aborte d.', FileNotFoundError(2, 'No such file or directory'))' INFO memory_checker: [memory_checker] Exits without checking memory usage since container 'telemetry' is not running! ```
Why I did it
To fix the logic introduced by [memory_checker] Do not check memory usage of containers which are not created #11129.
There could be a scenario before the reboot, where
docker service
has stoppedroot@sonic:/home/admin# monit status container_memory_telemetry
In such scenario, the
memory_checker
script will throw an error to the syslog:But, actually, this scenario is a correct behavior, because when the docker service is stopped, the Unix socket is destroyed and that is why we could see the
FileNotFoundError(2, 'No such file or directory'
exception in the syslog.Work item tracking
How I did it
Change the log severity to the warning and changed the return value.
How to verify it
It is really hard to catch the exact moment described in the
Why I did it
section.In order to check the logic:
root@sonic:/home/admin# monit restart container_memory_telemetry
Which release branch to backport (provide reason below if selected)
Tested branch (Please provide the tested image version)
Description for the changelog
Link to config_db schema for YANG module changes
A picture of a cute animal (not mandatory but encouraged)