Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SONiC docker containers show as "Exited" after config-reload #7180

Closed
alexrallen opened this issue Mar 29, 2021 · 4 comments · Fixed by #7228
Closed

SONiC docker containers show as "Exited" after config-reload #7180

alexrallen opened this issue Mar 29, 2021 · 4 comments · Fixed by #7228
Assignees
Labels
Issue for 202012 Triaged this issue has been triaged

Comments

@alexrallen
Copy link
Contributor

Description

Occasionally on config reload we have observed that the docker containers fail to properly restart and show as "Exited" when docker ps is run.

I have debugged this and the containers that fail to start all depend on the systemd service interfaces-config.service which has a start job that is hanging when this bug appears. This start job has an indefinite timeout so the entire system hangs indefinitely waiting on it to start.

I determined that the start job is hanging where it executes an ifupdown command which then executes systemctl try-restart ntp.service. When you execute systemctl list-jobs you can see that a "nop" job has been inserted into the queue because the ntp service is not running. However it seems to get stuck in the queue and never execute.

Execution trace of the locked up interfaces-config.service start job.

           ├─11300 /bin/bash /usr/bin/interfaces-config.sh
           ├─11301 /usr/bin/python /usr/sbin/ifdown --force eth0
           ├─11547 /sbin/dhclient -r -pf /run/dhclient.eth0.pid -lf /var/lib/dhcp/dhclient.eth0.leases eth0
           ├─11670 /bin/sh /sbin/dhclient-script
           ├─11682 /bin/sh /usr/sbin/invoke-rc.d ntp try-restart
           └─11704 systemctl try-restart ntp.service

Systemd jobs showing the held up ntp nop job.

root@r-lionfish-07:/var/log# sudo systemctl list-jobs
 JOB UNIT                       TYPE  STATE
1875 ntp.service                nop   waiting
2147 snmp.service               start waiting
1997 telemetry.service          start waiting
1753 syncd.service              start waiting
1749 dhcp_relay.service         start waiting

I found the following systemd issue which I believe this to be an instance of:
systemd/systemd#13124

I confirmed that the systemd version that we are running: 241 was released before this bug was reported and fixed. Additionally the infrequent nature of the systemd bug explains why we rarely see this occur in sonic in practice.

I would recommend upgrading the version of systemd used by sonic to resolve this issue.

Attached below is the show inventory output
sonic_dump_r-lionfish-07_20210329_195525.tar.gz

Steps to reproduce the issue:

  1. Run the sonig-mgmt test platform_tests/test_reload_config.py

Describe the results you received:

The test fails because many of the docker containers are shown as "exited" with only bgp and database containers running.

Describe the results you expected:

All containers back up and running, test passes successfully.

Output of show version:

SONiC Software Version: SONiC.SONIC.202012.53-aee4892c_Internal
Distribution: Debian 10.8
Kernel: 4.19.0-12-2-amd64
Build commit: aee4892c
Build date: Thu Mar 25 07:17:58 UTC 2021
Built by: sw-r2d2-bot@r-build-sonic-ci03

Platform: x86_64-mlnx_msn3420-r0
HwSKU: ACS-MSN3420
ASIC: mellanox
ASIC Count: 1
Serial Number: MT2012X01822            
Uptime: 19:55:35 up 1 day, 19:40,  1 user,  load average: 2.31, 1.45, 1.05

Docker images:
REPOSITORY                    TAG                                 IMAGE ID            SIZE
docker-teamd                  SONIC.202012.53-aee4892c_Internal   1d73bce4fcd8        412MB
docker-teamd                  latest                              1d73bce4fcd8        412MB
docker-nat                    SONIC.202012.53-aee4892c_Internal   4b70ff56ef58        415MB
docker-nat                    latest                              4b70ff56ef58        415MB
docker-orchagent              SONIC.202012.53-aee4892c_Internal   b3ab4b23959b        431MB
docker-orchagent              latest                              b3ab4b23959b        431MB
docker-fpm-frr                SONIC.202012.53-aee4892c_Internal   8eedb2cefa34        431MB
docker-fpm-frr                latest                              8eedb2cefa34        431MB
docker-sflow                  SONIC.202012.53-aee4892c_Internal   f73678a4a5e4        413MB
docker-sflow                  latest                              f73678a4a5e4        413MB
docker-wjh                    202012.202012.0-e663b63             bbf69da66de2        441MB
docker-wjh                    latest                              bbf69da66de2        441MB
docker-platform-monitor       SONIC.202012.53-aee4892c_Internal   9431a24cce94        693MB
docker-platform-monitor       latest                              9431a24cce94        693MB
docker-syncd-mlnx             SONIC.202012.53-aee4892c_Internal   b435d0fb66dc        666MB
docker-syncd-mlnx             latest                              b435d0fb66dc        666MB
docker-snmp                   SONIC.202012.53-aee4892c_Internal   1186f8a7345a        442MB
docker-snmp                   latest                              1186f8a7345a        442MB
docker-sonic-mgmt-framework   SONIC.202012.53-aee4892c_Internal   f82c902ebc99        620MB
docker-sonic-mgmt-framework   latest                              f82c902ebc99        620MB
docker-router-advertiser      SONIC.202012.53-aee4892c_Internal   58926fcdf5f6        401MB
docker-router-advertiser      latest                              58926fcdf5f6        401MB
docker-lldp                   SONIC.202012.53-aee4892c_Internal   175e6dcfe231        441MB
docker-lldp                   latest                              175e6dcfe231        441MB
docker-database               SONIC.202012.53-aee4892c_Internal   586be7012094        401MB
docker-database               latest                              586be7012094        401MB
docker-sonic-telemetry        SONIC.202012.53-aee4892c_Internal   4232634fe32d        491MB
docker-sonic-telemetry        latest                              4232634fe32d        491MB
docker-dhcp-relay             SONIC.202012.53-aee4892c_Internal   06444d9dd53a        405MB
docker-dhcp-relay             latest                              06444d9dd53a        405MB
@stepanblyschak
Copy link
Collaborator

Observed the same issue and also investigated down to systemd/systemd#13124.

In my scenario the switch stuck at "systemctl try-restart systemd-timesyncd.service".
The assumption that it is related to systemd/systemd#13124 makes sense because I observe hostcfgd running at that time and executing few "systemctl daemon-reload" commands which is exactly how to reproduce the issue this change - systemd/systemd#13124 is fixing.
Workaround can be to do "systemctl try-restart systemd-timesyncd.service" to insert another nop job and unblock the service chain.

@lguohan How can we take this fix systemd/systemd#13124 into SONiC? We currently do not build systemd from sources or should we use newer from buster-backports?

@anshuv-mfst
Copy link

anshuv-mfst commented Mar 31, 2021

@stepanblyschak - could you please look into this issue, thanks.

@anshuv-mfst anshuv-mfst added the Triaged this issue has been triaged label Mar 31, 2021
@stepanblyschak
Copy link
Collaborator

PR - #7228

@liat-grozovik liat-grozovik changed the title sonic docker containers show as "Exited" after config-reload SONiC docker containers show as "Exited" after config-reload Apr 5, 2021
@lguohan lguohan linked a pull request Apr 8, 2021 that will close this issue
4 tasks
lguohan pushed a commit that referenced this issue Apr 8, 2021
Fix #7180 

Update systemd to v247 in order to pick the fix for "core: coldplug possible nop_job" systemd/systemd#13124

Install systemd, systemd-sysv from buster-backports. Pass "systemd.unified_cgroup_hierarchy=0" as kernel argument to force systemd to not use unified cgroup hierarchy, otherwise dockerd won't start moby/moby#16238.
Also, chown $FILSYSTEM_ROOT for root, otherwise apt systemd installation complains, see similar https://unix.stackexchange.com/questions/593529/can-not-configure-systemd-inside-a-chrooted-environment

Signed-off-by: Stepan Blyschak <stepanb@nvidia.com>
yxieca pushed a commit that referenced this issue Apr 8, 2021
Fix #7180 

Update systemd to v247 in order to pick the fix for "core: coldplug possible nop_job" systemd/systemd#13124

Install systemd, systemd-sysv from buster-backports. Pass "systemd.unified_cgroup_hierarchy=0" as kernel argument to force systemd to not use unified cgroup hierarchy, otherwise dockerd won't start moby/moby#16238.
Also, chown $FILSYSTEM_ROOT for root, otherwise apt systemd installation complains, see similar https://unix.stackexchange.com/questions/593529/can-not-configure-systemd-inside-a-chrooted-environment

Signed-off-by: Stepan Blyschak <stepanb@nvidia.com>
@stepanblyschak
Copy link
Collaborator

Relevant for 202012 as systemd 247 was reverted.

raphaelt-nvidia pushed a commit to raphaelt-nvidia/sonic-buildimage that referenced this issue May 23, 2021
…#7228)

Fix sonic-net#7180 

Update systemd to v247 in order to pick the fix for "core: coldplug possible nop_job" systemd/systemd#13124

Install systemd, systemd-sysv from buster-backports. Pass "systemd.unified_cgroup_hierarchy=0" as kernel argument to force systemd to not use unified cgroup hierarchy, otherwise dockerd won't start moby/moby#16238.
Also, chown $FILSYSTEM_ROOT for root, otherwise apt systemd installation complains, see similar https://unix.stackexchange.com/questions/593529/can-not-configure-systemd-inside-a-chrooted-environment

Signed-off-by: Stepan Blyschak <stepanb@nvidia.com>
carl-nokia pushed a commit to carl-nokia/sonic-buildimage that referenced this issue Aug 7, 2021
…#7228)

Fix sonic-net#7180 

Update systemd to v247 in order to pick the fix for "core: coldplug possible nop_job" systemd/systemd#13124

Install systemd, systemd-sysv from buster-backports. Pass "systemd.unified_cgroup_hierarchy=0" as kernel argument to force systemd to not use unified cgroup hierarchy, otherwise dockerd won't start moby/moby#16238.
Also, chown $FILSYSTEM_ROOT for root, otherwise apt systemd installation complains, see similar https://unix.stackexchange.com/questions/593529/can-not-configure-systemd-inside-a-chrooted-environment

Signed-off-by: Stepan Blyschak <stepanb@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Issue for 202012 Triaged this issue has been triaged
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants