Skip to content
This repository has been archived by the owner on Apr 26, 2024. It is now read-only.

Stopping Synapse server hangs when ran in Element e2e test CI job #11471

Closed
germain-gg opened this issue Dec 1, 2021 · 7 comments · Fixed by #11490
Closed

Stopping Synapse server hangs when ran in Element e2e test CI job #11471

germain-gg opened this issue Dec 1, 2021 · 7 comments · Fixed by #11490
Labels
T-Defect Bugs, crashes, hangs, security vulnerabilities, or other reported issues.

Comments

@germain-gg
Copy link
Contributor

Description

matrix-react-sdk runs end to end tests in CI which spawn a synapse instance. Starting from 30st of November we have had all our CI jobs getting stuck when trying to stop Synapse. View the stop.sh script that gets executed

Steps to reproduce

Version information

v1.4.7 works ✅, v1.4.8 doesn't ❎

You can find all the logs in the CI job artifact here https://github.com/matrix-org/matrix-react-sdk/actions/runs/1522816410

But here is what the log file finishes with

2021-11-30 20:45:53,506 - twisted - 275 - INFO - sentinel - Received SIGTERM, shutting down.
2021-11-30 20:45:53,510 - twisted - 275 - INFO - sentinel - (TCP Port 5005 Closed)
2021-11-30 20:45:53,513 - twisted - 275 - INFO - sentinel - Main loop terminated.
2021-11-30 20:45:53,507 - synapse.storage.databases.main.lock - 82 - INFO - LockStore._on_shutdown-0 - Dropping held locks due to shutdown
2021-11-30 20:45:53,507 - synapse.storage.databases.main.lock - 91 - INFO - LockStore._on_shutdown-0 - Dropped locks due to shutdown
2021-11-30 20:45:53,508 - synapse.handlers.presence - 751 - INFO - presence.on_shutdown-0 - Performing _on_shutdown. Persisting 12 unpersisted changes
2021-11-30 20:45:53,508 - synapse.app._base - 495 - INFO - sentinel - Shutting down...
2021-11-30 20:45:53,510 - synapse.handlers.presence - 762 - INFO - presence.on_shutdown-0 - Finished _on_shutdown
2021-11-30 20:45:53,514 - synapse.app.homeserver - 134 - WARNING - sentinel - Stopping daemon.
@richvdh
Copy link
Member

richvdh commented Dec 1, 2021

I'm guessing this is due to #11459; however the only difference there is that we wait for synapse to actually stop.

@gsouquet: can we see the output from synctl somewhere? I don't see it in the main output, nor the artifact.

@richvdh
Copy link
Member

richvdh commented Dec 1, 2021

v1.4.7 works white_check_mark, v1.4.8 doesn't negative_squared_cross_mark

I think you mean 1.48, but have you actually tried with that? #11459 is in develop but not 1.48.

@germain-gg
Copy link
Contributor Author

I have actually not tried 1.48. Was kind of an assumption given the failure more or less matched with the release time.

I don't think we have the synctl logs anywhere unfortunately, but we could try to tweak the job to output them if that helps you in your investigation

@richvdh
Copy link
Member

richvdh commented Dec 1, 2021

but we could try to tweak the job to output them if that helps you in your investigation

I think it would, yes. thanks!

@squahtx squahtx added the T-Defect Bugs, crashes, hangs, security vulnerabilities, or other reported issues. label Dec 1, 2021
@squahtx
Copy link
Contributor

squahtx commented Dec 2, 2021

I tried (and failed) to reproduce the issue by running the CI scripts manually.

I suspect what's happening is that Synapse becomes a zombie process after terminating inside the GitHub Actions container [1].
synctl checks whether Synapse is still running using kill, which succeeds for zombies, and thus waits indefinitely.

[1] https://blog.phusion.nl/2015/01/20/docker-and-the-pid-1-zombie-reaping-problem/

@richvdh
Copy link
Member

richvdh commented Dec 2, 2021

Interesting, that makes sense.

@gsouquet it might be helpful if you could make stop.sh run ps auxf before running synctl, so we can get a better idea of the process heirarchy within the container.

@squahtx
Copy link
Contributor

squahtx commented Dec 2, 2021

I tested a workaround (7009f45, 425d9f5) in synctl and it makes the E2E CI terminate on time.

ps auxf shows a bunch of zombie processes:
https://github.com/matrix-org/matrix-react-sdk/runs/4395592326?check_suite_focus=true#step:5:114

all tests finished successfully
Stopping Element server ...
USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root         730  0.0  0.0   2392   692 ?        Ss   13:33   0:00 sh -e /__w/_temp/bad7a49b-42cf-4f19-ac70-9c2afc9165fd.sh
root         737  0.0  0.0   3740  2792 ?        S    13:33   0:00  \_ /bin/bash ./scripts/ci/run-end-to-end-tests.sh
root         740  0.0  0.0   3740  2936 ?        S    13:33   0:00      \_ /bin/bash ./run.sh --no-sandbox --log-directory logs/
root        1214  0.0  0.0   3740  2808 ?        S    13:35   0:00          \_ /bin/bash ./synapse/stop.sh
root        1218  0.0  0.0   7644  2728 ?        R    13:35   0:00              \_ ps auxf
root           1  0.0  0.0   2332   736 ?        Ss   13:29   0:00 tail -f /dev/null
root         120  0.0  0.0      0     0 ?        Z    13:30   0:00 [git-remote-http] <defunct>
root         292  0.0  0.0      0     0 ?        Z    13:30   0:00 [git-remote-http] <defunct>
root         764 10.8  1.4 1382860 100588 ?      Ssl  13:33   0:13 /__w/matrix-react-sdk/matrix-react-sdk/test/end-to-end-tests/synapse/installations/consent/env/bin/python -m synapse.app.homeserver -c homeserver.yaml --daemonize
root         776  0.0  0.0      0     0 ?        Z    13:33   0:00 [bash] <defunct>
root         793  0.0  0.0      0     0 ?        Z    13:33   0:00 [chrome] <defunct>
[...]
USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root         730  0.0  0.0   2392   692 ?        Ss   13:33   0:00 sh -e /__w/_temp/bad7a49b-42cf-4f19-ac70-9c2afc9165fd.sh
root         737  0.0  0.0   3740  2792 ?        S    13:33   0:00  \_ /bin/bash ./scripts/ci/run-end-to-end-tests.sh
root         740  0.0  0.0   3740  2936 ?        S    13:33   0:00      \_ /bin/bash ./run.sh --no-sandbox --log-directory logs/
root        1214  0.0  0.0   3740  2812 ?        S    13:35   0:00          \_ /bin/bash ./synapse/stop.sh
root        1219 36.0  0.6  74020 45044 ?        S    13:35   0:00              \_ /__w/matrix-react-sdk/matrix-react-sdk/test/end-to-end-tests/synapse/installations/consent/env/bin/python ./synctl stop
root        1220  0.0  0.0   7872  2768 ?        R    13:35   0:00                  \_ ps auxf
root           1  0.0  0.0   2332   736 ?        Ss   13:29   0:00 tail -f /dev/null
root         120  0.0  0.0      0     0 ?        Z    13:30   0:00 [git-remote-http] <defunct>
root         292  0.0  0.0      0     0 ?        Z    13:30   0:00 [git-remote-http] <defunct>
root         764 10.7  0.0      0     0 ?        Zs   13:33   0:13 [python] <defunct>
root         776  0.0  0.0      0     0 ?        Z    13:33   0:00 [bash] <defunct>
root         793  0.0  0.0      0     0 ?        Z    13:33   0:00 [chrome] <defunct>
[...]
stopped synapse.app.homeserver

The first invocation is just before synctl and the 2nd is during.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
T-Defect Bugs, crashes, hangs, security vulnerabilities, or other reported issues.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants