-
Notifications
You must be signed in to change notification settings - Fork 485
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
flaky test: test_neon_superuser
/ pg15-release cannot properly exit when replication is enabled
#6969
Comments
Take a quick look and this seems to be caused by the server cannot be shutdown when logical replication is enabled. Mostly failing in PG15. cc @save-buffer |
Or not related to logical replication... sorry for tagging |
Normal shutdown log:
Abnormal shutdown logs:
The shutdown process takes a minute. It seems that Postgres did not clean up all the processes, and therefore compute_ctl does not think it's shutdown.
|
ref #6969 Signed-off-by: Alex Chi Z <chi@neon.tech>
Why do we run monitor for CI tests at all? |
The problem is that postgres never exits. In
It's stuck at |
The monitor is working as expected. |
@skyzh Please add the details about the root cause. This is realistically not a problem in production, until and unless we suspend compute with logical replication. |
A summary on the root cause: If we have a replication job in pg15, it won't exit by itself. Postmaster waits forever. This does not cause any problems for us in prod because we don't suspend the compute node if logical replication is enabled. If we really want to stop it, Kubernetes will (possibly) do a force kill on a timeout, and therefore it can always stop in prod. The only problem is that stop may take long time, and during this time, users cannot connect to the database. Mitigation: #6975 that closes the replication. This allows us to pass the test case. If we need to solve this issue, we will need to look into our pg15 source code modification and how it correlates with this seemingly unrelated pull request. #6935 |
This pull request mitigates #6969, but the longer-term problem is that we cannot properly stop Postgres if there is a subscription. --------- Signed-off-by: Alex Chi Z <chi@neon.tech>
test_neon_superuser
test_neon_superuser
/ pg15-release cannot properly exit when replication is enabled
the issue can be reproduced on pg14-debug and pg15-debug after inserting a few loggings into the postmaster process on macOS. |
the process is stuck on WalSndWaitStopping |
and walsnd->state == WALSNDSTATE_STREAMING |
Okay, the problem seems to be with safekeepers. The walproposer does not exit when there is logical replications slots open. |
It seems that safekeeper did not receive SIGUSR2 (got_SIGUSR2=false), and it does not exit. Signal handling in pg15 is different from other versions? (pg14 is just flaky, and sometimes it passes, not sure if the root cause is the same) |
Interesting, what is the best way to reproduce? " inserting a few loggings into the postmaster process on macOS." -- in which place? #6975 is merged, so it happens on main now, right? Or alternatively do you have full logs for a fresh failure? |
This ''don't suspend the compute node if logical replication is enabled" logic doesn't prevent stop once we actually decided to stop, i.e. SIGTERMed the postmaster, so issue here is something different. |
@arssher remove the last SQL (drop subscription) from This is the log mixed with my own printfs... |
errno is not preserved in the signal handler. This pull request fixes it. Maybe related: #6969, but does not fix the flaky test problem. Signed-off-by: Alex Chi Z <chi@neon.tech>
after looking into it a little bit, the system is actually stuck on walsender instead of walproposer (so it's unrelated to safekeeper team haha). Seems to be related to our modifications to postgres on logical replication persistence -> neondatabase/postgres#395 |
Fix #6969 Ref neondatabase/postgres#395 neondatabase/postgres#396 Postgres will stuck on exit if the replication slot is not dropped before shutting down. This is caused by Neon's custom WAL record to record replication slots. The pull requests in the postgres repo fixes the problem, and this pull request bumps the postgres commit. --------- Signed-off-by: Alex Chi Z <chi@neon.tech>
Still in progress, to be releases this week. |
Steps to reproduce
In #6935 this test will randomly fail. Error messages captured by @knizhnik:
Expected result
Actual result
Environment
Logs, links
The text was updated successfully, but these errors were encountered: