Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

flaky test: test_neon_superuser / pg15-release cannot properly exit when replication is enabled #6969

Closed
skyzh opened this issue Feb 29, 2024 · 18 comments · Fixed by #7192
Closed
Assignees
Labels
c/compute Component: compute, excluding postgres itself t/bug Issue Type: Bug

Comments

@skyzh
Copy link
Member

skyzh commented Feb 29, 2024

Steps to reproduce

In #6935 this test will randomly fail. Error messages captured by @knizhnik:

PG:2024-02-28 12:38:59.697 GMT [308398] LOG:  logical decoding found consistent point at 0/2A8DAF8
PG:2024-02-28 12:38:59.697 GMT [308398] DETAIL:  There are no running transactions.
PG:2024-02-28 12:38:59.697 GMT [308398] STATEMENT:  START_REPLICATION SLOT "pg_16426_sync_16423_7340631415576970097" LOGICAL 0/2A8DB30 (proto_version '3', publication_names '"pub"')
PG:2024-02-28 12:38:59.697 GMT [308398] LOG:  Drop replication slot pg_replslot/pg_16426_sync_16423_7340631415576970097
PG:2024-02-28 12:38:59.697 GMT [308398] STATEMENT:  DROP_REPLICATION_SLOT pg_16426_sync_16423_7340631415576970097 WAIT
PG:2024-02-28 12:39:00.198 GMT [308143] LOG:  received fast shutdown request
PG:2024-02-28 12:39:00.198 GMT [308143] LOG:  aborting any active transactions
PG:2024-02-28 12:39:00.198 GMT [308162] FATAL:  terminating connection due to administrator command
PG:2024-02-28 12:39:00.198 GMT [308152] FATAL:  terminating background worker "WAL proposer" due to administrator command
PG:2024-02-28 12:39:00.198 GMT [308151] FATAL:  terminating background worker "Logical replication monitor" due to administrator command
PG:2024-02-28 12:39:00.199 GMT [308143] LOG:  background worker "Logical replication monitor" (PID 308151) exited with exit code 1
PG:2024-02-28 12:39:00.199 GMT [308143] LOG:  background worker "WAL proposer" (PID 308152) exited with exit code 1
PG:2024-02-28 12:39:00.199 GMT [308143] LOG:  background worker "logical replication launcher" (PID 308153) exited with exit code 1
PG:2024-02-28 12:39:00.199 GMT [308146] LOG:  shutting down
2024-02-28T12:39:00.691236Z ERROR could not get backends state change: could not query backends: connection closed
2024-02-28T12:39:00.691249Z  WARN failed to get list of walsenders: Error { kind: Closed, cause: None }

Expected result

Actual result

Environment

Logs, links

@skyzh skyzh added t/bug Issue Type: Bug c/compute Component: compute, excluding postgres itself labels Feb 29, 2024
@skyzh skyzh self-assigned this Feb 29, 2024
@skyzh
Copy link
Member Author

skyzh commented Feb 29, 2024

Take a quick look and this seems to be caused by the server cannot be shutdown when logical replication is enabled. Mostly failing in PG15. cc @save-buffer

@skyzh
Copy link
Member Author

skyzh commented Feb 29, 2024

Or not related to logical replication... sorry for tagging

@skyzh
Copy link
Member Author

skyzh commented Feb 29, 2024

Normal shutdown log:

PG:2024-02-29 15:18:06.771 GMT [157284] LOG:  received fast shutdown request
PG:2024-02-29 15:18:06.771 GMT [157284] LOG:  aborting any active transactions
PG:2024-02-29 15:18:06.771 GMT [157499] FATAL:  terminating connection due to administrator command
PG:2024-02-29 15:18:06.771 GMT [157295] FATAL:  terminating background worker "WAL proposer" due to administrator command
PG:2024-02-29 15:18:06.771 GMT [157294] FATAL:  terminating background worker "Logical replication monitor" due to administrator command
PG:2024-02-29 15:18:06.772 GMT [157284] LOG:  background worker "Logical replication monitor" (PID 157294) exited with exit code 1
PG:2024-02-29 15:18:06.772 GMT [157284] LOG:  background worker "WAL proposer" (PID 157295) exited with exit code 1
PG:2024-02-29 15:18:06.772 GMT [157284] LOG:  background worker "logical replication launcher" (PID 157296) exited with exit code 1
PG:2024-02-29 15:18:06.773 GMT [157289] LOG:  shutting down
PG:2024-02-29 15:18:06.783 GMT [157289] LOG:  checkpoint starting: shutdown immediate
PG:2024-02-29 15:18:06.784 GMT [157289] LOG:  checkpoint complete: wrote 10 buffers (7.8%); 0 WAL file(s) added, 0 removed, 1 recycled; write=0.001 s, sync=0.001 s, total=0.001 s; sync files=0, longest=0.000 s, average=0.000 s; distance=22460 kB, estimate=22460 kB
PG:2024-02-29 15:18:06.787 GMT [157284] LOG:  database system is shut down
2024-02-29T15:18:06.789135Z  INFO Postgres exited with code exit status: 0, shutting down
2024-02-29T15:18:06.789189Z  INFO syncing safekeepers on shutdown
PG:2024-02-29 15:18:06.791 GMT [159966] LOG:  skipping missing configuration file "/tmp/test_output/test_neon_superuser[debug-pg15]-2/repo/endpoints/ep-1/pgdata/compute_ctl_temp_override.conf"
PG:2024-02-29 15:18:06.791 GMT [159966] LOG:  skipping missing configuration file "/tmp/test_output/test_neon_superuser[debug-pg15]-2/repo/endpoints/ep-1/pgdata/compute_ctl_temp_override.conf"
PG:2024-02-29 15:18:06.798 GMT [159966] LOG:  Missing NEON_CONTROL_PLANE_TOKEN environment variable, forwarding will not be authenticated
PG:2024-02-29 15:18:06.798 GMT [159966] LOG:  [WP] connecting with node 127.0.0.1:28441
PG:2024-02-29 15:18:06.799 GMT [159966] LOG:  [WP] connected with node 127.0.0.1:28441
PG:2024-02-29 15:18:06.799 GMT [159966] LOG:  [WP] received AcceptorGreeting from safekeeper 127.0.0.1:28441, term=2
PG:2024-02-29 15:18:06.799 GMT [159966] LOG:  [WP] proposer connected to quorum (1) safekeepers, propTerm=3
PG:2024-02-29 15:18:06.799 GMT [159966] LOG:  [WP] requesting vote from 127.0.0.1:28441 for term 3
PG:2024-02-29 15:18:06.800 GMT [159966] LOG:  [WP] got VoteResponse from acceptor 127.0.0.1:28441, voteGiven=1, epoch=2, flushLsn=0/2A8E0F8, truncateLsn=0/2A8E0F8, timelineStartLsn=0/149F0D8
PG:2024-02-29 15:18:06.800 GMT [159966] LOG:  [WP] got votes from majority (1) of nodes, term 3, epochStartLsn 0/2A8E0F8, donor 127.0.0.1:28441, truncate_lsn 0/2A8E0F8
2024-02-29T15:18:06.801316Z  INFO synced safekeepers at lsn 0/2A8E0F8
2024-02-29T15:18:06.801347Z  INFO checking for core dumps in /tmp/test_output/test_neon_superuser[debug-pg15]-2/repo/endpoints/ep-1/pgdata
2024-02-29T15:18:06.801423Z  INFO shutting down tracing
OpenTelemetry trace error occurred. error sending request for url (http://localhost:4318/): error trying to connect: tcp connect error: Cannot assign requested address (os error 99)
2024-02-29T15:18:06.803573Z  INFO shutting down

Abnormal shutdown logs:

PG:2024-02-29 14:18:02.817 GMT [207739] LOG:  received fast shutdown request
PG:2024-02-29 14:18:02.817 GMT [207739] LOG:  aborting any active transactions
PG:2024-02-29 14:18:02.817 GMT [207958] FATAL:  terminating connection due to administrator command
PG:2024-02-29 14:18:02.817 GMT [207751] FATAL:  terminating background worker "WAL proposer" due to administrator command
PG:2024-02-29 14:18:02.817 GMT [207750] FATAL:  terminating background worker "Logical replication monitor" due to administrator command
PG:2024-02-29 14:18:02.818 GMT [207739] LOG:  background worker "WAL proposer" (PID 207751) exited with exit code 1
PG:2024-02-29 14:18:02.818 GMT [207739] LOG:  background worker "logical replication launcher" (PID 207752) exited with exit code 1
PG:2024-02-29 14:18:02.818 GMT [207739] LOG:  background worker "Logical replication monitor" (PID 207750) exited with exit code 1
2024-02-29T14:18:02.855588Z ERROR could not get backends state change: could not query backends: connection closed
2024-02-29T14:18:02.855629Z  WARN failed to get list of walsenders: Error { kind: Closed, cause: None }
PG:2024-02-29 14:18:02.819 GMT [207745] LOG:  shutting down
2024-02-29T14:18:03.355814Z ERROR could not get backends state change: could not query backends: connection closed
2024-02-29T14:18:03.355857Z  WARN failed to get list of walsenders: Error { kind: Closed, cause: None }
2024-02-29T14:18:03.856089Z ERROR could not get backends state change: could not query backends: connection closed
2024-02-29T14:18:03.856149Z  WARN failed to get list of walsenders: Error { kind: Closed, cause: None }
2024-02-29T14:18:04.356371Z ERROR could not get backends state change: could not query backends: connection closed
2024-02-29T14:18:04.356432Z  WARN failed to get list of walsenders: Error { kind: Closed, cause: None }
2024-02-29T14:18:04.856613Z ERROR could not get backends state change: could not query backends: connection closed
2024-02-29T14:18:04.856652Z  WARN failed to get list of walsenders: Error { kind: Closed, cause: None }
2024-02-29T14:18:05.356837Z ERROR could not get backends state change: could not query backends: connection closed
2024-02-29T14:18:05.356880Z  WARN failed to get list of walsenders: Error { kind: Closed, cause: None }
2024-02-29T14:18:05.857099Z ERROR could not get backends state change: could not query backends: connection closed
2024-02-29T14:18:05.857153Z  WARN failed to get list of walsenders: Error { kind: Closed, cause: None }
2024-02-29T14:18:06.357359Z ERROR could not get backends state change: could not query backends: connection closed
2024-02-29T14:18:06.357409Z  WARN failed to get list of walsenders: Error { kind: Closed, cause: None }
2024-02-29T14:18:06.857606Z ERROR could not get backends state change: could not query backends: connection closed
2024-02-29T14:18:06.857656Z  WARN failed to get list of walsenders: Error { kind: Closed, cause: None }
2024-02-29T14:18:07.357837Z ERROR could not get backends state change: could not query backends: connection closed
2024-02-29T14:18:07.357875Z  WARN failed to get list of walsenders: Error { kind: Closed, cause: None }
2024-02-29T14:18:07.858070Z ERROR could not get backends state change: could not query backends: connection closed
2024-02-29T14:18:07.858108Z  WARN failed to get list of walsenders: Error { kind: Closed, cause: None }
2024-02-29T14:18:08.358300Z ERROR could not get backends state change: could not query backends: connection closed
2024-02-29T14:18:08.358336Z  WARN failed to get list of walsenders: Error { kind: Closed, cause: None }
2024-02-29T14:18:08.858516Z ERROR could not get backends state change: could not query backends: connection closed
2024-02-29T14:18:08.858558Z  WARN failed to get list of walsenders: Error { kind: Closed, cause: None }
2024-02-29T14:18:09.358756Z ERROR could not get backends state change: could not query backends: connection closed
2024-02-29T14:18:09.358810Z  WARN failed to get list of walsenders: Error { kind: Closed, cause: None }
2024-02-29T14:18:09.858971Z ERROR could not get backends state change: could not query backends: connection closed
2024-02-29T14:18:09.859004Z  WARN failed to get list of walsenders: Error { kind: Closed, cause: None }
2024-02-29T14:18:10.359204Z ERROR could not get backends state change: could not query backends: connection closed
2024-02-29T14:18:10.359246Z  WARN failed to get list of walsenders: Error { kind: Closed, cause: None }
2024-02-29T14:18:10.859441Z ERROR could not get backends state change: could not query backends: connection closed
2024-02-29T14:18:10.859497Z  WARN failed to get list of walsenders: Error { kind: Closed, cause: None }
2024-02-29T14:18:11.359679Z ERROR could not get backends state change: could not query backends: connection closed
2024-02-29T14:18:11.359719Z  WARN failed to get list of walsenders: Error { kind: Closed, cause: None }
2024-02-29T14:18:11.859911Z ERROR could not get backends state change: could not query backends: connection closed
2024-02-29T14:18:11.859954Z  WARN failed to get list of walsenders: Error { kind: Closed, cause: None }
2024-02-29T14:18:12.360136Z ERROR could not get backends state change: could not query backends: connection closed
2024-02-29T14:18:12.360170Z  WARN failed to get list of walsenders: Error { kind: Closed, cause: None }
2024-02-29T14:18:12.860359Z ERROR could not get backends state change: could not query backends: connection closed
2024-02-29T14:18:12.863613Z  WARN failed to get list of walsenders: Error { kind: Closed, cause: None }
2024-02-29T14:18:13.363830Z ERROR could not get backends state change: could not query backends: connection closed
2024-02-29T14:18:13.363890Z  WARN failed to get list of walsenders: Error { kind: Closed, cause: None }
2024-02-29T14:18:13.864065Z ERROR could not get backends state change: could not query backends: connection closed
2024-02-29T14:18:13.864109Z  WARN failed to get list of walsenders: Error { kind: Closed, cause: None }
2024-02-29T14:18:14.364295Z ERROR could not get backends state change: could not query backends: connection closed
2024-02-29T14:18:14.364342Z  WARN failed to get list of walsenders: Error { kind: Closed, cause: None }
2024-02-29T14:18:14.864540Z ERROR could not get backends state change: could not query backends: connection closed
2024-02-29T14:18:14.864590Z  WARN failed to get list of walsenders: Error { kind: Closed, cause: None }
2024-02-29T14:18:15.364767Z ERROR could not get backends state change: could not query backends: connection closed

The shutdown process takes a minute. It seems that Postgres did not clean up all the processes, and therefore compute_ctl does not think it's shutdown.

2024-02-29 15:16:53.855 INFO [neon_fixtures.py:1429] Running command "/tmp/neon/bin/neon_local endpoint stop --mode=fast ep-1"
2024-02-29 15:17:55.28  INFO [neon_fixtures.py:1495] Run ['/tmp/neon/bin/neon_local', 'endpoint', 'stop', '--mode=fast', 'ep-1'] failed:

skyzh added a commit that referenced this issue Feb 29, 2024
ref #6969

Signed-off-by: Alex Chi Z <chi@neon.tech>
@knizhnik
Copy link
Contributor

Why do we run monitor for CI tests at all?
It seems to be there only test when Neon is used as subscriber ion logical replication (in other tests vanilla Postgres is used).
We are suspending suspending (shutdown) of compute node if LR is active. Can it be the source of the problem: monitor prevents termination of subscriber node?

@skyzh
Copy link
Member Author

skyzh commented Feb 29, 2024

The problem is that postgres never exits. In compute_ctl.rs,

    // Wait for the child Postgres process forever. In this state Ctrl+C will
    // propagate to Postgres and it will be shut down as well.
    if let Some((mut pg, logs_handle)) = pg {
        // Startup is finished, exit the startup tracing span
        drop(startup_context_guard);

        let ecode = pg
            .wait()
            .expect("failed to start waiting on Postgres process");
        PG_PID.store(0, Ordering::SeqCst);

It's stuck at pg.wait().

@skyzh
Copy link
Member Author

skyzh commented Feb 29, 2024

The monitor is working as expected. compute_ctl exits using std::os::exit, so everything will be terminated.

@andreasscherbaum
Copy link
Contributor

@skyzh Please add the details about the root cause.

This is realistically not a problem in production, until and unless we suspend compute with logical replication.

@skyzh
Copy link
Member Author

skyzh commented Mar 5, 2024

A summary on the root cause: If we have a replication job in pg15, it won't exit by itself. Postmaster waits forever.

This does not cause any problems for us in prod because we don't suspend the compute node if logical replication is enabled. If we really want to stop it, Kubernetes will (possibly) do a force kill on a timeout, and therefore it can always stop in prod.

The only problem is that stop may take long time, and during this time, users cannot connect to the database.

Mitigation: #6975 that closes the replication. This allows us to pass the test case.

If we need to solve this issue, we will need to look into our pg15 source code modification and how it correlates with this seemingly unrelated pull request. #6935

skyzh added a commit that referenced this issue Mar 6, 2024
This pull request mitigates
#6969, but the longer-term
problem is that we cannot properly stop Postgres if there is a
subscription.

---------

Signed-off-by: Alex Chi Z <chi@neon.tech>
@skyzh skyzh changed the title flaky test: test_neon_superuser flaky test: test_neon_superuser / pg15-release cannot properly exit when replication is enabled Mar 6, 2024
@skyzh
Copy link
Member Author

skyzh commented Mar 18, 2024

the issue can be reproduced on pg14-debug and pg15-debug after inserting a few loggings into the postmaster process on macOS.

@skyzh
Copy link
Member Author

skyzh commented Mar 18, 2024

the process is stuck on WalSndWaitStopping

@skyzh
Copy link
Member Author

skyzh commented Mar 18, 2024

and walsnd->state == WALSNDSTATE_STREAMING

@skyzh
Copy link
Member Author

skyzh commented Mar 18, 2024

PG:2024-03-18 18:45:36.217 GMT [38518] LOG:  standby "walproposer" is now a synchronous standby with priority 1
PG:2024-03-18 18:45:38.939 GMT [38512] LOG:  still waiting walsnd=0, state=3, pid=38518

Okay, the problem seems to be with safekeepers. The walproposer does not exit when there is logical replications slots open.

@skyzh
Copy link
Member Author

skyzh commented Mar 18, 2024

It seems that safekeeper did not receive SIGUSR2 (got_SIGUSR2=false), and it does not exit. Signal handling in pg15 is different from other versions? (pg14 is just flaky, and sometimes it passes, not sure if the root cause is the same)

@arssher
Copy link
Contributor

arssher commented Mar 19, 2024

Interesting, what is the best way to reproduce? " inserting a few loggings into the postmaster process on macOS." -- in which place? #6975 is merged, so it happens on main now, right?

Or alternatively do you have full logs for a fresh failure?

@arssher
Copy link
Contributor

arssher commented Mar 19, 2024

This does not cause any problems for us in prod because we don't suspend the compute node if logical replication is enabled.

This ''don't suspend the compute node if logical replication is enabled" logic doesn't prevent stop once we actually decided to stop, i.e. SIGTERMed the postmaster, so issue here is something different.

@skyzh
Copy link
Member Author

skyzh commented Mar 19, 2024

@arssher remove the last SQL (drop subscription) from test_neon_superuser and this seems to be stably reproduced on macOS.

This is the log mixed with my own printfs...

compute.log

skyzh added a commit that referenced this issue Mar 20, 2024
errno is not preserved in the signal handler. This pull request fixes
it. Maybe related: #6969, but
does not fix the flaky test problem.

Signed-off-by: Alex Chi Z <chi@neon.tech>
@skyzh
Copy link
Member Author

skyzh commented Mar 20, 2024

after looking into it a little bit, the system is actually stuck on walsender instead of walproposer (so it's unrelated to safekeeper team haha). Seems to be related to our modifications to postgres on logical replication persistence -> neondatabase/postgres#395

skyzh added a commit that referenced this issue Mar 28, 2024
Fix #6969

Ref neondatabase/postgres#395
neondatabase/postgres#396

Postgres will stuck on exit if the replication slot is not dropped
before shutting down. This is caused by Neon's custom WAL record to
record replication slots. The pull requests in the postgres repo fixes
the problem, and this pull request bumps the postgres commit.

---------

Signed-off-by: Alex Chi Z <chi@neon.tech>
@andreasscherbaum
Copy link
Contributor

Still in progress, to be releases this week.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
c/compute Component: compute, excluding postgres itself t/bug Issue Type: Bug
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants