flaky test: `test_neon_superuser` / pg15-release cannot properly exit when replication is enabled #6969

skyzh · 2024-02-29T15:04:30Z

Steps to reproduce

In #6935 this test will randomly fail. Error messages captured by @knizhnik:

PG:2024-02-28 12:38:59.697 GMT [308398] LOG:  logical decoding found consistent point at 0/2A8DAF8
PG:2024-02-28 12:38:59.697 GMT [308398] DETAIL:  There are no running transactions.
PG:2024-02-28 12:38:59.697 GMT [308398] STATEMENT:  START_REPLICATION SLOT "pg_16426_sync_16423_7340631415576970097" LOGICAL 0/2A8DB30 (proto_version '3', publication_names '"pub"')
PG:2024-02-28 12:38:59.697 GMT [308398] LOG:  Drop replication slot pg_replslot/pg_16426_sync_16423_7340631415576970097
PG:2024-02-28 12:38:59.697 GMT [308398] STATEMENT:  DROP_REPLICATION_SLOT pg_16426_sync_16423_7340631415576970097 WAIT
PG:2024-02-28 12:39:00.198 GMT [308143] LOG:  received fast shutdown request
PG:2024-02-28 12:39:00.198 GMT [308143] LOG:  aborting any active transactions
PG:2024-02-28 12:39:00.198 GMT [308162] FATAL:  terminating connection due to administrator command
PG:2024-02-28 12:39:00.198 GMT [308152] FATAL:  terminating background worker "WAL proposer" due to administrator command
PG:2024-02-28 12:39:00.198 GMT [308151] FATAL:  terminating background worker "Logical replication monitor" due to administrator command
PG:2024-02-28 12:39:00.199 GMT [308143] LOG:  background worker "Logical replication monitor" (PID 308151) exited with exit code 1
PG:2024-02-28 12:39:00.199 GMT [308143] LOG:  background worker "WAL proposer" (PID 308152) exited with exit code 1
PG:2024-02-28 12:39:00.199 GMT [308143] LOG:  background worker "logical replication launcher" (PID 308153) exited with exit code 1
PG:2024-02-28 12:39:00.199 GMT [308146] LOG:  shutting down
2024-02-28T12:39:00.691236Z ERROR could not get backends state change: could not query backends: connection closed
2024-02-28T12:39:00.691249Z  WARN failed to get list of walsenders: Error { kind: Closed, cause: None }

Expected result

Actual result

Environment

Logs, links

The text was updated successfully, but these errors were encountered:

skyzh · 2024-02-29T15:56:59Z

Take a quick look and this seems to be caused by the server cannot be shutdown when logical replication is enabled. Mostly failing in PG15. cc @save-buffer

skyzh · 2024-02-29T15:58:03Z

Or not related to logical replication... sorry for tagging

skyzh · 2024-02-29T16:19:00Z

Normal shutdown log:

PG:2024-02-29 15:18:06.771 GMT [157284] LOG:  received fast shutdown request
PG:2024-02-29 15:18:06.771 GMT [157284] LOG:  aborting any active transactions
PG:2024-02-29 15:18:06.771 GMT [157499] FATAL:  terminating connection due to administrator command
PG:2024-02-29 15:18:06.771 GMT [157295] FATAL:  terminating background worker "WAL proposer" due to administrator command
PG:2024-02-29 15:18:06.771 GMT [157294] FATAL:  terminating background worker "Logical replication monitor" due to administrator command
PG:2024-02-29 15:18:06.772 GMT [157284] LOG:  background worker "Logical replication monitor" (PID 157294) exited with exit code 1
PG:2024-02-29 15:18:06.772 GMT [157284] LOG:  background worker "WAL proposer" (PID 157295) exited with exit code 1
PG:2024-02-29 15:18:06.772 GMT [157284] LOG:  background worker "logical replication launcher" (PID 157296) exited with exit code 1
PG:2024-02-29 15:18:06.773 GMT [157289] LOG:  shutting down
PG:2024-02-29 15:18:06.783 GMT [157289] LOG:  checkpoint starting: shutdown immediate
PG:2024-02-29 15:18:06.784 GMT [157289] LOG:  checkpoint complete: wrote 10 buffers (7.8%); 0 WAL file(s) added, 0 removed, 1 recycled; write=0.001 s, sync=0.001 s, total=0.001 s; sync files=0, longest=0.000 s, average=0.000 s; distance=22460 kB, estimate=22460 kB
PG:2024-02-29 15:18:06.787 GMT [157284] LOG:  database system is shut down
2024-02-29T15:18:06.789135Z  INFO Postgres exited with code exit status: 0, shutting down
2024-02-29T15:18:06.789189Z  INFO syncing safekeepers on shutdown
PG:2024-02-29 15:18:06.791 GMT [159966] LOG:  skipping missing configuration file "/tmp/test_output/test_neon_superuser[debug-pg15]-2/repo/endpoints/ep-1/pgdata/compute_ctl_temp_override.conf"
PG:2024-02-29 15:18:06.791 GMT [159966] LOG:  skipping missing configuration file "/tmp/test_output/test_neon_superuser[debug-pg15]-2/repo/endpoints/ep-1/pgdata/compute_ctl_temp_override.conf"
PG:2024-02-29 15:18:06.798 GMT [159966] LOG:  Missing NEON_CONTROL_PLANE_TOKEN environment variable, forwarding will not be authenticated
PG:2024-02-29 15:18:06.798 GMT [159966] LOG:  [WP] connecting with node 127.0.0.1:28441
PG:2024-02-29 15:18:06.799 GMT [159966] LOG:  [WP] connected with node 127.0.0.1:28441
PG:2024-02-29 15:18:06.799 GMT [159966] LOG:  [WP] received AcceptorGreeting from safekeeper 127.0.0.1:28441, term=2
PG:2024-02-29 15:18:06.799 GMT [159966] LOG:  [WP] proposer connected to quorum (1) safekeepers, propTerm=3
PG:2024-02-29 15:18:06.799 GMT [159966] LOG:  [WP] requesting vote from 127.0.0.1:28441 for term 3
PG:2024-02-29 15:18:06.800 GMT [159966] LOG:  [WP] got VoteResponse from acceptor 127.0.0.1:28441, voteGiven=1, epoch=2, flushLsn=0/2A8E0F8, truncateLsn=0/2A8E0F8, timelineStartLsn=0/149F0D8
PG:2024-02-29 15:18:06.800 GMT [159966] LOG:  [WP] got votes from majority (1) of nodes, term 3, epochStartLsn 0/2A8E0F8, donor 127.0.0.1:28441, truncate_lsn 0/2A8E0F8
2024-02-29T15:18:06.801316Z  INFO synced safekeepers at lsn 0/2A8E0F8
2024-02-29T15:18:06.801347Z  INFO checking for core dumps in /tmp/test_output/test_neon_superuser[debug-pg15]-2/repo/endpoints/ep-1/pgdata
2024-02-29T15:18:06.801423Z  INFO shutting down tracing
OpenTelemetry trace error occurred. error sending request for url (http://localhost:4318/): error trying to connect: tcp connect error: Cannot assign requested address (os error 99)
2024-02-29T15:18:06.803573Z  INFO shutting down

Abnormal shutdown logs:

PG:2024-02-29 14:18:02.817 GMT [207739] LOG:  received fast shutdown request
PG:2024-02-29 14:18:02.817 GMT [207739] LOG:  aborting any active transactions
PG:2024-02-29 14:18:02.817 GMT [207958] FATAL:  terminating connection due to administrator command
PG:2024-02-29 14:18:02.817 GMT [207751] FATAL:  terminating background worker "WAL proposer" due to administrator command
PG:2024-02-29 14:18:02.817 GMT [207750] FATAL:  terminating background worker "Logical replication monitor" due to administrator command
PG:2024-02-29 14:18:02.818 GMT [207739] LOG:  background worker "WAL proposer" (PID 207751) exited with exit code 1
PG:2024-02-29 14:18:02.818 GMT [207739] LOG:  background worker "logical replication launcher" (PID 207752) exited with exit code 1
PG:2024-02-29 14:18:02.818 GMT [207739] LOG:  background worker "Logical replication monitor" (PID 207750) exited with exit code 1
2024-02-29T14:18:02.855588Z ERROR could not get backends state change: could not query backends: connection closed
2024-02-29T14:18:02.855629Z  WARN failed to get list of walsenders: Error { kind: Closed, cause: None }
PG:2024-02-29 14:18:02.819 GMT [207745] LOG:  shutting down
2024-02-29T14:18:03.355814Z ERROR could not get backends state change: could not query backends: connection closed
2024-02-29T14:18:03.355857Z  WARN failed to get list of walsenders: Error { kind: Closed, cause: None }
2024-02-29T14:18:03.856089Z ERROR could not get backends state change: could not query backends: connection closed
2024-02-29T14:18:03.856149Z  WARN failed to get list of walsenders: Error { kind: Closed, cause: None }
2024-02-29T14:18:04.356371Z ERROR could not get backends state change: could not query backends: connection closed
2024-02-29T14:18:04.356432Z  WARN failed to get list of walsenders: Error { kind: Closed, cause: None }
2024-02-29T14:18:04.856613Z ERROR could not get backends state change: could not query backends: connection closed
2024-02-29T14:18:04.856652Z  WARN failed to get list of walsenders: Error { kind: Closed, cause: None }
2024-02-29T14:18:05.356837Z ERROR could not get backends state change: could not query backends: connection closed
2024-02-29T14:18:05.356880Z  WARN failed to get list of walsenders: Error { kind: Closed, cause: None }
2024-02-29T14:18:05.857099Z ERROR could not get backends state change: could not query backends: connection closed
2024-02-29T14:18:05.857153Z  WARN failed to get list of walsenders: Error { kind: Closed, cause: None }
2024-02-29T14:18:06.357359Z ERROR could not get backends state change: could not query backends: connection closed
2024-02-29T14:18:06.357409Z  WARN failed to get list of walsenders: Error { kind: Closed, cause: None }
2024-02-29T14:18:06.857606Z ERROR could not get backends state change: could not query backends: connection closed
2024-02-29T14:18:06.857656Z  WARN failed to get list of walsenders: Error { kind: Closed, cause: None }
2024-02-29T14:18:07.357837Z ERROR could not get backends state change: could not query backends: connection closed
2024-02-29T14:18:07.357875Z  WARN failed to get list of walsenders: Error { kind: Closed, cause: None }
2024-02-29T14:18:07.858070Z ERROR could not get backends state change: could not query backends: connection closed
2024-02-29T14:18:07.858108Z  WARN failed to get list of walsenders: Error { kind: Closed, cause: None }
2024-02-29T14:18:08.358300Z ERROR could not get backends state change: could not query backends: connection closed
2024-02-29T14:18:08.358336Z  WARN failed to get list of walsenders: Error { kind: Closed, cause: None }
2024-02-29T14:18:08.858516Z ERROR could not get backends state change: could not query backends: connection closed
2024-02-29T14:18:08.858558Z  WARN failed to get list of walsenders: Error { kind: Closed, cause: None }
2024-02-29T14:18:09.358756Z ERROR could not get backends state change: could not query backends: connection closed
2024-02-29T14:18:09.358810Z  WARN failed to get list of walsenders: Error { kind: Closed, cause: None }
2024-02-29T14:18:09.858971Z ERROR could not get backends state change: could not query backends: connection closed
2024-02-29T14:18:09.859004Z  WARN failed to get list of walsenders: Error { kind: Closed, cause: None }
2024-02-29T14:18:10.359204Z ERROR could not get backends state change: could not query backends: connection closed
2024-02-29T14:18:10.359246Z  WARN failed to get list of walsenders: Error { kind: Closed, cause: None }
2024-02-29T14:18:10.859441Z ERROR could not get backends state change: could not query backends: connection closed
2024-02-29T14:18:10.859497Z  WARN failed to get list of walsenders: Error { kind: Closed, cause: None }
2024-02-29T14:18:11.359679Z ERROR could not get backends state change: could not query backends: connection closed
2024-02-29T14:18:11.359719Z  WARN failed to get list of walsenders: Error { kind: Closed, cause: None }
2024-02-29T14:18:11.859911Z ERROR could not get backends state change: could not query backends: connection closed
2024-02-29T14:18:11.859954Z  WARN failed to get list of walsenders: Error { kind: Closed, cause: None }
2024-02-29T14:18:12.360136Z ERROR could not get backends state change: could not query backends: connection closed
2024-02-29T14:18:12.360170Z  WARN failed to get list of walsenders: Error { kind: Closed, cause: None }
2024-02-29T14:18:12.860359Z ERROR could not get backends state change: could not query backends: connection closed
2024-02-29T14:18:12.863613Z  WARN failed to get list of walsenders: Error { kind: Closed, cause: None }
2024-02-29T14:18:13.363830Z ERROR could not get backends state change: could not query backends: connection closed
2024-02-29T14:18:13.363890Z  WARN failed to get list of walsenders: Error { kind: Closed, cause: None }
2024-02-29T14:18:13.864065Z ERROR could not get backends state change: could not query backends: connection closed
2024-02-29T14:18:13.864109Z  WARN failed to get list of walsenders: Error { kind: Closed, cause: None }
2024-02-29T14:18:14.364295Z ERROR could not get backends state change: could not query backends: connection closed
2024-02-29T14:18:14.364342Z  WARN failed to get list of walsenders: Error { kind: Closed, cause: None }
2024-02-29T14:18:14.864540Z ERROR could not get backends state change: could not query backends: connection closed
2024-02-29T14:18:14.864590Z  WARN failed to get list of walsenders: Error { kind: Closed, cause: None }
2024-02-29T14:18:15.364767Z ERROR could not get backends state change: could not query backends: connection closed

The shutdown process takes a minute. It seems that Postgres did not clean up all the processes, and therefore compute_ctl does not think it's shutdown.

2024-02-29 15:16:53.855 INFO [neon_fixtures.py:1429] Running command "/tmp/neon/bin/neon_local endpoint stop --mode=fast ep-1"
2024-02-29 15:17:55.28  INFO [neon_fixtures.py:1495] Run ['/tmp/neon/bin/neon_local', 'endpoint', 'stop', '--mode=fast', 'ep-1'] failed:

ref #6969 Signed-off-by: Alex Chi Z <chi@neon.tech>

knizhnik · 2024-02-29T19:18:42Z

Why do we run monitor for CI tests at all?
It seems to be there only test when Neon is used as subscriber ion logical replication (in other tests vanilla Postgres is used).
We are suspending suspending (shutdown) of compute node if LR is active. Can it be the source of the problem: monitor prevents termination of subscriber node?

skyzh · 2024-02-29T19:38:05Z

The problem is that postgres never exits. In compute_ctl.rs,

    // Wait for the child Postgres process forever. In this state Ctrl+C will
    // propagate to Postgres and it will be shut down as well.
    if let Some((mut pg, logs_handle)) = pg {
        // Startup is finished, exit the startup tracing span
        drop(startup_context_guard);

        let ecode = pg
            .wait()
            .expect("failed to start waiting on Postgres process");
        PG_PID.store(0, Ordering::SeqCst);

It's stuck at pg.wait().

skyzh · 2024-02-29T19:38:38Z

The monitor is working as expected. compute_ctl exits using std::os::exit, so everything will be terminated.

andreasscherbaum · 2024-03-05T15:33:10Z

@skyzh Please add the details about the root cause.

This is realistically not a problem in production, until and unless we suspend compute with logical replication.

skyzh · 2024-03-05T15:54:09Z

A summary on the root cause: If we have a replication job in pg15, it won't exit by itself. Postmaster waits forever.

This does not cause any problems for us in prod because we don't suspend the compute node if logical replication is enabled. If we really want to stop it, Kubernetes will (possibly) do a force kill on a timeout, and therefore it can always stop in prod.

The only problem is that stop may take long time, and during this time, users cannot connect to the database.

Mitigation: #6975 that closes the replication. This allows us to pass the test case.

If we need to solve this issue, we will need to look into our pg15 source code modification and how it correlates with this seemingly unrelated pull request. #6935

This pull request mitigates #6969, but the longer-term problem is that we cannot properly stop Postgres if there is a subscription. --------- Signed-off-by: Alex Chi Z <chi@neon.tech>

skyzh · 2024-03-18T18:12:00Z

the issue can be reproduced on pg14-debug and pg15-debug after inserting a few loggings into the postmaster process on macOS.

skyzh · 2024-03-18T18:15:19Z

the process is stuck on WalSndWaitStopping

skyzh · 2024-03-18T18:16:53Z

and walsnd->state == WALSNDSTATE_STREAMING

skyzh · 2024-03-18T18:49:32Z

PG:2024-03-18 18:45:36.217 GMT [38518] LOG:  standby "walproposer" is now a synchronous standby with priority 1
PG:2024-03-18 18:45:38.939 GMT [38512] LOG:  still waiting walsnd=0, state=3, pid=38518

Okay, the problem seems to be with safekeepers. The walproposer does not exit when there is logical replications slots open.

skyzh · 2024-03-18T19:09:56Z

It seems that safekeeper did not receive SIGUSR2 (got_SIGUSR2=false), and it does not exit. Signal handling in pg15 is different from other versions? (pg14 is just flaky, and sometimes it passes, not sure if the root cause is the same)

arssher · 2024-03-19T04:57:52Z

Interesting, what is the best way to reproduce? " inserting a few loggings into the postmaster process on macOS." -- in which place? #6975 is merged, so it happens on main now, right?

Or alternatively do you have full logs for a fresh failure?

arssher · 2024-03-19T05:00:05Z

This does not cause any problems for us in prod because we don't suspend the compute node if logical replication is enabled.

This ''don't suspend the compute node if logical replication is enabled" logic doesn't prevent stop once we actually decided to stop, i.e. SIGTERMed the postmaster, so issue here is something different.

skyzh · 2024-03-19T14:26:28Z

@arssher remove the last SQL (drop subscription) from test_neon_superuser and this seems to be stably reproduced on macOS.

This is the log mixed with my own printfs...

compute.log

errno is not preserved in the signal handler. This pull request fixes it. Maybe related: #6969, but does not fix the flaky test problem. Signed-off-by: Alex Chi Z <chi@neon.tech>

skyzh · 2024-03-20T19:23:40Z

after looking into it a little bit, the system is actually stuck on walsender instead of walproposer (so it's unrelated to safekeeper team haha). Seems to be related to our modifications to postgres on logical replication persistence -> neondatabase/postgres#395

Fix #6969 Ref neondatabase/postgres#395 neondatabase/postgres#396 Postgres will stuck on exit if the replication slot is not dropped before shutting down. This is caused by Neon's custom WAL record to record replication slots. The pull requests in the postgres repo fixes the problem, and this pull request bumps the postgres commit. --------- Signed-off-by: Alex Chi Z <chi@neon.tech>

andreasscherbaum · 2024-04-03T15:09:34Z

Still in progress, to be releases this week.

skyzh added t/bug Issue Type: Bug c/compute Component: compute, excluding postgres itself labels Feb 29, 2024

skyzh self-assigned this Feb 29, 2024

This was referenced Feb 29, 2024

test: disable test_superuser on pg15 #6972

Merged

fix(test): drop subscription when test completes #6975

Merged

skyzh added a commit that referenced this issue Feb 29, 2024

test: disable test_superuser on pg15 (#6972)

76ab57f

ref #6969 Signed-off-by: Alex Chi Z <chi@neon.tech>

skyzh changed the title ~~flaky test: test_neon_superuser~~ flaky test: test_neon_superuser / pg15-release cannot properly exit when replication is enabled Mar 6, 2024

skyzh mentioned this issue Mar 18, 2024

safekeeper: correctly handle signals #7167

Merged

5 tasks

skyzh mentioned this issue Mar 20, 2024

fix: XLogFlush replication slot save/drop neondatabase/postgres#395

Merged

skyzh added a commit that referenced this issue Mar 20, 2024

safekeeper: correctly handle signals (#7167)

55c4ef4

errno is not preserved in the signal handler. This pull request fixes it. Maybe related: #6969, but does not fix the flaky test problem. Signed-off-by: Alex Chi Z <chi@neon.tech>

This was referenced Mar 20, 2024

DNM reproduce 6969 #7191

Closed

fix: drop replication slot causes postgres stuck on exit #7192

Merged

skyzh closed this as completed in #7192 Mar 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

flaky test: `test_neon_superuser` / pg15-release cannot properly exit when replication is enabled #6969

flaky test: `test_neon_superuser` / pg15-release cannot properly exit when replication is enabled #6969

skyzh commented Feb 29, 2024

skyzh commented Feb 29, 2024

skyzh commented Feb 29, 2024

skyzh commented Feb 29, 2024

knizhnik commented Feb 29, 2024

skyzh commented Feb 29, 2024

skyzh commented Feb 29, 2024

andreasscherbaum commented Mar 5, 2024

skyzh commented Mar 5, 2024

skyzh commented Mar 18, 2024

skyzh commented Mar 18, 2024

skyzh commented Mar 18, 2024 •

edited

Loading

skyzh commented Mar 18, 2024

skyzh commented Mar 18, 2024

arssher commented Mar 19, 2024 •

edited

Loading

arssher commented Mar 19, 2024

skyzh commented Mar 19, 2024 •

edited

Loading

skyzh commented Mar 20, 2024

andreasscherbaum commented Apr 3, 2024

flaky test: test_neon_superuser / pg15-release cannot properly exit when replication is enabled #6969

flaky test: test_neon_superuser / pg15-release cannot properly exit when replication is enabled #6969

Comments

skyzh commented Feb 29, 2024

Steps to reproduce

Expected result

Actual result

Environment

Logs, links

skyzh commented Feb 29, 2024

skyzh commented Feb 29, 2024

skyzh commented Feb 29, 2024

knizhnik commented Feb 29, 2024

skyzh commented Feb 29, 2024

skyzh commented Feb 29, 2024

andreasscherbaum commented Mar 5, 2024

skyzh commented Mar 5, 2024

skyzh commented Mar 18, 2024

skyzh commented Mar 18, 2024

skyzh commented Mar 18, 2024 • edited Loading

skyzh commented Mar 18, 2024

skyzh commented Mar 18, 2024

arssher commented Mar 19, 2024 • edited Loading

arssher commented Mar 19, 2024

skyzh commented Mar 19, 2024 • edited Loading

skyzh commented Mar 20, 2024

andreasscherbaum commented Apr 3, 2024

flaky test: `test_neon_superuser` / pg15-release cannot properly exit when replication is enabled #6969

flaky test: `test_neon_superuser` / pg15-release cannot properly exit when replication is enabled #6969

skyzh commented Mar 18, 2024 •

edited

Loading

arssher commented Mar 19, 2024 •

edited

Loading

skyzh commented Mar 19, 2024 •

edited

Loading