[bitnami/postgresql-repmgr] always does a full resync of the database on standby node when coming up #52213

mzealey · 2023-10-27T11:55:28Z

Name and Version

bitnami/postgresql-repmgr:16.0.0-debian-11-r11

What architecture are you using?

amd64

What steps will reproduce the bug?

We have a pair of postgresql-repmgr instances running with the following config:

                "POSTGRESQL_PASSWORD=xx",
                "REPMGR_PASSWORD=xx",
                "REPMGR_PRIMARY_HOST=pubsub-01",
                "REPMGR_PRIMARY_PORT=5432",
                "REPMGR_PARTNER_NODES=pubsub-01,pubsub-02:5432",
                "REPMGR_NODE_NAME=pubsub-01",
                "REPMGR_NODE_NETWORK_NAME=pubsub-01",
                "REPMGR_PORT_NUMBER=5432",
                "REPMGR_USE_PGREWIND=yes",
                "POSTGRESQL_WAL_LEVEL=logical",
                "BITNAMI_DEBUG=true",

The same on the other instance. I have tried setting REPMGR_USE_PGREWIND to fix this issue but to no avail.

Restarting the standby instance causes the following logs:

postgresql-repmgr 11:41:28.37 
postgresql-repmgr 11:41:28.37 Welcome to the Bitnami postgresql-repmgr container
postgresql-repmgr 11:41:28.37 Subscribe to project updates by watching https://github.com/bitnami/containers
postgresql-repmgr 11:41:28.38 Submit issues and feature requests at https://github.com/bitnami/containers/issues
postgresql-repmgr 11:41:28.38 
postgresql-repmgr 11:41:28.39 INFO  ==> ** Starting PostgreSQL with Replication Manager setup **
postgresql-repmgr 11:41:28.42 INFO  ==> Validating settings in REPMGR_* env vars...
postgresql-repmgr 11:41:28.42 INFO  ==> Validating settings in POSTGRESQL_* env vars..
postgresql-repmgr 11:41:28.43 INFO  ==> Querying all partner nodes for common upstream node...
postgresql-repmgr 11:41:28.44 DEBUG ==> Checking node 'pubsub-01:5432'...
postgresql-repmgr 11:41:28.50 DEBUG ==> Pretending primary role node - 'pubsub-01:5432'
postgresql-repmgr 11:41:28.50 DEBUG ==> Pretending primary set to 'pubsub-01:5432'!
postgresql-repmgr 11:41:28.51 DEBUG ==> Checking node 'pubsub-02:5432'...
psql: error: connection to server at "pubsub-02" (10.200.0.107), port 5432 failed: Connection refused
        Is the server running on that host and accepting TCP/IP connections?
postgresql-repmgr 11:41:28.53 DEBUG ==> Skipping: failed to get primary from the node 'pubsub-02:5432'!
postgresql-repmgr 11:41:28.53 INFO  ==> Auto-detected primary node: 'pubsub-01:5432'
postgresql-repmgr 11:41:28.53 DEBUG ==> Primary node: 'pubsub-01:5432'
postgresql-repmgr 11:41:28.54 INFO  ==> Node configured as standby
postgresql-repmgr 11:41:28.55 INFO  ==> Preparing PostgreSQL configuration...
postgresql-repmgr 11:41:28.55 DEBUG ==> Injecting a new postgresql.conf file...
postgresql-repmgr 11:41:28.56 INFO  ==> postgresql.conf file not detected. Generating it...
postgresql-repmgr 11:41:28.70 DEBUG ==> Injecting a new pg_hba.conf file...
postgresql-repmgr 11:41:28.71 INFO  ==> Preparing repmgr configuration...
postgresql-repmgr 11:41:28.73 DEBUG ==> Node ID: '1002', Rol: 'standby', Primary Node: 'pubsub-01:5432'
postgresql-repmgr 11:41:28.73 INFO  ==> Initializing Repmgr...
postgresql-repmgr 11:41:28.74 INFO  ==> Waiting for primary node...
postgresql-repmgr 11:41:28.74 DEBUG ==> Wait for schema repmgr.repmgr on 'pubsub-01:5432', will try 6 times with 10 delay seconds (TIMEOUT=60)
postgresql-repmgr 11:41:28.78 DEBUG ==> Schema repmgr.repmgr exists!
postgresql-repmgr 11:41:28.78 INFO  ==> Rejoining node...
postgresql-repmgr 11:41:28.79 INFO  ==> Using pg_rewind to primary node...
postgresql-repmgr 11:41:28.79 INFO  ==> Running pg_rewind data to primary node...
pg_rewind: executing "/opt/bitnami/postgresql/bin/postgres" for target server to complete crash recovery
postgres: could not access the server configuration file "/bitnami/postgresql/data/postgresql.conf": No such file or directory
pg_rewind: error: postgres single-user mode in target cluster failed
pg_rewind: detail: Command was: /opt/bitnami/postgresql/bin/postgres --single -F -D /bitnami/postgresql/data template1 < /dev/null
postgresql-repmgr 11:41:28.85 WARN  ==> pg_rewind failed, resorting to data cloning
postgresql-repmgr 11:41:28.86 INFO  ==> Cloning data from primary node...
WARNING: following problems with command line parameters detected:
  -D/--pgdata will be ignored if a repmgr configuration file is provided
NOTICE: destination directory "/bitnami/postgresql/data" provided
NOTICE: checking for available walsenders on the source node (2 required)
NOTICE: checking replication connections can be made to the source server (2 required)
WARNING: directory "/bitnami/postgresql/data" exists but is not empty
NOTICE: -F/--force provided - deleting existing data directory "/bitnami/postgresql/data"
NOTICE: starting backup (using pg_basebackup)...
NOTICE: standby clone (using pg_basebackup) complete
NOTICE: you can now start your PostgreSQL server
HINT: for example: pg_ctl -D /bitnami/postgresql/data start
HINT: after starting the server, you need to re-register this standby with "repmgr standby register --force" to update the existing node record
[REPMGR EVENT] Node id: 1002; Event type: standby_clone; Success [1|0]: 1; Time: 2023-10-27 11:42:06.06608+00;  Details: cloned from host "pubsub-01", port 5432; backup method: pg_basebackup; --force: Y
Looking for the script: /opt/bitnami/repmgr/events/execs/standby_clone.sh
[REPMGR EVENT] no script '/opt/bitnami/repmgr/events/execs/standby_clone.sh' found. Skipping...
postgresql-repmgr 11:42:06.10 INFO  ==> Initializing PostgreSQL database...
postgresql-repmgr 11:42:06.10 DEBUG ==> Copying files from /bitnami/postgresql/conf to /opt/bitnami/postgresql/conf
postgresql-repmgr 11:42:06.11 INFO  ==> Custom configuration /opt/bitnami/postgresql/conf/postgresql.conf detected
postgresql-repmgr 11:42:06.11 INFO  ==> Custom configuration /opt/bitnami/postgresql/conf/pg_hba.conf detected
postgresql-repmgr 11:42:06.12 DEBUG ==> Ensuring expected directories/files exist...
postgresql-repmgr 11:42:06.15 INFO  ==> Deploying PostgreSQL with persisted data...
postgresql-repmgr 11:42:06.19 INFO  ==> Configuring replication parameters
postgresql-repmgr 11:42:06.24 INFO  ==> Configuring fsync
postgresql-repmgr 11:42:06.27 INFO  ==> Setting up streaming replication slave...
postgresql-repmgr 11:42:06.30 INFO  ==> Starting PostgreSQL in background...

What is the expected behavior?

After a brief restart I wouldn't have expected a full clone to need to happen.

It's also a bit strange that pg_rewind is not working although I appreciate this feature flag is not documented

What do you see instead?

per above, full pg_basebackup is happening each time the container restarts.

Additional information

No response

The text was updated successfully, but these errors were encountered:

javsalgar · 2023-10-30T08:45:08Z

Hi,

We added an experimental flag to use pg_rewind instead. You can try it so a full clone does not happen.

bitnami/charts#8933

mzealey · 2023-10-30T09:00:45Z

So per my comments I have tried this flag and it seems to have the problems - see the log I posted for some issue about the postgres config file not existing when it tries to run.

Also, it seems a bit strange to me that even though nothing has happened during the restart it always wants to try to do a full resync and/or rewind?

fevisera · 2023-11-06T10:59:53Z

Hi @mzealey,

I'm currently working on reproducing the issue you have mentioned. Could you kindly share the docker-compose file that you are using?

Thanks.

github-actions · 2023-11-22T01:25:55Z

This Issue has been automatically marked as "stale" because it has not had recent activity (for 15 days). It will be closed if no further activity occurs. Thanks for the feedback.

xtianus79 · 2023-11-27T04:35:35Z

This is still an issue. the Flag causes this error. ***I think I found the bug. Here are the docs to pg_rewind

postgresql-repmgr 04:00:49.65 INFO  ==> Rejoining node...
postgresql-repmgr 04:00:49.65 INFO  ==> Using pg_rewind to primary node...
postgresql-repmgr 04:00:49.65 INFO  ==> Running pg_rewind data to primary node...
pg_rewind: executing "/opt/bitnami/postgresql/bin/postgres" for target server to complete crash recovery
pg_rewind: executing "/opt/bitnami/postgresql/bin/postgres" for target server to complete crash recovery
postgres: could not access the server configuration file "/bitnami/postgresql/data/postgresql.conf": No such file or directory
pg_rewind: error: postgres single-user mode in target cluster failed
pg_rewind: detail: Command was: /opt/bitnami/postgresql/bin/postgres --single -F -D /bitnami/postgresql/data template1 < /dev/null
postgresql-repmgr 04:00:49.72 WARN  ==> pg_rewind failed, resorting to data cloning
postgresql-repmgr 04:00:49.72 INFO  ==> Cloning data from primary node...
WARNING: following problems with command line parameters detected:
  -D/--pgdata will be ignored if a repmgr configuration file is provided
NOTICE: destination directory "/bitnami/postgresql/data" provided
INFO: connecting to source node

The file location is here:

postgres=# SHOW config_file;
                 config_file
----------------------------------------------
 /opt/bitnami/postgresql/conf/postgresql.conf

Am I missing something?

After some more review I find this in the lib script

########################
# Execute pg_rewind to get data from the primary node
# Globals:
#   REPMGR_*
# Arguments:
#   None
# Returns:
#   None
#########################
repmgr_pgrewind() {
    info "Running pg_rewind data to primary node..."
    local -r flags=("-D" "$POSTGRESQL_DATA_DIR" "--source-server" "host=${REPMGR_CURRENT_PRIMARY_HOST} port=${REPMGR_CURRENT_PRIMARY_PORT} user=${REPMGR_USERNAME} dbname=${REPMGR_DATABASE}")

    if [[ "$REPMGR_USE_PASSFILE" = "true" ]]; then
        PGPASSFILE="$REPMGR_PASSFILE_PATH" debug_execute "${POSTGRESQL_BIN_DIR}/pg_rewind" "${flags[@]}"
    else
        PGPASSWORD="$REPMGR_PASSWORD" debug_execute "${POSTGRESQL_BIN_DIR}/pg_rewind" "${flags[@]}"
    fi
}

The flag set here are just the default flags with no option for the location of the conf file in the data directory. I think with the current flag settings they are not finding the name of the actual conf that is there.

local -r flags=("-D" "$POSTGRESQL_DATA_DIR" "--source-server" "host=${REPMGR_CURRENT_PRIMARY_HOST} port=${REPMGR_CURRENT_PRIMARY_PORT} user=${REPMGR_USERNAME} dbname=${REPMGR_DATABASE}")

There is an option in PG_Rewind that would reference the correct conf:

--config-file=filename
Use the specified main server configuration file for the target cluster. This affects pg_rewind when it uses internally the postgres command for the rewind operation on this cluster (when retrieving restore_command with the option -c/--restore-target-wal and when forcing a completion of crash recovery).

the file name is here in the data directory:
the only conf file here

I have no name!@alive-postgresql-ha-postgresql-1:/bitnami/postgresql/data$ cat post
postgresql.auto.conf  postmaster.opts       postmaster.pid

And this seems to be the one it wants

I have no name!@xxxxx-postgresql-ha-postgresql-1:/bitnami/postgresql/data$ cat postgresql.auto.conf
# Do not edit this file manually!
# It will be overwritten by the ALTER SYSTEM command.
primary_conninfo = 'host=''xxxxx-postgresql-ha-postgresql-0.alive-postgresql-ha-postgresql-headless.xxxxx-postgresql-ha.svc.cluster.local'' port=5432 user=repmgr application_name=''xxxxx-postgresql-ha-postgresql-1'' password=''xxxxxxxxxxxxx'' connect_timeout=5'
primary_slot_name = 'repmgr_slot_1001'

Also, while reading the documentation it says that when using pg_rewind it requires full_page_writes which is commented out by default in the bitnami setup but the doc assumes it is on by default. Is this a setting in the values.yaml? Should we set this by default? It seems like even though it is commented out it is still enabled by default postgresql

psql (16.1)
Type "help" for help.

postgres=# SHOW full_page_writes;
 full_page_writes
------------------
 on
(1 row)

postgres=#

pg_rewind requires that the target server either has the wal_log_hints option enabled in postgresql.conf or data checksums enabled when the cluster was initialized with initdb. Neither of these are currently on by default. full_page_writes must also be set to on, but is enabled by default.

I believe this would be the fix in the librepmgr.sh:

local -r flags=("-D" "$POSTGRESQL_DATA_DIR" "--source-server" "host=${REPMGR_CURRENT_PRIMARY_HOST} port=${REPMGR_CURRENT_PRIMARY_PORT} user=${REPMGR_USERNAME} dbname=${REPMGR_DATABASE}" "--config-file=postgresql.auto.conf")

github-actions · 2023-12-14T01:24:44Z

This Issue has been automatically marked as "stale" because it has not had recent activity (for 15 days). It will be closed if no further activity occurs. Thanks for the feedback.

github-actions · 2023-12-19T01:24:58Z

Due to the lack of activity in the last 5 days since it was marked as "stale", we proceed to close this Issue. Do not hesitate to reopen it later if necessary.

mzealey added the tech-issues The user has a technical issue about an application label Oct 27, 2023

github-actions bot added the triage Triage is needed label Oct 27, 2023

bitnami-bot assigned javsalgar Oct 27, 2023

javsalgar changed the title ~~postgresql-repmgr always does a full resync of the database on standby node when coming up~~ [bitnami/postgresql-repmgr] always does a full resync of the database on standby node when coming up Oct 30, 2023

javsalgar added the postgresql-repmgr label Oct 30, 2023

github-actions bot added in-progress and removed triage Triage is needed labels Oct 31, 2023

bitnami-bot assigned fevisera and unassigned javsalgar Oct 31, 2023

github-actions bot added the stale 15 days without activity label Nov 22, 2023

github-actions bot removed the stale 15 days without activity label Nov 28, 2023

xtianus79 mentioned this issue Nov 28, 2023

[postgresql-ha] Multiple Primaries bitnami/charts#2610

Closed

github-actions bot added the stale 15 days without activity label Dec 14, 2023

github-actions bot added the solved label Dec 19, 2023

bitnami-bot closed this as not planned Won't fix, can't repro, duplicate, stale Dec 19, 2023

github-actions bot removed the in-progress label Dec 19, 2023

This was referenced Aug 30, 2024

[bitnami/postgresql-repmgr] Standby resyncs with Primary Node at every Restart #71493

Closed

Standby resyncs with Primary Node at every Restart EnterpriseDB/repmgr#858

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[bitnami/postgresql-repmgr] always does a full resync of the database on standby node when coming up #52213

[bitnami/postgresql-repmgr] always does a full resync of the database on standby node when coming up #52213

mzealey commented Oct 27, 2023

javsalgar commented Oct 30, 2023

mzealey commented Oct 30, 2023

fevisera commented Nov 6, 2023

github-actions bot commented Nov 22, 2023

xtianus79 commented Nov 27, 2023 •

edited

Loading

github-actions bot commented Dec 14, 2023

github-actions bot commented Dec 19, 2023

[bitnami/postgresql-repmgr] always does a full resync of the database on standby node when coming up #52213

[bitnami/postgresql-repmgr] always does a full resync of the database on standby node when coming up #52213

Comments

mzealey commented Oct 27, 2023

Name and Version

What architecture are you using?

What steps will reproduce the bug?

What is the expected behavior?

What do you see instead?

Additional information

javsalgar commented Oct 30, 2023

mzealey commented Oct 30, 2023

fevisera commented Nov 6, 2023

github-actions bot commented Nov 22, 2023

xtianus79 commented Nov 27, 2023 • edited Loading

github-actions bot commented Dec 14, 2023

github-actions bot commented Dec 19, 2023

xtianus79 commented Nov 27, 2023 •

edited

Loading