Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[bitnami/postgresql-repmgr] always does a full resync of the database on standby node when coming up #52213

Closed
mzealey opened this issue Oct 27, 2023 · 7 comments
Assignees
Labels
postgresql-repmgr solved stale 15 days without activity tech-issues The user has a technical issue about an application

Comments

@mzealey
Copy link

mzealey commented Oct 27, 2023

Name and Version

bitnami/postgresql-repmgr:16.0.0-debian-11-r11

What architecture are you using?

amd64

What steps will reproduce the bug?

We have a pair of postgresql-repmgr instances running with the following config:

                "POSTGRESQL_PASSWORD=xx",
                "REPMGR_PASSWORD=xx",
                "REPMGR_PRIMARY_HOST=pubsub-01",
                "REPMGR_PRIMARY_PORT=5432",
                "REPMGR_PARTNER_NODES=pubsub-01,pubsub-02:5432",
                "REPMGR_NODE_NAME=pubsub-01",
                "REPMGR_NODE_NETWORK_NAME=pubsub-01",
                "REPMGR_PORT_NUMBER=5432",
                "REPMGR_USE_PGREWIND=yes",
                "POSTGRESQL_WAL_LEVEL=logical",
                "BITNAMI_DEBUG=true",

The same on the other instance. I have tried setting REPMGR_USE_PGREWIND to fix this issue but to no avail.

Restarting the standby instance causes the following logs:

postgresql-repmgr 11:41:28.37 
postgresql-repmgr 11:41:28.37 Welcome to the Bitnami postgresql-repmgr container
postgresql-repmgr 11:41:28.37 Subscribe to project updates by watching https://github.com/bitnami/containers
postgresql-repmgr 11:41:28.38 Submit issues and feature requests at https://github.com/bitnami/containers/issues
postgresql-repmgr 11:41:28.38 
postgresql-repmgr 11:41:28.39 INFO  ==> ** Starting PostgreSQL with Replication Manager setup **
postgresql-repmgr 11:41:28.42 INFO  ==> Validating settings in REPMGR_* env vars...
postgresql-repmgr 11:41:28.42 INFO  ==> Validating settings in POSTGRESQL_* env vars..
postgresql-repmgr 11:41:28.43 INFO  ==> Querying all partner nodes for common upstream node...
postgresql-repmgr 11:41:28.44 DEBUG ==> Checking node 'pubsub-01:5432'...
postgresql-repmgr 11:41:28.50 DEBUG ==> Pretending primary role node - 'pubsub-01:5432'
postgresql-repmgr 11:41:28.50 DEBUG ==> Pretending primary set to 'pubsub-01:5432'!
postgresql-repmgr 11:41:28.51 DEBUG ==> Checking node 'pubsub-02:5432'...
psql: error: connection to server at "pubsub-02" (10.200.0.107), port 5432 failed: Connection refused
        Is the server running on that host and accepting TCP/IP connections?
postgresql-repmgr 11:41:28.53 DEBUG ==> Skipping: failed to get primary from the node 'pubsub-02:5432'!
postgresql-repmgr 11:41:28.53 INFO  ==> Auto-detected primary node: 'pubsub-01:5432'
postgresql-repmgr 11:41:28.53 DEBUG ==> Primary node: 'pubsub-01:5432'
postgresql-repmgr 11:41:28.54 INFO  ==> Node configured as standby
postgresql-repmgr 11:41:28.55 INFO  ==> Preparing PostgreSQL configuration...
postgresql-repmgr 11:41:28.55 DEBUG ==> Injecting a new postgresql.conf file...
postgresql-repmgr 11:41:28.56 INFO  ==> postgresql.conf file not detected. Generating it...
postgresql-repmgr 11:41:28.70 DEBUG ==> Injecting a new pg_hba.conf file...
postgresql-repmgr 11:41:28.71 INFO  ==> Preparing repmgr configuration...
postgresql-repmgr 11:41:28.73 DEBUG ==> Node ID: '1002', Rol: 'standby', Primary Node: 'pubsub-01:5432'
postgresql-repmgr 11:41:28.73 INFO  ==> Initializing Repmgr...
postgresql-repmgr 11:41:28.74 INFO  ==> Waiting for primary node...
postgresql-repmgr 11:41:28.74 DEBUG ==> Wait for schema repmgr.repmgr on 'pubsub-01:5432', will try 6 times with 10 delay seconds (TIMEOUT=60)
postgresql-repmgr 11:41:28.78 DEBUG ==> Schema repmgr.repmgr exists!
postgresql-repmgr 11:41:28.78 INFO  ==> Rejoining node...
postgresql-repmgr 11:41:28.79 INFO  ==> Using pg_rewind to primary node...
postgresql-repmgr 11:41:28.79 INFO  ==> Running pg_rewind data to primary node...
pg_rewind: executing "/opt/bitnami/postgresql/bin/postgres" for target server to complete crash recovery
postgres: could not access the server configuration file "/bitnami/postgresql/data/postgresql.conf": No such file or directory
pg_rewind: error: postgres single-user mode in target cluster failed
pg_rewind: detail: Command was: /opt/bitnami/postgresql/bin/postgres --single -F -D /bitnami/postgresql/data template1 < /dev/null
postgresql-repmgr 11:41:28.85 WARN  ==> pg_rewind failed, resorting to data cloning
postgresql-repmgr 11:41:28.86 INFO  ==> Cloning data from primary node...
WARNING: following problems with command line parameters detected:
  -D/--pgdata will be ignored if a repmgr configuration file is provided
NOTICE: destination directory "/bitnami/postgresql/data" provided
NOTICE: checking for available walsenders on the source node (2 required)
NOTICE: checking replication connections can be made to the source server (2 required)
WARNING: directory "/bitnami/postgresql/data" exists but is not empty
NOTICE: -F/--force provided - deleting existing data directory "/bitnami/postgresql/data"
NOTICE: starting backup (using pg_basebackup)...
NOTICE: standby clone (using pg_basebackup) complete
NOTICE: you can now start your PostgreSQL server
HINT: for example: pg_ctl -D /bitnami/postgresql/data start
HINT: after starting the server, you need to re-register this standby with "repmgr standby register --force" to update the existing node record
[REPMGR EVENT] Node id: 1002; Event type: standby_clone; Success [1|0]: 1; Time: 2023-10-27 11:42:06.06608+00;  Details: cloned from host "pubsub-01", port 5432; backup method: pg_basebackup; --force: Y
Looking for the script: /opt/bitnami/repmgr/events/execs/standby_clone.sh
[REPMGR EVENT] no script '/opt/bitnami/repmgr/events/execs/standby_clone.sh' found. Skipping...
postgresql-repmgr 11:42:06.10 INFO  ==> Initializing PostgreSQL database...
postgresql-repmgr 11:42:06.10 DEBUG ==> Copying files from /bitnami/postgresql/conf to /opt/bitnami/postgresql/conf
postgresql-repmgr 11:42:06.11 INFO  ==> Custom configuration /opt/bitnami/postgresql/conf/postgresql.conf detected
postgresql-repmgr 11:42:06.11 INFO  ==> Custom configuration /opt/bitnami/postgresql/conf/pg_hba.conf detected
postgresql-repmgr 11:42:06.12 DEBUG ==> Ensuring expected directories/files exist...
postgresql-repmgr 11:42:06.15 INFO  ==> Deploying PostgreSQL with persisted data...
postgresql-repmgr 11:42:06.19 INFO  ==> Configuring replication parameters
postgresql-repmgr 11:42:06.24 INFO  ==> Configuring fsync
postgresql-repmgr 11:42:06.27 INFO  ==> Setting up streaming replication slave...
postgresql-repmgr 11:42:06.30 INFO  ==> Starting PostgreSQL in background...

What is the expected behavior?

After a brief restart I wouldn't have expected a full clone to need to happen.

It's also a bit strange that pg_rewind is not working although I appreciate this feature flag is not documented

What do you see instead?

per above, full pg_basebackup is happening each time the container restarts.

Additional information

No response

@mzealey mzealey added the tech-issues The user has a technical issue about an application label Oct 27, 2023
@github-actions github-actions bot added the triage Triage is needed label Oct 27, 2023
@javsalgar javsalgar changed the title postgresql-repmgr always does a full resync of the database on standby node when coming up [bitnami/postgresql-repmgr] always does a full resync of the database on standby node when coming up Oct 30, 2023
@javsalgar
Copy link
Contributor

Hi,

We added an experimental flag to use pg_rewind instead. You can try it so a full clone does not happen.

bitnami/charts#8933

@mzealey
Copy link
Author

mzealey commented Oct 30, 2023

So per my comments I have tried this flag and it seems to have the problems - see the log I posted for some issue about the postgres config file not existing when it tries to run.

Also, it seems a bit strange to me that even though nothing has happened during the restart it always wants to try to do a full resync and/or rewind?

@github-actions github-actions bot added in-progress and removed triage Triage is needed labels Oct 31, 2023
@bitnami-bot bitnami-bot assigned fevisera and unassigned javsalgar Oct 31, 2023
@fevisera
Copy link
Contributor

fevisera commented Nov 6, 2023

Hi @mzealey,

I'm currently working on reproducing the issue you have mentioned. Could you kindly share the docker-compose file that you are using?

Thanks.

Copy link

This Issue has been automatically marked as "stale" because it has not had recent activity (for 15 days). It will be closed if no further activity occurs. Thanks for the feedback.

@github-actions github-actions bot added the stale 15 days without activity label Nov 22, 2023
@xtianus79
Copy link

xtianus79 commented Nov 27, 2023

This is still an issue. the Flag causes this error. ***I think I found the bug. Here are the docs to pg_rewind

postgresql-repmgr 04:00:49.65 INFO  ==> Rejoining node...
postgresql-repmgr 04:00:49.65 INFO  ==> Using pg_rewind to primary node...
postgresql-repmgr 04:00:49.65 INFO  ==> Running pg_rewind data to primary node...
pg_rewind: executing "/opt/bitnami/postgresql/bin/postgres" for target server to complete crash recovery
pg_rewind: executing "/opt/bitnami/postgresql/bin/postgres" for target server to complete crash recovery
postgres: could not access the server configuration file "/bitnami/postgresql/data/postgresql.conf": No such file or directory
pg_rewind: error: postgres single-user mode in target cluster failed
pg_rewind: detail: Command was: /opt/bitnami/postgresql/bin/postgres --single -F -D /bitnami/postgresql/data template1 < /dev/null
postgresql-repmgr 04:00:49.72 WARN  ==> pg_rewind failed, resorting to data cloning
postgresql-repmgr 04:00:49.72 INFO  ==> Cloning data from primary node...
WARNING: following problems with command line parameters detected:
  -D/--pgdata will be ignored if a repmgr configuration file is provided
NOTICE: destination directory "/bitnami/postgresql/data" provided
INFO: connecting to source node

The file location is here:

postgres=# SHOW config_file;
                 config_file
----------------------------------------------
 /opt/bitnami/postgresql/conf/postgresql.conf

Am I missing something?

After some more review I find this in the lib script

########################
# Execute pg_rewind to get data from the primary node
# Globals:
#   REPMGR_*
# Arguments:
#   None
# Returns:
#   None
#########################
repmgr_pgrewind() {
    info "Running pg_rewind data to primary node..."
    local -r flags=("-D" "$POSTGRESQL_DATA_DIR" "--source-server" "host=${REPMGR_CURRENT_PRIMARY_HOST} port=${REPMGR_CURRENT_PRIMARY_PORT} user=${REPMGR_USERNAME} dbname=${REPMGR_DATABASE}")

    if [[ "$REPMGR_USE_PASSFILE" = "true" ]]; then
        PGPASSFILE="$REPMGR_PASSFILE_PATH" debug_execute "${POSTGRESQL_BIN_DIR}/pg_rewind" "${flags[@]}"
    else
        PGPASSWORD="$REPMGR_PASSWORD" debug_execute "${POSTGRESQL_BIN_DIR}/pg_rewind" "${flags[@]}"
    fi
}

The flag set here are just the default flags with no option for the location of the conf file in the data directory. I think with the current flag settings they are not finding the name of the actual conf that is there.

local -r flags=("-D" "$POSTGRESQL_DATA_DIR" "--source-server" "host=${REPMGR_CURRENT_PRIMARY_HOST} port=${REPMGR_CURRENT_PRIMARY_PORT} user=${REPMGR_USERNAME} dbname=${REPMGR_DATABASE}")

There is an option in PG_Rewind that would reference the correct conf:

--config-file=filename
Use the specified main server configuration file for the target cluster. This affects pg_rewind when it uses internally the postgres command for the rewind operation on this cluster (when retrieving restore_command with the option -c/--restore-target-wal and when forcing a completion of crash recovery).

the file name is here in the data directory:
the only conf file here

I have no name!@alive-postgresql-ha-postgresql-1:/bitnami/postgresql/data$ cat post
postgresql.auto.conf  postmaster.opts       postmaster.pid

And this seems to be the one it wants

I have no name!@xxxxx-postgresql-ha-postgresql-1:/bitnami/postgresql/data$ cat postgresql.auto.conf
# Do not edit this file manually!
# It will be overwritten by the ALTER SYSTEM command.
primary_conninfo = 'host=''xxxxx-postgresql-ha-postgresql-0.alive-postgresql-ha-postgresql-headless.xxxxx-postgresql-ha.svc.cluster.local'' port=5432 user=repmgr application_name=''xxxxx-postgresql-ha-postgresql-1'' password=''xxxxxxxxxxxxx'' connect_timeout=5'
primary_slot_name = 'repmgr_slot_1001'

Also, while reading the documentation it says that when using pg_rewind it requires full_page_writes which is commented out by default in the bitnami setup but the doc assumes it is on by default. Is this a setting in the values.yaml? Should we set this by default? It seems like even though it is commented out it is still enabled by default postgresql

psql (16.1)
Type "help" for help.

postgres=# SHOW full_page_writes;
 full_page_writes
------------------
 on
(1 row)

postgres=#

pg_rewind requires that the target server either has the wal_log_hints option enabled in postgresql.conf or data checksums enabled when the cluster was initialized with initdb. Neither of these are currently on by default. full_page_writes must also be set to on, but is enabled by default.

I believe this would be the fix in the librepmgr.sh:

local -r flags=("-D" "$POSTGRESQL_DATA_DIR" "--source-server" "host=${REPMGR_CURRENT_PRIMARY_HOST} port=${REPMGR_CURRENT_PRIMARY_PORT} user=${REPMGR_USERNAME} dbname=${REPMGR_DATABASE}" "--config-file=postgresql.auto.conf")

Copy link

This Issue has been automatically marked as "stale" because it has not had recent activity (for 15 days). It will be closed if no further activity occurs. Thanks for the feedback.

@github-actions github-actions bot added the stale 15 days without activity label Dec 14, 2023
Copy link

Due to the lack of activity in the last 5 days since it was marked as "stale", we proceed to close this Issue. Do not hesitate to reopen it later if necessary.

@bitnami-bot bitnami-bot closed this as not planned Won't fix, can't repro, duplicate, stale Dec 19, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
postgresql-repmgr solved stale 15 days without activity tech-issues The user has a technical issue about an application
Projects
None yet
Development

No branches or pull requests

5 participants