Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upgrade fails due to postmaster.pid #550

Open
eifrach opened this issue Feb 19, 2024 · 29 comments
Open

Upgrade fails due to postmaster.pid #550

eifrach opened this issue Feb 19, 2024 · 29 comments

Comments

@eifrach
Copy link
Contributor

eifrach commented Feb 19, 2024

Container platform

OCP 4, Podman/Docker

Version

12, 13

OS version of the container image

RHEL 7, RHEL 8, CentOS 7, CentOS Stream 8

Bugzilla, Jira

No response

Description

We are trying to work on and upgrade flow from version 12 to version 13
we are using k8s deployment pods
when we try to deploy the new image we get an error during the upgrade

==========  $PGDATA upgrade: 12 -> 13  ==========                                                                                                                                                 
                                                                                                                                                                                                  
                                                                                                                                                                                                  
===>  Starting old postgresql once again for a clean shutdown...                                                                                                                                  
                                                                                                                                                                                                  
pg_ctl: another server might be running; trying to start server anyway                                                                                                                            
waiting for server to start....2024-02-14 12:46:35.514 UTC [31] FATAL:  lock file "postmaster.pid" already exists                                                                                 
2024-02-14 12:46:35.514 UTC [31] HINT:  Is another postmaster (PID 1) running in data directory "/var/lib/pgsql/data/userdata"?           

It seems that the postmaster.pid is always there also on restart / redeployment and the service recovers without an issue. Only during the upgrade do we face the problem.

deleting the file and redeploying the upgrade solves it.

my question is,

  1. Is that the normal behavior ?
  2. What is the best approach to do deployment upgrades?

not sure if deleting a PID is good solution to run in Production enviorment

Reproducer

deploy version 12, and redeploy version 13 with the upgrade env

can be done with podman play or minikube

@fila43
Copy link
Member

fila43 commented Feb 28, 2024

Hello, thank you for your report. Since every layer (podman, openshift,..) complicates the debugging, it would be really helpful to provide a reproducer just using podman or on rpm level.

@eifrach
Copy link
Contributor Author

eifrach commented Feb 28, 2024

you can use this on minikube

  1. create storage
kind: PersistentVolumeClaim
apiVersion: v1
metadata:
  name: postgres-pv-claim
  labels:
    app: postgres
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 10Gi
  1. deploy postgres
apiVersion: apps/v1
kind: Deployment
metadata:
  name: postgres
spec:
  selector:
    matchLabels:
      app: postgres
  replicas: 1
  strategy:
    type: Recreate
  template:
    metadata:
      labels:
        app: postgres
    spec:
      containers:
        - name: postgres
          image: quay.io/centos7/postgresql-12-centos7
          imagePullPolicy: "IfNotPresent"
          ports:
            - containerPort: 5432
          env:
            - name: POSTGRESQL_UPGRADE
              value: "hardlink"
            - name: POSTGRESQL_UPGRADE_FORCE
              value: "true"
            - name: POSTGRESQL_DATABASE
              valueFrom:
                secretKeyRef:
                  name: assisted-installer-rds
                  key: db.name
            - name: POSTGRESQL_USER
              valueFrom:
                secretKeyRef:
                  name: assisted-installer-rds
                  key: db.user
            - name: POSTGRESQL_PASSWORD
              valueFrom:
                secretKeyRef:
                  name: assisted-installer-rds
                  key: db.password
          volumeMounts:
            - mountPath: /var/lib/pgsql/data
              name: postgredb
          resources:
            limits:
              memory: 500Mi
            requests:
              cpu: 100m
              memory: 400Mi
      volumes:
        - name: postgredb
          persistentVolumeClaim:
            claimName: postgres-pv-claim

than redeploy step 2 with a newer image quay.io/centos7/postgresql-13-centos7

This are the same steps that I'm running

@eifrach
Copy link
Contributor Author

eifrach commented Feb 28, 2024

BTW, when you restart the image you can see that the postmater.pid is there even without the upgrade
it's not been deleted when shutdown the postgres services

@fila43
Copy link
Member

fila43 commented Mar 12, 2024

It seems to be caused by the orchestration. I tested the scenario just by deploying the image by podman and then redeploying and the upgrade works fine.

@eifrach
Copy link
Contributor Author

eifrach commented Mar 12, 2024

  1. make sure you are using persistent data, could be you are not migrating the DB.
  2. make soure you have the postmaster.pid created before starting the upgrade.

but I did it with podman and minikube and openshift.

@fila43
Copy link
Member

fila43 commented Mar 12, 2024

Of course, I am using persistent data location. Without the upgrade option, it isn't possible to re-deploy it with the higher Postgresql version.
From my POW minikube and openshift block, somehow removal of postmaster.pid file.
Just off the top of my head, I see a few possibilities why the file persists:

  1. The container is stopped before postgresql removes the file
  2. postgresql is not allowed to remove the file from the openshift storage

@eifrach
Copy link
Contributor Author

eifrach commented Mar 13, 2024

so heres a step by step running it locally
using this deployment

apiVersion: apps/v1
kind: Deployment
metadata:
  name: postgres
spec:
  selector:
    matchLabels:
      app: postgres
  replicas: 1
  strategy:
    type: Recreate
  template:
    metadata:
      labels:
        app: postgres
    spec:
      containers:
        - name: postgres
          image: quay.io/centos7/postgresql-13-centos7
          imagePullPolicy: "IfNotPresent"
          ports:
            - containerPort: 5432
          env:
            - name: POSTGRESQL_UPGRADE
              value: "hardlink"
            - name: POSTGRESQL_UPGRADE_FORCE
              value: "true"
            - name: POSTGRESQL_DATABASE
              value: testdb
            - name: POSTGRESQL_USER
              value: user1
            - name: POSTGRESQL_PASSWORD
              value: password 
          volumeMounts:
            - mountPath: /var/lib/pgsql/data
              name: postgresdb
          resources:
            limits:
              memory: 500Mi
            requests:
              cpu: 100m
              memory: 400Mi
      volumes:
      - name: postgresdb
        persistentVolumeClaim:
          claimName: postgres-pv-claim
> # create volume
> podman volume create postgres-pv-claim 
postgres-pv-claim

> # starting pod
> podman kube play deployment.yaml 
Pod:
2e7286f9445a896a07a9f2f6ee0f8dc0a5438d2f63d98fa61c2fa98054076cb8
Container:
53021a21703d7b141c4c511dc71da6d97db738e3401c6bc8c402b758d466d4e3

>  # change the image
> sed -i 's/postgresql-12/postgresql-13/g' deployment.yaml 
> grep -Rn image: deployment.yaml 
19:          image: quay.io/centos7/postgresql-13-centos7

> # replace pod
> podman kube play deployment.yaml --replace 
Pods stopped:
7b45233e856809db7527ec4cb074bfaedec0f3d9c16a4954500980f34ddb0b67
Pods removed:
7b45233e856809db7527ec4cb074bfaedec0f3d9c16a4954500980f34ddb0b67
Secrets removed:
Volumes removed:
Trying to pull quay.io/centos7/postgresql-13-centos7:latest...
Getting image source signatures
Copying blob b929e0b929d2 done   | 
Copying blob 06c7e4737942 skipped: already exists  
Copying blob c61d16cfe03e skipped: already exists  
Copying config 7d69e95a7f done   | 
Writing manifest to image destination
Pod:
f96d847bf394168c141c8c5a5d9e63cb4c373eb7b0defaa1f6468431a61ccaf5
Container:
52e3699579e0398985e4d8d815059c42b0a8ed285471398c0bcb4f24ff4f41ed

> # logs...
> podman logs postgres-pod-postgres 

==========  $PGDATA upgrade: 12 -> 13  ==========


===>  Starting old postgresql once again for a clean shutdown...

pg_ctl: another server might be running; trying to start server anyway
waiting for server to start....2024-03-13 08:58:58.037 UTC [26] FATAL:  lock file "postmaster.pid" already exists
2024-03-13 08:58:58.037 UTC [26] HINT:  Is another postmaster (PID 1) running in data directory "/var/lib/pgsql/data/userdata"? 
 stopped waiting
pg_ctl: could not start server
Examine the log output.

p.s.
for some reason the --replace command doesn't work when running in user mode, please run it as root if you wanna recreate

@eifrach
Copy link
Contributor Author

eifrach commented Mar 13, 2024

FYI, when you run pg_ctl -D /var/lib/pgsql/data/userdata stop
it does remove the PID file.

I'm guessing that the SIGTERM is not shutting down the container properly.

@fila43
Copy link
Member

fila43 commented Mar 13, 2024

FYI, when you run pg_ctl -D /var/lib/pgsql/data/userdata stop it does remove the PID file.

I'm guessing that the SIGTERM is not shutting down the container properly.

I have the same opinion, it would make sense, that Postgresql stop is not handled correctly in the openshift environment

@eifrach
Copy link
Contributor Author

eifrach commented Mar 13, 2024

I see it in minikube / podman and openshift.

maybe a better way is to add PreStop command
but, this may casue some logs/events

      preStop:
        exec:
          command: [  pg_ctl, -D, /var/lib/pgsql/data/userdata, stop ]
     ```

@phracek
Copy link
Member

phracek commented Mar 13, 2024

@eifrach Can you please test your deployment with the defined namespace, that is unique to your switched project.

@eifrach
Copy link
Contributor Author

eifrach commented Mar 13, 2024

what do you mean? the orignal is in namespace this deployment just for tests
this is the original project which I've test on minikube with namespace

https://github.com/openshift/assisted-service/blob/bc5a2af1aced6ce9c716e80d24482e6af8fa3663/deploy/postgres/postgres-deployment.yaml

@zmiklank
Copy link
Contributor

I was reviewing the PR which should fix this issue: #554.

However the main communication seems to be here, so I'll ask here.

As you have discussed here, it indeed seems, like the container/pod is not terminating gracefully with SIGTERM. OCP/podman should send SIGKILL when SIGTERM does not manage to terminate the running container.

If this is really happening here - and the container is terminated by SIGKILL - then based on upstream postgresql documentation this is not safe, and should be avoided:

"It is best not to use SIGKILL to shut down the server. Doing so will prevent the server from releasing shared memory and semaphores. Furthermore, SIGKILL kills the postgres process without letting it relay the signal to its subprocesses, so it might be necessary to kill the individual subprocesses by hand as well."

In that case I think that the proposed fix could have some unwanted consequences.
We should find out how the process was terminated in this use case.

The SIGTERM seems to work correctly when running the container directly from podman. (podman run/podman kill)

@eifrach
Copy link
Contributor Author

eifrach commented Mar 18, 2024

thanks so much - i will try it out in a day or two and let you know

@eifrach
Copy link
Contributor Author

eifrach commented Mar 20, 2024

hey,
I tested the fix and seems to be working as design.

but I found a new problem, the repo has no longer have centos7 builds.

and centos8 with psql-13 doesn't have psql-12 to run the upgrade - is it a bug?
( i had to build with the old centos7 to validate the fix )

  • fedora psql13 same can't upgrade from 12

@fila43
Copy link
Member

fila43 commented Mar 20, 2024

I was reviewing the PR which should fix this issue: #554.

However the main communication seems to be here, so I'll ask here.

As you have discussed here, it indeed seems, like the container/pod is not terminating gracefully with SIGTERM. OCP/podman should send SIGKILL when SIGTERM does not manage to terminate the running container.

If this is really happening here - and the container is terminated by SIGKILL - then based on upstream postgresql documentation this is not safe, and should be avoided:

"It is best not to use SIGKILL to shut down the server. Doing so will prevent the server from releasing shared memory and semaphores. Furthermore, SIGKILL kills the postgres process without letting it relay the signal to its subprocesses, so it might be necessary to kill the individual subprocesses by hand as well."

In that case I think that the proposed fix could have some unwanted consequences. We should find out how the process was terminated in this use case.

I understand your point that sending SIGKILL to the server is not a good practice. We should definitely identify why it happens and try to avoid it. But anyway we also need a mechanism to "recover" after such an ugly shutdown. Proposed PR doesn't handle the cause but the consequence.

The SIGTERM seems to work correctly when running the container directly from podman. (podman run/podman kill)

So I would propose to merge it, if it works and doesn't case any other issue. And we can open an issue for further investigation to find out why the .pid file persists.

@zmiklank
Copy link
Contributor

hey, I tested the fix and seems to be working as design.

but I found a new problem, the repo has no longer have centos7 builds.

CentOS 7 images from SCLOrg are no longer being rebuilt, see more in commit 9ebea2b.

and centos8 with psql-13 doesn't have psql-12 to run the upgrade - is it a bug? ( i had to build with the old centos7 to validate the fix )

* fedora psql13 same can't upgrade from 12

I do not know, and even if this is a bug, it would not be in a container image layer, but rather rpm layer. @fila43 Do you have any info about this?

So I would propose to merge it, if it works and doesn't case any other issue. And we can open an issue for further investigation to find out why the .pid file persists.

@fila43 , Ok, I agree.

@eifrach
Copy link
Contributor Author

eifrach commented Mar 24, 2024

Now I tried to use the build to upgrade from 12 to 13 using the fedora image - seems that PSQL12 not found
quay.io/fedora/postgresql-13:latest

==========  $PGDATA upgrade: 12 -> 13  ==========

/usr/share/container-scripts/postgresql/common.sh: line 335: scl_source: No such file or directory

===>  Starting old postgresql once again for a clean shutdown...

/usr/share/container-scripts/postgresql/common.sh: line 353: /opt/rh/rh-postgresql12/root/usr/bin/pg_ctl: No such file or directory
Warning: Can't detect cpuset size from cgroups, will use nproc

@eifrach
Copy link
Contributor Author

eifrach commented Mar 24, 2024

just fyi, after playing a bit with the script
I managed to upgrade to 13 -- not sure this will fit all builds
but this is my workaround

9c25f36

@zmiklank
Copy link
Contributor

@eifrach Thanks for the analysis. We discussed this within the team and came to conclusion that some bigger changes of how upgrades in sclorg postgresql container images work would need to be incorporated. This could, however, take some time.

@eifrach
Copy link
Contributor Author

eifrach commented Apr 3, 2024

sry for the late response, but for RHEL images it's a big issue for some of openshift operators
and PSQL12 is EOL soon.
do you have any other solution ?

@fila43
Copy link
Member

fila43 commented Apr 3, 2024

At RPM level, we also have the Dump and Restore upgrade approach. It should also work for containers. More about here. Would it be suitable workaround for your use-case?

@eifrach
Copy link
Contributor Author

eifrach commented Apr 4, 2024

The app uses official Redhat images, which also have security updates and those very important to us.

For this I would need either to create a custom build or create some kind of a startup script for the migration.
which I don't want to add the upgrade to the application -- if I do, we will need to maintain / test. It adds complexity we are trying to avoid.

@eifrach
Copy link
Contributor Author

eifrach commented Apr 4, 2024

btw, this is how I got the upgrade to work

FROM quay.io/sclorg/postgresql-13-c8s:latest

USER root


ENV PQSL12_BIN=/usr/lib64/pgsql/postgresql-12
ENV PQSL12_DESTANATION=/opt/rh/rh-postgresql12/root/usr

RUN dnf install -y procps-ng util-linux postgresql-upgrade 

RUN mkdir -p $PQSL12_DESTANATION && \
    /usr/libexec/fix-permissions $PQSL12_BIN/bin && \
    ln -s $PQSL12_BIN/bin  $PQSL12_DESTANATION/bin 

# removing issues from current sctipt 
RUN sed -i 's#"${old_pgengine}/pg_isready"# #g' /usr/share/container-scripts/postgresql/common.sh && \
    sed -i 's#new_pgengine=/opt/rh/rh-postgresql${new_raw_version}/root/usr/bin#new_pgengine="/bin"#g' /usr/share/container-scripts/postgresql/common.sh && \
    sed -i 's#&& ! pgrep -f "postgres" > /dev/null# #g'  /usr/share/container-scripts/postgresql/common.sh

USER 26

ENTRYPOINT ["container-entrypoint"]
CMD ["run-postgresql"]

@fila43
Copy link
Member

fila43 commented Apr 10, 2024

Since we do not support in-place upgrades for images greater than centos7, as you mentioned, the container would need a couple of changes. This would lead to an increase in the size of the container. I recommend filing a feature request in JIra.

@carbonin
Copy link

Since we do not support in-place upgrades for images greater than centos7

Out of curiosity, why was this dropped? Seems like a rather large feature to remove.

@carbonin
Copy link

Looks like the docs at least were removed in e773fe8

@hhorak do you have more details around why in-place upgrade was removed? Would something like @eifrach's patch be adaquate to add support back?

@fila43
Copy link
Member

fila43 commented Apr 10, 2024

Looks like the docs at least were removed in e773fe8

@hhorak do you have more details around why in-place upgrade was removed? Would something like @eifrach's patch be adaquate to add support back?

More precisely it has never been removed. Between rhel7 and rhel8 there was a change in postgresql on the rpm level and the new design hasn't been adopted yet.

It's an area where we will welcome any kind of contribution. the proposed change is heading in the right direction but needs to be adjusted

@eifrach
Copy link
Contributor Author

eifrach commented Apr 11, 2024

@fila43 I've added a initial PR #562
I test it locally and seems to be working -- can you review and comment

thanks !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants