-
Notifications
You must be signed in to change notification settings - Fork 9.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[postgresql-ha] Multiple Primaries #2610
Comments
Hi, Do you remember any pattern that could lead to the issue? Did it happen after one instance crashing maybe? Or did it happen just right after deploying randomly |
also had this bug.
postgresql-postgresql-ha-postgresql-0
postgresql-postgresql-ha-postgresql-1
postgresql-postgresql-ha-postgresql-2
postgresql-postgresql-ha-postgresql-2
postgresql-postgresql-ha-postgresql-0
|
Hi, So, if I understand correctly (to reproduce it correctly), did you kill node 0 so node 1 becomes the primary? If so, at which point did you perform this operation? |
This Issue has been automatically marked as "stale" because it has not had recent activity (for 15 days). It will be closed if no further activity occurs. Thanks for the feedback. |
Due to the lack of activity in the last 5 days since it was marked as "stale", we proceed to close this Issue. Do not hesitate to reopen it later if necessary. |
I have this problem too.
|
Hi, Any pattern that could give us any clue? Did one of the pod crash or several? |
I don't know when exactly. I think when two or three of the pods crashed at the same time. Or maybe I lost my connection between nodes at the same time. |
Hi, I see, not sure if there's an easy way to reproduce this kind of situations. Let's see if other users come across the same issue as well. |
I ran into a similar issue with multiple primaries.
DB was created with these args:
Not sure of what triggered this. There was a stability run which should have a relatively stable query traffic rate. kubectl get pods reports that instance 0 restarted a number of times.
instance 0 had this log:
Around the time of the instance 0 promote, I see this in the instance 1 log:
|
Hi, I see some connection issues, but it's not clear to me if it has to do with the instance startup, and then everything gets back to normal. Did you see any more connection issues with other pods? |
@javsalgar , it is possible there was some connectivity issues during the last time of startup for instance 0. I don't have much more data from when this event happened. Regardless, both instances remained as masters after the last instance 0 startup. I eventually deleted instance 0 and it recovered. |
We have also had this issue and I've actually found it pretty easy to reproduce. It does seem random though as another commenter said. If you start with 2 replicas, -0 being primary and -1 being standby. Then kubectl delete pod -0, then it will automatically come back up in our case because it's in a statefulset. I have seen these 3 different outcomes doing this, seemingly at random. Outcome 1 (expected and healthy): pod -1 becomes primary and pod -0 becomes standby when it comes back up Once you're in the Outcome 3 state, you cannot get out of it unless you scale the replicas down to 1 (leaving only -0 as primary) then scale back up to desired number and you will be back in starting state. |
Let me try this again, and I will let you know if I get to the issue. |
FWIW, to add onto @aw381246's comment, we were able to reproduce the issue via the following steps: Install the chart
Verify the cluster status
Delete the primary replicaTo simulate a future deployment that would need to terminate a pod, as well as simulate a real pod termination due to unforeseen issues, we delete the primary pod.
Oddities
Verify cluster statusFrom the original primary replica:
From the original standby replica:
NOTE that each replica reports both as primary, albeit with slight verbiage difference from the original primary replica in the |
Thank you so much for the detailed report! |
Hi, Just a note to let you know that I was able to reproduce the issue after shutting down node 0 two times. I will open a task for investigating the issue in more detail. Thank you so much for reporting. |
Thank you @javsalgar! |
postgresql-repmgr 12:52:46.10 INFO ==> ** Starting PostgreSQL with Replication Manager setup **Sat, Oct 16 2021 8:52:46 pm | postgresql-repmgr 12:52:46.14 INFO ==> Validating settings in REPMGR_* env vars... |
Thank you @jpsn123 for sharing your logs. We have an internal task to investigate Postgresql-HA split-brain scenarios and look for solutions to prevent this behavior: some proposed solutions were adding witness nodes and improve the method used to reintegrate the primary node when recovered. If you would like to contribute, feel free to send a PR with your suggestions and we will be happy to review it. |
@migruiz4 how can we add the |
Hi @vishrantgupta, Without getting much into details, the chart And the |
How to recover to a proper state when this issue occurs? Is there a workaround? |
Hi @pkExec, I'm sorry but I don't know the exact method to recover Postgresql. As a suggestion, maybe performing a DB dump, redeploying the PostgreSQL chart, and restoring the dump would solve the problem. The following link may help you: https://www.postgresguide.com/utilities/backup-restore/ We are aware of this issue, and we are working on a solution to prevent split-brain scenarios. We are also open to any contributions that would help improve the chart. |
I think this issue should be reopened, I've been fighting this for some days. I thought it would be my setup (I'm running a jsonnet version of this helm) but then I tried to run the original helm chart, and this is still reproducible. Steps to reproduce:
Get this:
|
Hi @colega, Could you please share what version of the chart you are using? Does your chart or values.yaml include any modification? I tried to reproduce the issue in two different scenarios using the latest version (11.5.2), but none resulted in the error you shared. Can you reproduce it consistently or does it only happen under certain circumstances? E.g.: several restarts, manually deleting pods, a specific node promotes to primary... In my first deployment, using default values, the chart was restarted successfully:
In my second deployment, I added some extra settings to delay the restart in case my local environment was restarting faster: postgresql:
command:
- /bin/bash
args:
- -ec
- |
sleep 60
/opt/bitnami/scripts/postgresql-repmgr/entrypoint.sh /opt/bitnami/scripts/postgresql-repmgr/run.sh
livenessProbe:
initialDelaySeconds: 90
I don't discard a race condition during the failover process. In your case, does Postgresql get stuck in that status or does it resolve after some time? Please provide all the information you can, it would help us troubleshoot the issue. |
this is still an issue and should be reopened because the witness doesn't do anything here to help which is just odd. the delta of the primary coming back is 40 seconds after the election happened on the other nodes and it is still setting itself up as the primary and causing a split brain. A. why doesn't the witness help here B. what is the pretending mean in this log? it's saying that 1 is "pretending" do be the primary when in fact it is. this seems to be the error. |
Could you please create a separate issue filling the template with the specific information for your use case? Thanks |
I think this may be related? |
@migruiz4 sorry for the delayed response on this (I'm not even using this helm chart). I just got a new laptop, fresh install on an M3 MBP, and I decided to give this a try. Fresh install of docker, kind, helm, etc. Current version is:
I followed exactly the same steps of my previous post (except that I had to add the bitnami repo this time, and the change to the sts name), and I could reproduce it after second restart of the sts:
|
@javsalgar Hi, I think this problem is not solved. I reproduced this problem in bitnami/postgresql-ha 16.3.0 version. I hope to reopen this issue for discussion. |
Which chart:
postgresql-ha 3.2.1
Describe the bug
Somehow both Postgres Instances End up in Primary Mode.
This Results in pgpool redirecting requests to both instances and thus the 2 Databases drift appart with different States.
Output of repmgr cluster show:
DB 0
DB 1
To Reproduce
Steps to reproduce the behavior:
This just happens randomly as it seems.
Expected behavior
Only One Node at a Time being the Primary Node
Version of Helm and Kubernetes:
helm version
:kubectl version
:The text was updated successfully, but these errors were encountered: