Skip to content

Commit

Permalink
Set wasShutdown=true during hot-standby replica startup only when pri…
Browse files Browse the repository at this point in the history
…mary is not alive (#364)

* Set wasShutdown=true during hot-standby replica startup only when primary is not alive
* Report fatal error if hot standaby replica is started with oldestAcriveXid=0

Postgres part of neondatabase/neon#6705
---------

Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
  • Loading branch information
knizhnik and Konstantin Knizhnik authored Feb 22, 2024
1 parent 0baccce commit be91d91
Show file tree
Hide file tree
Showing 2 changed files with 50 additions and 1 deletion.
18 changes: 18 additions & 0 deletions src/backend/access/transam/xlog.c
Original file line number Diff line number Diff line change
Expand Up @@ -5187,6 +5187,24 @@ StartupXLOG(void)
&haveBackupLabel, &haveTblspcMap);
checkPoint = ControlFile->checkPointCopy;

if (ZenithRecoveryRequested)
{
if (wasShutdown)
checkPoint.oldestActiveXid = InvalidTransactionId;
else if (!TransactionIdIsValid(checkPoint.oldestActiveXid))
{
/*
* It should not actually happen: PS oldestActiveXid
* from running xacts WAL records and include it in checkpoint
* sent in basebackup.
* FirstNormalTransactionId is conservative estimation of oldest active XACT, unless
* current XID is greater than 1^31. So it is also not 100% safe solution but better than assertion failure.
*/
elog(FATAL, "oldestActiveXid=%d", checkPoint.oldestActiveXid);
checkPoint.oldestActiveXid = FirstNormalTransactionId;
}
}

/* initialize shared memory variables from the checkpoint record */
ShmemVariableCache->nextXid = checkPoint.nextXid;
ShmemVariableCache->nextOid = checkPoint.nextOid;
Expand Down
33 changes: 32 additions & 1 deletion src/backend/access/transam/xlogrecovery.c
Original file line number Diff line number Diff line change
Expand Up @@ -546,6 +546,18 @@ XLogWaitForReplayOf(XLogRecPtr redoEndRecPtr)
ConditionVariableCancelSleep();
}


/*
* NEON: check if primary node is running.
* Correspondent GUC is received from control plane
*/
static bool
IsPrimaryAlive()
{
const char* val = GetConfigOption("neon.primary_is_running", true, false);
return val != NULL && strcmp(val, "on") == 0;
}

/*
* Prepare the system for WAL recovery, if needed.
*
Expand Down Expand Up @@ -802,7 +814,26 @@ InitWalRecovery(ControlFileData *ControlFile, bool *wasShutdown_ptr,
//EndRecPtr = ControlFile->checkPointCopy.redo;

memcpy(&checkPoint, &ControlFile->checkPointCopy, sizeof(CheckPoint));
wasShutdown = true;
// When primary Neon compute node is started, we pretend that it started after a clean shutdown and
// no recovery is needed. We don't need to do WAL replay, the page server does that on a page-by-page basis.
// When a read-only replica is started, PostgreSQL normally waits for a shutdown checkpoint or running-xacts
// record before enabling hot standby, to establish which transactions are still running in the primary,
// and might still commit later. But if we know that the primary is not running - because the control plane
// says so - we can skip that. That avoids having to wait indefinitely if the primary is not running. This is
// particularly important for Neon because we don't start recovery from a checkpoint record, so there's
// no guarantee on when we'll see the next checkpoint or running-xacts record, if ever. so if we know the primary is
// not currently running, also set wasShutdown to 'true'.
if (StandbyModeRequested &&
PrimaryConnInfo != NULL && *PrimaryConnInfo != '\0')
{
if (!IsPrimaryAlive())
wasShutdown = true;
else
wasShutdown = false;
}
else
wasShutdown = true;


/* Initialize expectedTLEs, like ReadRecord() does */
expectedTLEs = readTimeLineHistory(checkPoint.ThisTimeLineID);
Expand Down

0 comments on commit be91d91

Please sign in to comment.