PostgreSQL 时间点恢复(PITR)时查找wal record的顺序 - loop(pg_wal, restore_command, stream)
背景
PostgreSQL recovery时,如何获取需要的wal record呢?
流程
PostgreSQL recovery时,可以从三个地方获取wal record
1、pg_wal 目录
2、recovery.conf中配置的restore_command
3、recovery.conf中配置的stream replication
优先从1开始,如果找不到则返回FALSE,接下来去RESTORE_COMMAND中找,最后是stream。然后再次循环从1开始找。
代码如下
static bool
typedef enum
{
XLOG_FROM_ANY = 0, /* request to read WAL from any source */
XLOG_FROM_ARCHIVE, /* restored using restore_command */
XLOG_FROM_PG_WAL, /* existing file in pg_wal */
XLOG_FROM_STREAM /* streamed from master */
} XLogSource;
src/backend/access/transam/xlog.c
/*
* Open the WAL segment containing WAL location 'RecPtr'.
*
* The segment can be fetched via restore_command, or via walreceiver having
* streamed the record, or it can already be present in pg_wal. Checking
* pg_wal is mainly for crash recovery, but it will be polled in standby mode
* too, in case someone copies a new segment directly to pg_wal. That is not
* documented or recommended, though.
*
* If 'fetching_ckpt' is true, we're fetching a checkpoint record, and should
* prepare to read WAL starting from RedoStartLSN after this.
*
* 'RecPtr' might not point to the beginning of the record we're interested
* in, it might also point to the page or segment header. In that case,
* 'tliRecPtr' is the position of the WAL record we're interested in. It is
* used to decide which timeline to stream the requested WAL from.
*
* If the record is not immediately available, the function returns false
* if we're not in standby mode. In standby mode, waits for it to become
* available.
*
* When the requested record becomes available, the function opens the file
* containing it (if not open already), and returns true. When end of standby
* mode is triggered by the user, and there is no more WAL available, returns
* false.
*/
static bool
WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
bool fetching_ckpt, XLogRecPtr tliRecPtr)
{
static TimestampTz last_fail_time = 0;
TimestampTz now;
bool streaming_reply_sent = false;
/*-------
* Standby mode is implemented by a state machine:
*
* 1. Read from either archive or pg_wal (XLOG_FROM_ARCHIVE), or just
* pg_wal (XLOG_FROM_PG_WAL)
* 2. Check trigger file
* 3. Read from primary server via walreceiver (XLOG_FROM_STREAM)
* 4. Rescan timelines
* 5. Sleep wal_retrieve_retry_interval milliseconds, and loop back to 1.
*
* Failure to read from the current source advances the state machine to
* the next state.
*
* 'currentSource' indicates the current state. There are no currentSource
* values for "check trigger", "rescan timelines", and "sleep" states,
* those actions are taken when reading from the previous source fails, as
* part of advancing to the next state.
*-------
*/
...........
for (;;)
{
int oldSource = currentSource;
/*
* First check if we failed to read from the current source, and
* advance the state machine if so. The failure to read might've
* happened outside this function, e.g when a CRC check fails on a
* record, or within this loop.
*/
if (lastSourceFailed)
{
switch (currentSource)
{
case XLOG_FROM_ARCHIVE:
case XLOG_FROM_PG_WAL:
/*
* Check to see if the trigger file exists. Note that we
* do this only after failure, so when you create the
* trigger file, we still finish replaying as much as we
* can from archive and pg_wal before failover.
*/
if (StandbyMode && CheckForStandbyTrigger())
{
ShutdownWalRcv();
return false;
}
/*
* Not in standby mode, and we've now tried the archive
* and pg_wal.
*/
if (!StandbyMode)
return false;
/*
* If primary_conninfo is set, launch walreceiver to try
* to stream the missing WAL.
*
* If fetching_ckpt is true, RecPtr points to the initial
* checkpoint location. In that case, we use RedoStartLSN
* as the streaming start position instead of RecPtr, so
* that when we later jump backwards to start redo at
* RedoStartLSN, we will have the logs streamed already.
*/
if (PrimaryConnInfo)
{
XLogRecPtr ptr;
TimeLineID tli;
if (fetching_ckpt)
{
ptr = RedoStartLSN;
tli = ControlFile->checkPointCopy.ThisTimeLineID;
}
else
{
ptr = tliRecPtr;
tli = tliOfPointInHistory(tliRecPtr, expectedTLEs);
if (curFileTLI > 0 && tli < curFileTLI)
elog(ERROR, "according to history file, WAL location %X/%X belongs to timeline %u, but previous recovered WAL file came from timeline %u",
(uint32) (ptr >> 32), (uint32) ptr,
tli, curFileTLI);
}
curFileTLI = tli;
RequestXLogStreaming(tli, ptr, PrimaryConnInfo,
PrimarySlotName);
receivedUpto = 0;
}
/*
* Move to XLOG_FROM_STREAM state in either case. We'll
* get immediate failure if we didn't launch walreceiver,
* and move on to the next state.
*/
currentSource = XLOG_FROM_STREAM;
break;
case XLOG_FROM_STREAM:
/*
* Failure while streaming. Most likely, we got here
* because streaming replication was terminated, or
* promotion was triggered. But we also get here if we
* find an invalid record in the WAL streamed from master,
* in which case something is seriously wrong. There's
* little chance that the problem will just go away, but
* PANIC is not good for availability either, especially
* in hot standby mode. So, we treat that the same as
* disconnection, and retry from archive/pg_wal again. The
* WAL in the archive should be identical to what was
* streamed, so it's unlikely that it helps, but one can
* hope...
*/
/*
* Before we leave XLOG_FROM_STREAM state, make sure that
* walreceiver is not active, so that it won't overwrite
* WAL that we restore from archive.
*/
if (WalRcvStreaming())
ShutdownWalRcv();
/*
* Before we sleep, re-scan for possible new timelines if
* we were requested to recover to the latest timeline.
*/
if (recoveryTargetIsLatest)
{
if (rescanLatestTimeLine())
{
currentSource = XLOG_FROM_ARCHIVE;
break;
}
}
/*
* XLOG_FROM_STREAM is the last state in our state
* machine, so we've exhausted all the options for
* obtaining the requested WAL. We're going to loop back
* and retry from the archive, but if it hasn't been long
* since last attempt, sleep wal_retrieve_retry_interval
* milliseconds to avoid busy-waiting.
*/
now = GetCurrentTimestamp();
if (!TimestampDifferenceExceeds(last_fail_time, now,
wal_retrieve_retry_interval))
{
long secs,
wait_time;
int usecs;
TimestampDifference(last_fail_time, now, &secs, &usecs);
wait_time = wal_retrieve_retry_interval -
(secs * 1000 + usecs / 1000);
WaitLatch(&XLogCtl->recoveryWakeupLatch,
WL_LATCH_SET | WL_TIMEOUT | WL_POSTMASTER_DEATH,
wait_time, WAIT_EVENT_RECOVERY_WAL_STREAM);
ResetLatch(&XLogCtl->recoveryWakeupLatch);
now = GetCurrentTimestamp();
}
last_fail_time = now;
currentSource = XLOG_FROM_ARCHIVE;
break;
default:
elog(ERROR, "unexpected WAL source %d", currentSource);
}
}
else if (currentSource == XLOG_FROM_PG_WAL)
{
/*
* We just successfully read a file in pg_wal. We prefer files in
* the archive over ones in pg_wal, so try the next file again
* from the archive first.
*/
if (InArchiveRecovery)
currentSource = XLOG_FROM_ARCHIVE;
}
if (currentSource != oldSource)
elog(DEBUG2, "switched WAL source from %s to %s after %s",
xlogSourceNames[oldSource], xlogSourceNames[currentSource],
lastSourceFailed ? "failure" : "success");
/*
* We've now handled possible failure. Try to read from the chosen
* source.
*/
lastSourceFailed = false;
switch (currentSource)
{
case XLOG_FROM_ARCHIVE:
case XLOG_FROM_PG_WAL:
/* Close any old file we might have open. */
if (readFile >= 0)
{
close(readFile);
readFile = -1;
}
/* Reset curFileTLI if random fetch. */
if (randAccess)
curFileTLI = 0;
/*
* Try to restore the file from archive, or read an existing
* file from pg_wal.
*/
readFile = XLogFileReadAnyTLI(readSegNo, DEBUG2,
currentSource == XLOG_FROM_ARCHIVE ? XLOG_FROM_ANY :
currentSource);
if (readFile >= 0)
return true; /* success! */
/*
* Nope, not found in archive or pg_wal.
*/
lastSourceFailed = true;
break;
case XLOG_FROM_STREAM:
{
bool havedata;
/*
* Check if WAL receiver is still active.
*/
if (!WalRcvStreaming())
{
lastSourceFailed = true;
break;
}
/*
* Walreceiver is active, so see if new data has arrived.
*
* We only advance XLogReceiptTime when we obtain fresh
* WAL from walreceiver and observe that we had already
* processed everything before the most recent "chunk"
* that it flushed to disk. In steady state where we are
* keeping up with the incoming data, XLogReceiptTime will
* be updated on each cycle. When we are behind,
* XLogReceiptTime will not advance, so the grace time
* allotted to conflicting queries will decrease.
*/
if (RecPtr < receivedUpto)
havedata = true;
else
{
XLogRecPtr latestChunkStart;
receivedUpto = GetWalRcvWriteRecPtr(&latestChunkStart, &receiveTLI);
if (RecPtr < receivedUpto && receiveTLI == curFileTLI)
{
havedata = true;
if (latestChunkStart <= RecPtr)
{
XLogReceiptTime = GetCurrentTimestamp();
SetCurrentChunkStartTime(XLogReceiptTime);
}
}
else
havedata = false;
}
if (havedata)
{
/*
* Great, streamed far enough. Open the file if it's
* not open already. Also read the timeline history
* file if we haven't initialized timeline history
* yet; it should be streamed over and present in
* pg_wal by now. Use XLOG_FROM_STREAM so that source
* info is set correctly and XLogReceiptTime isn't
* changed.
*/
if (readFile < 0)
{
if (!expectedTLEs)
expectedTLEs = readTimeLineHistory(receiveTLI);
readFile = XLogFileRead(readSegNo, PANIC,
receiveTLI,
XLOG_FROM_STREAM, false);
Assert(readFile >= 0);
}
else
{
/* just make sure source info is correct... */
readSource = XLOG_FROM_STREAM;
XLogReceiptSource = XLOG_FROM_STREAM;
return true;
}
break;
}
/*
* Data not here yet. Check for trigger, then wait for
* walreceiver to wake us up when new WAL arrives.
*/
if (CheckForStandbyTrigger())
{
/*
* Note that we don't "return false" immediately here.
* After being triggered, we still want to replay all
* the WAL that was already streamed. It's in pg_wal
* now, so we just treat this as a failure, and the
* state machine will move on to replay the streamed
* WAL from pg_wal, and then recheck the trigger and
* exit replay.
*/
lastSourceFailed = true;
break;
}
/*
* Since we have replayed everything we have received so
* far and are about to start waiting for more WAL, let's
* tell the upstream server our replay location now so
* that pg_stat_replication doesn't show stale
* information.
*/
if (!streaming_reply_sent)
{
WalRcvForceReply();
streaming_reply_sent = true;
}
/*
* Wait for more WAL to arrive. Time out after 5 seconds
* to react to a trigger file promptly.
*/
WaitLatch(&XLogCtl->recoveryWakeupLatch,
WL_LATCH_SET | WL_TIMEOUT | WL_POSTMASTER_DEATH,
5000L, WAIT_EVENT_RECOVERY_WAL_ALL);
ResetLatch(&XLogCtl->recoveryWakeupLatch);
break;
}
default:
elog(ERROR, "unexpected WAL source %d", currentSource);
}
/*
* This possibly-long loop needs to handle interrupts of startup
* process.
*/
HandleStartupProcInterrupts();
}
return false; /* not reached */
这里决定XLOG从哪里获取,会在pg_wal目录, resotre_command, stream三个SOURCE之间轮询,失败跳到下一个SOURCE获取WAL。
参考
src/backend/access/transam/xlog.c
recovery.conf
#restore_command = '' # e.g. 'cp /mnt/server/archivedir/%f %p'
#primary_conninfo = '' # e.g. 'host=localhost port=5432'