PostgreSQL 9.3 add Fast promote mode skips checkpoint at end of recovery

9 minute read

背景

PostgreSQL 将新增promote的选项, -m smart | fast

1. smart 模式下promote standby数据库时, 在结束恢复后, 必须执行完一个checkpoint才激活.

2. fast 模式下promote standby数据库时, 在结束恢复后, 不需要等待checkpoin结束, 而是往XLOG中写入XLOG_END_OF_RECOVERY标记, 然后激活.

原文如下 :

pg_ctl promote -m fast will skip the checkpoint at end of recovery so that we  
can achieve very fast failover when the apply delay is low. Write new WAL record  
XLOG_END_OF_RECOVERY to allow us to switch timeline correctly for downstream log  
readers. If we skip synchronous end of recovery checkpoint we request a normal  
spread checkpoint so that the window of re-recovery is low.  
  
Simon Riggs and Kyotaro Horiguchi, with input from Fujii Masao.  
Review by Heikki Linnakangas  

注意

1. 不管是哪种模式, 都需要等待已经接收到的xlog全部恢复. 所以如果standby的恢复速度与XLOG的接收量相差很大的话, fast模式也快不到哪去.

2. wal receiver进程是在apply xlog的进程逻辑(startup process)中关闭的. 如下,

src/backend/access/transam/xlog.c

/*  
 * Check to see whether the user-specified trigger file exists and whether a  
 * promote request has arrived.  If either condition holds, return true.  
 */  
static bool  
CheckForStandbyTrigger(void)  
{  
        struct stat stat_buf;  
        static bool triggered = false;  
  
        if (triggered)  
                return true;  
  
        if (IsPromoteTriggered())  
        {  
                ereport(LOG,  
                                (errmsg("received promote request")));  
                ResetPromoteTriggered();  
                triggered = true;  
                return true;  
        }  
  
        if (TriggerFile == NULL)  
                return false;  
  
        if (stat(TriggerFile, &stat_buf) == 0)  
        {  
                ereport(LOG,  
                                (errmsg("trigger file found: %s", TriggerFile)));  
                unlink(TriggerFile);  
                triggered = true;  
                return true;  
        }  
        return false;  
}  

所以在检测到触发文件或者promote_triggered=true也就是接收到pg_ctl的promote请求后, 将关闭WALreceiver进程.

src/backend/access/transam/xlog.c

/*  
 * In standby mode, wait for WAL at position 'RecPtr' to become available, either  
 * via restore_command succeeding to restore the segment, or via walreceiver  
 * having streamed the record (or via someone copying the segment directly to  
 * pg_xlog, but that is not documented or recommended).  
 *  
 * If 'fetching_ckpt' is true, we're fetching a checkpoint record, and should  
 * prepare to read WAL starting from RedoStartLSN after this.  
 *  
 * 'RecPtr' might not point to the beginning of the record we're interested  
 * in, it might also point to the page or segment header. In that case,  
 * 'tliRecPtr' is the position of the WAL record we're interested in. It is  
 * used to decide which timeline to stream the requested WAL from.  
 *  
 * When the requested record becomes available, the function opens the file  
 * containing it (if not open already), and returns true. When end of standby  
 * mode is triggered by the user, and there is no more WAL available, returns  
 * false.  
 */  
static bool  
WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,  
                                                        bool fetching_ckpt, XLogRecPtr tliRecPtr)  
{  
        static pg_time_t last_fail_time = 0;  
        pg_time_t now;  
  
        /*-------  
         * Standby mode is implemented by a state machine:  
         *  
         * 1. Read from archive (XLOG_FROM_ARCHIVE)  
         * 2. Read from pg_xlog (XLOG_FROM_PG_XLOG)  
         * 3. Check trigger file  
         * 4. Read from primary server via walreceiver (XLOG_FROM_STREAM)  
         * 5. Rescan timelines  
         * 6. Sleep 5 seconds, and loop back to 1.  
         *  
         * Failure to read from the current source advances the state machine to  
         * the next state. In addition, successfully reading a file from pg_xlog  
         * moves the state machine from state 2 back to state 1 (we always prefer  
         * files in the archive over files in pg_xlog).  
         *  
         * 'currentSource' indicates the current state. There are no currentSource  
         * values for "check trigger", "rescan timelines", and "sleep" states,  
         * those actions are taken when reading from the previous source fails, as  
         * part of advancing to the next state.  
         *-------  
         */  
....  
略  
                                case XLOG_FROM_PG_XLOG:  
                                        /*  
                                         * Check to see if the trigger file exists. Note that we do  
                                         * this only after failure, so when you create the trigger  
                                         * file, we still finish replaying as much as we can from  
                                         * archive and pg_xlog before failover.  
                                         */  
                                        if (CheckForStandbyTrigger())  
                                        {  
                                                ShutdownWalRcv();  
                                                return false;  
                                        }  
略  
...  

src/backend/postmaster/startup.c

void  
ResetPromoteTriggered(void)  
{  
        promote_triggered = false;  
}  

参考

1. https://github.com/postgres/postgres/commit/fd4ced5230162b50a5c9d33b4bf9cfb1231aa62e

2. src/bin/pg_ctl/pg_ctl.c

 printf(_("\nPromotion modes are:\n"));  
 printf(_("  smart       promote after performing a checkpoint\n"));  
 printf(_("  fast        promote quickly without waiting for checkpoint completion\n"));  
static pgpid_t  
get_pgpid(void)  
{  
   FILE       *pidf;  
   long        pid;  
 
   pidf = fopen(pid_file, "r");  
   if (pidf == NULL)  
   {  
       /* No pid file, not an error on startup */  
       if (errno == ENOENT)  
           return 0;  
       else  
       {  
           write_stderr(_("%s: could not open PID file \"%s\": %s\n"),  
                        progname, pid_file, strerror(errno));  
           exit(1);  
       }  
   }  
   if (fscanf(pidf, "%ld", &pid) != 1)  
   {  
       /* Is the file empty? */  
       if (ftell(pidf) == 0 && feof(pidf))  
           write_stderr(_("%s: the PID file \"%s\" is empty\n"),  
                        progname, pid_file);  
       else  
           write_stderr(_("%s: invalid data in PID file \"%s\"\n"),  
                        progname, pid_file);  
       exit(1);  
   }  
   fclose(pidf);  
   return (pgpid_t) pid;  
}  
/*  
* promote  
*/  
 
static void  
do_promote(void)  
{  
   FILE       *prmfile;  
   pgpid_t     pid;  
   struct stat statbuf;  
 
   pid = get_pgpid();  
 
   if (pid == 0)               /* no pid file */  
   {  
       write_stderr(_("%s: PID file \"%s\" does not exist\n"), progname, pid_file);  
       write_stderr(_("Is server running?\n"));  
       exit(1);  
   }  
   else if (pid < 0)           /* standalone backend, not postmaster */  
   {  
       pid = -pid;  
       write_stderr(_("%s: cannot promote server; "  
                      "single-user server is running (PID: %ld)\n"),  
                    progname, pid);  
       exit(1);  
   }  
 
   /* If recovery.conf doesn't exist, the server is not in standby mode */  
   if (stat(recovery_file, &statbuf) != 0)  
   {  
       write_stderr(_("%s: cannot promote server; "  
                      "server is not in standby mode\n"),  
                    progname);  
       exit(1);  
   }  
 
   /*  
    * Use two different kinds of promotion file so we can understand  
    * the difference between smart and fast promotion.  
    */  
   if (shutdown_mode >= FAST_MODE)  
       snprintf(promote_file, MAXPGPATH, "%s/fast_promote", pg_data);  
   else  
       snprintf(promote_file, MAXPGPATH, "%s/promote", pg_data);  
 
   if ((prmfile = fopen(promote_file, "w")) == NULL)  
   {  
       write_stderr(_("%s: could not create promote signal file \"%s\": %s\n"),  
                    progname, promote_file, strerror(errno));  
       exit(1);  
   }  
   if (fclose(prmfile))  
   {  
       write_stderr(_("%s: could not write promote signal file \"%s\": %s\n"),  
                    progname, promote_file, strerror(errno));  
       exit(1);  
   }  
 
   sig = SIGUSR1;  
   if (kill((pid_t) pid, sig) != 0)  
   {  
       write_stderr(_("%s: could not send promote signal (PID: %ld): %s\n"),  
                    progname, pid, strerror(errno));  
       if (unlink(promote_file) != 0)  
           write_stderr(_("%s: could not remove promote signal file \"%s\": %s\n"),  
                        progname, promote_file, strerror(errno));  
       exit(1);  
   }  
 
   print_msg(_("server promoting\n"));  
}  

3. src/backend/postmaster/postmaster.c

/*  
* sigusr1_handler - handle signal conditions from child processes  
*/  
static void  
sigusr1_handler(SIGNAL_ARGS)  
{  
   int         save_errno = errno;  
 
   PG_SETMASK(&BlockSig);  
 
   /*  
    * RECOVERY_STARTED and BEGIN_HOT_STANDBY signals are ignored in  
    * unexpected states. If the startup process quickly starts up, completes  
    * recovery, exits, we might process the death of the startup process  
    * first. We don't want to go back to recovery in that case.  
    */  
   if (CheckPostmasterSignal(PMSIGNAL_RECOVERY_STARTED) &&  
       pmState == PM_STARTUP && Shutdown == NoShutdown)  
   {  
       /* WAL redo has started. We're out of reinitialization. */  
       FatalError = false;  
 
       /*  
        * Crank up the background tasks.  It doesn't matter if this fails,  
        * we'll just try again later.  
        */  
       Assert(CheckpointerPID == 0);  
       CheckpointerPID = StartCheckpointer();  
       Assert(BgWriterPID == 0);  
       BgWriterPID = StartBackgroundWriter();  
 
       pmState = PM_RECOVERY;  
   }  
   if (CheckPostmasterSignal(PMSIGNAL_BEGIN_HOT_STANDBY) &&  
       pmState == PM_RECOVERY && Shutdown == NoShutdown)  
   {  
       /*  
        * Likewise, start other special children as needed.  
        */  
       Assert(PgStatPID == 0);  
       PgStatPID = pgstat_start();  
 
       ereport(LOG,  
       (errmsg("database system is ready to accept read only connections")));  
 
       pmState = PM_HOT_STANDBY;  
 
       /* Some workers may be scheduled to start now */  
       StartOneBackgroundWorker();  
   }  
 
   if (CheckPostmasterSignal(PMSIGNAL_WAKEN_ARCHIVER) &&  
       PgArchPID != 0)  
   {  
       /*  
        * Send SIGUSR1 to archiver process, to wake it up and begin archiving  
        * next transaction log file.  
        */  
       signal_child(PgArchPID, SIGUSR1);  
   }  
 
   if (CheckPostmasterSignal(PMSIGNAL_ROTATE_LOGFILE) &&  
       SysLoggerPID != 0)  
   {  
       /* Tell syslogger to rotate logfile */  
       signal_child(SysLoggerPID, SIGUSR1);  
   }  
 
   if (CheckPostmasterSignal(PMSIGNAL_START_AUTOVAC_LAUNCHER) &&  
       Shutdown == NoShutdown)  
   {  
       /*  
        * Start one iteration of the autovacuum daemon, even if autovacuuming  
        * is nominally not enabled.  This is so we can have an active defense  
        * against transaction ID wraparound.  We set a flag for the main loop  
        * to do it rather than trying to do it here --- this is because the  
        * autovac process itself may send the signal, and we want to handle  
        * that by launching another iteration as soon as the current one  
        * completes.  
        */  
       start_autovac_launcher = true;  
   }  
 
   if (CheckPostmasterSignal(PMSIGNAL_START_AUTOVAC_WORKER) &&  
       Shutdown == NoShutdown)  
   {  
       /* The autovacuum launcher wants us to start a worker process. */  
       StartAutovacuumWorker();  
   }  
 
   if (CheckPostmasterSignal(PMSIGNAL_START_WALRECEIVER) &&  
       WalReceiverPID == 0 &&  
       (pmState == PM_STARTUP || pmState == PM_RECOVERY ||  
        pmState == PM_HOT_STANDBY || pmState == PM_WAIT_READONLY) &&  
       Shutdown == NoShutdown)  
   {  
       /* Startup Process wants us to start the walreceiver process. */  
       WalReceiverPID = StartWalReceiver();  
   }  
 
   if (CheckPostmasterSignal(PMSIGNAL_ADVANCE_STATE_MACHINE) &&  
       (pmState == PM_WAIT_BACKUP || pmState == PM_WAIT_BACKENDS))  
   {  
       /* Advance postmaster's state machine */  
       PostmasterStateMachine();  
   }  
 
   if (CheckPromoteSignal() && StartupPID != 0 &&  
       (pmState == PM_STARTUP || pmState == PM_RECOVERY ||  
        pmState == PM_HOT_STANDBY || pmState == PM_WAIT_READONLY))  
   {  
       /* Tell startup process to finish recovery */  
       signal_child(StartupPID, SIGUSR2);  
   }  
 
   PG_SETMASK(&UnBlockSig);  
 
   errno = save_errno;  
}  

4. src/backend/access/transam/xlog.c

/*  
* Check to see if a promote request has arrived. Should be  
* called by postmaster after receiving SIGUSR1.  
*/  
bool  
CheckPromoteSignal(void)  
{  
   struct stat stat_buf;  
 
   if (stat(PROMOTE_SIGNAL_FILE, &stat_buf) == 0 ||  
       stat(FAST_PROMOTE_SIGNAL_FILE, &stat_buf) == 0)  
       return true;  
 
   return false;  
}  

5. src/backend/postmaster/startup.c

00106 /* SIGUSR2: set flag to finish recovery */ 00107 static void 00108 StartupProcTriggerHandler(SIGNAL_ARGS) 00109 { 00110     int         save_errno = errno; 00111  00112     promote_triggered = true; 00113     WakeupRecovery(); 00114  00115     errno = save_errno; 00116 }  

6. src/backend/access/transam/xlog.c

   /*  
    * recoveryWakeupLatch is used to wake up the startup process to continue  
    * WAL replay, if it is waiting for WAL to arrive or failover trigger file  
    * to appear.  
    */  
   Latch       recoveryWakeupLatch;  
/*  
* Wake up startup process to replay newly arrived WAL, or to notice that  
* failover has been requested.  
*/  
void  
WakeupRecovery(void)  
{  
   SetLatch(&XLogCtl->recoveryWakeupLatch);  
}  

7. src/include/storage/latch.h

/*  
* Latch structure should be treated as opaque and only accessed through  
* the public functions. It is defined here to allow embedding Latches as  
* part of bigger structs.  
*/  
typedef struct  
{  
   sig_atomic_t is_set;  
   bool        is_shared;  
   int         owner_pid;  
#ifdef WIN32  
   HANDLE      event;  
#endif  
} Latch;  

digoal’s 大量PostgreSQL文章入口

Twitter Facebook Google+ LinkedIn

Digoal.zhou

PostgreSQL 9.3 add Fast promote mode skips checkpoint at end of recovery

背景

注意

参考

digoal’s 大量PostgreSQL文章入口

You May Also Enjoy

PostgreSQL(PPAS 兼容Oracle) 从零开始入门手册 - 珍藏版

PostgreSQL pipelinedb 流计算插件 - IoT应用 - 实时轨迹聚合

PostgreSQL plpgsql 存储过程、函数 - 状态、异常变量打印、异常捕获… - GET [STACKED] DIAGNOSTICS

PostgreSQL datediff 日期间隔（单位转换）兼容SQL用法