whether pg_start_backup() force keep xlog in pg_xlog directory before execute pg_stop_backup() ?

7 minute read

背景

今天在群里讨论的一个问题 :

开了pg_start_backup() , 但是忘记使用pg_stop_backup()来关闭备份, XLOG会怎么样?

答案是

xlog不会怎么样,不会一直保持不做rotate,也就是说PG_XLOG目录不会一直膨胀。

但是需要注意的是, 数据库不允许同时有多个客户端执行pg_start_backup(), 所以要想再次执行必须先用pg_stop_backup()来停止.

我对pg_start_backup()的理解是,既然开始备份了,当然要保证这个备份是可以恢复的,所以当然需要用来恢复的XLOG是要保持住的,测试结果是,不会保持。

可能postgres不想在这个里面增加复杂度,他允许你在没有开启archive的时候使用pg_start_backup(),因为PG还有一个wal_keep_segments 参数是用来配置需要保持多少个segments的,所以只要KEEP足够,在备份完数据文件后再备份一下XLOG目录就可以了。

另外需要提的是,pg_start_backup()开始后,会再数据文件目录创建一个标签文件,另外在XLOG里面也会生成一个标签文件。直到pg_stop_backup()被调用,标签文件被清掉。

正文

一次测试结果

postgres@db-172-16-3-150-> ll  
total 2.3G  
-rw------- 1 postgres postgres 285 Oct 24 12:59 00000001000000050000003B.00000020.backup  
-rw------- 1 postgres postgres 64M Nov 30 14:08 000000010000002400000019  
-rw------- 1 postgres postgres 64M Nov 30 14:08 00000001000000240000001A  
-rw------- 1 postgres postgres 64M Nov 30 14:08 00000001000000240000001B  
-rw------- 1 postgres postgres 64M Nov 30 14:08 00000001000000240000001C  
-rw------- 1 postgres postgres 64M Nov 30 14:08 00000001000000240000001D  
-rw------- 1 postgres postgres 64M Nov 30 14:08 00000001000000240000001E  
-rw------- 1 postgres postgres 64M Nov 30 14:09 00000001000000240000001F  
-rw------- 1 postgres postgres 64M Nov 30 14:09 000000010000002400000020  
-rw------- 1 postgres postgres 64M Nov 30 14:09 000000010000002400000021  
-rw------- 1 postgres postgres 64M Nov 30 14:09 000000010000002400000022  
-rw------- 1 postgres postgres 64M Nov 30 14:09 000000010000002400000023  
-rw------- 1 postgres postgres 64M Nov 30 14:10 000000010000002400000024  
-rw------- 1 postgres postgres 64M Nov 30 14:10 000000010000002400000025  
-rw------- 1 postgres postgres 64M Nov 30 14:10 000000010000002400000026  
-rw------- 1 postgres postgres 64M Nov 30 14:10 000000010000002400000027  
-rw------- 1 postgres postgres 64M Nov 30 14:10 000000010000002400000028  
-rw------- 1 postgres postgres 64M Nov 30 14:10 000000010000002400000029  
-rw------- 1 postgres postgres 64M Nov 30 14:10 00000001000000240000002A  
-rw------- 1 postgres postgres 64M Nov 30 14:10 00000001000000240000002B  
-rw------- 1 postgres postgres 64M Nov 30 14:10 00000001000000240000002C  
-rw------- 1 postgres postgres 64M Nov 30 14:10 00000001000000240000002D  
-rw------- 1 postgres postgres 64M Nov 30 14:12 00000001000000240000002E  
-rw------- 1 postgres postgres 64M Nov 30 14:12 00000001000000240000002F  
-rw------- 1 postgres postgres 64M Nov 30 14:12 000000010000002400000030  
-rw------- 1 postgres postgres 64M Nov 30 14:12 000000010000002400000031  
-rw------- 1 postgres postgres 64M Nov 30 14:12 000000010000002400000032  
-rw------- 1 postgres postgres 64M Nov 30 14:12 000000010000002400000033  
-rw------- 1 postgres postgres 64M Nov 30 14:12 000000010000002400000034  
-rw------- 1 postgres postgres 64M Nov 30 14:12 000000010000002400000035  
-rw------- 1 postgres postgres 64M Nov 30 14:12 000000010000002400000036  
-rw------- 1 postgres postgres 64M Nov 30 14:12 000000010000002400000037  
-rw------- 1 postgres postgres 64M Nov 30 14:12 000000010000002400000038  
-rw------- 1 postgres postgres 64M Nov 30 14:12 000000010000002400000039  
-rw------- 1 postgres postgres 64M Nov 30 14:07 00000001000000240000003A  
-rw------- 1 postgres postgres 64M Nov 30 14:07 00000001000000240000003B  
-rw------- 1 postgres postgres 64M Nov 30 14:07 00000001000000240000003C  
drwx------ 2 postgres postgres 12K Oct 25 10:42 archive_status  

从上面的结果看显然pg_start_backup()没有强制KEEP xlog。

start_backup执行完之后产生的第一个XLOG 00000001000000050000003B.00000020.backup 已经被rotate掉了.

在PostgreSQL里面还分两种类型的数据文件级备份 :

/*  
 * do_pg_start_backup is the workhorse of the user-visible pg_start_backup()  
 * function. It creates the necessary starting checkpoint and constructs the  
 * backup label file.  
 *  
 * There are two kind of backups: exclusive and non-exclusive. An exclusive  
 * backup is started with pg_start_backup(), and there can be only one active  
 * at a time. The backup label file of an exclusive backup is written to  
 * $PGDATA/backup_label, and it is removed by pg_stop_backup().  
 *  
 * A non-exclusive backup is used for the streaming base backups (see  
 * src/backend/replication/basebackup.c). The difference to exclusive backups  
 * is that the backup label file is not written to disk. Instead, its would-be  
 * contents are returned in *labelfile, and the caller is responsible for  
 * including it in the backup archive as 'backup_label'. There can be many  
 * non-exclusive backups active at the same time, and they don't conflict  
 * with an exclusive backup either.  
 *  
 * Every successfully started non-exclusive backup must be stopped by calling  
 * do_pg_stop_backup() or do_pg_abort_backup().  
 */  

一种就是人机交互的模式如调用pg_start_backup(). 它不允许同时有多个客户端在执行pg_start_backup().

另一种是流复制用到的模式, 如pg_basebackup, 它允许同时使用, 就是说可以同时有多个客户端在执行pg_basebackup.

        /*  
         * exclusiveBackup is true if a backup started with pg_start_backup() is  
         * in progress, and nonExclusiveBackups is a counter indicating the number  
         * of streaming base backups currently in progress. forcePageWrites is set  
         * to true when either of these is non-zero. lastBackupStart is the latest  
         * checkpoint redo location used as a starting point for an online backup.  
         */  

最后人机交互的pg_start_backup()干了哪几件比较重要的事情呢?

1. 写备份标签文件到$PGDATA, 它包含了哪些信息呢?

The label file contains the user-supplied label string (typically this would be used to tell where the backup dump will be stored) and the starting time and starting WAL location for the dump.  
  
pg_start_backup(PG_FUNCTION_ARGS)  
{  
        text       *backupid = PG_GETARG_TEXT_P(0);  
        bool            fast = PG_GETARG_BOOL(1);  
        char       *backupidstr;  
        XLogRecPtr      startpoint;  
        char            startxlogstr[MAXFNAMELEN];  
  
        backupidstr = text_to_cstring(backupid);  
  
        startpoint = do_pg_start_backup(backupidstr, fast, NULL);  
  
        snprintf(startxlogstr, sizeof(startxlogstr), "%X/%X",  
                         startpoint.xlogid, startpoint.xrecoff);  
        PG_RETURN_TEXT_P(cstring_to_text(startxlogstr));  
}  

2. checkpoint之前强制切换xlog文件确保checkpoint之后的wal文件不包含前面的timeline IDs. timeline我们在讲主备角色切换的时候接触过, 备库激活后timeline会自增1, 详细的解释如下 :

       /*  
         * Force an XLOG file switch before the checkpoint, to ensure that the WAL  
         * segment the checkpoint is written to doesn't contain pages with old  
         * timeline IDs. That would otherwise happen if you called  
         * pg_start_backup() right after restoring from a PITR archive: the first  
         * WAL segment containing the startup checkpoint has pages in the  
         * beginning with the old timeline ID. That can cause trouble at recovery:  
         * we won't have a history file covering the old timeline if pg_xlog  
         * directory was not included in the base backup and the WAL archive was  
         * cleared too before starting the backup.  
         */  
        RequestXLogSwitch();  

3. 在shared memory中激活备份标记, 打开full-page WAL writes, 然后强制checkpoint, 把shared buffer里面的脏页面刷到磁盘上.

最终的目的是确保我们的数据文件备份和xlog备份可用于恢复数据库, 而不会有不一致的块(因为在拷贝的过程中BLOCK可能在变化, 可能拷贝了一个BLOCK的前半部分是改变前的, 后半部分是改变后的, 因此需要full-page WAL write来恢复这种块). 具体的解释如下 :

	/*  
         * Mark backup active in shared memory.  We must do full-page WAL writes  
         * during an on-line backup even if not doing so at other times, because  
         * it's quite possible for the backup dump to obtain a "torn" (partially  
         * written) copy of a database page if it reads the page concurrently with  
         * our write to the same page.  This can be fixed as long as the first  
         * write to the page in the WAL sequence is a full-page write. Hence, we  
         * turn on forcePageWrites and then force a CHECKPOINT, to ensure there  
         * are no dirty pages in shared memory that might get dumped while the  
         * backup is in progress without having a corresponding WAL record.  (Once  
         * the backup is complete, we need not force full-page writes anymore,  
         * since we expect that any pages not modified during the backup interval  
         * must have been correctly captured by the backup.)  
         *  
         * We must hold WALInsertLock to change the value of forcePageWrites, to  
         * ensure adequate interlocking against XLogInsert().  
         */  
        LWLockAcquire(WALInsertLock, LW_EXCLUSIVE);  

那么为什么一定要在checkpoint前打开full-page WAL write呢?

checkpoint之后, BLOCK在第一次被改变的时候, 是写full page到XLOG文件的, 但是前提是full_page_writes参数=on.

因此如果我们的参数中full_page_writes=off了. 如果仅仅是checkpoint , 就不能确保 checkpoint之后, BLOCK在第一次被改变的时候, 是写full page到XLOG文件的 了.

所以pg_start_backup()中设计了一个打开full-page WAL write的过程(XLogCtl->Insert.forcePageWrites = true;). 并且这个过程必须在checkpoint发出之前. 而且full-page write要hold住直到备份结束.

从xlog的doPageWrites也可以看出在写page到WAL文件之前是需要先看看full_page_writes的值或者是Insert->forcePageWrites标记的值, 来判断此次写是否要full-page write.

/*  
         * Decide if we need to do full-page writes in this XLOG record: true if  
         * full_page_writes is on or we have a PITR request for it.  Since we  
         * don't yet have the insert lock, forcePageWrites could change under us,  
         * but we'll recheck it once we have the lock.  
         */  
        doPageWrites = fullPageWrites || Insert->forcePageWrites;  

其他

shared memory中用于XLOG控制的数据结构 :

/*----------  
 * Shared-memory data structures for XLOG control  
 *  
 * LogwrtRqst indicates a byte position that we need to write and/or fsync  
 * the log up to (all records before that point must be written or fsynced).  
 * LogwrtResult indicates the byte positions we have already written/fsynced.  
 * These structs are identical but are declared separately to indicate their  
 * slightly different functions.  
 *  
 * We do a lot of pushups to minimize the amount of access to lockable  
 * shared memory values.  There are actually three shared-memory copies of  
 * LogwrtResult, plus one unshared copy in each backend.  Here's how it works:  
 *              XLogCtl->LogwrtResult is protected by info_lck  
 *              XLogCtl->Write.LogwrtResult is protected by WALWriteLock  
 *              XLogCtl->Insert.LogwrtResult is protected by WALInsertLock  
 * One must hold the associated lock to read or write any of these, but  
 * of course no lock is needed to read/write the unshared LogwrtResult.  
 *  
 * XLogCtl->LogwrtResult and XLogCtl->Write.LogwrtResult are both "always  
 * right", since both are updated by a write or flush operation before  
 * it releases WALWriteLock.  The point of keeping XLogCtl->Write.LogwrtResult  
 * is that it can be examined/modified by code that already holds WALWriteLock  
 * without needing to grab info_lck as well.  
 *  
 * XLogCtl->Insert.LogwrtResult may lag behind the reality of the other two,  
 * but is updated when convenient.      Again, it exists for the convenience of  
 * code that is already holding WALInsertLock but not the other locks.  
 *  
 * The unshared LogwrtResult may lag behind any or all of these, and again  
 * is updated when convenient.  
 *  
 * The request bookkeeping is simpler: there is a shared XLogCtl->LogwrtRqst  
 * (protected by info_lck), but we don't need to cache any copies of it.  
 *  
 * Note that this all works because the request and result positions can only  
 * advance forward, never back up, and so we can easily determine which of two  
 * values is "more up to date".  
 *  
 * info_lck is only held long enough to read/update the protected variables,  
 * so it's a plain spinlock.  The other locks are held longer (potentially  
 * over I/O operations), so we use LWLocks for them.  These locks are:  
 *  
 * WALInsertLock: must be held to insert a record into the WAL buffers.  
 *  
 * WALWriteLock: must be held to write WAL buffers to disk (XLogWrite or  
 * XLogFlush).  
 *  
 * ControlFileLock: must be held to read/update control file or create  
 * new log file.  
 *  
 * CheckpointLock: must be held to do a checkpoint or restartpoint (ensures  
 * only one checkpointer at a time; currently, with all checkpoints done by  
 * the bgwriter, this is just pro forma).  
 *  
 *----------  
 */  

参考

src/backend/access/transam/xlog.c

src/backend/replication/basebackup.c

Flag Counter

digoal’s 大量PostgreSQL文章入口