whether pg_start_backup() force keep xlog in pg_xlog directory before execute pg_stop_backup() ?
背景
今天在群里讨论的一个问题 :
开了pg_start_backup() , 但是忘记使用pg_stop_backup()来关闭备份, XLOG会怎么样?
答案是
xlog不会怎么样,不会一直保持不做rotate,也就是说PG_XLOG目录不会一直膨胀。
但是需要注意的是, 数据库不允许同时有多个客户端执行pg_start_backup(), 所以要想再次执行必须先用pg_stop_backup()来停止.
我对pg_start_backup()的理解是,既然开始备份了,当然要保证这个备份是可以恢复的,所以当然需要用来恢复的XLOG是要保持住的,测试结果是,不会保持。
可能postgres不想在这个里面增加复杂度,他允许你在没有开启archive的时候使用pg_start_backup(),因为PG还有一个wal_keep_segments 参数是用来配置需要保持多少个segments的,所以只要KEEP足够,在备份完数据文件后再备份一下XLOG目录就可以了。
另外需要提的是,pg_start_backup()开始后,会再数据文件目录创建一个标签文件,另外在XLOG里面也会生成一个标签文件。直到pg_stop_backup()被调用,标签文件被清掉。
正文
一次测试结果
postgres@db-172-16-3-150-> ll
total 2.3G
-rw------- 1 postgres postgres 285 Oct 24 12:59 00000001000000050000003B.00000020.backup
-rw------- 1 postgres postgres 64M Nov 30 14:08 000000010000002400000019
-rw------- 1 postgres postgres 64M Nov 30 14:08 00000001000000240000001A
-rw------- 1 postgres postgres 64M Nov 30 14:08 00000001000000240000001B
-rw------- 1 postgres postgres 64M Nov 30 14:08 00000001000000240000001C
-rw------- 1 postgres postgres 64M Nov 30 14:08 00000001000000240000001D
-rw------- 1 postgres postgres 64M Nov 30 14:08 00000001000000240000001E
-rw------- 1 postgres postgres 64M Nov 30 14:09 00000001000000240000001F
-rw------- 1 postgres postgres 64M Nov 30 14:09 000000010000002400000020
-rw------- 1 postgres postgres 64M Nov 30 14:09 000000010000002400000021
-rw------- 1 postgres postgres 64M Nov 30 14:09 000000010000002400000022
-rw------- 1 postgres postgres 64M Nov 30 14:09 000000010000002400000023
-rw------- 1 postgres postgres 64M Nov 30 14:10 000000010000002400000024
-rw------- 1 postgres postgres 64M Nov 30 14:10 000000010000002400000025
-rw------- 1 postgres postgres 64M Nov 30 14:10 000000010000002400000026
-rw------- 1 postgres postgres 64M Nov 30 14:10 000000010000002400000027
-rw------- 1 postgres postgres 64M Nov 30 14:10 000000010000002400000028
-rw------- 1 postgres postgres 64M Nov 30 14:10 000000010000002400000029
-rw------- 1 postgres postgres 64M Nov 30 14:10 00000001000000240000002A
-rw------- 1 postgres postgres 64M Nov 30 14:10 00000001000000240000002B
-rw------- 1 postgres postgres 64M Nov 30 14:10 00000001000000240000002C
-rw------- 1 postgres postgres 64M Nov 30 14:10 00000001000000240000002D
-rw------- 1 postgres postgres 64M Nov 30 14:12 00000001000000240000002E
-rw------- 1 postgres postgres 64M Nov 30 14:12 00000001000000240000002F
-rw------- 1 postgres postgres 64M Nov 30 14:12 000000010000002400000030
-rw------- 1 postgres postgres 64M Nov 30 14:12 000000010000002400000031
-rw------- 1 postgres postgres 64M Nov 30 14:12 000000010000002400000032
-rw------- 1 postgres postgres 64M Nov 30 14:12 000000010000002400000033
-rw------- 1 postgres postgres 64M Nov 30 14:12 000000010000002400000034
-rw------- 1 postgres postgres 64M Nov 30 14:12 000000010000002400000035
-rw------- 1 postgres postgres 64M Nov 30 14:12 000000010000002400000036
-rw------- 1 postgres postgres 64M Nov 30 14:12 000000010000002400000037
-rw------- 1 postgres postgres 64M Nov 30 14:12 000000010000002400000038
-rw------- 1 postgres postgres 64M Nov 30 14:12 000000010000002400000039
-rw------- 1 postgres postgres 64M Nov 30 14:07 00000001000000240000003A
-rw------- 1 postgres postgres 64M Nov 30 14:07 00000001000000240000003B
-rw------- 1 postgres postgres 64M Nov 30 14:07 00000001000000240000003C
drwx------ 2 postgres postgres 12K Oct 25 10:42 archive_status
从上面的结果看显然pg_start_backup()没有强制KEEP xlog。
start_backup执行完之后产生的第一个XLOG 00000001000000050000003B.00000020.backup 已经被rotate掉了.
在PostgreSQL里面还分两种类型的数据文件级备份 :
/*
* do_pg_start_backup is the workhorse of the user-visible pg_start_backup()
* function. It creates the necessary starting checkpoint and constructs the
* backup label file.
*
* There are two kind of backups: exclusive and non-exclusive. An exclusive
* backup is started with pg_start_backup(), and there can be only one active
* at a time. The backup label file of an exclusive backup is written to
* $PGDATA/backup_label, and it is removed by pg_stop_backup().
*
* A non-exclusive backup is used for the streaming base backups (see
* src/backend/replication/basebackup.c). The difference to exclusive backups
* is that the backup label file is not written to disk. Instead, its would-be
* contents are returned in *labelfile, and the caller is responsible for
* including it in the backup archive as 'backup_label'. There can be many
* non-exclusive backups active at the same time, and they don't conflict
* with an exclusive backup either.
*
* Every successfully started non-exclusive backup must be stopped by calling
* do_pg_stop_backup() or do_pg_abort_backup().
*/
一种就是人机交互的模式如调用pg_start_backup(). 它不允许同时有多个客户端在执行pg_start_backup().
另一种是流复制用到的模式, 如pg_basebackup, 它允许同时使用, 就是说可以同时有多个客户端在执行pg_basebackup.
/*
* exclusiveBackup is true if a backup started with pg_start_backup() is
* in progress, and nonExclusiveBackups is a counter indicating the number
* of streaming base backups currently in progress. forcePageWrites is set
* to true when either of these is non-zero. lastBackupStart is the latest
* checkpoint redo location used as a starting point for an online backup.
*/
最后人机交互的pg_start_backup()干了哪几件比较重要的事情呢?
1. 写备份标签文件到$PGDATA, 它包含了哪些信息呢?
The label file contains the user-supplied label string (typically this would be used to tell where the backup dump will be stored) and the starting time and starting WAL location for the dump.
pg_start_backup(PG_FUNCTION_ARGS)
{
text *backupid = PG_GETARG_TEXT_P(0);
bool fast = PG_GETARG_BOOL(1);
char *backupidstr;
XLogRecPtr startpoint;
char startxlogstr[MAXFNAMELEN];
backupidstr = text_to_cstring(backupid);
startpoint = do_pg_start_backup(backupidstr, fast, NULL);
snprintf(startxlogstr, sizeof(startxlogstr), "%X/%X",
startpoint.xlogid, startpoint.xrecoff);
PG_RETURN_TEXT_P(cstring_to_text(startxlogstr));
}
2. checkpoint之前强制切换xlog文件确保checkpoint之后的wal文件不包含前面的timeline IDs. timeline我们在讲主备角色切换的时候接触过, 备库激活后timeline会自增1, 详细的解释如下 :
/*
* Force an XLOG file switch before the checkpoint, to ensure that the WAL
* segment the checkpoint is written to doesn't contain pages with old
* timeline IDs. That would otherwise happen if you called
* pg_start_backup() right after restoring from a PITR archive: the first
* WAL segment containing the startup checkpoint has pages in the
* beginning with the old timeline ID. That can cause trouble at recovery:
* we won't have a history file covering the old timeline if pg_xlog
* directory was not included in the base backup and the WAL archive was
* cleared too before starting the backup.
*/
RequestXLogSwitch();
3. 在shared memory中激活备份标记, 打开full-page WAL writes, 然后强制checkpoint, 把shared buffer里面的脏页面刷到磁盘上.
最终的目的是确保我们的数据文件备份和xlog备份可用于恢复数据库, 而不会有不一致的块(因为在拷贝的过程中BLOCK可能在变化, 可能拷贝了一个BLOCK的前半部分是改变前的, 后半部分是改变后的, 因此需要full-page WAL write来恢复这种块). 具体的解释如下 :
/*
* Mark backup active in shared memory. We must do full-page WAL writes
* during an on-line backup even if not doing so at other times, because
* it's quite possible for the backup dump to obtain a "torn" (partially
* written) copy of a database page if it reads the page concurrently with
* our write to the same page. This can be fixed as long as the first
* write to the page in the WAL sequence is a full-page write. Hence, we
* turn on forcePageWrites and then force a CHECKPOINT, to ensure there
* are no dirty pages in shared memory that might get dumped while the
* backup is in progress without having a corresponding WAL record. (Once
* the backup is complete, we need not force full-page writes anymore,
* since we expect that any pages not modified during the backup interval
* must have been correctly captured by the backup.)
*
* We must hold WALInsertLock to change the value of forcePageWrites, to
* ensure adequate interlocking against XLogInsert().
*/
LWLockAcquire(WALInsertLock, LW_EXCLUSIVE);
那么为什么一定要在checkpoint前打开full-page WAL write呢?
checkpoint之后, BLOCK在第一次被改变的时候, 是写full page到XLOG文件的, 但是前提是full_page_writes参数=on.
因此如果我们的参数中full_page_writes=off了. 如果仅仅是checkpoint , 就不能确保 checkpoint之后, BLOCK在第一次被改变的时候, 是写full page到XLOG文件的 了.
所以pg_start_backup()中设计了一个打开full-page WAL write的过程(XLogCtl->Insert.forcePageWrites = true;). 并且这个过程必须在checkpoint发出之前. 而且full-page write要hold住直到备份结束.
从xlog的doPageWrites也可以看出在写page到WAL文件之前是需要先看看full_page_writes的值或者是Insert->forcePageWrites标记的值, 来判断此次写是否要full-page write.
/*
* Decide if we need to do full-page writes in this XLOG record: true if
* full_page_writes is on or we have a PITR request for it. Since we
* don't yet have the insert lock, forcePageWrites could change under us,
* but we'll recheck it once we have the lock.
*/
doPageWrites = fullPageWrites || Insert->forcePageWrites;
其他
shared memory中用于XLOG控制的数据结构 :
/*----------
* Shared-memory data structures for XLOG control
*
* LogwrtRqst indicates a byte position that we need to write and/or fsync
* the log up to (all records before that point must be written or fsynced).
* LogwrtResult indicates the byte positions we have already written/fsynced.
* These structs are identical but are declared separately to indicate their
* slightly different functions.
*
* We do a lot of pushups to minimize the amount of access to lockable
* shared memory values. There are actually three shared-memory copies of
* LogwrtResult, plus one unshared copy in each backend. Here's how it works:
* XLogCtl->LogwrtResult is protected by info_lck
* XLogCtl->Write.LogwrtResult is protected by WALWriteLock
* XLogCtl->Insert.LogwrtResult is protected by WALInsertLock
* One must hold the associated lock to read or write any of these, but
* of course no lock is needed to read/write the unshared LogwrtResult.
*
* XLogCtl->LogwrtResult and XLogCtl->Write.LogwrtResult are both "always
* right", since both are updated by a write or flush operation before
* it releases WALWriteLock. The point of keeping XLogCtl->Write.LogwrtResult
* is that it can be examined/modified by code that already holds WALWriteLock
* without needing to grab info_lck as well.
*
* XLogCtl->Insert.LogwrtResult may lag behind the reality of the other two,
* but is updated when convenient. Again, it exists for the convenience of
* code that is already holding WALInsertLock but not the other locks.
*
* The unshared LogwrtResult may lag behind any or all of these, and again
* is updated when convenient.
*
* The request bookkeeping is simpler: there is a shared XLogCtl->LogwrtRqst
* (protected by info_lck), but we don't need to cache any copies of it.
*
* Note that this all works because the request and result positions can only
* advance forward, never back up, and so we can easily determine which of two
* values is "more up to date".
*
* info_lck is only held long enough to read/update the protected variables,
* so it's a plain spinlock. The other locks are held longer (potentially
* over I/O operations), so we use LWLocks for them. These locks are:
*
* WALInsertLock: must be held to insert a record into the WAL buffers.
*
* WALWriteLock: must be held to write WAL buffers to disk (XLogWrite or
* XLogFlush).
*
* ControlFileLock: must be held to read/update control file or create
* new log file.
*
* CheckpointLock: must be held to do a checkpoint or restartpoint (ensures
* only one checkpointer at a time; currently, with all checkpoints done by
* the bgwriter, this is just pro forma).
*
*----------
*/
参考
src/backend/access/transam/xlog.c
src/backend/replication/basebackup.c