PostgreSQL 检查点性能影响及源码分析 - 2
背景
数据库可靠性从何而来?
数据库崩溃后如何恢复,从什么位置开始恢复?
数据库检查点是什么?
检查点要干些什么?
为什么脏数据较多时,检查点会对性能有一定的影响?
什么是full page write?
相信这些问题是搞数据库的同学都想搞明白的。
接下里的一系列文章,围绕检查点展开讲解,讲一讲检查点的原理,以及为什么脏数据较多是,它会对数据库产生一定的性能影响。
正文
接着上一篇讲解检查点最重的操作CheckPointGuts@src/backend/access/transam/xlog.c。
http://blog.163.com/digoal@126/blog/static/163877040201542103933969/
检查点最重量级的函数如下:
CheckPointGuts@src/backend/access/transam/xlog.c
static void
CheckPointGuts(XLogRecPtr checkPointRedo, int flags)
{
CheckPointCLOG(); // src/backend/access/transam/clog.c
CheckPointSUBTRANS(); // src/backend/access/transam/subtrans.c
CheckPointMultiXact(); // src/backend/access/transam/multixact.c
CheckPointPredicate(); // src/backend/storage/lmgr/predicate.c
CheckPointRelationMap(); // src/backend/utils/cache/relmapper.c
CheckPointReplicationSlots(); // src/backend/replication/slot.c
CheckPointSnapBuild(); // src/backend/replication/logical/snapbuild.c
CheckPointLogicalRewriteHeap(); // src/backend/access/heap/rewriteheap.c
CheckPointBuffers(flags); // src/backend/storage/buffer/bufmgr.c
CheckPointTwoPhase(checkPointRedo); // src/backend/access/transam/twophase.c
}
分解到每个调用:
1. 将commit log在buffer中的脏数据刷到pg_clog目录下对应的文件中。
CheckPointCLOG@src/backend/access/transam/clog.c
/*
* Perform a checkpoint --- either during shutdown, or on-the-fly
*/
void
CheckPointCLOG(void)
{
/* Flush dirty CLOG pages to disk */
TRACE_POSTGRESQL_CLOG_CHECKPOINT_START(true);
SimpleLruFlush(ClogCtl, true);
TRACE_POSTGRESQL_CLOG_CHECKPOINT_DONE(true);
}
2. 将subtrans log在buffer中的脏数据刷到pg_subtrans目录下对应的文件中。
CheckPointSUBTRANS@src/backend/access/transam/subtrans.c
/*
* Perform a checkpoint --- either during shutdown, or on-the-fly
*/
void
CheckPointSUBTRANS(void)
{
/*
* Flush dirty SUBTRANS pages to disk
*
* This is not actually necessary from a correctness point of view. We do
* it merely to improve the odds that writing of dirty pages is done by
* the checkpoint process and not by backends.
*/
TRACE_POSTGRESQL_SUBTRANS_CHECKPOINT_START(true);
SimpleLruFlush(SubTransCtl, true);
TRACE_POSTGRESQL_SUBTRANS_CHECKPOINT_DONE(true);
}
3. 将MultiXact log在buffer中的脏数据刷到pg_multixact目录下对应的文件中。
CheckPointMultiXact@src/backend/access/transam/multixact.c
/*
* Perform a checkpoint --- either during shutdown, or on-the-fly
*/
void
CheckPointMultiXact(void)
{
TRACE_POSTGRESQL_MULTIXACT_CHECKPOINT_START(true);
/* Flush dirty MultiXact pages to disk */
SimpleLruFlush(MultiXactOffsetCtl, true);
SimpleLruFlush(MultiXactMemberCtl, true);
TRACE_POSTGRESQL_MULTIXACT_CHECKPOINT_DONE(true);
}
4. Flush dirty SLRU(simple least recent used) pages to disk
CheckPointPredicate@src/backend/storage/lmgr/predicate.c
/*
* Perform a checkpoint --- either during shutdown, or on-the-fly
*
* We don't have any data that needs to survive a restart, but this is a
* convenient place to truncate the SLRU.
*/
void
CheckPointPredicate(void)
{
int tailPage;
LWLockAcquire(OldSerXidLock, LW_EXCLUSIVE);
/* Exit quickly if the SLRU is currently not in use. */
if (oldSerXidControl->headPage < 0)
{
LWLockRelease(OldSerXidLock);
return;
}
if (TransactionIdIsValid(oldSerXidControl->tailXid))
{
/* We can truncate the SLRU up to the page containing tailXid */
tailPage = OldSerXidPage(oldSerXidControl->tailXid);
}
else
{
/*
* The SLRU is no longer needed. Truncate to head before we set head
* invalid.
*
* XXX: It's possible that the SLRU is not needed again until XID
* wrap-around has happened, so that the segment containing headPage
* that we leave behind will appear to be new again. In that case it
* won't be removed until XID horizon advances enough to make it
* current again.
*/
tailPage = oldSerXidControl->headPage;
oldSerXidControl->headPage = -1;
}
LWLockRelease(OldSerXidLock);
/* Truncate away pages that are no longer required */
SimpleLruTruncate(OldSerXidSlruCtl, tailPage);
/*
* Flush dirty SLRU pages to disk
*
* This is not actually necessary from a correctness point of view. We do
* it merely as a debugging aid.
*
* We're doing this after the truncation to avoid writing pages right
* before deleting the file in which they sit, which would be completely
* pointless.
*/
SimpleLruFlush(OldSerXidSlruCtl, true);
}
前面4个调用,全部用到了SimpleLruFlush来完成刷缓存的动作。
SimpleLruFlush@src/backend/access/transam/slru.c
/*
* Flush dirty pages to disk during checkpoint or database shutdown
*/
void
SimpleLruFlush(SlruCtl ctl, bool checkpoint)
{
SlruShared shared = ctl->shared;
SlruFlushData fdata;
int slotno;
int pageno = 0;
int i;
bool ok;
/*
* Find and write dirty pages
*/
fdata.num_files = 0;
LWLockAcquire(shared->ControlLock, LW_EXCLUSIVE); // 注意每次都要获取排他锁
for (slotno = 0; slotno < shared->num_slots; slotno++)
{
SlruInternalWritePage(ctl, slotno, &fdata); // 这个可能会是比较重的操作
/*
* When called during a checkpoint, we cannot assert that the slot is
* clean now, since another process might have re-dirtied it already.
* That's okay.
*/
Assert(checkpoint ||
shared->page_status[slotno] == SLRU_PAGE_EMPTY ||
(shared->page_status[slotno] == SLRU_PAGE_VALID &&
!shared->page_dirty[slotno]));
}
LWLockRelease(shared->ControlLock);
/*
* Now fsync and close any files that were open
*/
ok = true;
for (i = 0; i < fdata.num_files; i++)
{
if (ctl->do_fsync && pg_fsync(fdata.fd[i]))
{
slru_errcause = SLRU_FSYNC_FAILED;
slru_errno = errno;
pageno = fdata.segno[i] * SLRU_PAGES_PER_SEGMENT;
ok = false;
}
if (CloseTransientFile(fdata.fd[i]))
{
slru_errcause = SLRU_CLOSE_FAILED;
slru_errno = errno;
pageno = fdata.segno[i] * SLRU_PAGES_PER_SEGMENT;
ok = false;
}
}
if (!ok)
SlruReportIOError(ctl, pageno, InvalidTransactionId);
}
5. 将rel mapper文件缓存写入文件, 什么是rel mapper文件呢?
rel mapper存储了一些数据库全局对象和文件ID的映射关系,一般的对象这种关系存储在全局对象pg_class.relfilenode中。
* For most tables, the physical file underlying the table is specified by
* pg_class.relfilenode. However, that obviously won't work for pg_class
* itself, nor for the other "nailed" catalogs for which we have to be able
* to set up working Relation entries without access to pg_class. It also
* does not work for shared catalogs, since there is no practical way to
* update other databases' pg_class entries when relocating a shared catalog.
* Therefore, for these special catalogs (henceforth referred to as "mapped
* catalogs") we rely on a separately maintained file that shows the mapping
* from catalog OIDs to filenode numbers. Each database has a map file for
* its local mapped catalogs, and there is a separate map file for shared
* catalogs. Mapped catalogs have zero in their pg_class.relfilenode entries.
rel mapping文件名:
每个数据库有一个pg_filenode.map文件,全局还有一个pg_filenode.map文件。
这些文件分别放在表空间/database_oid/目录和global/目录下。
/*
* The map file is critical data: we have no automatic method for recovering
* from loss or corruption of it. We use a CRC so that we can detect
* corruption. To minimize the risk of failed updates, the map file should
* be kept to no more than one standard-size disk sector (ie 512 bytes),
* and we use overwrite-in-place rather than playing renaming games.
* The struct layout below is designed to occupy exactly 512 bytes, which
* might make filesystem updates a bit more efficient.
*
* Entries in the mappings[] array are in no particular order. We could
* speed searching by insisting on OID order, but it really shouldn't be
* worth the trouble given the intended size of the mapping sets.
*/
#define RELMAPPER_FILENAME "pg_filenode.map"
CheckPointRelationMap@src/backend/utils/cache/relmapper.c
/*
* CheckPointRelationMap
*
* This is called during a checkpoint. It must ensure that any relation map
* updates that were WAL-logged before the start of the checkpoint are
* securely flushed to disk and will not need to be replayed later. This
* seems unlikely to be a performance-critical issue, so we use a simple
* method: we just take and release the RelationMappingLock. This ensures
* that any already-logged map update is complete, because write_relmap_file
* will fsync the map file before the lock is released.
*/
void
CheckPointRelationMap(void)
{
LWLockAcquire(RelationMappingLock, LW_SHARED); // 隐式fsync, 加锁前会自动完成fsync.
LWLockRelease(RelationMappingLock);
}
6. 将流复制replication slots信息刷到pg_replslot目录下对应的文件中。
CheckPointReplicationSlots@src/backend/replication/slot.c
/*
* Flush all replication slots to disk.
*
* This needn't actually be part of a checkpoint, but it's a convenient
* location.
*/
void
CheckPointReplicationSlots(void)
{
int i;
elog(DEBUG1, "performing replication slot checkpoint");
/*
* Prevent any slot from being created/dropped while we're active. As we
* explicitly do *not* want to block iterating over replication_slots or
* acquiring a slot we cannot take the control lock - but that's OK,
* because holding ReplicationSlotAllocationLock is strictly stronger, and
* enough to guarantee that nobody can change the in_use bits on us.
*/
LWLockAcquire(ReplicationSlotAllocationLock, LW_SHARED);
for (i = 0; i < max_replication_slots; i++)
{
ReplicationSlot *s = &ReplicationSlotCtl->replication_slots[i];
char path[MAXPGPATH];
if (!s->in_use)
continue;
/* save the slot to disk, locking is handled in SaveSlotToPath() */
sprintf(path, "pg_replslot/%s", NameStr(s->data.name));
SaveSlotToPath(s, path, LOG);
}
LWLockRelease(ReplicationSlotAllocationLock);
}
7. 逻辑复制相关的脏数据,刷入pg_logical/snapshots目录下对应的文件。
CheckPointSnapBuild@src/backend/replication/logical/snapbuild.c
/*
* Remove all serialized snapshots that are not required anymore because no
* slot can need them. This doesn't actually have to run during a checkpoint,
* but it's a convenient point to schedule this.
*
* NB: We run this during checkpoints even if logical decoding is disabled so
* we cleanup old slots at some point after it got disabled.
*/
void
CheckPointSnapBuild(void)
{
XLogRecPtr cutoff;
XLogRecPtr redo;
DIR *snap_dir;
struct dirent *snap_de;
char path[MAXPGPATH];
/*
* We start of with a minimum of the last redo pointer. No new replication
* slot will start before that, so that's a safe upper bound for removal.
*/
redo = GetRedoRecPtr();
/* now check for the restart ptrs from existing slots */
cutoff = ReplicationSlotsComputeLogicalRestartLSN();
/* don't start earlier than the restart lsn */
if (redo < cutoff)
cutoff = redo;
snap_dir = AllocateDir("pg_logical/snapshots");
while ((snap_de = ReadDir(snap_dir, "pg_logical/snapshots")) != NULL)
{
uint32 hi;
uint32 lo;
XLogRecPtr lsn;
struct stat statbuf;
if (strcmp(snap_de->d_name, ".") == 0 ||
strcmp(snap_de->d_name, "..") == 0)
continue;
snprintf(path, MAXPGPATH, "pg_logical/snapshots/%s", snap_de->d_name);
if (lstat(path, &statbuf) == 0 && !S_ISREG(statbuf.st_mode))
{
elog(DEBUG1, "only regular files expected: %s", path);
continue;
}
/*
* temporary filenames from SnapBuildSerialize() include the LSN and
* everything but are postfixed by .$pid.tmp. We can just remove them
* the same as other files because there can be none that are
* currently being written that are older than cutoff.
*
* We just log a message if a file doesn't fit the pattern, it's
* probably some editors lock/state file or similar...
*/
if (sscanf(snap_de->d_name, "%X-%X.snap", &hi, &lo) != 2)
{
ereport(LOG,
(errmsg("could not parse file name \"%s\"", path)));
continue;
}
lsn = ((uint64) hi) << 32 | lo;
/* check whether we still need it */
if (lsn < cutoff || cutoff == InvalidXLogRecPtr)
{
elog(DEBUG1, "removing snapbuild snapshot %s", path);
/*
* It's not particularly harmful, though strange, if we can't
* remove the file here. Don't prevent the checkpoint from
* completing, that'd be cure worse than the disease.
*/
if (unlink(path) < 0)
{
ereport(LOG,
(errcode_for_file_access(),
errmsg("could not remove file \"%s\": %m",
path)));
continue;
}
}
}
FreeDir(snap_dir);
}
8. 逻辑复制相关的脏数据,刷入pg_logical/mappings目录下对应的文件。
CheckPointLogicalRewriteHeap@src/backend/access/heap/rewriteheap.c
/* ---
* Perform a checkpoint for logical rewrite mappings
*
* This serves two tasks:
* 1) Remove all mappings not needed anymore based on the logical restart LSN
* 2) Flush all remaining mappings to disk, so that replay after a checkpoint
* only has to deal with the parts of a mapping that have been written out
* after the checkpoint started.
* ---
*/
void
CheckPointLogicalRewriteHeap(void)
{
XLogRecPtr cutoff;
XLogRecPtr redo;
DIR *mappings_dir;
struct dirent *mapping_de;
char path[MAXPGPATH];
/*
* We start of with a minimum of the last redo pointer. No new decoding
* slot will start before that, so that's a safe upper bound for removal.
*/
redo = GetRedoRecPtr();
/* now check for the restart ptrs from existing slots */
cutoff = ReplicationSlotsComputeLogicalRestartLSN();
/* don't start earlier than the restart lsn */
if (cutoff != InvalidXLogRecPtr && redo < cutoff)
cutoff = redo;
mappings_dir = AllocateDir("pg_logical/mappings");
while ((mapping_de = ReadDir(mappings_dir, "pg_logical/mappings")) != NULL)
{
struct stat statbuf;
Oid dboid;
Oid relid;
XLogRecPtr lsn;
TransactionId rewrite_xid;
TransactionId create_xid;
uint32 hi,
lo;
if (strcmp(mapping_de->d_name, ".") == 0 ||
strcmp(mapping_de->d_name, "..") == 0)
continue;
snprintf(path, MAXPGPATH, "pg_logical/mappings/%s", mapping_de->d_name);
if (lstat(path, &statbuf) == 0 && !S_ISREG(statbuf.st_mode))
continue;
/* Skip over files that cannot be ours. */
if (strncmp(mapping_de->d_name, "map-", 4) != 0)
continue;
if (sscanf(mapping_de->d_name, LOGICAL_REWRITE_FORMAT,
&dboid, &relid, &hi, &lo, &rewrite_xid, &create_xid) != 6)
elog(ERROR, "could not parse filename \"%s\"", mapping_de->d_name);
lsn = ((uint64) hi) << 32 | lo;
if (lsn < cutoff || cutoff == InvalidXLogRecPtr)
{
elog(DEBUG1, "removing logical rewrite file \"%s\"", path);
if (unlink(path) < 0)
ereport(ERROR,
(errcode_for_file_access(),
errmsg("could not remove file \"%s\": %m", path)));
}
else
{
int fd = OpenTransientFile(path, O_RDONLY | PG_BINARY, 0);
/*
* The file cannot vanish due to concurrency since this function
* is the only one removing logical mappings and it's run while
* CheckpointLock is held exclusively.
*/
if (fd < 0)
ereport(ERROR,
(errcode_for_file_access(),
errmsg("could not open file \"%s\": %m", path)));
/*
* We could try to avoid fsyncing files that either haven't
* changed or have only been created since the checkpoint's start,
* but it's currently not deemed worth the effort.
*/
else if (pg_fsync(fd) != 0)
ereport(ERROR,
(errcode_for_file_access(),
errmsg("could not fsync file \"%s\": %m", path)));
CloseTransientFile(fd);
}
}
FreeDir(mappings_dir);
}
9. 将预提交(2PC)事务状态相关脏数据刷入pg_twophase目录下对应的文件。
如果没有开启2PC(#max_prepared_transactions = 0),这里不需要操作。
CheckPointTwoPhase(checkPointRedo)@src/backend/access/transam/twophase.c
/*
* CheckPointTwoPhase -- handle 2PC component of checkpointing.
*
* We must fsync the state file of any GXACT that is valid and has a PREPARE
* LSN <= the checkpoint's redo horizon. (If the gxact isn't valid yet or
* has a later LSN, this checkpoint is not responsible for fsyncing it.)
*
* This is deliberately run as late as possible in the checkpoint sequence,
* because GXACTs ordinarily have short lifespans, and so it is quite
* possible that GXACTs that were valid at checkpoint start will no longer
* exist if we wait a little bit.
*
* If a GXACT remains valid across multiple checkpoints, it'll be fsynced
* each time. This is considered unusual enough that we don't bother to
* expend any extra code to avoid the redundant fsyncs. (They should be
* reasonably cheap anyway, since they won't cause I/O.)
*/
void
CheckPointTwoPhase(XLogRecPtr redo_horizon)
{
TransactionId *xids;
int nxids;
char path[MAXPGPATH];
int i;
......
if (max_prepared_xacts <= 0)
return; /* nothing to do */
......
10. 将shared buffer中的在检查点之前(这个说法并不严谨,也可能包含检查点开始后某一个时间差内产生的脏数据,见BufferSync@src/backend/storage/buffer/bufmgr.c)产生的脏数据块刷入缓存,但是同样可能需要全扫描整个缓存内存区。
原因下一篇再讲。
CheckPointBuffers(flags)@src/backend/storage/buffer/bufmgr.c
/*
* CheckPointBuffers
*
* Flush all dirty blocks in buffer pool to disk at checkpoint time.
*
* Note: temporary relations do not participate in checkpoints, so they don't
* need to be flushed.
*/
void
CheckPointBuffers(int flags)
{
TRACE_POSTGRESQL_BUFFER_CHECKPOINT_START(flags);
CheckpointStats.ckpt_write_t = GetCurrentTimestamp();
BufferSync(flags);
CheckpointStats.ckpt_sync_t = GetCurrentTimestamp();
TRACE_POSTGRESQL_BUFFER_CHECKPOINT_SYNC_START();
smgrsync();
CheckpointStats.ckpt_sync_end_t = GetCurrentTimestamp();
TRACE_POSTGRESQL_BUFFER_CHECKPOINT_DONE();
}
小结
1. 从锁冲突角度来分析,可能会带来较大影响的有刷commit log,刷buffer。
2. 从数量级和IO层面分析,主观判断除了CheckPointBuffers(flags)@src/backend/storage/buffer/bufmgr.c,其他几个刷缓存的动作应该都很快,不会有太大的冲突或影响。
但是这些都只是主观判断,还需要有测试数据来提供支撑。
跟踪锁冲突的次数和耗时,跟踪每个刷缓存函数的耗时。
跟踪的内容将留到后面的篇幅来讲。
参考
1. http://blog.163.com/digoal@126/blog/static/163877040201542103933969/