PostgreSQL 检查点性能影响及源码分析 - 3
背景
数据库可靠性从何而来?
数据库崩溃后如何恢复,从什么位置开始恢复?
数据库检查点是什么?
检查点要干些什么?
为什么脏数据较多时,检查点会对性能有一定的影响?
什么是full page write?
相信这些问题是搞数据库的同学都想搞明白的。
接下里的一系列文章,围绕检查点展开讲解,讲一讲检查点的原理,以及为什么脏数据较多是,它会对数据库产生一定的性能影响。
正文
接着上一篇,
http://blog.163.com/digoal@126/blog/static/1638770402015463252387/
这篇主要谈一下CheckPointBuffers(flags).
CheckPointBuffers(flags)@src/backend/storage/buffer/bufmgr.c
/*
* CheckPointBuffers
*
* Flush all dirty blocks in buffer pool to disk at checkpoint time.
*
* Note: temporary relations do not participate in checkpoints, so they don't
* need to be flushed.
*/
void
CheckPointBuffers(int flags)
{
TRACE_POSTGRESQL_BUFFER_CHECKPOINT_START(flags); // buffer checkpoint开始探针
CheckpointStats.ckpt_write_t = GetCurrentTimestamp();
BufferSync(flags); // 这个是重量级操作, 需要全扫描1次BUFFER, 锁buffer头, 设置标记。 再扫描一次buffer,将前面标记过的脏块flush到磁盘。
CheckpointStats.ckpt_sync_t = GetCurrentTimestamp();
TRACE_POSTGRESQL_BUFFER_CHECKPOINT_SYNC_START(); // buffer checkpoint sync开始探针
smgrsync(); // sync操作
CheckpointStats.ckpt_sync_end_t = GetCurrentTimestamp();
TRACE_POSTGRESQL_BUFFER_CHECKPOINT_DONE(); // buffer checkpoint 结束探针
}
BufferSync是一个比较重的操作。
第一次全扫描buffer区,将脏数据块头部设置为本次checkpoint需要flush的块。
第二次扫描,将前面设置为本次需要checkpoint的块FLUSH到磁盘。
但是需要注意,第一次设置为need checkpoint的块,有一个计数,第二次在刷数据块时,可能提前到达这个计数,所以第二次刷脏块的动作可能不需要扫全缓存区域。
但是,第一次被标记的脏块,也可能在这期间被其他进程如bgwriter写掉了,所以第二次扫描时无法达到计数,则还是需要全扫描整个缓存区。
(为什么不在第一次设置时同时记住脏块的内存位置,第二次直接去FLUSH这些位置的块呢?还需要重复再扫一次)
BufferSync@src/backend/storage/buffer/bufmgr.c
/*
* BufferSync -- Write out all dirty buffers in the pool.
*
* This is called at checkpoint time to write out all dirty shared buffers.
* The checkpoint request flags should be passed in. If CHECKPOINT_IMMEDIATE
* is set, we disable delays between writes; if CHECKPOINT_IS_SHUTDOWN,
* CHECKPOINT_END_OF_RECOVERY or CHECKPOINT_FLUSH_ALL is set, we write even
* unlogged buffers, which are otherwise skipped. The remaining flags
* currently have no effect here.
*/
static void
BufferSync(int flags)
{
int buf_id;
int num_to_scan;
int num_to_write;
int num_written;
int mask = BM_DIRTY; // 脏块掩码
/* Make sure we can handle the pin inside SyncOneBuffer */
ResourceOwnerEnlargeBuffers(CurrentResourceOwner);
/*
* Unless this is a shutdown checkpoint or we have been explicitly told,
* we write only permanent, dirty buffers. But at shutdown or end of
* recovery, we write all dirty buffers.
*/
if (!((flags & (CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_END_OF_RECOVERY |
CHECKPOINT_FLUSH_ALL))))
mask |= BM_PERMANENT; // 持久对象掩码
/*
* Loop over all buffers, and mark the ones that need to be written with
* BM_CHECKPOINT_NEEDED. Count them as we go (num_to_write), so that we
* can estimate how much work needs to be done.
*
* This allows us to write only those pages that were dirty when the
* checkpoint began, and not those that get dirtied while it proceeds.
* Whenever a page with BM_CHECKPOINT_NEEDED is written out, either by us
* later in this function, or by normal backends or the bgwriter cleaning
* scan, the flag is cleared. Any buffer dirtied after this point won't
* have the flag set.
*
* Note that if we fail to write some buffer, we may leave buffers with
* BM_CHECKPOINT_NEEDED still set. This is OK since any such buffer would
* certainly need to be written for the next checkpoint attempt, too.
*/
num_to_write = 0; // BM_CHECKPOINT_NEEDED计数
for (buf_id = 0; buf_id < NBuffers; buf_id++) // 将当前数据库中的脏块标记为本次检查点需要flush的状态
// 也就是说,flush过程中数据库产生的脏块不用理会。
{
volatile BufferDesc *bufHdr = &BufferDescriptors[buf_id];
/*
* Header spinlock is enough to examine BM_DIRTY, see comment in
* SyncOneBuffer.
*/
LockBufHdr(bufHdr); // 锁缓存头
if ((bufHdr->flags & mask) == mask) // 将包含脏块掩码或者并且包含持久化掩码的缓存增加标记BM_CHECKPOINT_NEEDED
{
bufHdr->flags |= BM_CHECKPOINT_NEEDED;
num_to_write++;
}
UnlockBufHdr(bufHdr);
}
if (num_to_write == 0)
return; /* nothing to do */
TRACE_POSTGRESQL_BUFFER_SYNC_START(NBuffers, num_to_write); // 刷缓存开始,探针
/*
* Loop over all buffers again, and write the ones (still) marked with
* BM_CHECKPOINT_NEEDED. In this loop, we start at the clock sweep point
* since we might as well dump soon-to-be-recycled buffers first.
*
* Note that we don't read the buffer alloc count here --- that should be
* left untouched till the next BgBufferSync() call.
*/
buf_id = StrategySyncStart(NULL, NULL);
num_to_scan = NBuffers;
num_written = 0;
while (num_to_scan-- > 0) // 需要sync的buffer块计数递减
{
volatile BufferDesc *bufHdr = &BufferDescriptors[buf_id];
/*
* We don't need to acquire the lock here, because we're only looking
* at a single bit. It's possible that someone else writes the buffer
* and clears the flag right after we check, but that doesn't matter
* since SyncOneBuffer will then do nothing. However, there is a
* further race condition: it's conceivable that between the time we
* examine the bit here and the time SyncOneBuffer acquires lock,
* someone else not only wrote the buffer but replaced it with another
* page and dirtied it. In that improbable case, SyncOneBuffer will
* write the buffer though we didn't need to. It doesn't seem worth
* guarding against this, though.
*/
if (bufHdr->flags & BM_CHECKPOINT_NEEDED) // 判断掩码,如果包含BM_CHECKPOINT_NEEDED,则刷
{
if (SyncOneBuffer(buf_id, false) & BUF_WRITTEN) // 调用SyncOneBuffer刷缓存
{
TRACE_POSTGRESQL_BUFFER_SYNC_WRITTEN(buf_id); // 表示该数据块刷新成功
BgWriterStats.m_buf_written_checkpoints++;
num_written++;
/*
* We know there are at most num_to_write buffers with
* BM_CHECKPOINT_NEEDED set; so we can stop scanning if
* num_written reaches num_to_write.
*
* Note that num_written doesn't include buffers written by
* other backends, or by the bgwriter cleaning scan. That
* means that the estimate of how much progress we've made is
* conservative, and also that this test will often fail to
* trigger. But it seems worth making anyway.
*/
if (num_written >= num_to_write) // 如果提前完成刷新,不需要扫全缓存区,退出
break;
/*
* Sleep to throttle our I/O rate.
*/
CheckpointWriteDelay(flags, (double) num_written / num_to_write); // 将目前刷缓存完成比例传给CheckpointWriteDelay,如果达到休息点,则会触发一个100毫秒的等待。
// 假设一共有1000个需要刷的块(num_to_write),目前已经刷了100个(num_written )。
// CheckpointWriteDelay(flags, 0.1); , 假设CheckPointCompletionTarget为默认的0.5
// IsCheckpointOnSchedule里, progress *= CheckPointCompletionTarget; = 0.1*0.5 = 0.05
// elapsed_xlogs = (((double) (recptr - ckpt_start_recptr)) / XLogSegSize) / CheckPointSegments
// 如果 progress < elapsed_xlogs 不休息
// progress最大就是0.5, 因为num_written / num_to_write最大就是1, 1乘以0.5还是0.5
// 因此CheckPointCompletionTarget越大,休息区间越大。
}
}
if (++buf_id >= NBuffers)
buf_id = 0;
}
/*
* Update checkpoint statistics. As noted above, this doesn't include
* buffers written by other backends or bgwriter scan.
*/
CheckpointStats.ckpt_bufs_written += num_written;
TRACE_POSTGRESQL_BUFFER_SYNC_DONE(NBuffers, num_written, num_to_write); // 标记为BM_CHECKPOINT_NEEDED的脏块已全部flush完
}
刷单个BUFFER,返回bitmask,BUF_WRITTEN表示已写入磁盘。
SyncOneBuffer@src/backend/storage/buffer/bufmgr.c
/*
* SyncOneBuffer -- process a single buffer during syncing.
*
* If skip_recently_used is true, we don't write currently-pinned buffers, nor
* buffers marked recently used, as these are not replacement candidates.
*
* Returns a bitmask containing the following flag bits:
* BUF_WRITTEN: we wrote the buffer.
* BUF_REUSABLE: buffer is available for replacement, ie, it has
* pin count 0 and usage count 0.
*
* (BUF_WRITTEN could be set in error if FlushBuffers finds the buffer clean
* after locking it, but we don't care all that much.)
*
* Note: caller must have done ResourceOwnerEnlargeBuffers.
*/
static int
SyncOneBuffer(int buf_id, bool skip_recently_used)
{
volatile BufferDesc *bufHdr = &BufferDescriptors[buf_id];
int result = 0;
/*
* Check whether buffer needs writing.
*
* We can make this check without taking the buffer content lock so long
* as we mark pages dirty in access methods *before* logging changes with
* XLogInsert(): if someone marks the buffer dirty just after our check we
* don't worry because our checkpoint.redo points before log record for
* upcoming changes and so we are not required to write such dirty buffer.
*/
LockBufHdr(bufHdr);
if (bufHdr->refcount == 0 && bufHdr->usage_count == 0)
result |= BUF_REUSABLE;
else if (skip_recently_used)
{
/* Caller told us not to write recently-used buffers */
UnlockBufHdr(bufHdr);
return result;
}
if (!(bufHdr->flags & BM_VALID) || !(bufHdr->flags & BM_DIRTY))
{
/* It's clean, so nothing to do */
UnlockBufHdr(bufHdr);
return result;
}
/*
* Pin it, share-lock it, write it. (FlushBuffer will do nothing if the
* buffer is clean by the time we've locked it.)
*/
PinBuffer_Locked(bufHdr);
LWLockAcquire(bufHdr->content_lock, LW_SHARED);
FlushBuffer(bufHdr, NULL); // 调用FlushBuffer刷buffer
LWLockRelease(bufHdr->content_lock);
UnpinBuffer(bufHdr, true);
return result | BUF_WRITTEN;
}
调用FlushBuffer将BUFFER刷到内核,内核负责写如磁盘,在写checkpoint WAL前,必须写到磁盘。
FlushBuffer@src/backend/storage/buffer/bufmgr.c
/*
* FlushBuffer
* Physically write out a shared buffer.
*
* NOTE: this actually just passes the buffer contents to the kernel; the
* real write to disk won't happen until the kernel feels like it. This
* is okay from our point of view since we can redo the changes from WAL.
* However, we will need to force the changes to disk via fsync before
* we can checkpoint WAL. 在写checkpoint WAL前,buffer必须写到磁盘。
*
* The caller must hold a pin on the buffer and have share-locked the
* buffer contents. (Note: a share-lock does not prevent updates of
* hint bits in the buffer, so the page could change while the write
* is in progress, but we assume that that will not invalidate the data
* written.)
*
* If the caller has an smgr reference for the buffer's relation, pass it
* as the second parameter. If not, pass NULL.
*/
static void
FlushBuffer(volatile BufferDesc *buf, SMgrRelation reln)
{
XLogRecPtr recptr;
ErrorContextCallback errcallback;
instr_time io_start,
io_time;
Block bufBlock;
char *bufToWrite;
/*
* Acquire the buffer's io_in_progress lock. If StartBufferIO returns
* false, then someone else flushed the buffer before we could, so we need
* not do anything.
*/
if (!StartBufferIO(buf, false))
return;
/* Setup error traceback support for ereport() */
errcallback.callback = shared_buffer_write_error_callback;
errcallback.arg = (void *) buf;
errcallback.previous = error_context_stack;
error_context_stack = &errcallback;
/* Find smgr relation for buffer */
if (reln == NULL)
reln = smgropen(buf->tag.rnode, InvalidBackendId);
TRACE_POSTGRESQL_BUFFER_FLUSH_START(buf->tag.forkNum,
buf->tag.blockNum,
reln->smgr_rnode.node.spcNode,
reln->smgr_rnode.node.dbNode,
reln->smgr_rnode.node.relNode);
LockBufHdr(buf);
/*
* Run PageGetLSN while holding header lock, since we don't have the
* buffer locked exclusively in all cases.
*/
recptr = BufferGetLSN(buf); // 这里又一个BUFFER头锁
/* To check if block content changes while flushing. - vadim 01/17/97 */
buf->flags &= ~BM_JUST_DIRTIED;
UnlockBufHdr(buf);
/*
* Force XLOG flush up to buffer's LSN. This implements the basic WAL // XLOG 强写到buffer lsn位置,
* rule that log updates must hit disk before any of the data-file changes // 确保在此之前数据块改变产生的XLOG都写入磁盘了.
* they describe do.
*
* However, this rule does not apply to unlogged relations, which will be
* lost after a crash anyway. Most unlogged relation pages do not bear
* LSNs since we never emit WAL records for them, and therefore flushing
* up through the buffer LSN would be useless, but harmless. However,
* GiST indexes use LSNs internally to track page-splits, and therefore
* unlogged GiST pages bear "fake" LSNs generated by
* GetFakeLSNForUnloggedRel. It is unlikely but possible that the fake
* LSN counter could advance past the WAL insertion point; and if it did
* happen, attempting to flush WAL through that location would fail, with
* disastrous system-wide consequences. To make sure that can't happen,
* skip the flush if the buffer isn't permanent.
*/
if (buf->flags & BM_PERMANENT)
XLogFlush(recptr);
/*
* Now it's safe to write buffer to disk. Note that no one else should
* have been able to write it while we were busy with log flushing because
* we have the io_in_progress lock.
*/
bufBlock = BufHdrGetBlock(buf);
/*
* Update page checksum if desired. Since we have only shared lock on the
* buffer, other processes might be updating hint bits in it, so we must
* copy the page to private storage if we do checksumming.
*/
bufToWrite = PageSetChecksumCopy((Page) bufBlock, buf->tag.blockNum);
if (track_io_timing)
INSTR_TIME_SET_CURRENT(io_start);
/*
* bufToWrite is either the shared buffer or a copy, as appropriate.
*/
smgrwrite(reln, // 将BUFFER写入磁盘
buf->tag.forkNum,
buf->tag.blockNum,
bufToWrite,
false);
if (track_io_timing)
{
INSTR_TIME_SET_CURRENT(io_time);
INSTR_TIME_SUBTRACT(io_time, io_start);
pgstat_count_buffer_write_time(INSTR_TIME_GET_MICROSEC(io_time));
INSTR_TIME_ADD(pgBufferUsage.blk_write_time, io_time);
}
pgBufferUsage.shared_blks_written++;
/*
* Mark the buffer as clean (unless BM_JUST_DIRTIED has become set) and
* end the io_in_progress state.
*/
TerminateBufferIO(buf, true, 0);
TRACE_POSTGRESQL_BUFFER_FLUSH_DONE(buf->tag.forkNum, // 单个buffer块 flush结束
buf->tag.blockNum,
reln->smgr_rnode.node.spcNode,
reln->smgr_rnode.node.dbNode,
reln->smgr_rnode.node.relNode);
/* Pop the error context stack */
error_context_stack = errcallback.previous;
}
Write the supplied buffer out.
smgrwrite@src/backend/storage/smgr/smgr.c
/*
* smgrwrite() -- Write the supplied buffer out.
*
* This is to be used only for updating already-existing blocks of a
* relation (ie, those before the current EOF). To extend a relation,
* use smgrextend().
*
* This is not a synchronous write -- the block is not necessarily
* on disk at return, only dumped out to the kernel. However,
* provisions will be made to fsync the write before the next checkpoint.
*
* skipFsync indicates that the caller will make other provisions to
* fsync the relation, so we needn't bother. Temporary relations also
* do not require fsync.
*/
void
smgrwrite(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
char *buffer, bool skipFsync)
{
(*(smgrsw[reln->smgr_which].smgr_write)) (reln, forknum, blocknum,
buffer, skipFsync);
}
最后一步是将前面的write sync 到磁盘.
smgrsync@src/backend/storage/smgr/smgr.c
/*
* smgrsync() -- Sync files to disk during checkpoint.
*/
void
smgrsync(void)
{
int i;
for (i = 0; i < NSmgr; i++)
{
if (smgrsw[i].smgr_sync)
(*(smgrsw[i].smgr_sync)) ();
}
}
smgr_sync实际调用的是
mdsync@src/backend/storage/smgr/md.c
/*
* mdsync() -- Sync previous writes to stable storage.
*/
void
mdsync(void)
{
static bool mdsync_in_progress = false;
HASH_SEQ_STATUS hstat;
PendingOperationEntry *entry;
int absorb_counter;
/* Statistics on sync times */
int processed = 0;
instr_time sync_start,
sync_end,
sync_diff;
uint64 elapsed;
uint64 longest = 0;
uint64 total_elapsed = 0;
/*
* This is only called during checkpoints, and checkpoints should only
* occur in processes that have created a pendingOpsTable.
*/
if (!pendingOpsTable)
elog(ERROR, "cannot sync without a pendingOpsTable");
/*
* If we are in the checkpointer, the sync had better include all fsync
* requests that were queued by backends up to this point. The tightest
* race condition that could occur is that a buffer that must be written
* and fsync'd for the checkpoint could have been dumped by a backend just
* before it was visited by BufferSync(). We know the backend will have
* queued an fsync request before clearing the buffer's dirtybit, so we
* are safe as long as we do an Absorb after completing BufferSync().
*/
AbsorbFsyncRequests();
/*
* To avoid excess fsync'ing (in the worst case, maybe a never-terminating
* checkpoint), we want to ignore fsync requests that are entered into the
* hashtable after this point --- they should be processed next time,
* instead. We use mdsync_cycle_ctr to tell old entries apart from new
* ones: new ones will have cycle_ctr equal to the incremented value of
* mdsync_cycle_ctr.
*
* In normal circumstances, all entries present in the table at this point
* will have cycle_ctr exactly equal to the current (about to be old)
* value of mdsync_cycle_ctr. However, if we fail partway through the
* fsync'ing loop, then older values of cycle_ctr might remain when we
* come back here to try again. Repeated checkpoint failures would
* eventually wrap the counter around to the point where an old entry
* might appear new, causing us to skip it, possibly allowing a checkpoint
* to succeed that should not have. To forestall wraparound, any time the
* previous mdsync() failed to complete, run through the table and
* forcibly set cycle_ctr = mdsync_cycle_ctr.
*
* Think not to merge this loop with the main loop, as the problem is
* exactly that that loop may fail before having visited all the entries.
* From a performance point of view it doesn't matter anyway, as this path
* will never be taken in a system that's functioning normally.
*/
if (mdsync_in_progress)
{
/* prior try failed, so update any stale cycle_ctr values */
hash_seq_init(&hstat, pendingOpsTable);
while ((entry = (PendingOperationEntry *) hash_seq_search(&hstat)) != NULL)
{
entry->cycle_ctr = mdsync_cycle_ctr;
}
}
/* Advance counter so that new hashtable entries are distinguishable */
mdsync_cycle_ctr++;
/* Set flag to detect failure if we don't reach the end of the loop */
mdsync_in_progress = true;
/* Now scan the hashtable for fsync requests to process */
absorb_counter = FSYNCS_PER_ABSORB;
hash_seq_init(&hstat, pendingOpsTable);
while ((entry = (PendingOperationEntry *) hash_seq_search(&hstat)) != NULL)
{
ForkNumber forknum;
/*
* If the entry is new then don't process it this time; it might
* contain multiple fsync-request bits, but they are all new. Note
* "continue" bypasses the hash-remove call at the bottom of the loop.
*/
if (entry->cycle_ctr == mdsync_cycle_ctr)
continue;
/* Else assert we haven't missed it */
Assert((CycleCtr) (entry->cycle_ctr + 1) == mdsync_cycle_ctr);
/*
* Scan over the forks and segments represented by the entry.
*
* The bitmap manipulations are slightly tricky, because we can call
* AbsorbFsyncRequests() inside the loop and that could result in
* bms_add_member() modifying and even re-palloc'ing the bitmapsets.
* This is okay because we unlink each bitmapset from the hashtable
* entry before scanning it. That means that any incoming fsync
* requests will be processed now if they reach the table before we
* begin to scan their fork.
*/
for (forknum = 0; forknum <= MAX_FORKNUM; forknum++)
{
Bitmapset *requests = entry->requests[forknum];
int segno;
entry->requests[forknum] = NULL;
entry->canceled[forknum] = false;
while ((segno = bms_first_member(requests)) >= 0)
{
int failures;
/*
* If fsync is off then we don't have to bother opening the
* file at all. (We delay checking until this point so that
* changing fsync on the fly behaves sensibly.)
*/
if (!enableFsync)
continue;
/*
* If in checkpointer, we want to absorb pending requests
* every so often to prevent overflow of the fsync request
* queue. It is unspecified whether newly-added entries will
* be visited by hash_seq_search, but we don't care since we
* don't need to process them anyway.
*/
if (--absorb_counter <= 0)
{
AbsorbFsyncRequests();
absorb_counter = FSYNCS_PER_ABSORB;
}
/*
* The fsync table could contain requests to fsync segments
* that have been deleted (unlinked) by the time we get to
* them. Rather than just hoping an ENOENT (or EACCES on
* Windows) error can be ignored, what we do on error is
* absorb pending requests and then retry. Since mdunlink()
* queues a "cancel" message before actually unlinking, the
* fsync request is guaranteed to be marked canceled after the
* absorb if it really was this case. DROP DATABASE likewise
* has to tell us to forget fsync requests before it starts
* deletions.
*/
for (failures = 0;; failures++) /* loop exits at "break" */
{
SMgrRelation reln;
MdfdVec *seg;
char *path;
int save_errno;
/*
* Find or create an smgr hash entry for this relation.
* This may seem a bit unclean -- md calling smgr? But
* it's really the best solution. It ensures that the
* open file reference isn't permanently leaked if we get
* an error here. (You may say "but an unreferenced
* SMgrRelation is still a leak!" Not really, because the
* only case in which a checkpoint is done by a process
* that isn't about to shut down is in the checkpointer,
* and it will periodically do smgrcloseall(). This fact
* justifies our not closing the reln in the success path
* either, which is a good thing since in non-checkpointer
* cases we couldn't safely do that.)
*/
reln = smgropen(entry->rnode, InvalidBackendId);
/* Attempt to open and fsync the target segment */
seg = _mdfd_getseg(reln, forknum,
(BlockNumber) segno * (BlockNumber) RELSEG_SIZE,
false, EXTENSION_RETURN_NULL);
INSTR_TIME_SET_CURRENT(sync_start);
if (seg != NULL &&
FileSync(seg->mdfd_vfd) >= 0)
{
/* Success; update statistics about sync timing */
INSTR_TIME_SET_CURRENT(sync_end);
sync_diff = sync_end;
INSTR_TIME_SUBTRACT(sync_diff, sync_start);
elapsed = INSTR_TIME_GET_MICROSEC(sync_diff);
if (elapsed > longest)
longest = elapsed;
total_elapsed += elapsed;
processed++;
if (log_checkpoints)
elog(DEBUG1, "checkpoint sync: number=%d file=%s time=%.3f msec",
processed,
FilePathName(seg->mdfd_vfd),
(double) elapsed / 1000);
break; /* out of retry loop */
}
/* Compute file name for use in message */
save_errno = errno;
path = _mdfd_segpath(reln, forknum, (BlockNumber) segno);
errno = save_errno;
/*
* It is possible that the relation has been dropped or
* truncated since the fsync request was entered.
* Therefore, allow ENOENT, but only if we didn't fail
* already on this file. This applies both for
* _mdfd_getseg() and for FileSync, since fd.c might have
* closed the file behind our back.
*
* XXX is there any point in allowing more than one retry?
* Don't see one at the moment, but easy to change the
* test here if so.
*/
if (!FILE_POSSIBLY_DELETED(errno) ||
failures > 0)
ereport(ERROR,
(errcode_for_file_access(),
errmsg("could not fsync file \"%s\": %m",
path)));
else
ereport(DEBUG1,
(errcode_for_file_access(),
errmsg("could not fsync file \"%s\" but retrying: %m",
path)));
pfree(path);
/*
* Absorb incoming requests and check to see if a cancel
* arrived for this relation fork.
*/
AbsorbFsyncRequests();
absorb_counter = FSYNCS_PER_ABSORB; /* might as well... */
if (entry->canceled[forknum])
break;
} /* end retry loop */
}
bms_free(requests);
}
/*
* We've finished everything that was requested before we started to
* scan the entry. If no new requests have been inserted meanwhile,
* remove the entry. Otherwise, update its cycle counter, as all the
* requests now in it must have arrived during this cycle.
*/
for (forknum = 0; forknum <= MAX_FORKNUM; forknum++)
{
if (entry->requests[forknum] != NULL)
break;
}
if (forknum <= MAX_FORKNUM)
entry->cycle_ctr = mdsync_cycle_ctr;
else
{
/* Okay to remove it */
if (hash_search(pendingOpsTable, &entry->rnode,
HASH_REMOVE, NULL) == NULL)
elog(ERROR, "pendingOpsTable corrupted");
}
} /* end loop over hashtable entries */
/* Return sync performance metrics for report at checkpoint end */
CheckpointStats.ckpt_sync_rels = processed;
CheckpointStats.ckpt_longest_sync = longest;
CheckpointStats.ckpt_agg_sync_time = total_elapsed;
/* Flag successful completion of mdsync */
mdsync_in_progress = false;
}
小结
checkpointer刷缓存主要分几个步骤,
1. 遍历shared buffer区,将当前SHARED BUFFER中脏块新增FLAG need checkpoint,
2. 遍历shared buffer区,将上一步标记为need checkpoint的块write到磁盘,WRITE前需要确保该buffer lsn前的XLOG已经fsync到磁盘,
3. 将前面的write sync到持久化存储。
具体耗时可以参考期间的探针,或者检查点日志输出。
下一篇讲一下检查点的跟踪。
参考
1. http://blog.163.com/digoal@126/blog/static/1638770402015463252387/