PostgreSQL 检查点性能影响及源码分析 - 3

16 minute read

背景

数据库可靠性从何而来?

数据库崩溃后如何恢复,从什么位置开始恢复?

数据库检查点是什么?

检查点要干些什么?

为什么脏数据较多时,检查点会对性能有一定的影响?

什么是full page write?

相信这些问题是搞数据库的同学都想搞明白的。

接下里的一系列文章,围绕检查点展开讲解,讲一讲检查点的原理,以及为什么脏数据较多是,它会对数据库产生一定的性能影响。

正文

接着上一篇,

http://blog.163.com/digoal@126/blog/static/1638770402015463252387/

这篇主要谈一下CheckPointBuffers(flags).

CheckPointBuffers(flags)@src/backend/storage/buffer/bufmgr.c

/*  
 * CheckPointBuffers  
 *  
 * Flush all dirty blocks in buffer pool to disk at checkpoint time.  
 *  
 * Note: temporary relations do not participate in checkpoints, so they don't  
 * need to be flushed.  
 */  
void  
CheckPointBuffers(int flags)  
{  
        TRACE_POSTGRESQL_BUFFER_CHECKPOINT_START(flags);  // buffer checkpoint开始探针  
        CheckpointStats.ckpt_write_t = GetCurrentTimestamp();  
        BufferSync(flags);  //  这个是重量级操作, 需要全扫描1次BUFFER, 锁buffer头, 设置标记。 再扫描一次buffer,将前面标记过的脏块flush到磁盘。  
        CheckpointStats.ckpt_sync_t = GetCurrentTimestamp();  
        TRACE_POSTGRESQL_BUFFER_CHECKPOINT_SYNC_START();   // buffer checkpoint sync开始探针  
        smgrsync();  // sync操作  
        CheckpointStats.ckpt_sync_end_t = GetCurrentTimestamp();  
        TRACE_POSTGRESQL_BUFFER_CHECKPOINT_DONE();   // buffer checkpoint 结束探针  
}   

BufferSync是一个比较重的操作。

第一次全扫描buffer区,将脏数据块头部设置为本次checkpoint需要flush的块。

第二次扫描,将前面设置为本次需要checkpoint的块FLUSH到磁盘。

但是需要注意,第一次设置为need checkpoint的块,有一个计数,第二次在刷数据块时,可能提前到达这个计数,所以第二次刷脏块的动作可能不需要扫全缓存区域。

但是,第一次被标记的脏块,也可能在这期间被其他进程如bgwriter写掉了,所以第二次扫描时无法达到计数,则还是需要全扫描整个缓存区。

(为什么不在第一次设置时同时记住脏块的内存位置,第二次直接去FLUSH这些位置的块呢?还需要重复再扫一次)

BufferSync@src/backend/storage/buffer/bufmgr.c

/*  
 * BufferSync -- Write out all dirty buffers in the pool.  
 *  
 * This is called at checkpoint time to write out all dirty shared buffers.  
 * The checkpoint request flags should be passed in.  If CHECKPOINT_IMMEDIATE  
 * is set, we disable delays between writes; if CHECKPOINT_IS_SHUTDOWN,  
 * CHECKPOINT_END_OF_RECOVERY or CHECKPOINT_FLUSH_ALL is set, we write even  
 * unlogged buffers, which are otherwise skipped.  The remaining flags  
 * currently have no effect here.  
 */  
static void  
BufferSync(int flags)  
{  
        int                     buf_id;  
        int                     num_to_scan;  
        int                     num_to_write;  
        int                     num_written;  
        int                     mask = BM_DIRTY;  // 脏块掩码  
  
        /* Make sure we can handle the pin inside SyncOneBuffer */  
        ResourceOwnerEnlargeBuffers(CurrentResourceOwner);  
  
        /*  
         * Unless this is a shutdown checkpoint or we have been explicitly told,  
         * we write only permanent, dirty buffers.  But at shutdown or end of  
         * recovery, we write all dirty buffers.  
         */  
        if (!((flags & (CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_END_OF_RECOVERY |  
                                        CHECKPOINT_FLUSH_ALL))))  
                mask |= BM_PERMANENT;  // 持久对象掩码  
  
        /*  
         * Loop over all buffers, and mark the ones that need to be written with  
         * BM_CHECKPOINT_NEEDED.  Count them as we go (num_to_write), so that we  
         * can estimate how much work needs to be done.  
         *  
         * This allows us to write only those pages that were dirty when the  
         * checkpoint began, and not those that get dirtied while it proceeds.  
         * Whenever a page with BM_CHECKPOINT_NEEDED is written out, either by us  
         * later in this function, or by normal backends or the bgwriter cleaning  
         * scan, the flag is cleared.  Any buffer dirtied after this point won't  
         * have the flag set.  
         *  
         * Note that if we fail to write some buffer, we may leave buffers with  
         * BM_CHECKPOINT_NEEDED still set.  This is OK since any such buffer would  
         * certainly need to be written for the next checkpoint attempt, too.  
         */  
        num_to_write = 0;  // BM_CHECKPOINT_NEEDED计数  
        for (buf_id = 0; buf_id < NBuffers; buf_id++)   //  将当前数据库中的脏块标记为本次检查点需要flush的状态  
                                                                                      //   也就是说,flush过程中数据库产生的脏块不用理会。  
                                                                     
        {  
                volatile BufferDesc *bufHdr = &BufferDescriptors[buf_id];  
  
                /*  
                 * Header spinlock is enough to examine BM_DIRTY, see comment in  
                 * SyncOneBuffer.  
                 */  
                LockBufHdr(bufHdr);  // 锁缓存头  
  
                if ((bufHdr->flags & mask) == mask)   // 将包含脏块掩码或者并且包含持久化掩码的缓存增加标记BM_CHECKPOINT_NEEDED  
                {  
                        bufHdr->flags |= BM_CHECKPOINT_NEEDED;    
                        num_to_write++;  
                }  
  
                UnlockBufHdr(bufHdr);  
        }  
  
        if (num_to_write == 0)  
                return;                                 /* nothing to do */  
  
        TRACE_POSTGRESQL_BUFFER_SYNC_START(NBuffers, num_to_write);  // 刷缓存开始,探针  
  
        /*  
         * Loop over all buffers again, and write the ones (still) marked with  
         * BM_CHECKPOINT_NEEDED.  In this loop, we start at the clock sweep point  
         * since we might as well dump soon-to-be-recycled buffers first.  
         *  
         * Note that we don't read the buffer alloc count here --- that should be  
         * left untouched till the next BgBufferSync() call.  
         */  
        buf_id = StrategySyncStart(NULL, NULL);  
        num_to_scan = NBuffers;  
        num_written = 0;  
        while (num_to_scan-- > 0)  // 需要sync的buffer块计数递减  
        {  
                volatile BufferDesc *bufHdr = &BufferDescriptors[buf_id];  
  
                /*  
                 * We don't need to acquire the lock here, because we're only looking  
                 * at a single bit. It's possible that someone else writes the buffer  
                 * and clears the flag right after we check, but that doesn't matter  
                 * since SyncOneBuffer will then do nothing.  However, there is a  
                 * further race condition: it's conceivable that between the time we  
                 * examine the bit here and the time SyncOneBuffer acquires lock,  
                 * someone else not only wrote the buffer but replaced it with another  
                 * page and dirtied it.  In that improbable case, SyncOneBuffer will  
                 * write the buffer though we didn't need to.  It doesn't seem worth  
                 * guarding against this, though.  
                 */  
                if (bufHdr->flags & BM_CHECKPOINT_NEEDED)  // 判断掩码,如果包含BM_CHECKPOINT_NEEDED,则刷  
                {  
                        if (SyncOneBuffer(buf_id, false) & BUF_WRITTEN)  // 调用SyncOneBuffer刷缓存  
                        {  
                                TRACE_POSTGRESQL_BUFFER_SYNC_WRITTEN(buf_id);  //  表示该数据块刷新成功  
                                BgWriterStats.m_buf_written_checkpoints++;  
                                num_written++;  
  
                                /*  
                                 * We know there are at most num_to_write buffers with  
                                 * BM_CHECKPOINT_NEEDED set; so we can stop scanning if  
                                 * num_written reaches num_to_write.  
                                 *  
                                 * Note that num_written doesn't include buffers written by  
                                 * other backends, or by the bgwriter cleaning scan. That  
                                 * means that the estimate of how much progress we've made is  
                                 * conservative, and also that this test will often fail to  
                                 * trigger.  But it seems worth making anyway.  
                                 */  
                                if (num_written >= num_to_write)  // 如果提前完成刷新,不需要扫全缓存区,退出  
                                        break;  
  
                                /*  
                                 * Sleep to throttle our I/O rate.  
                                 */  
                                CheckpointWriteDelay(flags, (double) num_written / num_to_write);   // 将目前刷缓存完成比例传给CheckpointWriteDelay,如果达到休息点,则会触发一个100毫秒的等待。  
                                //  假设一共有1000个需要刷的块(num_to_write),目前已经刷了100个(num_written )。  
                                //   CheckpointWriteDelay(flags, 0.1); , 假设CheckPointCompletionTarget为默认的0.5  
                               //    IsCheckpointOnSchedule里, progress *= CheckPointCompletionTarget; = 0.1*0.5 = 0.05  
                               //   elapsed_xlogs = (((double) (recptr - ckpt_start_recptr)) / XLogSegSize) / CheckPointSegments  
                               //   如果 progress < elapsed_xlogs 不休息  
                               //   progress最大就是0.5, 因为num_written / num_to_write最大就是1, 1乘以0.5还是0.5  
                               //   因此CheckPointCompletionTarget越大,休息区间越大。  
                        }  
                }  
  
                if (++buf_id >= NBuffers)  
                        buf_id = 0;  
        }  
  
        /*  
         * Update checkpoint statistics. As noted above, this doesn't include  
         * buffers written by other backends or bgwriter scan.  
         */  
        CheckpointStats.ckpt_bufs_written += num_written;  
  
        TRACE_POSTGRESQL_BUFFER_SYNC_DONE(NBuffers, num_written, num_to_write);  // 标记为BM_CHECKPOINT_NEEDED的脏块已全部flush完  
}  

刷单个BUFFER,返回bitmask,BUF_WRITTEN表示已写入磁盘。

SyncOneBuffer@src/backend/storage/buffer/bufmgr.c

/*  
 * SyncOneBuffer -- process a single buffer during syncing.  
 *  
 * If skip_recently_used is true, we don't write currently-pinned buffers, nor  
 * buffers marked recently used, as these are not replacement candidates.  
 *  
 * Returns a bitmask containing the following flag bits:  
 *      BUF_WRITTEN: we wrote the buffer.  
 *      BUF_REUSABLE: buffer is available for replacement, ie, it has  
 *              pin count 0 and usage count 0.  
 *  
 * (BUF_WRITTEN could be set in error if FlushBuffers finds the buffer clean  
 * after locking it, but we don't care all that much.)  
 *  
 * Note: caller must have done ResourceOwnerEnlargeBuffers.  
 */  
static int  
SyncOneBuffer(int buf_id, bool skip_recently_used)  
{  
        volatile BufferDesc *bufHdr = &BufferDescriptors[buf_id];  
        int                     result = 0;  
  
        /*  
         * Check whether buffer needs writing.  
         *  
         * We can make this check without taking the buffer content lock so long  
         * as we mark pages dirty in access methods *before* logging changes with  
         * XLogInsert(): if someone marks the buffer dirty just after our check we  
         * don't worry because our checkpoint.redo points before log record for  
         * upcoming changes and so we are not required to write such dirty buffer.  
         */  
        LockBufHdr(bufHdr);  
  
        if (bufHdr->refcount == 0 && bufHdr->usage_count == 0)     
                result |= BUF_REUSABLE;  
        else if (skip_recently_used)  
        {  
                /* Caller told us not to write recently-used buffers */  
                UnlockBufHdr(bufHdr);  
                return result;  
        }  
  
        if (!(bufHdr->flags & BM_VALID) || !(bufHdr->flags & BM_DIRTY))  
        {  
                /* It's clean, so nothing to do */  
                UnlockBufHdr(bufHdr);  
                return result;  
        }  
  
        /*  
         * Pin it, share-lock it, write it.  (FlushBuffer will do nothing if the  
         * buffer is clean by the time we've locked it.)  
         */  
        PinBuffer_Locked(bufHdr);  
        LWLockAcquire(bufHdr->content_lock, LW_SHARED);  
  
        FlushBuffer(bufHdr, NULL);  // 调用FlushBuffer刷buffer  
  
        LWLockRelease(bufHdr->content_lock);  
        UnpinBuffer(bufHdr, true);  
  
        return result | BUF_WRITTEN;  
}  

调用FlushBuffer将BUFFER刷到内核,内核负责写如磁盘,在写checkpoint WAL前,必须写到磁盘。

FlushBuffer@src/backend/storage/buffer/bufmgr.c

/*  
 * FlushBuffer  
 *              Physically write out a shared buffer.  
 *  
 * NOTE: this actually just passes the buffer contents to the kernel; the  
 * real write to disk won't happen until the kernel feels like it.  This  
 * is okay from our point of view since we can redo the changes from WAL.  
 * However, we will need to force the changes to disk via fsync before  
 * we can checkpoint WAL.  在写checkpoint WAL前,buffer必须写到磁盘。  
 *  
 * The caller must hold a pin on the buffer and have share-locked the  
 * buffer contents.  (Note: a share-lock does not prevent updates of  
 * hint bits in the buffer, so the page could change while the write  
 * is in progress, but we assume that that will not invalidate the data  
 * written.)  
 *  
 * If the caller has an smgr reference for the buffer's relation, pass it  
 * as the second parameter.  If not, pass NULL.  
 */  
static void  
FlushBuffer(volatile BufferDesc *buf, SMgrRelation reln)  
{  
        XLogRecPtr      recptr;  
        ErrorContextCallback errcallback;  
        instr_time      io_start,  
                                io_time;  
        Block           bufBlock;  
        char       *bufToWrite;  
  
        /*  
         * Acquire the buffer's io_in_progress lock.  If StartBufferIO returns  
         * false, then someone else flushed the buffer before we could, so we need  
         * not do anything.  
         */  
        if (!StartBufferIO(buf, false))  
                return;  
  
        /* Setup error traceback support for ereport() */  
        errcallback.callback = shared_buffer_write_error_callback;  
        errcallback.arg = (void *) buf;  
        errcallback.previous = error_context_stack;  
        error_context_stack = &errcallback;  
  
        /* Find smgr relation for buffer */  
        if (reln == NULL)  
                reln = smgropen(buf->tag.rnode, InvalidBackendId);  
  
        TRACE_POSTGRESQL_BUFFER_FLUSH_START(buf->tag.forkNum,  
                                                                                buf->tag.blockNum,  
                                                                                reln->smgr_rnode.node.spcNode,  
                                                                                reln->smgr_rnode.node.dbNode,  
                                                                                reln->smgr_rnode.node.relNode);  
  
        LockBufHdr(buf);  
  
        /*  
         * Run PageGetLSN while holding header lock, since we don't have the  
         * buffer locked exclusively in all cases.  
         */  
        recptr = BufferGetLSN(buf);  // 这里又一个BUFFER头锁  
  
        /* To check if block content changes while flushing. - vadim 01/17/97 */  
        buf->flags &= ~BM_JUST_DIRTIED;  
        UnlockBufHdr(buf);  
  
        /*  
         * Force XLOG flush up to buffer's LSN.  This implements the basic WAL  //  XLOG 强写到buffer lsn位置,  
         * rule that log updates must hit disk before any of the data-file changes  // 确保在此之前数据块改变产生的XLOG都写入磁盘了.  
         * they describe do.  
         *  
         * However, this rule does not apply to unlogged relations, which will be  
         * lost after a crash anyway.  Most unlogged relation pages do not bear  
         * LSNs since we never emit WAL records for them, and therefore flushing  
         * up through the buffer LSN would be useless, but harmless.  However,  
         * GiST indexes use LSNs internally to track page-splits, and therefore  
         * unlogged GiST pages bear "fake" LSNs generated by  
         * GetFakeLSNForUnloggedRel.  It is unlikely but possible that the fake  
         * LSN counter could advance past the WAL insertion point; and if it did  
         * happen, attempting to flush WAL through that location would fail, with  
         * disastrous system-wide consequences.  To make sure that can't happen,  
         * skip the flush if the buffer isn't permanent.  
         */  
        if (buf->flags & BM_PERMANENT)  
                XLogFlush(recptr);  
  
        /*  
         * Now it's safe to write buffer to disk. Note that no one else should  
         * have been able to write it while we were busy with log flushing because  
         * we have the io_in_progress lock.  
         */  
        bufBlock = BufHdrGetBlock(buf);    
  
        /*  
         * Update page checksum if desired.  Since we have only shared lock on the  
         * buffer, other processes might be updating hint bits in it, so we must  
         * copy the page to private storage if we do checksumming.  
         */  
        bufToWrite = PageSetChecksumCopy((Page) bufBlock, buf->tag.blockNum);  
  
        if (track_io_timing)  
                INSTR_TIME_SET_CURRENT(io_start);  
  
        /*  
         * bufToWrite is either the shared buffer or a copy, as appropriate.  
         */  
        smgrwrite(reln,               //  将BUFFER写入磁盘  
                          buf->tag.forkNum,  
                          buf->tag.blockNum,  
                          bufToWrite,  
                          false);  
  
        if (track_io_timing)  
        {  
                INSTR_TIME_SET_CURRENT(io_time);  
                INSTR_TIME_SUBTRACT(io_time, io_start);  
                pgstat_count_buffer_write_time(INSTR_TIME_GET_MICROSEC(io_time));  
                INSTR_TIME_ADD(pgBufferUsage.blk_write_time, io_time);  
        }  
  
        pgBufferUsage.shared_blks_written++;  
  
        /*  
         * Mark the buffer as clean (unless BM_JUST_DIRTIED has become set) and  
         * end the io_in_progress state.  
         */  
        TerminateBufferIO(buf, true, 0);  
  
        TRACE_POSTGRESQL_BUFFER_FLUSH_DONE(buf->tag.forkNum,    // 单个buffer块 flush结束  
                                                                           buf->tag.blockNum,  
                                                                           reln->smgr_rnode.node.spcNode,  
                                                                           reln->smgr_rnode.node.dbNode,  
                                                                           reln->smgr_rnode.node.relNode);  
  
        /* Pop the error context stack */  
        error_context_stack = errcallback.previous;  
}  

Write the supplied buffer out.

smgrwrite@src/backend/storage/smgr/smgr.c

/*  
 *      smgrwrite() -- Write the supplied buffer out.  
 *  
 *              This is to be used only for updating already-existing blocks of a  
 *              relation (ie, those before the current EOF).  To extend a relation,  
 *              use smgrextend().  
 *  
 *              This is not a synchronous write -- the block is not necessarily  
 *              on disk at return, only dumped out to the kernel.  However,  
 *              provisions will be made to fsync the write before the next checkpoint.  
 *  
 *              skipFsync indicates that the caller will make other provisions to  
 *              fsync the relation, so we needn't bother.  Temporary relations also  
 *              do not require fsync.  
 */  
void  
smgrwrite(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,  
                  char *buffer, bool skipFsync)  
{  
        (*(smgrsw[reln->smgr_which].smgr_write)) (reln, forknum, blocknum,  
                                                                                          buffer, skipFsync);  
}  

最后一步是将前面的write sync 到磁盘.

smgrsync@src/backend/storage/smgr/smgr.c

/*  
 *      smgrsync() -- Sync files to disk during checkpoint.  
 */  
void  
smgrsync(void)  
{  
        int                     i;  
  
        for (i = 0; i < NSmgr; i++)  
        {  
                if (smgrsw[i].smgr_sync)  
                        (*(smgrsw[i].smgr_sync)) ();  
        }  
}  

smgr_sync实际调用的是

mdsync@src/backend/storage/smgr/md.c

/*  
 *      mdsync() -- Sync previous writes to stable storage.  
 */  
void  
mdsync(void)  
{  
        static bool mdsync_in_progress = false;  
  
        HASH_SEQ_STATUS hstat;  
        PendingOperationEntry *entry;  
        int                     absorb_counter;  
  
        /* Statistics on sync times */  
        int                     processed = 0;  
        instr_time      sync_start,  
                                sync_end,  
                                sync_diff;  
        uint64          elapsed;  
        uint64          longest = 0;  
        uint64          total_elapsed = 0;  
        /*  
         * This is only called during checkpoints, and checkpoints should only  
         * occur in processes that have created a pendingOpsTable.  
         */  
        if (!pendingOpsTable)  
                elog(ERROR, "cannot sync without a pendingOpsTable");  
  
        /*  
         * If we are in the checkpointer, the sync had better include all fsync  
         * requests that were queued by backends up to this point.  The tightest  
         * race condition that could occur is that a buffer that must be written  
         * and fsync'd for the checkpoint could have been dumped by a backend just  
         * before it was visited by BufferSync().  We know the backend will have  
         * queued an fsync request before clearing the buffer's dirtybit, so we  
         * are safe as long as we do an Absorb after completing BufferSync().  
         */  
        AbsorbFsyncRequests();  
  
        /*  
         * To avoid excess fsync'ing (in the worst case, maybe a never-terminating  
         * checkpoint), we want to ignore fsync requests that are entered into the  
         * hashtable after this point --- they should be processed next time,  
         * instead.  We use mdsync_cycle_ctr to tell old entries apart from new  
         * ones: new ones will have cycle_ctr equal to the incremented value of  
         * mdsync_cycle_ctr.  
         *  
         * In normal circumstances, all entries present in the table at this point  
         * will have cycle_ctr exactly equal to the current (about to be old)  
         * value of mdsync_cycle_ctr.  However, if we fail partway through the  
         * fsync'ing loop, then older values of cycle_ctr might remain when we  
         * come back here to try again.  Repeated checkpoint failures would  
         * eventually wrap the counter around to the point where an old entry  
         * might appear new, causing us to skip it, possibly allowing a checkpoint  
         * to succeed that should not have.  To forestall wraparound, any time the  
         * previous mdsync() failed to complete, run through the table and  
         * forcibly set cycle_ctr = mdsync_cycle_ctr.  
         *  
         * Think not to merge this loop with the main loop, as the problem is  
         * exactly that that loop may fail before having visited all the entries.  
         * From a performance point of view it doesn't matter anyway, as this path  
         * will never be taken in a system that's functioning normally.  
         */  
        if (mdsync_in_progress)  
        {  
                /* prior try failed, so update any stale cycle_ctr values */  
                hash_seq_init(&hstat, pendingOpsTable);  
                while ((entry = (PendingOperationEntry *) hash_seq_search(&hstat)) != NULL)  
                {  
                        entry->cycle_ctr = mdsync_cycle_ctr;  
                }  
        }  
  
        /* Advance counter so that new hashtable entries are distinguishable */  
        mdsync_cycle_ctr++;  
  
        /* Set flag to detect failure if we don't reach the end of the loop */  
        mdsync_in_progress = true;  
  
        /* Now scan the hashtable for fsync requests to process */  
        absorb_counter = FSYNCS_PER_ABSORB;  
        hash_seq_init(&hstat, pendingOpsTable);  
        while ((entry = (PendingOperationEntry *) hash_seq_search(&hstat)) != NULL)  
        {  
                ForkNumber      forknum;  
  
                /*  
                 * If the entry is new then don't process it this time; it might  
                 * contain multiple fsync-request bits, but they are all new.  Note  
                 * "continue" bypasses the hash-remove call at the bottom of the loop.  
                 */  
                if (entry->cycle_ctr == mdsync_cycle_ctr)  
                        continue;  
  
                /* Else assert we haven't missed it */  
                Assert((CycleCtr) (entry->cycle_ctr + 1) == mdsync_cycle_ctr);  
  
                /*  
                 * Scan over the forks and segments represented by the entry.  
                 *  
                 * The bitmap manipulations are slightly tricky, because we can call  
                 * AbsorbFsyncRequests() inside the loop and that could result in  
                 * bms_add_member() modifying and even re-palloc'ing the bitmapsets.  
                 * This is okay because we unlink each bitmapset from the hashtable  
                 * entry before scanning it.  That means that any incoming fsync  
                 * requests will be processed now if they reach the table before we  
                 * begin to scan their fork.  
                 */  
                for (forknum = 0; forknum <= MAX_FORKNUM; forknum++)  
                {  
                        Bitmapset  *requests = entry->requests[forknum];  
                        int                     segno;  
  
                        entry->requests[forknum] = NULL;  
                        entry->canceled[forknum] = false;  
  
                        while ((segno = bms_first_member(requests)) >= 0)  
                        {  
                                int                     failures;  
  
                                /*  
                                 * If fsync is off then we don't have to bother opening the  
                                 * file at all.  (We delay checking until this point so that  
                                 * changing fsync on the fly behaves sensibly.)  
                                 */  
                                if (!enableFsync)  
                                        continue;  
                                /*  
                                 * If in checkpointer, we want to absorb pending requests  
                                 * every so often to prevent overflow of the fsync request  
                                 * queue.  It is unspecified whether newly-added entries will  
                                 * be visited by hash_seq_search, but we don't care since we  
                                 * don't need to process them anyway.  
                                 */  
                                if (--absorb_counter <= 0)  
                                {  
                                        AbsorbFsyncRequests();  
                                        absorb_counter = FSYNCS_PER_ABSORB;  
                                }  
  
                                /*  
                                 * The fsync table could contain requests to fsync segments  
                                 * that have been deleted (unlinked) by the time we get to  
                                 * them. Rather than just hoping an ENOENT (or EACCES on  
                                 * Windows) error can be ignored, what we do on error is  
                                 * absorb pending requests and then retry.  Since mdunlink()  
                                 * queues a "cancel" message before actually unlinking, the  
                                 * fsync request is guaranteed to be marked canceled after the  
                                 * absorb if it really was this case. DROP DATABASE likewise  
                                 * has to tell us to forget fsync requests before it starts  
                                 * deletions.  
                                 */  
                                for (failures = 0;; failures++) /* loop exits at "break" */  
                                {  
                                        SMgrRelation reln;  
                                        MdfdVec    *seg;  
                                        char       *path;  
                                        int                     save_errno;  
  
                                        /*  
                                         * Find or create an smgr hash entry for this relation.  
                                         * This may seem a bit unclean -- md calling smgr?      But  
                                         * it's really the best solution.  It ensures that the  
                                         * open file reference isn't permanently leaked if we get  
                                         * an error here. (You may say "but an unreferenced  
                                         * SMgrRelation is still a leak!" Not really, because the  
                                         * only case in which a checkpoint is done by a process  
                                         * that isn't about to shut down is in the checkpointer,  
                                         * and it will periodically do smgrcloseall(). This fact  
                                         * justifies our not closing the reln in the success path  
                                         * either, which is a good thing since in non-checkpointer  
                                         * cases we couldn't safely do that.)  
                                         */  
                                        reln = smgropen(entry->rnode, InvalidBackendId);  
  
                                        /* Attempt to open and fsync the target segment */  
                                        seg = _mdfd_getseg(reln, forknum,  
                                                         (BlockNumber) segno * (BlockNumber) RELSEG_SIZE,  
                                                                           false, EXTENSION_RETURN_NULL);  
  
                                        INSTR_TIME_SET_CURRENT(sync_start);  
  
                                        if (seg != NULL &&  
                                                FileSync(seg->mdfd_vfd) >= 0)  
                                        {  
                                                /* Success; update statistics about sync timing */  
                                                INSTR_TIME_SET_CURRENT(sync_end);  
                                                sync_diff = sync_end;  
                                                INSTR_TIME_SUBTRACT(sync_diff, sync_start);  
                                                elapsed = INSTR_TIME_GET_MICROSEC(sync_diff);  
                                                if (elapsed > longest)  
                                                        longest = elapsed;  
                                                total_elapsed += elapsed;  
                                                processed++;  
                                                if (log_checkpoints)  
                                                        elog(DEBUG1, "checkpoint sync: number=%d file=%s time=%.3f msec",  
                                                                 processed,  
                                                                 FilePathName(seg->mdfd_vfd),  
                                                                 (double) elapsed / 1000);  
  
                                                break;  /* out of retry loop */  
                                        }  
                                        /* Compute file name for use in message */  
                                        save_errno = errno;  
                                        path = _mdfd_segpath(reln, forknum, (BlockNumber) segno);  
                                        errno = save_errno;  
  
                                        /*  
                                         * It is possible that the relation has been dropped or  
                                         * truncated since the fsync request was entered.  
                                         * Therefore, allow ENOENT, but only if we didn't fail  
                                         * already on this file.  This applies both for  
                                         * _mdfd_getseg() and for FileSync, since fd.c might have  
                                         * closed the file behind our back.  
                                         *  
                                         * XXX is there any point in allowing more than one retry?  
                                         * Don't see one at the moment, but easy to change the  
                                         * test here if so.  
                                         */  
                                        if (!FILE_POSSIBLY_DELETED(errno) ||  
                                                failures > 0)  
                                                ereport(ERROR,  
                                                                (errcode_for_file_access(),  
                                                                 errmsg("could not fsync file \"%s\": %m",  
                                                                                path)));  
                                        else  
                                                ereport(DEBUG1,  
                                                                (errcode_for_file_access(),  
                                                errmsg("could not fsync file \"%s\" but retrying: %m",  
                                                           path)));  
                                        pfree(path);  
  
                                        /*  
                                         * Absorb incoming requests and check to see if a cancel  
                                         * arrived for this relation fork.  
                                         */  
                                        AbsorbFsyncRequests();  
                                        absorb_counter = FSYNCS_PER_ABSORB; /* might as well... */  
                                        if (entry->canceled[forknum])  
                                                break;  
                                }                               /* end retry loop */  
                        }  
                        bms_free(requests);  
                }  
  
                /*  
                 * We've finished everything that was requested before we started to  
                 * scan the entry.  If no new requests have been inserted meanwhile,  
                 * remove the entry.  Otherwise, update its cycle counter, as all the  
                 * requests now in it must have arrived during this cycle.  
                 */  
                for (forknum = 0; forknum <= MAX_FORKNUM; forknum++)  
                {  
                        if (entry->requests[forknum] != NULL)  
                                break;  
                }  
                if (forknum <= MAX_FORKNUM)  
                        entry->cycle_ctr = mdsync_cycle_ctr;  
                else  
                {  
                        /* Okay to remove it */  
                        if (hash_search(pendingOpsTable, &entry->rnode,  
                                                        HASH_REMOVE, NULL) == NULL)  
                                elog(ERROR, "pendingOpsTable corrupted");  
                }  
        }                                                       /* end loop over hashtable entries */  
  
        /* Return sync performance metrics for report at checkpoint end */  
        CheckpointStats.ckpt_sync_rels = processed;  
        CheckpointStats.ckpt_longest_sync = longest;  
        CheckpointStats.ckpt_agg_sync_time = total_elapsed;  
  
        /* Flag successful completion of mdsync */  
        mdsync_in_progress = false;  
}  

小结

checkpointer刷缓存主要分几个步骤,

1. 遍历shared buffer区,将当前SHARED BUFFER中脏块新增FLAG need checkpoint,

2. 遍历shared buffer区,将上一步标记为need checkpoint的块write到磁盘,WRITE前需要确保该buffer lsn前的XLOG已经fsync到磁盘,

3. 将前面的write sync到持久化存储。

具体耗时可以参考期间的探针,或者检查点日志输出。

下一篇讲一下检查点的跟踪。

参考

1. http://blog.163.com/digoal@126/blog/static/1638770402015463252387/

Flag Counter

digoal’s 大量PostgreSQL文章入口