PostgreSQL 检查点性能影响及源码分析 - 7

9 minute read

背景

数据库可靠性从何而来?

数据库崩溃后如何恢复,从什么位置开始恢复?

数据库检查点是什么?

检查点要干些什么?

为什么脏数据较多时,检查点会对性能有一定的影响?

什么是full page write?

相信这些问题是搞数据库的同学都想搞明白的。

接下里的一系列文章,围绕检查点展开讲解,讲一讲检查点的原理,以及为什么脏数据较多是,它会对数据库产生一定的性能影响。

正文

本文将介绍一下如何通过修改IsCheckpointOnSchedule达到检查点平滑化的目的。

如果要了解PostgreSQL checkpoint的机制,请参考我前面写的几篇checkpoint的源码分析。

《PostgreSQL 检查点性能影响及源码分析 - 1》

《PostgreSQL 检查点性能影响及源码分析 - 2》

《PostgreSQL 检查点性能影响及源码分析 - 3》

《PostgreSQL 检查点性能影响及源码分析 - 4》

《PostgreSQL 检查点性能影响及源码分析 - 5》

《PostgreSQL 检查点性能影响及源码分析 - 6》

在写前面几篇文章时,我发现一个问题,即PostgreSQL的checkpoint调度有点小问题,在使用了异步提交的情况下,并不能均匀的将buffer write分布到整个checkpoint_completion_target。

例如:

我这里的checkpoint 调度配置如下:

checkpoint_segments = 512  
checkpoint_timeout = 5min  
checkpoint_completion_target = 0.9  

理论上checkpoint write buffer应该均匀的分布在5120.9个xlog范围内或50.9分钟内。

但如果开启了异步提交(synchronous_commit=off),那么事情会变成怎么样呢?

使用stap进行跟踪:

# stap -DMAXSKIPPED=100000 -v 11111 -e '  
global s_var, e_var, stat_var;  
  
/* probe smgr__md__read__start(ForkNumber, BlockNumber, Oid, Oid, Oid, int); */  
probe process("/opt/pgsql/bin/postgres").mark("smgr__md__read__start") {  
  s_var[pid(),1] = gettimeofday_us()  
}  
  
/* probe smgr__md__read__done(ForkNumber, BlockNumber, Oid, Oid, Oid, int, int, int); */  
probe process("/opt/pgsql/bin/postgres").mark("smgr__md__read__done") {  
  e_var[pid(),1] = gettimeofday_us()  
  if ( s_var[pid(),1] > 0 )  
    stat_var[pid(),1] <<< e_var[pid(),1] - s_var[pid(),1]  
}  
  
/* probe smgr__md__write__start(ForkNumber, BlockNumber, Oid, Oid, Oid, int); */  
probe process("/opt/pgsql/bin/postgres").mark("smgr__md__write__start") {  
  s_var[pid(),2] = gettimeofday_us()  
}  
  
/* probe smgr__md__write__done(ForkNumber, BlockNumber, Oid, Oid, Oid, int, int, int); */  
probe process("/opt/pgsql/bin/postgres").mark("smgr__md__write__done") {  
  e_var[pid(),2] = gettimeofday_us()  
  if ( s_var[pid(),2] > 0 )  
    stat_var[pid(),2] <<< e_var[pid(),2] - s_var[pid(),2]  
}  
  
probe process("/opt/pgsql/bin/postgres").mark("buffer__sync__start") {  
  printf("buffer__sync__start num_buffers: %d, dirty_buffers: %d\n", $NBuffers, $num_to_write)  
}  
  
probe process("/opt/pgsql/bin/postgres").mark("checkpoint__start") {  
  printf("checkpoint start\n")  
}  
  
probe process("/opt/pgsql/bin/postgres").mark("checkpoint__done") {  
  printf("checkpoint done\n")  
}  
  
probe timer.s(1) {  
  foreach ([v1,v2] in stat_var +) {  
    if ( @count(stat_var[v1,v2]) >0 ) {  
      printf("r1_or_w2 %d, pid: %d, min: %d, max: %d, avg: %d, sum: %d, count: %d\n", v2, v1, @min(stat_var[v1,v2]), @max(stat_var[v1,v2]), @avg(stat_var[v1,v2]), @sum(stat_var[v1,v2]), @count(stat_var[v1,v2]))  
    }  
  }  
  printf("----------------------------------end-----------------------------\n")  
  delete s_var  
  delete e_var  
  delete stat_var  
}'  

这里继续使用我前几篇写checkpoint原理的文章中的测试用例和测试数据。

$ vi test.sql  
\setrandom id 1 50000000  
update tbl set crt_time=now() where id = :id ;  
  
$ pgbench -M prepared -n -r -f ./test.sql -P 1 -c 28 -j 28 -T 100000000  

发生检查点时的tps:  
progress: 255.0 s, 58152.2 tps, lat 0.462 ms stddev 0.504  
progress: 256.0 s, 31382.8 tps, lat 0.844 ms stddev 2.331  
progress: 257.0 s, 14615.5 tps, lat 1.863 ms stddev 4.554  
progress: 258.0 s, 16258.4 tps, lat 1.652 ms stddev 4.139  
progress: 259.0 s, 17814.7 tps, lat 1.526 ms stddev 4.035  
progress: 260.0 s, 14573.8 tps, lat 1.825 ms stddev 5.592  
progress: 261.0 s, 16736.6 tps, lat 1.600 ms stddev 5.018  
progress: 262.0 s, 19060.5 tps, lat 1.448 ms stddev 4.818  
progress: 263.0 s, 20553.2 tps, lat 1.290 ms stddev 4.146  
progress: 264.0 s, 26223.0 tps, lat 1.042 ms stddev 3.711  
progress: 265.0 s, 31953.0 tps, lat 0.836 ms stddev 2.837  
progress: 266.0 s, 43396.1 tps, lat 0.627 ms stddev 1.615  
progress: 267.0 s, 50487.8 tps, lat 0.533 ms stddev 0.647  
progress: 268.0 s, 53537.7 tps, lat 0.502 ms stddev 0.598  
progress: 269.0 s, 54259.3 tps, lat 0.496 ms stddev 0.624  
progress: 270.0 s, 56139.8 tps, lat 0.479 ms stddev 0.524  

stap跟踪输出:

当发生检查点时,checkpointer发生write buffer 的操作。但是你会发现,写buffer操作并不均匀,前面写得多,后面写得少。

正常情况下写156467个buffer,均匀分布在5*0.9即270秒的话,每秒应该写579个脏块才算均匀。

但是实际情况是,第一秒写了8596个脏块,接下来的十几秒实际都超出了按时间均匀的速度,原因是XLOG步调更快,checkpointer根本停不下来:

checkpoint start  
buffer__sync__start num_buffers: 262144, dirty_buffers: 156467  
r1_or_w2 2, pid: 19848, min: 41, max: 1471, avg: 49, sum: 425291, count: 8596  
----------------------------------end-----------------------------  
r1_or_w2 2, pid: 19848, min: 41, max: 153, avg: 49, sum: 450597, count: 9078  
----------------------------------end-----------------------------  
r1_or_w2 2, pid: 19848, min: 41, max: 643, avg: 51, sum: 429193, count: 8397  
----------------------------------end-----------------------------  
r1_or_w2 2, pid: 19848, min: 41, max: 1042, avg: 55, sum: 449091, count: 8097  
----------------------------------end-----------------------------  
r1_or_w2 2, pid: 19848, min: 41, max: 254, avg: 52, sum: 296668, count: 5617  
----------------------------------end-----------------------------  
r1_or_w2 2, pid: 19848, min: 39, max: 171, avg: 54, sum: 321027, count: 5851  
----------------------------------end-----------------------------  
r1_or_w2 2, pid: 19848, min: 41, max: 138, avg: 60, sum: 300056, count: 4953  
----------------------------------end-----------------------------  
r1_or_w2 2, pid: 19848, min: 42, max: 1217, avg: 65, sum: 312859, count: 4748  
----------------------------------end-----------------------------  
r1_or_w2 2, pid: 19848, min: 41, max: 1371, avg: 56, sum: 353905, count: 6304  
----------------------------------end-----------------------------  
r1_or_w2 2, pid: 19848, min: 41, max: 358, avg: 58, sum: 236254, count: 4038  
----------------------------------end-----------------------------  
r1_or_w2 2, pid: 19848, min: 34, max: 1239, avg: 63, sum: 296906, count: 4703  
----------------------------------end-----------------------------  
r1_or_w2 2, pid: 19848, min: 31, max: 17408, avg: 63, sum: 415234, count: 6534  
----------------------------------end-----------------------------  
r1_or_w2 2, pid: 19848, min: 31, max: 5486, avg: 57, sum: 190345, count: 3318  
----------------------------------end-----------------------------  
r1_or_w2 2, pid: 19848, min: 29, max: 510, avg: 53, sum: 136221, count: 2563  
----------------------------------end-----------------------------  
r1_or_w2 2, pid: 19848, min: 32, max: 733, avg: 52, sum: 108327, count: 2070  
----------------------------------end-----------------------------  
r1_or_w2 2, pid: 19848, min: 34, max: 382, avg: 53, sum: 96157, count: 1812  
----------------------------------end-----------------------------  
r1_or_w2 2, pid: 19848, min: 43, max: 327, avg: 53, sum: 83641, count: 1571  
----------------------------------end-----------------------------  
r1_or_w2 2, pid: 19848, min: 33, max: 102, avg: 54, sum: 79991, count: 1468  
----------------------------------end-----------------------------  
r1_or_w2 2, pid: 19848, min: 35, max: 88, avg: 53, sum: 74338, count: 1389  
----------------------------------end-----------------------------  
r1_or_w2 2, pid: 19848, min: 32, max: 86, avg: 52, sum: 65710, count: 1243  
----------------------------------end-----------------------------  
r1_or_w2 2, pid: 19848, min: 30, max: 347, avg: 52, sum: 66866, count: 1263  
----------------------------------end-----------------------------  
r1_or_w2 2, pid: 19848, min: 31, max: 93, avg: 54, sum: 75642, count: 1398  
----------------------------------end-----------------------------  
r1_or_w2 2, pid: 19848, min: 33, max: 100, avg: 51, sum: 62302, count: 1216  
......  

原因分析:

因为开启了full page write的情况下,检查点后任何数据块第一次变脏时,需要写full page的WAL,所以导致WAL写的量会瞬间变大(假设测试用例是大表全随机更新的场景,第一次变脏的数据块更多,WAL瞬间量越大)。

因此,如果按XLOG的写入量来调度CHECKPOINTER的write buffer的话,前期write buffer就会更多,后期更少。

我们可以通过修改IsCheckpointOnSchedule来修正这个问题。当开启了full page write时,不使用XLOG调度,只是用时间调度,当然这样也有问题,有可能时间间隔很长,超出了checkpoint_segments,那么会导致检查点无法在下次检查点开始前结束write buffer的动作。

src/backend/postmaster/checkpointer.c

/*  
 * IsCheckpointOnSchedule -- are we on schedule to finish this checkpoint  
 *               in time?  
 *  
 * Compares the current progress against the time/segments elapsed since last  
 * checkpoint, and returns true if the progress we've made this far is greater  
 * than the elapsed time/segments.  
 */  
static bool  
IsCheckpointOnSchedule(double progress)  
{  
        XLogRecPtr      recptr;  
        struct timeval now;  
        double          elapsed_xlogs,  
                                elapsed_time;  
  
        Assert(ckpt_active);  
  
        /* Scale progress according to checkpoint_completion_target. */  
        progress *= CheckPointCompletionTarget;  
  
        /*  
         * Check against the cached value first. Only do the more expensive  
         * calculations once we reach the target previously calculated. Since  
         * neither time or WAL insert pointer moves backwards, a freshly  
         * calculated value can only be greater than or equal to the cached value.  
         */  
        if (progress < ckpt_cached_elapsed)  
                return false;  
  
        /*  
         * Check progress against WAL segments written and checkpoint_segments.  
         *  
         * We compare the current WAL insert location against the location  
         *  
         * We compare the current WAL insert location against the location  
         * computed before calling CreateCheckPoint. The code in XLogInsert that  
         * actually triggers a checkpoint when checkpoint_segments is exceeded  
         * compares against RedoRecptr, so this is not completely accurate.  
         * However, it's good enough for our purposes, we're only calculating an  
         * estimate anyway.  
         */  // 问题出在CheckPointSegments的计算方法上,因为再开启异步提交时,产生XLOG的步调比写buffer的步调更快。  
                   // 因此写buffer在前期会赶不上写XLOG,导致前期写BUFFER要更多,后期写更少。  
        if (!RecoveryInProgress())  
        {  
                recptr = GetInsertRecPtr();  
                elapsed_xlogs = (((double) (recptr - ckpt_start_recptr)) / XLogSegSize) / CheckPointSegments;  
  
                if (progress < elapsed_xlogs)  
                {  
                        ckpt_cached_elapsed = elapsed_xlogs;  
                        return false;  
                }  
        }  
           // 而时间检查不会出现这样的问题,因为时间是均匀的,所以可以让检查点写BUFFER的步调和时间步调一致。  
        /*  
         * Check progress against time elapsed and checkpoint_timeout.  
         */  
        gettimeofday(&now, NULL);  
        elapsed_time = ((double) ((pg_time_t) now.tv_sec - ckpt_start_time) +  
                                        now.tv_usec / 1000000.0) / CheckPointTimeout;  
  
        if (progress < elapsed_time)  
        {  
                ckpt_cached_elapsed = elapsed_time;  
                return false;  
        }  
  
        /* It looks like we're on schedule. */  
        return true;  
}  

简单粗暴的解决办法,修改代码如下:

vi src/backend/postmaster/checkpointer.c

/*  
 * IsCheckpointOnSchedule -- are we on schedule to finish this checkpoint  
 *               in time?  
 *  
 * Compares the current progress against the time/segments elapsed since last  
 * checkpoint, and returns true if the progress we've made this far is greater  
 * than the elapsed time/segments.  
 */  
static bool  
IsCheckpointOnSchedule(double progress)  
{  
        XLogRecPtr      recptr;  
        struct timeval now;  
        double          elapsed_xlogs,  
                                elapsed_time;  
  
        Assert(ckpt_active);  
  
        /* Scale progress according to checkpoint_completion_target. */  
        progress *= CheckPointCompletionTarget;  
  
        /*  
         * Check against the cached value first. Only do the more expensive  
         * calculations once we reach the target previously calculated. Since  
         * neither time or WAL insert pointer moves backwards, a freshly  
         * calculated value can only be greater than or equal to the cached value.  
         */  
        if (progress < ckpt_cached_elapsed)  
                return false;  
  
        /*  
         * Check progress against WAL segments written and checkpoint_segments.  
         *  
         * We compare the current WAL insert location against the location  
         * computed before calling CreateCheckPoint. The code in XLogInsert that  
         * actually triggers a checkpoint when checkpoint_segments is exceeded  
         * compares against RedoRecptr, so this is not completely accurate.  
         * However, it's good enough for our purposes, we're only calculating an  
         * estimate anyway.  
         */  
        if (!RecoveryInProgress())  
        {  
                if (!fullPageWrites)  
                {  
                    recptr = GetInsertRecPtr();  
                    elapsed_xlogs = (((double) (recptr - ckpt_start_recptr)) / XLogSegSize) / CheckPointSegments;  
  
                    if (progress < elapsed_xlogs)  
                    {  
                        ckpt_cached_elapsed = elapsed_xlogs;  
                        return false;  
                    }  
                 }  
        }  
  
        /*  
         * Check progress against time elapsed and checkpoint_timeout.  
         */  
        gettimeofday(&now, NULL);  
        elapsed_time = ((double) ((pg_time_t) now.tv_sec - ckpt_start_time) +  
                                        now.tv_usec / 1000000.0) / CheckPointTimeout;  
  
        if (progress < elapsed_time)  
        {  
                ckpt_cached_elapsed = elapsed_time;  
                return false;  
        }  
  
        /* It looks like we're on schedule. */  
        return true;  
}  
# gmake && gmake install  

重新测试:

pg_ctl restart -m fast  

检查点发生时的tps:

progress: 291.0 s, 63144.9 tps, lat 0.426 ms stddev 0.383  
progress: 292.0 s, 55063.7 tps, lat 0.480 ms stddev 1.433  
progress: 293.0 s, 12225.3 tps, lat 2.238 ms stddev 4.460  
progress: 294.0 s, 16436.4 tps, lat 1.621 ms stddev 4.043  
progress: 295.0 s, 18516.5 tps, lat 1.444 ms stddev 3.286  
progress: 296.0 s, 21983.7 tps, lat 1.251 ms stddev 2.941  
progress: 297.0 s, 25759.7 tps, lat 1.034 ms stddev 2.356  
progress: 298.0 s, 33139.4 tps, lat 0.821 ms stddev 1.676  
progress: 299.0 s, 41904.9 tps, lat 0.644 ms stddev 1.134  
progress: 300.0 s, 52432.9 tps, lat 0.513 ms stddev 0.470  
progress: 301.0 s, 57115.4 tps, lat 0.471 ms stddev 0.325  
progress: 302.0 s, 59422.1 tps, lat 0.452 ms stddev 0.297  
progress: 303.0 s, 59860.5 tps, lat 0.449 ms stddev 0.309  

stap跟踪结果:

checkpoint start  
----------------------------------end-----------------------------  
buffer__sync__start num_buffers: 262144, dirty_buffers: 156761  
r1_or_w2 2, pid: 22334, min: 51, max: 137, avg: 60, sum: 52016, count: 860  
----------------------------------end-----------------------------  
r1_or_w2 2, pid: 22334, min: 51, max: 108, avg: 58, sum: 35526, count: 604  
----------------------------------end-----------------------------  
r1_or_w2 2, pid: 22334, min: 51, max: 145, avg: 71, sum: 39779, count: 559  
----------------------------------end-----------------------------  
r1_or_w2 2, pid: 22334, min: 52, max: 172, avg: 79, sum: 47279, count: 594  
----------------------------------end-----------------------------  
r1_or_w2 2, pid: 22334, min: 44, max: 160, avg: 63, sum: 36907, count: 581  
----------------------------------end-----------------------------  
r1_or_w2 2, pid: 22334, min: 51, max: 113, avg: 61, sum: 33895, count: 552  
----------------------------------end-----------------------------  
r1_or_w2 2, pid: 22334, min: 51, max: 116, avg: 61, sum: 38177, count: 617  
----------------------------------end-----------------------------  
r1_or_w2 2, pid: 22334, min: 51, max: 113, avg: 62, sum: 34199, count: 550  
----------------------------------end-----------------------------  
r1_or_w2 2, pid: 22334, min: 53, max: 109, avg: 65, sum: 39842, count: 606  
----------------------------------end-----------------------------  
r1_or_w2 2, pid: 22334, min: 50, max: 118, avg: 64, sum: 35099, count: 545  
----------------------------------end-----------------------------  
r1_or_w2 2, pid: 22334, min: 50, max: 107, avg: 64, sum: 39027, count: 606  
----------------------------------end-----------------------------  
r1_or_w2 2, pid: 22334, min: 51, max: 114, avg: 62, sum: 34054, count: 545  
----------------------------------end-----------------------------  
r1_or_w2 2, pid: 22334, min: 47, max: 106, avg: 63, sum: 38573, count: 605  
----------------------------------end-----------------------------  
r1_or_w2 2, pid: 22334, min: 48, max: 101, avg: 62, sum: 38051, count: 607  
----------------------------------end-----------------------------  
r1_or_w2 2, pid: 22334, min: 42, max: 103, avg: 61, sum: 33596, count: 545  
......  

现在写buffer已经比较均匀了,但是我们看到虽然现在checkpointer写BUFFER是均匀的,但是pgbench在checkpoint开始的前10秒左右还是有非常剧烈的性能下降然后逐渐上升。

因为我们这里只看到检查点写BUFFER带来的IO影响,其实性能下降还有其他因素在里面,这里先不阐述,和xlog有关,后面再写。

如果你的系统IO比较差,并且要使用异步XLOG提交的话,在做检查点时可能受到IO影响较大,建议你调整一下这部分代码,同时加大target参数,但是建议不超过0.9,最好是调整后观察IO影响会不会降低,降低到不影响业务为宜。

为什么checkpoint后使用WAL调度会导致不均匀呢?或者说为什么wal会更快,因为我们这里使用的压力测试用例是大量的更新,checkpoint后会产生较多的wal full page write(如果你设置了full_page_writes=on),所以另一种调教方法是,当打开了full page write, 给wal加一个指数来均衡,这个指数是一个经验值,所以也不是非常完美,个人还是比较偏向直接用时间来调度。

		/*  
		 * Immediately after a checkpoint, a lot more WAL is generated when  
		 * full_page_write is enabled, because every WAL record has to include  
		 * a full image of the modified page. It levels off as time passes and  
		 * more updates fall on pages that have already been modified since  
		 * the last checkpoint.  
		 *  
		 * To correct for that effect, apply a corrective factor on the  
		 * amount of WAL consumed so far.  
		 */  
		if (fullPageWrites)  
			elapsed_xlogs = pow(elapsed_xlogs, 1.5);  

从另一个角度来看,XLOG产生速度越快,对IO的需求量越大,这个时候如果要降低checkpointer write buffer对IO的争抢,应该sleep才对,但是现有的策略是xlog产生速度越快,checkpointer write buffer也越多。所以单纯从时间角度来均衡checkpointer的写buffer可能比较靠谱。或者有好的算法可以做到产生XLOG越快的时候,让checkpointer尽量休息,而产生XLOG少的时候,让checkpointer快点写。

参考

1. src/backend/postmaster/checkpointer.c

2. http://blog.163.com/digoal@126/blog/static/163877040201542103933969/

3. http://blog.163.com/digoal@126/blog/static/1638770402015463252387/

4. http://blog.163.com/digoal@126/blog/static/16387704020154651655783/

5. http://blog.163.com/digoal@126/blog/static/16387704020154653422892/

Flag Counter

digoal’s 大量PostgreSQL文章入口