PostgreSQL 检查点性能影响及源码分析 - 6

17 minute read

背景

数据库可靠性从何而来?

数据库崩溃后如何恢复,从什么位置开始恢复?

数据库检查点是什么?

检查点要干些什么?

为什么脏数据较多时,检查点会对性能有一定的影响?

什么是full page write?

相信这些问题是搞数据库的同学都想搞明白的。

接下里的一系列文章,围绕检查点展开讲解,讲一讲检查点的原理,以及为什么脏数据较多是,它会对数据库产生一定的性能影响。

正文

本文将揭开checkpoint带来的性能影响的原因,和写full page XLOG有关系,另外和IO有轻微关系(前面已经讨论过,只要你的系统IO能力强,可以忽略IO的影响,参考http://blog.163.com/digoal@126/blog/static/163877040201541134842882/)。

本文要说的是full page write的影响,这个影响实际上是非常大的。

1. 写xlog的本质是这样的,数据库在刷shared buffer中的脏块前,必须确保脏块相关的REDO操作先写XLOG成功,才能刷脏块。

2. PostgreSQL的检查点有什么作用呢?

当数据库crash后需要恢复时,或者因为其他原因需要恢复时,从最后一个检查点开始,读取XLOG并实施恢复。

根据以上特征,为了保证数据的一致性,除了保证XLOG写成功,还要保证数据块是一致的(即刷数据块必须是原子操作,磁盘中这个数据块不能有新老数据共同存在)。

full_page_writes 是在文件系统不能保证这种原子操作的前提下设计的,做法就是当刷脏数据块前,如果这个数据块是在检查点后第一次变脏的,那么需要在XLOG中记录下整个数据块的内容。那么在数据恢复时,即使刷脏块不是原子操作也没关系,因为WAL中记录了整个数据块的内容,恢复时会使用xlog中记录的full page覆盖数据文件中的块。

参数如下:

full_page_writes (boolean)  
When this parameter is on, the PostgreSQL server writes the entire content of each disk page to WAL during the first modification of that page after a checkpoint. 
This is needed because a page write that is in process during an operating system crash might be only partially completed, leading to an on-disk page that contains a mix of old and new data. 
The row-level change data normally stored in WAL will not be enough to completely restore such a page during post-crash recovery. 
Storing the full page image guarantees that the page can be correctly restored, but at the price of increasing the amount of data that must be written to WAL. 
(Because WAL replay always starts from a checkpoint, it is sufficient to do this during the first change of each page after a checkpoint. Therefore, one way to reduce the cost of full-page writes is to increase the checkpoint interval parameters.)  
  
Turning this parameter off speeds normal operation, but might lead to either unrecoverable data corruption, or silent data corruption, after a system failure. 
The risks are similar to turning off fsync, though smaller, and it should be turned off only based on the same circumstances recommended for that parameter.  
  
Turning off this parameter does not affect use of WAL archiving for point-in-time recovery (PITR) (see Section 24.3).  
  
This parameter can only be set in the postgresql.conf file or on the server command line. The default is on.  

正因为这样,数据块越大,写full page带来的影响也愈大,而且检查点越频繁,WAL内容也会越多(因为FULL PAGE WRITE较多)。

那么怎么样才能安全关闭full_page_writes 呢?

1. 文件系统能帮我们避免partial write。数据文件所在的文件系统可以确保不出现partial write, 例如写一个32K的数据块,那么这个写操作必须是原子性的。例如zfs文件系统提供了这样的参数,允许用户打开full page write,如果文件系统打开了这种写,数据库就可以关闭full_page_writes。

2. 硬件支持full page write接口,例如FusionIO,以及宝存的PCI-E SSD硬盘产品,提供了相应的原子写API,(或者它们的原子操作本身就大于数据库的block size),只要PostgreSQL 的block_size小于等于硬件能提供的原子写SIZE,使用对应的API后,就可以关闭数据库的full page writes。

如果关闭full_page_writes 后,因为硬件问题,或者其他问题导致操作系统Crash, 并且在检查点后第一次成为脏块时出现了partial write(data corruption)。(例如32KB的数据块,其中有一些是未改写(old),一些已改写(new)) 那么怎么办?

如果真的这样了的话,查询数据时,遇到这种数据块可能报错,可以设置zero_damaged_pages来跳过这种数据块。

zero_damaged_pages (boolean)  
Detection of a damaged page header normally causes PostgreSQL to report an error, aborting the current transaction. 
Setting zero_damaged_pages to on causes the system to instead report a warning, zero out the damaged page in memory, and continue processing. 
This behavior will destroy data, namely all the rows on the damaged page. 
However, it does allow you to get past the error and retrieve rows from any undamaged pages that might be present in the table. 
It is useful for recovering data if corruption has occurred due to a hardware or software error. 
You should generally not set this on until you have given up hope of recovering data from the damaged pages of a table. 
Zeroed-out pages are not forced to disk so it is recommended to recreate the table or the index before turning this parameter off again. 
The default setting is off, and it can only be changed by a superuser.  

另一方面,XLOG必须是顺序写入的,所以有一个锁保护,因此在write wal to wal buffer时,需要加这个锁。

/*  
 * Inserting to WAL is protected by a small fixed number of WAL insertion  
 * locks. To insert to the WAL, you must hold one of the locks - it doesn't  
 * matter which one. To lock out other concurrent insertions, you must hold  
 * of them. Each WAL insertion lock consists of a lightweight lock, plus an  
 * indicator of how far the insertion has progressed (insertingAt).  
 *  
 * The insertingAt values are read when a process wants to flush WAL from  
 * the in-memory buffers to disk, to check that all the insertions to the  
 * region the process is about to write out have finished. You could simply  
 * wait for all currently in-progress insertions to finish, but the  
 * insertingAt indicator allows you to ignore insertions to later in the WAL,  
 * so that you only wait for the insertions that are modifying the buffers  
 * you're about to write out.  
 *  
 * This isn't just an optimization. If all the WAL buffers are dirty, an  
 * inserter that's holding a WAL insert lock might need to evict an old WAL  
 * buffer, which requires flushing the WAL. If it's possible for an inserter  
 * to block on another inserter unnecessarily, deadlock can arise when two  
 * inserters holding a WAL insert lock wait for each other to finish their  
 * insertion.  
 *  
 * Small WAL records that don't cross a page boundary never update the value,  
 * the WAL record is just copied to the page and the lock is released. But  
 * to avoid the deadlock-scenario explained above, the indicator is always  
 * updated before sleeping while holding an insertion lock.  
 */  
typedef struct  
{  
        LWLock          lock;  
        XLogRecPtr      insertingAt;  
} WALInsertLock;  

也就是说,写WAL越慢,TPS会越低,即使是异步(因为wal异步虽然不需要等flush wal to disk,但是也要保证写wal buffer完成。)

我们来做一个测试,重新启动数据库,刚启动时,数据库会做一个启动检查点,所以第一次的脏块需要写full page:

为了规避shared buffer的影响,我们先使用prewarm把数据加载到shared buffer中

$ pg_ctl start  
  
$ psql  
digoal=> select * from pg_prewarm('tbl');  
 pg_prewarm   
------------  
     122933  
(1 row)  
digoal=> select * from pg_prewarm('tbl_pkey');  
 pg_prewarm   
------------  
      34023  
(1 row)  

开始测试

$ cat test.sql  
\setrandom id 1 50000000  
update tbl set crt_time=now() where id = :id ;  

可以看到,测试数据和做检查点时是一样的,性能逐渐上升,原因就是一开始的WAL要写full page,影响了性能。

$ pgbench -M prepared -n -r -f ./test.sql -P 1 -c 28 -j 28 -T 1000000  
progress: 1.0 s, 14744.4 tps, lat 1.865 ms stddev 4.009  
progress: 2.0 s, 16811.3 tps, lat 1.668 ms stddev 4.304  
progress: 3.0 s, 19606.2 tps, lat 1.415 ms stddev 3.358  
progress: 4.0 s, 23229.7 tps, lat 1.214 ms stddev 2.922  
progress: 5.0 s, 27691.0 tps, lat 1.001 ms stddev 2.356  
progress: 6.0 s, 34756.0 tps, lat 0.810 ms stddev 1.685  
progress: 7.0 s, 46214.1 tps, lat 0.604 ms stddev 0.839  
progress: 8.0 s, 54729.7 tps, lat 0.510 ms stddev 0.358  
progress: 9.0 s, 56774.4 tps, lat 0.491 ms stddev 0.324  
progress: 10.0 s, 57470.0 tps, lat 0.485 ms stddev 0.330  
progress: 11.0 s, 57342.1 tps, lat 0.487 ms stddev 0.349  
progress: 12.0 s, 58505.2 tps, lat 0.477 ms stddev 0.323  
progress: 13.0 s, 58999.8 tps, lat 0.473 ms stddev 0.315  
progress: 14.0 s, 59607.8 tps, lat 0.468 ms stddev 0.310  
progress: 15.0 s, 59006.1 tps, lat 0.473 ms stddev 0.328  
progress: 16.0 s, 59519.8 tps, lat 0.468 ms stddev 0.315  
progress: 17.0 s, 59787.6 tps, lat 0.466 ms stddev 0.310  
progress: 18.0 s, 59188.9 tps, lat 0.471 ms stddev 0.324  
progress: 19.0 s, 59626.9 tps, lat 0.468 ms stddev 0.321  
progress: 20.0 s, 61206.3 tps, lat 0.456 ms stddev 0.297  
progress: 21.0 s, 59606.7 tps, lat 0.468 ms stddev 0.318  
progress: 22.0 s, 60024.8 tps, lat 0.465 ms stddev 0.316  

热数据越多,这种“热身”效应越明显。

现在我把full page writes关闭,再次测试:

full_page_writes = off  
  
pg_ctl restart -m fast  
  
$ psql  
digoal=> select * from pg_prewarm('tbl');  
 pg_prewarm   
------------  
     122933  
(1 row)  
  
digoal=> select * from pg_prewarm('tbl_pkey');  
 pg_prewarm   
------------  
      34023  
(1 row)  

可以看到,“热身”效应没有了,因为WAL不再需要写full page了。

$ pgbench -M prepared -n -r -f ./test.sql -P 1 -c 28 -j 28 -T 1000000  
progress: 1.0 s, 73337.2 tps, lat 0.375 ms stddev 0.350  
progress: 2.0 s, 68862.4 tps, lat 0.405 ms stddev 0.324  
progress: 3.0 s, 64543.7 tps, lat 0.432 ms stddev 0.321  
progress: 4.0 s, 62050.6 tps, lat 0.449 ms stddev 0.325  
progress: 5.0 s, 61312.0 tps, lat 0.455 ms stddev 0.316  
progress: 6.0 s, 60668.8 tps, lat 0.460 ms stddev 0.318  
progress: 7.0 s, 59014.3 tps, lat 0.473 ms stddev 0.331  
progress: 8.0 s, 60419.3 tps, lat 0.462 ms stddev 0.307  
progress: 9.0 s, 60216.9 tps, lat 0.463 ms stddev 0.310  
progress: 10.0 s, 59478.2 tps, lat 0.469 ms stddev 0.318  
progress: 11.0 s, 60376.4 tps, lat 0.462 ms stddev 0.301  
progress: 12.0 s, 59792.6 tps, lat 0.466 ms stddev 0.318  
progress: 13.0 s, 59747.6 tps, lat 0.467 ms stddev 0.318  
progress: 14.0 s, 60387.0 tps, lat 0.462 ms stddev 0.304  
progress: 15.0 s, 59698.8 tps, lat 0.467 ms stddev 0.316  
progress: 16.0 s, 59928.5 tps, lat 0.465 ms stddev 0.313  
progress: 17.0 s, 60510.5 tps, lat 0.460 ms stddev 0.302  

我们还可以通过stap来看到这样的效果:

stap -DMAXSKIPPED=100000 -v 11111 -e '  
global f_start, stat_var;  
  
probe process("/opt/pgsql/bin/postgres").function("XLogInsert@/opt/soft_bak/postgresql-9.4.1/src/backend/access/transam/xlog.c").call {   
  f_start[pid(), tid()] = gettimeofday_ns()  
}  
  
probe process("/opt/pgsql/bin/postgres").function("XLogInsert@/opt/soft_bak/postgresql-9.4.1/src/backend/access/transam/xlog.c").return {  
  t=gettimeofday_ns()  
  a=pid()  
  b=tid()  
  if (f_start[a,b]) {  
    stat_var[a,b] <<< t - f_start[a,b]  
  }  
}  
  
probe process("/opt/pgsql/bin/postgres").mark("checkpoint__start") {  
  printf("checkpoint start\n")  
}  
  
probe process("/opt/pgsql/bin/postgres").mark("checkpoint__done") {  
  printf("checkpoint done\n")  
  foreach ([v1,v2] in stat_var +) {  
    if ( @count(stat_var[v1,v2]) >0 ) {  
      printf("pid: %d, tid: %d, min: %d, max: %d, avg: %d, sum: %d, count: %d\n", v1, v2, @min(stat_var[v1,v2]), @max(stat_var[v1,v2]), @avg(stat_var[v1,v2]), @sum(stat_var[v1,v2]), @count(stat_var[v1,v2]))  
    }  
  }  
}'  

当full page writes关闭时:

checkpoint start  
checkpoint done  
pid: 7976, tid: 7976, min: 17660, max: 17660, avg: 17660, sum: 17660, count: 1  
pid: 8722, tid: 8722, min: 3902, max: 1795060, avg: 8797, sum: 2339179954, count: 265896  
pid: 8737, tid: 8737, min: 3804, max: 2397915, avg: 8921, sum: 2373901808, count: 266084  
pid: 8726, tid: 8726, min: 3812, max: 1544051, avg: 8876, sum: 2364795050, count: 266418  
pid: 8736, tid: 8736, min: 3932, max: 1585877, avg: 8833, sum: 2354577217, count: 266553  
pid: 8706, tid: 8706, min: 3812, max: 2242369, avg: 8776, sum: 2340338511, count: 266651  
pid: 8739, tid: 8739, min: 3913, max: 1533554, avg: 8890, sum: 2371167014, count: 266720  
pid: 8698, tid: 8698, min: 3872, max: 2141824, avg: 8759, sum: 2337391283, count: 266851  
pid: 8732, tid: 8732, min: 3834, max: 1426147, avg: 8896, sum: 2375078067, count: 266953  
pid: 8735, tid: 8735, min: 3876, max: 4253655, avg: 8908, sum: 2378517468, count: 266998  
pid: 8695, tid: 8695, min: 3872, max: 2256848, avg: 8853, sum: 2364436879, count: 267057  
pid: 8711, tid: 8711, min: 3883, max: 2840955, avg: 8792, sum: 2349657964, count: 267224  
pid: 8694, tid: 8694, min: 3947, max: 2684147, avg: 8819, sum: 2357223023, count: 267266  
pid: 8718, tid: 8718, min: 3846, max: 1666267, avg: 8924, sum: 2385454634, count: 267279  
pid: 8734, tid: 8734, min: 3835, max: 2663756, avg: 8798, sum: 2352532736, count: 267382  
pid: 8693, tid: 8693, min: 3830, max: 1915645, avg: 8764, sum: 2345468816, count: 267619  
pid: 8738, tid: 8738, min: 3876, max: 2187521, avg: 8895, sum: 2380585848, count: 267620  
pid: 8705, tid: 8705, min: 3906, max: 2579393, avg: 8964, sum: 2399871667, count: 267717  
pid: 8728, tid: 8728, min: 3872, max: 2405755, avg: 8839, sum: 2366833087, count: 267749  
pid: 8729, tid: 8729, min: 3853, max: 1613303, avg: 8947, sum: 2396649611, count: 267854  
pid: 8730, tid: 8730, min: 3906, max: 2590461, avg: 8903, sum: 2385215913, count: 267903  
pid: 8719, tid: 8719, min: 3819, max: 2737461, avg: 8696, sum: 2329969230, count: 267918  
pid: 8708, tid: 8708, min: 3909, max: 1614057, avg: 8947, sum: 2398100004, count: 268029  
pid: 8717, tid: 8717, min: 3857, max: 2158133, avg: 8798, sum: 2359353315, count: 268151  
pid: 8733, tid: 8733, min: 3831, max: 1694714, avg: 8889, sum: 2386096329, count: 268426  
pid: 8709, tid: 8709, min: 3857, max: 1501952, avg: 8930, sum: 2402379420, count: 268997  
pid: 8704, tid: 8704, min: 3857, max: 1593996, avg: 8873, sum: 2389259952, count: 269254  
pid: 8714, tid: 8714, min: 3850, max: 1428079, avg: 8869, sum: 2388105216, count: 269263  
pid: 8701, tid: 8701, min: 3860, max: 1637398, avg: 8894, sum: 2396702871, count: 269470  

当full page writes开启时,wal写的平均时间比关闭full page writes时长很多:

checkpoint start  
checkpoint done  
pid: 8887, tid: 8887, min: 20388, max: 20388, avg: 20388, sum: 20388, count: 1  
pid: 7976, tid: 7976, min: 17660, max: 18440, avg: 18050, sum: 36100, count: 2  
pid: 9004, tid: 9004, min: 11571, max: 226607, avg: 93843, sum: 563063, count: 6  
pid: 8805, tid: 8805, min: 3545, max: 1847920, avg: 8823, sum: 1280653201, count: 145139  
pid: 8914, tid: 8914, min: 3478, max: 19553712, avg: 46633, sum: 10678936832, count: 228995  
pid: 8965, tid: 8965, min: 3857, max: 33328682, avg: 37567, sum: 9228772302, count: 245661  
pid: 8986, tid: 8986, min: 3805, max: 32983019, avg: 37369, sum: 9208748829, count: 246421  
pid: 8969, tid: 8969, min: 3824, max: 33472947, avg: 37242, sum: 9185048728, count: 246626  
pid: 8990, tid: 8990, min: 3902, max: 32383832, avg: 37270, sum: 9202617869, count: 246912  
pid: 8954, tid: 8954, min: 3902, max: 32199797, avg: 37151, sum: 9175503504, count: 246976  
pid: 8991, tid: 8991, min: 3797, max: 33125798, avg: 37156, sum: 9179089256, count: 247040  
pid: 8952, tid: 8952, min: 3782, max: 33192867, avg: 37446, sum: 9251197245, count: 247049  
pid: 8958, tid: 8958, min: 3801, max: 33094778, avg: 37337, sum: 9232970224, count: 247287  
pid: 8979, tid: 8979, min: 3842, max: 33224595, avg: 37079, sum: 9171570876, count: 247351  
pid: 8984, tid: 8984, min: 3786, max: 33242997, avg: 37090, sum: 9176002582, count: 247396  
pid: 8945, tid: 8945, min: 3819, max: 32980512, avg: 37063, sum: 9170007762, count: 247413  
pid: 8978, tid: 8978, min: 3838, max: 33609199, avg: 37452, sum: 9270596686, count: 247530  
pid: 8981, tid: 8981, min: 3759, max: 33190956, avg: 36915, sum: 9140303488, count: 247598  
pid: 8993, tid: 8993, min: 3864, max: 33163898, avg: 37178, sum: 9209639310, count: 247715  
pid: 8980, tid: 8980, min: 3838, max: 33197079, avg: 37071, sum: 9184138362, count: 247739  
pid: 8973, tid: 8973, min: 3808, max: 32933431, avg: 37014, sum: 9174981958, count: 247876  
pid: 8962, tid: 8962, min: 3838, max: 33110329, avg: 37188, sum: 9218350426, count: 247880  
pid: 8982, tid: 8982, min: 3820, max: 33107999, avg: 36969, sum: 9173436495, count: 248136  
pid: 8974, tid: 8974, min: 3834, max: 32363107, avg: 37128, sum: 9213153086, count: 248140  
pid: 8949, tid: 8949, min: 3838, max: 32520297, avg: 37266, sum: 9250689544, count: 248229  
pid: 8983, tid: 8983, min: 3827, max: 32591534, avg: 37085, sum: 9208350649, count: 248303  
pid: 8992, tid: 8992, min: 3854, max: 33391344, avg: 37197, sum: 9240135638, count: 248407  
pid: 8988, tid: 8988, min: 3775, max: 33189136, avg: 36888, sum: 9166448605, count: 248489  
pid: 8985, tid: 8985, min: 3823, max: 32310200, avg: 36714, sum: 9124542692, count: 248528  
pid: 8987, tid: 8987, min: 3793, max: 32861463, avg: 37067, sum: 9214883458, count: 248599  
pid: 8964, tid: 8964, min: 3812, max: 32963347, avg: 36984, sum: 9196461264, count: 248654  
pid: 8989, tid: 8989, min: 3790, max: 33236923, avg: 36882, sum: 9171642674, count: 248674  
pid: 8955, tid: 8955, min: 3838, max: 32400840, avg: 36901, sum: 9210718295, count: 249601  
pid: 8722, tid: 8722, min: 3880, max: 4580443, avg: 8828, sum: 3783136090, count: 428524  
pid: 8737, tid: 8737, min: 3804, max: 3539772, avg: 8888, sum: 3821238551, count: 429931  
pid: 8706, tid: 8706, min: 3812, max: 4387445, avg: 8777, sum: 3774082384, count: 429974  
pid: 8718, tid: 8718, min: 3846, max: 2775268, avg: 8949, sum: 3849426661, count: 430127  
pid: 8726, tid: 8726, min: 3782, max: 2775655, avg: 8851, sum: 3809751808, count: 430398  
pid: 8736, tid: 8736, min: 3913, max: 2712147, avg: 8786, sum: 3781992446, count: 430441  
pid: 8711, tid: 8711, min: 3868, max: 4244141, avg: 8761, sum: 3774796287, count: 430843  
pid: 8732, tid: 8732, min: 3834, max: 2296943, avg: 8887, sum: 3829627946, count: 430900  
pid: 8728, tid: 8728, min: 3868, max: 3326763, avg: 8803, sum: 3794778852, count: 431045  
pid: 8694, tid: 8694, min: 3944, max: 3331770, avg: 8837, sum: 3809602504, count: 431094  
pid: 8693, tid: 8693, min: 3820, max: 2340163, avg: 8745, sum: 3770891213, count: 431177  
pid: 8739, tid: 8739, min: -1310, max: 2893860, avg: 8881, sum: 3829692548, count: 431180  
pid: 8729, tid: 8729, min: 3781, max: 2710005, avg: 8929, sum: 3851335374, count: 431306  
pid: 8734, tid: 8734, min: 3808, max: 3303660, avg: 8766, sum: 3781832008, count: 431400  
pid: 8698, tid: 8698, min: 3868, max: 2141824, avg: 8699, sum: 3754114431, count: 431508  
pid: 8735, tid: 8735, min: 3856, max: 7316670, avg: 8899, sum: 3841025315, count: 431579  
pid: 8708, tid: 8708, min: 3857, max: 2156062, avg: 8958, sum: 3867736316, count: 431763  
pid: 8705, tid: 8705, min: 3868, max: 2579393, avg: 8908, sum: 3847521523, count: 431890  
pid: 8695, tid: 8695, min: 3823, max: 2256848, avg: 8825, sum: 3812586458, count: 431990  
pid: 8738, tid: 8738, min: 3876, max: 2187521, avg: 8833, sum: 3821431064, count: 432596  
pid: 8733, tid: 8733, min: 3831, max: 3051621, avg: 8879, sum: 3841527531, count: 432620  
pid: 8719, tid: 8719, min: 3812, max: 4460423, avg: 8708, sum: 3767820516, count: 432669  
pid: 8704, tid: 8704, min: 3857, max: 3500486, avg: 8885, sum: 3844851608, count: 432716  
pid: 8730, tid: 8730, min: 3857, max: 2942251, avg: 8861, sum: 3838738375, count: 433191  
pid: 8717, tid: 8717, min: 3845, max: 2727852, avg: 8790, sum: 3808255298, count: 433241  
pid: 8701, tid: 8701, min: -14361, max: 5013506, avg: 8887, sum: 3853784088, count: 433605  
pid: 8714, tid: 8714, min: 3846, max: 3206576, avg: 8846, sum: 3838045173, count: 433846  
pid: 8709, tid: 8709, min: -22632, max: 2128161, avg: 8916, sum: 3869261657, count: 433926  

最后,即使我们关闭了full page writes, 在某些情况下也会写full page, 那就是打开在线备份时做的那个检查点。

源码如下:

/*  
 * Shared state data for XLogInsert.  
 */  
typedef struct XLogCtlInsert  
{  
        slock_t         insertpos_lck;  /* protects CurrBytePos and PrevBytePos */  
  
        /*  
         * CurrBytePos is the end of reserved WAL. The next record will be  
         * inserted at that position. PrevBytePos is the start position of the  
         * previously inserted (or rather, reserved) record - it is copied to the  
         * prev-link of the next record. These are stored as "usable byte  
         * positions" rather than XLogRecPtrs (see XLogBytePosToRecPtr()).  
         */  
        uint64          CurrBytePos;  
        uint64          PrevBytePos;  
  
        /*  
         * Make sure the above heavily-contended spinlock and byte positions are  
         * on their own cache line. In particular, the RedoRecPtr and full page  
         * write variables below should be on a different cache line. They are  
         * read on every WAL insertion, but updated rarely, and we don't want  
         * those reads to steal the cache line containing Curr/PrevBytePos.  
         */  
        char            pad[PG_CACHE_LINE_SIZE];  
  
        /*  
         * fullPageWrites is the master copy used by all backends to determine  
         * whether to write full-page to WAL, instead of using process-local one.  
         * This is required because, when full_page_writes is changed by SIGHUP,  
         * we must WAL-log it before it actually affects WAL-logging by backends.  
         * Checkpointer sets at startup or after SIGHUP.  
         *  
         * To read these fields, you must hold an insertion lock. To modify them,  
         * you must hold ALL the locks.  
         */  
        XLogRecPtr      RedoRecPtr;             /* current redo point for insertions */  
        bool            forcePageWrites;        /* forcing full-page writes for PITR? */  // 这个就是保证做在线备份时写FULL PAGE的变量。  
        bool            fullPageWrites;  
  
        /*  
         * exclusiveBackup is true if a backup started with pg_start_backup() is  
         * in progress, and nonExclusiveBackups is a counter indicating the number  
         * of streaming base backups currently in progress. forcePageWrites is set  
         * to true when either of these is non-zero. lastBackupStart is the latest  
         * checkpoint redo location used as a starting point for an online backup.  
         */  
        bool            exclusiveBackup;  
        int                     nonExclusiveBackups;  
        XLogRecPtr      lastBackupStart;  
  
        /*  
         * WAL insertion locks.  
         */  
        WALInsertLockPadded *WALInsertLocks;  
        LWLockTranche WALInsertLockTranche;  
        int                     WALInsertLockTrancheId;  
} XLogCtlInsert;  
  
XLogRecPtr  
XLogInsert(RmgrId rmid, uint8 info, XLogRecData *rdata)  
{  
......  
        /*  
         * Decide if we need to do full-page writes in this XLOG record: true if  
         * full_page_writes is on or we have a PITR request for it.  Since we  
         * don't yet have an insertion lock, fullPageWrites and forcePageWrites  
         * could change under us, but we'll recheck them once we have a lock.  
         */  
        doPageWrites = Insert->fullPageWrites || Insert->forcePageWrites;  
...  
}  
  
/*  
 * do_pg_start_backup is the workhorse of the user-visible pg_start_backup()  
 * function. It creates the necessary starting checkpoint and constructs the  
 * backup label file.  
 *  
 * There are two kind of backups: exclusive and non-exclusive. An exclusive  
 * backup is started with pg_start_backup(), and there can be only one active  
 * at a time. The backup label file of an exclusive backup is written to  
 * $PGDATA/backup_label, and it is removed by pg_stop_backup().  
 *  
 * A non-exclusive backup is used for the streaming base backups (see  
 * src/backend/replication/basebackup.c). The difference to exclusive backups  
 * is that the backup label file is not written to disk. Instead, its would-be  
 * contents are returned in *labelfile, and the caller is responsible for  
 * including it in the backup archive as 'backup_label'. There can be many  
 * non-exclusive backups active at the same time, and they don't conflict  
 * with an exclusive backup either.  
 *  
 * Returns the minimum WAL position that must be present to restore from this  
 * backup, and the corresponding timeline ID in *starttli_p.  
 *  
 * Every successfully started non-exclusive backup must be stopped by calling  
 * do_pg_stop_backup() or do_pg_abort_backup().  
 *  
 * It is the responsibility of the caller of this function to verify the  
 * permissions of the calling user!  
 */  
XLogRecPtr  
do_pg_start_backup(const char *backupidstr, bool fast, TimeLineID *starttli_p,  
                                   char **labelfile)  
{  
......  
        XLogCtl->Insert.forcePageWrites = true;  
.....  
}  

因此,我最后做一个测试来验证一下,开始备份后,性能应该会受到wal write full page影响而下降:

full_page_writes = off  
wal_level = hot_standby  
archive_mode = on  
archive_command = '/bin/date'  

准备热备:

digoal=# select pg_start_backup('now()');  

由于备份会强制开启fpw, 影响又出现了

progress: 38.0 s, 60170.0 tps, lat 0.463 ms stddev 0.339  
progress: 39.0 s, 48121.9 tps, lat 0.580 ms stddev 1.518  
progress: 40.0 s, 15061.9 tps, lat 1.839 ms stddev 4.240  
progress: 41.0 s, 17799.1 tps, lat 1.585 ms stddev 3.630  
progress: 42.0 s, 20799.1 tps, lat 1.337 ms stddev 3.151  
progress: 43.0 s, 24623.8 tps, lat 1.143 ms stddev 2.766  
progress: 44.0 s, 28381.1 tps, lat 0.976 ms stddev 2.944  
progress: 45.0 s, 38196.2 tps, lat 0.737 ms stddev 1.874  
progress: 46.0 s, 45302.8 tps, lat 0.608 ms stddev 1.484  
progress: 47.0 s, 64550.1 tps, lat 0.438 ms stddev 0.653  

最后提供一些优化建议:

1. 配置合理的shared buffer,1/4内存,但是不建议超过热数据大小。

2. 如果开启了异步提交,修改一下on schedule checkpoint算法,参考我前面的文章。

3. 配置合理的checkpoint_segments, checkpoint_timeout, checkpoint_completion_target。

checkpoint_segments建议和shared buffer一样大,例如shared buffer=8G, wal segment=16MB,checkpoint_segments=8G/16MB=512  
checkpoint_timeout设置为大于生成checkpoint_segments即512个xlog的时间周期。  

4. checkpoint_completion_target根据IO能力进行调整,调整到checkpoint不影响业务为宜。

checkpoint_completion_target越大,对IO影响越小,但是checkpoint周期越长,需用户自己权衡。  

5. 如果你不能容忍检查点给wal 带来的full page write。建议采购可以实现原子写的硬件设备,或者使用支持full page write的文件系统。

如果你的数据库对数据一致性要求不是那么高,也可以冒险直接关闭full page writes,仅仅当检查点后第一次变脏的数据块出现partial write时才会导致这个数据块的数据不一致。  

6. 对于内存较大场景,建议使用32KB的block size。

7. initdb -k这个参数只是打开数据块的校验,不是来防止partial write的,而是用来检查数据块是否出现了partial write或其他异常的。还可以用来做检测块级别的纂改等。一般不建议打开,因为对性能影响较大。

8. full page writes带来的性能影响如何计算呢?实际是和连续写几个wal block size大小的能力有关,如果block_size=32K, wal_block_size=8K, 那么一个脏块的full page write需要写4个wal_block_size,假设wal fsync能力是每秒写10000个8K的块,那么检查点后的写操作如果全部都发生在不同的数据块上面(就比如我前面的update测试用例),写WAL能力下降可能导致tps降到2500以下。如果开启了异步wal的话,这个时候就是检验内存的写4个wal_block_size能力。

[参考]
1. http://blog.163.com/digoal@126/blog/static/163877040201542103933969/

2. http://blog.163.com/digoal@126/blog/static/1638770402015463252387/

3. http://blog.163.com/digoal@126/blog/static/16387704020154651655783/

4. http://blog.163.com/digoal@126/blog/static/16387704020154653422892/

5. http://blog.163.com/digoal@126/blog/static/16387704020154811421484/

6. http://blog.163.com/digoal@126/blog/static/163877040201541134842882/

7. src/backend/access/transam/xlog.c

8. http://wenku.it168.com/d_001613581.shtml

9. http://www.postgresql.org/docs/9.4/static/runtime-config-wal.html#GUC-FULL-PAGE-WRITES

10. http://en.wikipedia.org/wiki/Journaling_file_system

11. http://dba.stackexchange.com/questions/98515/is-it-safe-to-disable-innodb-doublewrite-buffering-when-setting-innodb-page-size/98603

12. http://lwn.net/Articles/552095/

13. http://www.pgcon.org/2012/schedule/attachments/258_212_Internals%20Of%20PostgreSQL%20Wal.pdf

14. http://en.wikipedia.org/wiki/Data_corruption

Flag Counter

digoal’s 大量PostgreSQL文章入口