zfs 快照增量大小 vs PostgreSQL产生的XLOG大小

6 minute read

背景

zfs快照增量 和 oracle的rman incremental backup极其类似。是其他不具备oracle rman 数据库级增量备份的数据库产品的福音,例如PostgreSQL(注意,使用pg_rman , pg_probackup都可以支持块级增量了)。

下面我们来测试一下zfs快照增量的空间占用情况?

理论上,快照增量的极限是当前zfs文件系统的大小,也就是在打完快照后,快照对应的ZFS上的每个块都被改变了。

所以文件系统越大,同时更新面越广,快照就可能越大。

这种情况什么时候会发生呢?

比如插入数据库的每条记录,将来都会被变更一次,这样的应用场景,快照是会很大的。

测试CASE, TPC-B。

环境

PostgreSQL 9.5rc1  
  
$PGDATA目录  
zp1/data01            4.1T  1.1T  3.1T  25% /data01  
  
pg_xlog目录  
挂载在zfs以外的某个目录。  

参数

关闭full page write(因为$PGDATA所在的zfs是cow的,不需要开启FPW。)

数据量, 70亿,包括索引超过1TB。

pgbench -i -s 70000  

这样的测试数据量,覆盖的范围是1TB,而且测试包含了4条更新,1条插入,更新范围是1TB的范围,所以单个快照最大的新增的空间是可能达到1TB的大小的。

测试1

测试开始前,创建检查点,记录XLOG位置。

postgres=# checkpoint;  
CHECKPOINT  
postgres=# select pg_current_xlog_insert_location();  
 pg_current_xlog_insert_location   
---------------------------------  
 592/9D563C70  
(1 row)  

创建快照

#zfs snapshot zp1/data01@2016010401  

压测600秒

pgbench -M prepared -n -r -P 1 -c 48 -j 48 -T 600  
  
transaction type: TPC-B (sort of)  
scaling factor: 70000  
query mode: prepared  
number of clients: 48  
number of threads: 48  
duration: 600 s  
number of transactions actually processed: 2866811  
latency average: 10.044 ms  
latency stddev: 16.135 ms  
tps = 4777.542920 (including connections establishing)  
tps = 4777.781687 (excluding connections establishing)  
statement latencies in milliseconds:  
        0.004060        \set nbranches 1 * :scale  
        0.001105        \set ntellers 10 * :scale  
        0.000861        \set naccounts 100000 * :scale  
        0.001687        \setrandom aid 1 :naccounts  
        0.001017        \setrandom bid 1 :nbranches  
        0.000961        \setrandom tid 1 :ntellers  
        0.001000        \setrandom delta -5000 5000  
        0.052541        BEGIN;  
        9.318532        UPDATE pgbench_accounts SET abalance = abalance + :delta WHERE aid = :aid;  
        0.138619        SELECT abalance FROM pgbench_accounts WHERE aid = :aid;  
        0.169681        UPDATE pgbench_tellers SET tbalance = tbalance + :delta WHERE tid = :tid;  
        0.147535        UPDATE pgbench_branches SET bbalance = bbalance + :delta WHERE bid = :bid;  
        0.115753        INSERT INTO pgbench_history (tid, bid, aid, delta, mtime) VALUES (:tid, :bid, :aid, :delta, CURRENT_TIMESTAMP);  
        0.079338        END;  

创建检查点,记录XLOG位置

postgres=# checkpoint;  
CHECKPOINT  
  
postgres=# select pg_current_xlog_insert_location();  
 pg_current_xlog_insert_location   
---------------------------------  
 592/EC6AEB38  
(1 row)  

计算XLOG产生量

postgres=# select pg_size_pretty(pg_xlog_location_diff('592/EC6AEB38', '592/9D563C70'));  
 pg_size_pretty   
----------------  
 1265 MB  
(1 row)  

创建快照

#zfs snapshot zp1/data01@2016010402  

计算快照增量,约120GB,比XLOG大很多。后面会分析原因。

#zfs send -n -P -v -i zp1/data01@2016010401 zp1/data01@2016010402  
incremental     2016010401      zp1/data01@2016010402   124825285344  
size    124825285344  

测试2

记录XLOG位置

postgres=# select pg_current_xlog_insert_location();  
 pg_current_xlog_insert_location   
---------------------------------  
 592/ED000098  
(1 row)  

压测600秒

pgbench -M prepared -n -r -P 1 -c 48 -j 48 -T 600  
transaction type: TPC-B (sort of)  
scaling factor: 70000  
query mode: prepared  
number of clients: 48  
number of threads: 48  
duration: 600 s  
number of transactions actually processed: 1837930  
latency average: 15.667 ms  
latency stddev: 34.785 ms  
tps = 3062.922412 (including connections establishing)  
tps = 3063.082480 (excluding connections establishing)  
statement latencies in milliseconds:  
        0.004004        \set nbranches 1 * :scale  
        0.001107        \set ntellers 10 * :scale  
        0.000849        \set naccounts 100000 * :scale  
        0.001673        \setrandom aid 1 :naccounts  
        0.000973        \setrandom bid 1 :nbranches  
        0.000928        \setrandom tid 1 :ntellers  
        0.000969        \setrandom delta -5000 5000  
        0.052813        BEGIN;  
        14.959468       UPDATE pgbench_accounts SET abalance = abalance + :delta WHERE aid = :aid;  
        0.131955        SELECT abalance FROM pgbench_accounts WHERE aid = :aid;  
        0.156640        UPDATE pgbench_tellers SET tbalance = tbalance + :delta WHERE tid = :tid;  
        0.140038        UPDATE pgbench_branches SET bbalance = bbalance + :delta WHERE bid = :bid;  
        0.125661        INSERT INTO pgbench_history (tid, bid, aid, delta, mtime) VALUES (:tid, :bid, :aid, :delta, CURRENT_TIMESTAMP);  
        0.079363        END;  

创建检查点,记录XLOG位置

postgres=# checkpoint;  
CHECKPOINT  
postgres=# select pg_current_xlog_insert_location();  
 pg_current_xlog_insert_location   
---------------------------------  
 593/1FAEC728  
(1 row)  

产生XLOG量

postgres=# select pg_size_pretty(pg_xlog_location_diff('593/1FAEC728', '592/ED000098'));  
 pg_size_pretty   
----------------  
 811 MB  
(1 row)  

创建快照

#zfs snapshot zp1/data01@2016010403  

计算快照增量,约80GB

#zfs send -n -P -v -i zp1/data01@2016010402 zp1/data01@2016010403  
incremental     2016010402      zp1/data01@2016010403   83633656888  
size    83633656888  

递归增量,总共约200GB

#zfs send -n -P -v -I zp1/data01@2016010401 zp1/data01@2016010403  
incremental     2016010401      zp1/data01@2016010402   124825285344  
incremental     2016010402      zp1/data01@2016010403   83633656888  
size    208458942232  

每个快照占用的空间

#zfs list -t snapshot  
NAME                    USED  AVAIL  REFER  MOUNTPOINT  
zp1/data01@2016010401   121G      -  1.01T  -  
zp1/data01@2016010402  20.9G      -  1.01T  -  
zp1/data01@2016010403   200K      -  1.01T  -  
  
#df -h  
Filesystem            Size  Used Avail Use% Mounted on  
zp1/data01            4.9T  1.1T  3.9T  21% /data01  

删掉这三个快照,能回收140G空间.

#zfs destroy zp1/data01@2016010401  
#zfs destroy zp1/data01@2016010402  
#zfs destroy zp1/data01@2016010403  
  
#df -h  
Filesystem            Size  Used Avail Use% Mounted on  
zp1/data01            5.1T  1.1T  4.1T  20% /data01  

由于ZFS的快照是COW的,我们前面的测试涉及的块变更范围是1TB,两次600秒的压测,数据块变更的范围是80GB和120GB。

如果将活跃数据降低,快照也会变小,如下:

把活跃数据降低到1GB,再次测试。

测试3

pgbench -i -s 100  digoal  
  
#zfs snapshot zp1/data01@2016010403  

这次测试,TPS达到了4.57w/s,但是影响的块在1GB的范围内快照却小了很多。

pgbench -M prepared -n -r -P 1 -c 48 -j 48 -T 600  digoal  
transaction type: TPC-B (sort of)  
scaling factor: 100  
query mode: prepared  
number of clients: 48  
number of threads: 48  
duration: 600 s  
number of transactions actually processed: 27425009  
latency average: 1.048 ms  
latency stddev: 0.530 ms  
tps = 45704.065177 (including connections establishing)  
tps = 45706.628863 (excluding connections establishing)  
statement latencies in milliseconds:  
        0.003538        \set nbranches 1 * :scale  
        0.001089        \set ntellers 10 * :scale  
        0.000896        \set naccounts 100000 * :scale  
        0.001683        \setrandom aid 1 :naccounts  
        0.001225        \setrandom bid 1 :nbranches  
        0.001526        \setrandom tid 1 :ntellers  
        0.002240        \setrandom delta -5000 5000  
        0.084777        BEGIN;  
        0.217070        UPDATE pgbench_accounts SET abalance = abalance + :delta WHERE aid = :aid;  
        0.118606        SELECT abalance FROM pgbench_accounts WHERE aid = :aid;  
        0.154885        UPDATE pgbench_tellers SET tbalance = tbalance + :delta WHERE tid = :tid;  
        0.204343        UPDATE pgbench_branches SET bbalance = bbalance + :delta WHERE bid = :bid;  
        0.117456        INSERT INTO pgbench_history (tid, bid, aid, delta, mtime) VALUES (:tid, :bid, :aid, :delta, CURRENT_TIMESTAMP);  
        0.129509        END;  
  
#zfs snapshot zp1/data01@2016010404  
  
#zfs list -t snapshot  
NAME                    USED  AVAIL  REFER  MOUNTPOINT  
zp1/data01@2016010401   131G      -  1.01T  -  
zp1/data01@2016010402   352K      -  1.01T  -  
zp1/data01@2016010403  1.47G      -  1.01T  -  
zp1/data01@2016010404  15.6M      -  1.01T  -  

快照增量只有3.1GB。

#zfs send -n -P -v -i zp1/data01@2016010403 zp1/data01@2016010404  
incremental     2016010403      zp1/data01@2016010404   3245256520  
size    3245256520  
  
postgres=# \l+  
                                                               List of databases  
   Name    |  Owner   | Encoding | Collate | Ctype |   Access privileges   |  Size   | Tablespace |                Description                   
-----------+----------+----------+---------+-------+-----------------------+---------+------------+--------------------------------------------  
 digoal    | postgres | UTF8     | C       | C     |                       | 2971 MB | pg_default |   

测试4

将full page write打开,再对比xlog和zfs快照增量的大小。

#zfs snapshot zp1/data01@2016010405  
  
postgres=# select pg_current_xlog_insert_location();  
 pg_current_xlog_insert_location   
---------------------------------  
 596/830000D0  
(1 row)  
  
pgbench -M prepared -n -r -P 1 -c 48 -j 48 -T 600  
transaction type: TPC-B (sort of)  
scaling factor: 70000  
query mode: prepared  
number of clients: 48  
number of threads: 48  
duration: 600 s  
number of transactions actually processed: 1772299  
latency average: 16.247 ms  
latency stddev: 31.695 ms  
tps = 2953.504271 (including connections establishing)  
tps = 2953.662892 (excluding connections establishing)  
statement latencies in milliseconds:  
        0.004457        \set nbranches 1 * :scale  
        0.001219        \set ntellers 10 * :scale  
        0.000958        \set naccounts 100000 * :scale  
        0.001777        \setrandom aid 1 :naccounts  
        0.001084        \setrandom bid 1 :nbranches  
        0.001013        \setrandom tid 1 :ntellers  
        0.001071        \setrandom delta -5000 5000  
        0.062108        BEGIN;  
        14.041890       UPDATE pgbench_accounts SET abalance = abalance + :delta WHERE aid = :aid;  
        0.153864        SELECT abalance FROM pgbench_accounts WHERE aid = :aid;  
        0.870157        UPDATE pgbench_tellers SET tbalance = tbalance + :delta WHERE tid = :tid;  
        0.478328        UPDATE pgbench_branches SET bbalance = bbalance + :delta WHERE bid = :bid;  
        0.362079        INSERT INTO pgbench_history (tid, bid, aid, delta, mtime) VALUES (:tid, :bid, :aid, :delta, CURRENT_TIMESTAMP);  
        0.255290        END;  
  
postgres=# checkpoint;  
CHECKPOINT  
postgres=# select pg_current_xlog_insert_location();  
 pg_current_xlog_insert_location   
---------------------------------  
 5A7/3DF47450  
(1 row)  
postgres=# select pg_size_pretty(pg_xlog_location_diff('5A7/3DF47450', '596/830000D0'));  
 pg_size_pretty   
----------------  
 67 GB  
(1 row)  
  
#zfs snapshot zp1/data01@2016010406  
  
#zfs send -n -P -v -i zp1/data01@2016010405 zp1/data01@2016010406  
incremental     2016010405      zp1/data01@2016010406   80259589256  
size    80259589256  
  
postgres=# select pg_size_pretty(80259589256);  
 pg_size_pretty   
----------------  
 75 GB  
(1 row)  
  
#zfs list -t snapshot  
NAME                    USED  AVAIL  REFER  MOUNTPOINT  
zp1/data01@2016010401   131G      -  1.01T  -  
zp1/data01@2016010402   352K      -  1.01T  -  
zp1/data01@2016010403  1.47G      -  1.01T  -  
zp1/data01@2016010404  2.16G      -  1.01T  -  
zp1/data01@2016010405  11.0G      -  1.01T  -  
zp1/data01@2016010406   424K      -  1.01T  -  
  
#zfs destroy zp1/data01@2016010401  
#zfs destroy zp1/data01@2016010402  
#zfs destroy zp1/data01@2016010403  
#zfs destroy zp1/data01@2016010404  
  
#zfs list -t snapshot  
NAME                    USED  AVAIL  REFER  MOUNTPOINT  
zp1/data01@2016010405  80.0G      -  1.01T  -  
zp1/data01@2016010406   424K      -  1.01T  -  

测试5

开启full page write,并且活跃数据缩小到1GB测试:

 pgbench -M prepared -n -r -P 1 -c 48 -j 48 -T 600 digoal  
  
postgres=# select pg_size_pretty(pg_xlog_location_diff('5AA/7ED591C0', '5A7/3DF47450'));  
 pg_size_pretty   
----------------  
 13 GB  
(1 row)  
  
#zfs snapshot zp1/data01@2016010407  
  
#zfs send -n -P -v -i zp1/data01@2016010406 zp1/data01@2016010407  
incremental     2016010406      zp1/data01@2016010407   3605632832  
size    3605632832  
  
postgres=# select pg_size_pretty(3605632832);  
 pg_size_pretty   
----------------  
 3439 MB  
(1 row)  
  
#zfs list -t snapshot  
NAME                    USED  AVAIL  REFER  MOUNTPOINT  
zp1/data01@2016010405  80.0G      -  1.01T  -  
zp1/data01@2016010406   840K      -  1.01T  -  
zp1/data01@2016010407   248K      -  1.01T  -  

小结

1. zfs快照是块级别的,所以一定比xlog大,并且本例xlog关闭了fpw,所以进一步缩小了XLOG的产生量。

(当开启XLOG的full page write时,XLOG的量和ZFS快照增量就非常接近了,对于小数据库,快照比XLOG小很多)

2. zfs快照占用的空间,和数据块的变更有个,当数据块发生任意修改时,这个数据块就会占用快照空间。

对于PG来说,数据块的变动是非常多的,例如:

tuple hint bit 可能会在查询时更新,

vacuum 回收垃圾时,

vacuum freeze时,

update 数据时,

索引变更时。

以上操作都会引起对应数据块的更新,从而导致快照变大。

注意每个块不管更新多少次,在一个快照中只占用一个块的空间。

新增的块不会占用快照的空间,只有老的块发生变更时才占用快照空间。

3. 如果使用zfs的快照作为PostgreSQL的备份,需要注意什么?

监控快照的空间占用情况。

及时删除不需要的老的快照释放空间。

控制创建快照的频率。

关闭数据库的FULL PAGE WRITE,减少产生的日志量。

Flag Counter

digoal’s 大量PostgreSQL文章入口