PostgreSQL 9.6 vacuum freeze大幅性能提升 代码浅析
背景
PostgreSQL的tuple(即记录)头信息中有两个字段分别为XMIN,XMAX用于标记行产生与变更的事务号,以标示记录的版本号,事务的可见性等。
这个事务号是32BIT的长度,因此PG设计了一个事务存活的最长时间是约20亿,如果超过20亿必须将这个事务置为frozen。
被置为frozen状态的记录,对所有的事务可见,从而解决了32BIT的XID可以用于实现MVCC的目的。
因此PostgreSQL 需要周期性的对表进行扫描,检查是否需要将记录置为frozen。
PostgreSQL 9.4以前的版本,FROZEN是通过一个等于2的XID来表示的。
从9.4开始改成了通过tuple 头部的t_infomask中的两个互斥的比特位来表示HEAP_XMIN_COMMITTED | HEAP_XMIN_INVALID。 |
参见
https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=37484ad2aacef5ec794f4dd3d5cf814475180a78
有些插入操作,也可以直接将记录置为freeze,例如大批量的COPY数据。 insert into等。
src/include/access/heapam.h
/* "options" flag bits for heap_insert */
#define HEAP_INSERT_SKIP_WAL 0x0001
#define HEAP_INSERT_SKIP_FSM 0x0002
#define HEAP_INSERT_FROZEN 0x0004
#define HEAP_INSERT_SPECULATIVE 0x0008
表的全局年龄则记录在pg_class中,即使表没有任何变化,在年龄到达一定的值(参数配置)后,也需要发起frozen的动作,对表的记录进行扫描。
9.6对这块做了改进,当数据页中的所有记录已经是FROZEN状态时,在发起vacuum freeze时会跳过这个页的扫描,从而大幅提升静态数据的freeze操作,减少IO扫描。
目前这个page frozen标记放在表对应的VM文件中。
vacuum freeze改进代码分析
1. 如何将记录标记为freeze状态
https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=37484ad2aacef5ec794f4dd3d5cf814475180a78
Change the way we mark tuples as frozen.
Instead of changing the tuple xmin to FrozenTransactionId, the combination
of HEAP_XMIN_COMMITTED and HEAP_XMIN_INVALID, which were previously never
set together, is now defined as HEAP_XMIN_FROZEN. A variety of previous
proposals to freeze tuples opportunistically before vacuum_freeze_min_age
is reached have foundered on the objection that replacing xmin by
FrozenTransactionId might hinder debugging efforts when things in this
area go awry; this patch is intended to solve that problem by keeping
the XID around (but largely ignoring the value to which it is set).
Third-party code that checks for HEAP_XMIN_INVALID on tuples where
HEAP_XMIN_COMMITTED might be set will be broken by this change. To fix,
use the new accessor macros in htup_details.h rather than consulting the
bits directly. HeapTupleHeaderGetXmin has been modified to return
FrozenTransactionId when the infomask bits indicate that the tuple is
frozen; use HeapTupleHeaderGetRawXmin when you already know that the
tuple isn't marked commited or frozen, or want the raw value anyway.
We currently do this in routines that display the xmin for user consumption,
in tqual.c where it's known to be safe and important for the avoidance of
extra cycles, and in the function-caching code for various procedural
languages, which shouldn't invalidate the cache just because the tuple
gets frozen.
Robert Haas and Andres Freund
src/include/access/htup_details.h
/*
* information stored in t_infomask:
*/
#define HEAP_XMIN_FROZEN (HEAP_XMIN_COMMITTED|HEAP_XMIN_INVALID)
2. 9.6在VM文件中,新增了对FROZEN页的标记,当整页数据都是静态的时,并且都已经是freeze状态,则该页标记为FROZEN,在执行vacuum freeze或自动触发freeze时,就会跳过这个页的扫描。
src/backend/access/heap/visibilitymap.c
* The visibility map is a bitmap with two bits (all-visible and all-frozen)
* per heap page. A set all-visible bit means that all tuples on the page are
* known visible to all transactions, and therefore the page doesn't need to
* be vacuumed. A set all-frozen bit means that all tuples on the page are
* completely frozen, and therefore the page doesn't need to be vacuumed even
* if whole table scanning vacuum is required (e.g. anti-wraparound vacuum).
* The all-frozen bit must be set only when the page is already all-visible.
src/include/access/visibilitymap.h
/* Number of bits for one heap page */
#define BITS_PER_HEAPBLOCK 2
/* Flags for bit map */
#define VISIBILITYMAP_ALL_VISIBLE 0x01
#define VISIBILITYMAP_ALL_FROZEN 0x02
#define VISIBILITYMAP_VALID_BITS 0x03 /* OR of all valid
* visiblitymap flags bits */
/* Macros for visibilitymap test */
#define VM_ALL_VISIBLE(r, b, v) \
((visibilitymap_get_status((r), (b), (v)) & VISIBILITYMAP_ALL_VISIBLE) != 0)
#define VM_ALL_FROZEN(r, b, v) \
((visibilitymap_get_status((r), (b), (v)) & VISIBILITYMAP_ALL_FROZEN) != 0)
3. 如果vm页损坏了,我们可以通过vacuum DISABLE_PAGE_SKIPPING强制扫描所有的页。
https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=ede62e56fbe809baa1a7bc3873d82f12ffe7540b
If you really want to vacuum every single page in the relation,
regardless of apparent visibility status or anything else, you can use
this option. In previous releases, this behavior could be achieved
using VACUUM (FREEZE), but because we can now recognize all-frozen
pages as not needing to be frozen again, that no longer works. There
should be no need for routine use of this option, but maybe bugs or
disaster recovery will necessitate its use.
postgres=# \h vacuum
Command: VACUUM
Description: garbage-collect and optionally analyze a database
Syntax:
VACUUM [ ( { FULL | FREEZE | VERBOSE | ANALYZE | DISABLE_PAGE_SKIPPING } [, ...] ) ] [ table_name [ (column_name [, ...] ) ] ]
9.6 vs 9.5 FREEZE操作
测试用例
postgres=# create table test(id int, info text);
CREATE TABLE
插入6400万记录
$ vi test.sql
insert into test values (1,'test');
$ pgbench -M prepared -n -r -f ./test.sql -c 64 -j 64 -t 1000000
transaction type: ./test.sql
scaling factor: 1
query mode: prepared
number of clients: 64
number of threads: 64
number of transactions per client: 1000000
number of transactions actually processed: 64000000/64000000
latency average = 0.164 ms
tps = 389383.803477 (including connections establishing)
tps = 389393.063237 (excluding connections establishing)
script statistics:
- statement latencies in milliseconds:
0.162 insert into test values (1,'test');
1. 9.6测试结果
生成vm文件
postgres=# vacuum analyze test;
VACUUM
Time: 8629.180 ms
确认VM文件已生成
postgres=# select * from pg_stat_file(pg_relation_filepath('test')||'_vm');
size | access | modification | change | creation | isdir
-------+------------------------+------------------------+------------------------+----------+-------
90112 | 2016-10-02 13:34:24+08 | 2016-10-02 13:34:32+08 | 2016-10-02 13:34:32+08 | | f
(1 row)
记录当前XLOG位置
postgres=# select pg_current_xlog_location();
pg_current_xlog_location
--------------------------
3/87E3C58
(1 row)
执行第一次vacuum freeze
postgres=# vacuum freeze test;
VACUUM
Time: 3487.945 ms
记录XLOG位置
postgres=# select pg_current_xlog_location();
pg_current_xlog_location
--------------------------
3/2E3C3F30
(1 row)
计算第一次VACUUM FREEZE产生了多少XLOG
postgres=# select pg_size_pretty(pg_xlog_location_diff('3/2E3C3F30','3/87E3C58'));
pg_size_pretty
----------------
604 MB
(1 row)
不产生数据变更,开始接下来的vacuum freeze。
9.6接下来的VACUUM FREEZE非常快,已经自动跳过了frozen page,并且不会产生XLOG。
postgres=# vacuum freeze test;
VACUUM
Time: 16.581 ms
postgres=# select pg_current_xlog_location();
pg_current_xlog_location
--------------------------
3/2E3C3F30
(1 row)
Time: 0.079 ms
postgres=# vacuum freeze test;
VACUUM
Time: 16.555 ms
postgres=# select pg_current_xlog_location();
pg_current_xlog_location
--------------------------
3/2E3C3F30
(1 row)
Time: 0.115 ms
postgres=# vacuum freeze verbose test;
INFO: vacuuming "public.test"
INFO: "test": found 0 removable, 0 nonremovable row versions in 0 out of 345957 pages
DETAIL: 0 dead row versions cannot be removed yet.
There were 0 unused item pointers.
Skipped 0 pages due to buffer pins.
0 pages are entirely empty.
CPU 0.00s/0.00u sec elapsed 0.00 sec.
INFO: vacuuming "pg_toast.pg_toast_16415"
INFO: index "pg_toast_16415_index" now contains 0 row versions in 1 pages
DETAIL: 0 index row versions were removed.
0 index pages have been deleted, 0 are currently reusable.
CPU 0.00s/0.00u sec elapsed 0.00 sec.
INFO: "pg_toast_16415": found 0 removable, 0 nonremovable row versions in 0 out of 0 pages
DETAIL: 0 dead row versions cannot be removed yet.
There were 0 unused item pointers.
Skipped 0 pages due to buffer pins.
0 pages are entirely empty.
CPU 0.00s/0.00u sec elapsed 0.00 sec.
VACUUM
Time: 16.566 ms
消耗几个事务后,重新vacuum freeze依旧很快,产生非常少量的xlog(这些xlog实际上是变更test表的pg_class.relfrozenxid字段产生的)。
postgres=# select txid_current();
txid_current
--------------
64001925
(1 row)
postgres=# select txid_current();
txid_current
--------------
64001926
(1 row)
postgres=# select txid_current();
txid_current
--------------
64001927
(1 row)
postgres=# vacuum freeze verbose test;
INFO: vacuuming "public.test"
INFO: "test": found 0 removable, 0 nonremovable row versions in 0 out of 345957 pages
DETAIL: 0 dead row versions cannot be removed yet.
There were 0 unused item pointers.
Skipped 0 pages due to buffer pins.
0 pages are entirely empty.
CPU 0.00s/0.00u sec elapsed 0.00 sec.
INFO: vacuuming "pg_toast.pg_toast_16415"
INFO: index "pg_toast_16415_index" now contains 0 row versions in 1 pages
DETAIL: 0 index row versions were removed.
0 index pages have been deleted, 0 are currently reusable.
CPU 0.00s/0.00u sec elapsed 0.00 sec.
INFO: "pg_toast_16415": found 0 removable, 0 nonremovable row versions in 0 out of 0 pages
DETAIL: 0 dead row versions cannot be removed yet.
There were 0 unused item pointers.
Skipped 0 pages due to buffer pins.
0 pages are entirely empty.
CPU 0.00s/0.00u sec elapsed 0.00 sec.
VACUUM
Time: 18.020 ms
postgres=# select pg_current_xlog_location();
pg_current_xlog_location
--------------------------
3/2E3C4130
(1 row)
postgres=# select pg_size_pretty(pg_xlog_location_diff('3/2E3C4130', '3/2E3C3F30'));
pg_size_pretty
----------------
512 bytes
(1 row)
使用强制vacuum freeze,不跳过froze pages.
当VM文件损坏时,可以这样使用。
postgres=# vacuum (freeze, verbose, DISABLE_PAGE_SKIPPING) test;
INFO: vacuuming "public.test"
INFO: "test": found 0 removable, 64000000 nonremovable row versions in 345957 out of 345957 pages
DETAIL: 0 dead row versions cannot be removed yet.
There were 0 unused item pointers.
Skipped 0 pages due to buffer pins.
2 pages are entirely empty.
CPU 0.68s/3.70u sec elapsed 4.38 sec.
INFO: vacuuming "pg_toast.pg_toast_16415"
INFO: index "pg_toast_16415_index" now contains 0 row versions in 1 pages
DETAIL: 0 index row versions were removed.
0 index pages have been deleted, 0 are currently reusable.
CPU 0.00s/0.00u sec elapsed 0.00 sec.
INFO: "pg_toast_16415": found 0 removable, 0 nonremovable row versions in 0 out of 0 pages
DETAIL: 0 dead row versions cannot be removed yet.
There were 0 unused item pointers.
Skipped 0 pages due to buffer pins.
0 pages are entirely empty.
CPU 0.00s/0.00u sec elapsed 0.00 sec.
VACUUM
Time: 4397.821 ms
postgres=# select pg_current_xlog_location();
pg_current_xlog_location
--------------------------
3/2E3C42F0
(1 row)
postgres=# select pg_size_pretty(pg_xlog_location_diff('3/2E3C42F0','3/2E3C4130'));
pg_size_pretty
----------------
448 bytes
(1 row)
通过测试,我们可以看到PostgreSQL 9.6在vacuum freeze这块改进非常实用,特别是当数据库很庞大,并且存储了大量的历史静态数据时,在发生vacuum freeze时,IO不会像以前那样飙升了。
2. 9.5测试结果
生成vm文件
postgres=# vacuum analyze test;
VACUUM
Time: 17441.652 ms
确认VM文件已生成
postgres=# select * from pg_stat_file(pg_relation_filepath('test')||'_vm');
size | access | modification | change | creation | isdir
-------+------------------------+------------------------+------------------------+----------+-------
49152 | 2016-10-02 14:11:17+08 | 2016-10-02 14:11:34+08 | 2016-10-02 14:11:34+08 | | f
(1 row)
记录当前XLOG位置
postgres=# select pg_current_xlog_location();
pg_current_xlog_location
--------------------------
5/A7FD5AC8
(1 row)
执行第一次vacuum freeze
postgres=# vacuum freeze test;
VACUUM
Time: 9889.702 ms
记录XLOG位置
postgres=# select pg_current_xlog_location();
pg_current_xlog_location
--------------------------
5/CCB2DAB8
(1 row)
计算第一次VACUUM FREEZE产生了多少XLOG
postgres=# select pg_size_pretty(pg_xlog_location_diff('5/CCB2DAB8', '5/A7FD5AC8'));
pg_size_pretty
----------------
587 MB
(1 row)
不产生数据变更,开始接下来的vacuum freeze。
9.5的版本,接下来的VACUUM FREEZE同样需要扫描全表,因为VM中未记录frozen pages。
postgres=# vacuum freeze test;
VACUUM
Time: 7191.695 ms
postgres=# select pg_current_xlog_location();
pg_current_xlog_location
--------------------------
5/CCB2DAB8
(1 row)
postgres=# vacuum freeze test;
VACUUM
Time: 7159.769 ms
postgres=# select pg_current_xlog_location();
pg_current_xlog_location
--------------------------
5/CCB2DAB8
(1 row)
postgres=# vacuum freeze verbose test;
INFO: vacuuming "public.test"
INFO: "test": found 0 removable, 64000000 nonremovable row versions in 346248 out of 346248 pages
DETAIL: 0 dead row versions cannot be removed yet.
There were 0 unused item pointers.
Skipped 0 pages due to buffer pins.
0 pages are entirely empty.
CPU 0.00s/7.18u sec elapsed 7.19 sec.
INFO: vacuuming "pg_toast.pg_toast_16682"
INFO: index "pg_toast_16682_index" now contains 0 row versions in 1 pages
DETAIL: 0 index row versions were removed.
0 index pages have been deleted, 0 are currently reusable.
CPU 0.00s/0.00u sec elapsed 0.00 sec.
INFO: "pg_toast_16682": found 0 removable, 0 nonremovable row versions in 0 out of 0 pages
DETAIL: 0 dead row versions cannot be removed yet.
There were 0 unused item pointers.
Skipped 0 pages due to buffer pins.
0 pages are entirely empty.
CPU 0.00s/0.00u sec elapsed 0.00 sec.
VACUUM
Time: 7206.649 ms
消耗几个事务后,重新vacuum freeze。
postgres=# select txid_current();
txid_current
--------------
128001924
(1 row)
postgres=# select txid_current();
txid_current
--------------
128001925
(1 row)
postgres=# select txid_current();
txid_current
--------------
128001926
(1 row)
postgres=# select txid_current();
txid_current
--------------
128001927
(1 row)
postgres=# vacuum freeze verbose test;
INFO: vacuuming "public.test"
INFO: "test": found 0 removable, 64000000 nonremovable row versions in 346248 out of 346248 pages
DETAIL: 0 dead row versions cannot be removed yet.
There were 0 unused item pointers.
Skipped 0 pages due to buffer pins.
0 pages are entirely empty.
CPU 0.00s/7.18u sec elapsed 7.19 sec.
INFO: vacuuming "pg_toast.pg_toast_16682"
INFO: index "pg_toast_16682_index" now contains 0 row versions in 1 pages
DETAIL: 0 index row versions were removed.
0 index pages have been deleted, 0 are currently reusable.
CPU 0.00s/0.00u sec elapsed 0.00 sec.
INFO: "pg_toast_16682": found 0 removable, 0 nonremovable row versions in 0 out of 0 pages
DETAIL: 0 dead row versions cannot be removed yet.
There were 0 unused item pointers.
Skipped 0 pages due to buffer pins.
0 pages are entirely empty.
CPU 0.00s/0.00u sec elapsed 0.00 sec.
VACUUM
Time: 7208.527 ms
postgres=# select pg_current_xlog_location();
pg_current_xlog_location
--------------------------
5/CCB2DEA8
(1 row)
postgres=# select pg_size_pretty(pg_xlog_location_diff('5/CCB2DEA8', '5/CCB2DAB8'));
pg_size_pretty
----------------
1008 bytes
(1 row)
pg_visibility插件
使用pg_visibility插件,可以查看VM文件的内容。检测VM文件是否损坏,truncate损坏的vm文件等操作。
https://www.postgresql.org/docs/9.6/static/pgvisibility.html
postgres=# create extension pg_visibility;
CREATE EXTENSION
postgres=# select * from pg_visibility_map('test', 0);
all_visible | all_frozen
-------------+------------
t | t
(1 row)
postgres=# select * from pg_visibility_map('test');
blkno | all_visible | all_frozen
-------+-------------+------------
0 | t | t
1 | t | t
2 | t | t
.......
postgres=# select * from pg_visibility_map_summary('test');
all_visible | all_frozen
-------------+------------
44248 | 44248
(1 row)
postgres=# select * from pg_visibility('test', 0);
all_visible | all_frozen | pd_all_visible
-------------+------------+----------------
t | t | t
(1 row)
postgres=# select * from pg_visibility('test');
blkno | all_visible | all_frozen | pd_all_visible
-------+-------------+------------+----------------
0 | t | t | t
1 | t | t | t
2 | t | t | t
3 | t | t | t
4 | t | t | t
5 | t | t | t
6 | t | t | t
......
检查VM文件是否损坏,如果有记录返回,说明数据库发生了crash,并且VM文件已损坏。
原理是检查页级别为all frozen或者all visibility的页,如果里面有不是freeze或者visibility的记录,则说明页级标记与行级标记不一致。需要修复。
postgres=# select * from pg_check_frozen('test');
t_ctid
--------
(0 rows)
postgres=# select * from pg_check_visible('test');
t_ctid
--------
(0 rows)
修复VM文件
postgres=# select * from pg_truncate_visibility_map('test');
pg_truncate_visibility_map
----------------------------
(1 row)
postgres=# vacuum analyze test;
VACUUM
出现不一致的原因
The page-level PD_ALL_VISIBLE bit has the same meaning as the all-visible bit in the visibility map, but is stored within the data page itself rather than a separate data structure.
These will normally agree, but the page-level bit can sometimes be set while the visibility map bit is clear after a crash recovery;
小结
PostgreSQL 9.6对VM文件的功能进行了扩展,实用2个比特位标记该页的记录数是否是clean的,对所有事务可见;以及标记该页的所有记录数是否都是FREEZE的。
在进行freeze操作时,或者触发autovacuum freeze prevent wrapped object时,会自动跳过标记为frozen的page,从而大幅减少vacuum freeze的IO。
对于数据库中有大量静态数据,并且又有高并发的写事务并行存在时,特别实用。
同时PostgreSQL 9.6还提供了强制vacuum freeze的接口,不跳过任何页。 保留这样的接口,目的是在vm文件损坏时可用,VM文件也可以通过这种方法自动修复。