为什么PostgreSQL查询语句也可能产生 xlog, 并且可能对buffer有write操作 ? hint bits
背景
本文还可以回答另一个问题,为什么数据库开启wal_hint_log后或者使用initdb -k参数开启checksum后,产生的XLOG变多了。
PostgreSQL 查询是可能产生XLOG的,这需要从PostgreSQL tuple上的hint bits说起。
什么是hint bits?你可以参考这个页面:
https://wiki.postgresql.org/wiki/Hint_Bits
hint bits是tuple头部的infomask里的2个BIT。用来表示该tuple的事务状态。
src/include/access/htup_details.h
#define HEAP_XMIN_COMMITTED 0x0100 /* t_xmin committed 256 */
#define HEAP_XMIN_INVALID 0x0200 /* t_xmin invalid/aborted 512 */
#define HEAP_XMAX_COMMITTED 0x0400 /* t_xmax committed 1024 */
#define HEAP_XMAX_INVALID 0x0800 /* t_xmax invalid/aborted 2048 */
hint bits含义:
如果XMIN对应的BIT都没有设置,有两种可能,事务未结束,或者事务已结束,但是BIT还未被第一个检查该TUPLE HINT BIT的会话设置。
If neither of the XMIN bits is set, then either:
1. The creating transaction is still in progress, which you can check by examining the list of running transactions in shared memory;
2. You are the first one to check since it ended, in which case you need to consult pg_clog to know the transaction’s status, and you can update the hint bits if you find out its final state.
XMAX表示被删除的记录,原理一样。
If the tuple has been marked deleted, then similar remarks apply to the XMAX bits.
假设没有hint bits,我们只能从pg_clog中或者PGXACT内存结构中(未结束的或未清除的事务信息内存)得知该tuple对应的事务提交状态,显然如果每条tuple都要查询pg_clog的话,性能一定会很差。
所以为了提升性能,PostgreSQL在tuple的头部t_infomask中通过4个比特位来存储事务的提交状态。从而我们不需要查询pg_clog来获得事务信息。
但是请注意,并不是在事务结束时设置t_infomask的hint bits。而是在后面的DML或者DQL,VACUUM等SQL扫描到对应的TUPLE时,触发SET BITS的操作。
设置hint bits的代码如下:
SetHintBits@src/backend/utils/time/tqual.c
src/backend/utils/time/tqual.c
* tqual.c
* POSTGRES "time qualification" code, ie, tuple visibility rules.
*
* NOTE: all the HeapTupleSatisfies routines will update the tuple's
* "hint" status bits if we see that the inserting or deleting transaction
* has now committed or aborted (and it is safe to set the hint bits).
* If the hint bits are changed, MarkBufferDirtyHint is called on
* the passed-in buffer. The caller must hold not only a pin, but at least
* shared buffer content lock on the buffer containing the tuple.
*
* NOTE: must check TransactionIdIsInProgress (which looks in PGXACT array)
* before TransactionIdDidCommit/TransactionIdDidAbort (which look in
* pg_clog). Otherwise we have a race condition: we might decide that a
* just-committed transaction crashed, because none of the tests succeed.
* xact.c is careful to record commit/abort in pg_clog before it unsets
* MyPgXact->xid in PGXACT array. That fixes that problem, but it also
* means there is a window where TransactionIdIsInProgress and
* TransactionIdDidCommit will both return true. If we check only
* TransactionIdDidCommit, we could consider a tuple committed when a
* later GetSnapshotData call will still think the originating transaction
* is in progress, which leads to application-level inconsistency. The
* upshot is that we gotta check TransactionIdIsInProgress first in all
* code paths, except for a few cases where we are looking at
* subtransactions of our own main transaction and so there can't be any
* race condition.
*
* Summary of visibility functions:
*
* HeapTupleSatisfiesMVCC()
* visible to supplied snapshot, excludes current command
* HeapTupleSatisfiesUpdate()
* visible to instant snapshot, with user-supplied command
* counter and more complex result
* HeapTupleSatisfiesSelf()
* visible to instant snapshot and current command
* HeapTupleSatisfiesDirty()
* like HeapTupleSatisfiesSelf(), but includes open transactions
* HeapTupleSatisfiesVacuum()
* visible to any running transaction, used by VACUUM
* HeapTupleSatisfiesToast()
* visible unless part of interrupted vacuum, used for TOAST
* HeapTupleSatisfiesAny()
* all tuples are visible
*
* Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group
* Portions Copyright (c) 1994, Regents of the University of California
*
* IDENTIFICATION
* src/backend/utils/time/tqual.c
......
/*
* SetHintBits()
*
* Set commit/abort hint bits on a tuple, if appropriate at this time.
*
* It is only safe to set a transaction-committed hint bit if we know the
* transaction's commit record has been flushed to disk, or if the table is
* temporary or unlogged and will be obliterated by a crash anyway. We
* cannot change the LSN of the page here because we may hold only a share
* lock on the buffer, so we can't use the LSN to interlock this; we have to
* just refrain from setting the hint bit until some future re-examination
* of the tuple.
*
* We can always set hint bits when marking a transaction aborted. (Some
* code in heapam.c relies on that!)
*
* Also, if we are cleaning up HEAP_MOVED_IN or HEAP_MOVED_OFF entries, then
* we can always set the hint bits, since pre-9.0 VACUUM FULL always used
* synchronous commits and didn't move tuples that weren't previously
* hinted. (This is not known by this subroutine, but is applied by its
* callers.) Note: old-style VACUUM FULL is gone, but we have to keep this
* module's support for MOVED_OFF/MOVED_IN flag bits for as long as we
* support in-place update from pre-9.0 databases.
*
* Normal commits may be asynchronous, so for those we need to get the LSN
* of the transaction and then check whether this is flushed.
*
* The caller should pass xid as the XID of the transaction to check, or
* InvalidTransactionId if no check is needed.
*/
static inline void
SetHintBits(HeapTupleHeader tuple, Buffer buffer,
uint16 infomask, TransactionId xid)
{
if (TransactionIdIsValid(xid))
{
/* NB: xid must be known committed here! */
XLogRecPtr commitLSN = TransactionIdGetCommitLSN(xid); // 获取事务对应的commitLSN
if (XLogNeedsFlush(commitLSN) && BufferIsPermanent(buffer)) // 在设置hint bits前,必须确保事务对应的xlog 已经flush到磁盘,否则可能出现不一致的情况。例如数据恢复时xlog没有,但是CLOG显示已提交。
return; /* not flushed yet, so don't set hint */
}
tuple->t_infomask |= infomask; // 设置hint bits
MarkBufferDirtyHint(buffer, true); // 将buffer标记为dirty,当initdb 打开了checksum或者使用了wal_log_hints时,如果它刚好是checkpoint后的第一个脏页,则写full page到WAL。
}
我们可以来验证一下:
> truncate t;
postgres=# select pg_backend_pid();
pg_backend_pid
----------------
5497
(1 row)
stap跟踪
[root@digoal ~]# cat trc.stp
global f_start[999999]
probe process("/opt/pgsql/bin/postgres").function("SetHintBits@/opt/soft_bak/postgresql-9.4.4/src/backend/utils/time/tqual.c").call {
f_start[execname(), pid(), tid(), cpu()] = gettimeofday_ms()
printf("%s <- time:%d, pp:%s, par:%s\n", thread_indent(-1), gettimeofday_ms(), pp(), $$parms$$)
# printf("%s -> time:%d, pp:%s\n", thread_indent(1), f_start[execname(), pid(), tid(), cpu()], pp() )
}
probe process("/opt/pgsql/bin/postgres").function("SetHintBits@/opt/soft_bak/postgresql-9.4.4/src/backend/utils/time/tqual.c").return {
t=gettimeofday_ms()
a=execname()
b=cpu()
c=pid()
d=pp()
e=tid()
if (f_start[a,c,e,b]) {
# printf("%s <- time:%d, pp:%s, par:%s\n", thread_indent(-1), t - f_start[a,c,e,b], d, $$params$$)
printf("%s <- time:%d, pp:%s\n", thread_indent(-1), t - f_start[a,c,e,b], d)
}
}
[root@digoal ~]# stap -vp 5 -DMAXSKIPPED=9999999 -DSTP_NO_OVERLOAD -DMAXTRYLOCK=100 ./trc.stp -x 5497
postgres=# insert into t values (1);
INSERT 0 1
insert事务结束后,没有跟踪到SetHintBits,因为事务结束时不设置hint bits。
postgres=# select * from t;
id
----
1
(1 row)
执行select时跟踪到了sethintbits,参数infomask=256,即HEAP_XMIN_COMMITTED 。
71259406 postgres(5497): <- time:1441448520839, pp:process("/opt/pgsql9.4.4/bin/postgres").function("SetHintBits@/opt/soft_bak/postgresql-9.4.4/src/backend/utils/time/tqual.c:110").call, par:tuple={.t_choice={.t_heap={.t_xmin=390734170, .t_xmax=0, .t_field3={.t_cid=0, .t_xvac=0}}, .t_datum={.datum_len_=390734170, .datum_typmod=0, .datum_typeid=0}}, .t_ctid={.ip_blkid={.bi_hi=0, .bi_lo=0}, .ip_posid=1}, .t_infomask2=1, .t_infomask=2048, .t_hoff='\030', .t_bits=""} buffer=3657 infomask=256 xid=390734170
71259458 postgres(5497): <- time:0, pp:process("/opt/pgsql9.4.4/bin/postgres").function("SetHintBits@/opt/soft_bak/postgresql-9.4.4/src/backend/utils/time/tqual.c:110").return
再次插入一条记录
postgres=# insert into t values (2);
INSERT 0 1
无SetHintBits
更新2条记录,你会发现有一次SetHintBits
postgres=# update t set id=3;
UPDATE 2
这次SetHintBits实际上对应的是values(2)的那条记录,而values(1)那条已经在前面的select *被调用了SetHintBits。
5356459357 postgres(5497): <- time:1441453806039, pp:process("/opt/pgsql9.4.4/bin/postgres").function("SetHintBits@/opt/soft_bak/postgresql-9.4.4/src/backend/utils/time/tqual.c:110").call, par:tuple={.t_choice={.t_heap={.t_xmin=390734178, .t_xmax=0, .t_field3={.t_cid=0, .t_xvac=0}}, .t_datum={.datum_len_=390734178, .datum_typmod=0, .datum_typeid=0}}, .t_ctid={.ip_blkid={.bi_hi=0, .bi_lo=0}, .ip_posid=2}, .t_infomask2=1, .t_infomask=2048, .t_hoff='\030', .t_bits=""} buffer=3657 infomask=256 xid=390734178
5356459410 postgres(5497): <- time:0, pp:process("/opt/pgsql9.4.4/bin/postgres").function("SetHintBits@/opt/soft_bak/postgresql-9.4.4/src/backend/utils/time/tqual.c:110").return
那么现在实际上有4条记录,2条是老的版本,2条是新的版本。
再次执行select,全表扫描,会扫到所有的tuple,而且这4条TUPLE都是在最后一次事务后没有被设置SetHintBits的。所以会产生4次SetHintBits
postgres=# select * from t;
id
----
3
3
(2 rows)
5464475078 postgres(5497): <- time:1441453914055, pp:process("/opt/pgsql9.4.4/bin/postgres").function("SetHintBits@/opt/soft_bak/postgresql-9.4.4/src/backend/utils/time/tqual.c:110").call, par:tuple={.t_choice={.t_heap={.t_xmin=390734177, .t_xmax=390734179, .t_field3={.t_cid=0, .t_xvac=0}}, .t_datum={.datum_len_=390734177, .datum_typmod=390734179, .datum_typeid=0}}, .t_ctid={.ip_blkid={.bi_hi=0, .bi_lo=0}, .ip_posid=3}, .t_infomask2=16385, .t_infomask=256, .t_hoff='\030', .t_bits=""} buffer=3657 infomask=1024 xid=390734179
5464475132 postgres(5497): <- time:0, pp:process("/opt/pgsql9.4.4/bin/postgres").function("SetHintBits@/opt/soft_bak/postgresql-9.4.4/src/backend/utils/time/tqual.c:110").return
5464475156 postgres(5497): <- time:1441453914055, pp:process("/opt/pgsql9.4.4/bin/postgres").function("SetHintBits@/opt/soft_bak/postgresql-9.4.4/src/backend/utils/time/tqual.c:110").call, par:tuple={.t_choice={.t_heap={.t_xmin=390734178, .t_xmax=390734179, .t_field3={.t_cid=0, .t_xvac=0}}, .t_datum={.datum_len_=390734178, .datum_typmod=390734179, .datum_typeid=0}}, .t_ctid={.ip_blkid={.bi_hi=0, .bi_lo=0}, .ip_posid=4}, .t_infomask2=16385, .t_infomask=256, .t_hoff='\030', .t_bits=""} buffer=3657 infomask=1024 xid=390734179
5464475190 postgres(5497): <- time:0, pp:process("/opt/pgsql9.4.4/bin/postgres").function("SetHintBits@/opt/soft_bak/postgresql-9.4.4/src/backend/utils/time/tqual.c:110").return
前2条SetHintBits的infomask参数=1024对应HEAP_XMAX_COMMITTED,对应的是老的版本。
5464475210 postgres(5497): <- time:1441453914055, pp:process("/opt/pgsql9.4.4/bin/postgres").function("SetHintBits@/opt/soft_bak/postgresql-9.4.4/src/backend/utils/time/tqual.c:110").call, par:tuple={.t_choice={.t_heap={.t_xmin=390734179, .t_xmax=0, .t_field3={.t_cid=0, .t_xvac=0}}, .t_datum={.datum_len_=390734179, .datum_typmod=0, .datum_typeid=0}}, .t_ctid={.ip_blkid={.bi_hi=0, .bi_lo=0}, .ip_posid=3}, .t_infomask2=32769, .t_infomask=10240, .t_hoff='\030', .t_bits=""} buffer=3657 infomask=256 xid=390734179
5464475243 postgres(5497): <- time:0, pp:process("/opt/pgsql9.4.4/bin/postgres").function("SetHintBits@/opt/soft_bak/postgresql-9.4.4/src/backend/utils/time/tqual.c:110").return
5464475263 postgres(5497): <- time:1441453914055, pp:process("/opt/pgsql9.4.4/bin/postgres").function("SetHintBits@/opt/soft_bak/postgresql-9.4.4/src/backend/utils/time/tqual.c:110").call, par:tuple={.t_choice={.t_heap={.t_xmin=390734179, .t_xmax=0, .t_field3={.t_cid=0, .t_xvac=0}}, .t_datum={.datum_len_=390734179, .datum_typmod=0, .datum_typeid=0}}, .t_ctid={.ip_blkid={.bi_hi=0, .bi_lo=0}, .ip_posid=4}, .t_infomask2=32769, .t_infomask=10240, .t_hoff='\030', .t_bits=""} buffer=3657 infomask=256 xid=390734179
5464475294 postgres(5497): <- time:0, pp:process("/opt/pgsql9.4.4/bin/postgres").function("SetHintBits@/opt/soft_bak/postgresql-9.4.4/src/backend/utils/time/tqual.c:110").return
前2条SetHintBits的infomask参数=256对应HEAP_XMIN_COMMITTED,对应的是新的版本。
接下来我们看看SetHintBits是不是产生了xlog,需要关注SetHintBits里的这个调用:
MarkBufferDirtyHint(buffer, true)@src/backend/storage/buffer/bufmgr.c
/*
* MarkBufferDirtyHint
*
* Mark a buffer dirty for non-critical changes.
*
* This is essentially the same as MarkBufferDirty, except:
*
* 1. The caller does not write WAL; so if checksums are enabled, we may need
* to write an XLOG_HINT WAL record to protect against torn pages.
* 2. The caller might have only share-lock instead of exclusive-lock on the
* buffer's content lock.
* 3. This function does not guarantee that the buffer is always marked dirty
* (due to a race condition), so it cannot be used for important changes.
*/
void
MarkBufferDirtyHint(Buffer buffer, bool buffer_std)
......
仅仅当设置了wal_log_hints或初始化数据库initdb -k打开了checksum时,
它会调用XLogSaveBufferForHint来写WAL,当然不是所有的set hint都会写WAL,只有当checkpoint后,如果这个数据块是第一次被修改才需要写。
if (XLogHintBitIsNeeded() && (bufHdr->flags & BM_PERMANENT))
{
...
MyPgXact->delayChkpt = delayChkpt = true;
lsn = XLogSaveBufferForHint(buffer, buffer_std);
......
什么情况下的set hint bits操作需要写WAL?
src/include/access/xlog.h
#define XLogHintBitIsNeeded() (DataChecksumsEnabled() || wal_log_hints)
src/backend/access/transam/xlog.c
/*
* Are checksums enabled for data pages?
*/
bool
DataChecksumsEnabled(void)
{
Assert(ControlFile != NULL);
return (ControlFile->data_checksum_version > 0);
}
......
XLogSaveBufferForHint(Buffer buffer, bool buffer_std)@src/backend/access/transam/xlog.c
/*
* Write a backup block if needed when we are setting a hint. Note that
* this may be called for a variety of page types, not just heaps.
*
* Callable while holding just share lock on the buffer content.
*
* We can't use the plain backup block mechanism since that relies on the
* Buffer being exclusively locked. Since some modifications (setting LSN, hint
* bits) are allowed in a sharelocked buffer that can lead to wal checksum
* failures. So instead we copy the page and insert the copied data as normal
* record data.
*
* We only need to do something if page has not yet been full page written in
* this checkpoint round. The LSN of the inserted wal record is returned if we
* had to write, InvalidXLogRecPtr otherwise.
*
* It is possible that multiple concurrent backends could attempt to write WAL
* records. In that case, multiple copies of the same block would be recorded
* in separate WAL records by different backends, though that is still OK from
* a correctness perspective. // 可能写多次哦
*/
XLogRecPtr
XLogSaveBufferForHint(Buffer buffer, bool buffer_std)
{
通过XLogCheckBuffer来判断是否需要写XLOG。
/*
* Determine whether the buffer referenced by an XLogRecData item has to
* be backed up, and if so fill a BkpBlock struct for it. In any case
* save the buffer's LSN at *lsn.
*/
static bool
XLogCheckBuffer(XLogRecData *rdata, bool holdsExclusiveLock,
XLogRecPtr *lsn, BkpBlock *bkpb)
{
验证:
[root@digoal ~]# cat trc.stp
global f_start[999999]
probe process("/opt/pgsql/bin/postgres").function("XLogCheckBuffer@/opt/soft_bak/postgresql-9.4.4/src/backend/access/transam/xlog.c").call {
f_start[execname(), pid(), tid(), cpu()] = gettimeofday_ms()
# printf("%s <- time:%d, pp:%s, par:%s\n", thread_indent(-1), gettimeofday_ms(), pp(), $$parms$$)
printf("%s -> time:%d, pp:%s\n", thread_indent(1), f_start[execname(), pid(), tid(), cpu()], pp() )
}
probe process("/opt/pgsql/bin/postgres").function("XLogCheckBuffer@/opt/soft_bak/postgresql-9.4.4/src/backend/access/transam/xlog.c").return {
t=gettimeofday_ms()
a=execname()
b=cpu()
c=pid()
d=pp()
e=tid()
if (f_start[a,c,e,b]) {
printf("%s <- time:%d, pp:%s, par:%s\n", thread_indent(-1), t - f_start[a,c,e,b], d, $return$$)
# printf("%s <- time:%d, pp:%s\n", thread_indent(-1), t - f_start[a,c,e,b], d)
}
}
观察检查点后的没有发生full page write的数据块上的sethintbits操作:
postgres=# update t set id=4;
UPDATE 2
postgres=# checkpoint;
CHECKPOINT
postgres=# select * from t;
id
----
4
4
(2 rows)
跟踪到XLogCheckBuffer的返回结果为true,因为这个块是在checkpoint后第一次被修改,所以还没有FPW,因此这次SETHINTBITs需要写XLOG。
0 postgres(5497): -> time:1441457600685, pp:process("/opt/pgsql9.4.4/bin/postgres").function("XLogCheckBuffer@/opt/soft_bak/postgresql-9.4.4/src/backend/access/transam/xlog.c:2031").call
30 postgres(5497): <- time:0, pp:process("/opt/pgsql9.4.4/bin/postgres").function("XLogCheckBuffer@/opt/soft_bak/postgresql-9.4.4/src/backend/access/transam/xlog.c:2031").return, par:'\001'
再次更新
postgres=# update t set id=5;
UPDATE 2
跟踪到XLogCheckBuffer返回FALSE,因为这个数据块在CHECKPOINT后,已经调用了FPW,所以这次SETHINTBIT不需要写XLOG。
0 postgres(5497): -> time:1441457627431, pp:process("/opt/pgsql9.4.4/bin/postgres").function("XLogCheckBuffer@/opt/soft_bak/postgresql-9.4.4/src/backend/access/transam/xlog.c:2031").call
27 postgres(5497): <- time:0, pp:process("/opt/pgsql9.4.4/bin/postgres").function("XLogCheckBuffer@/opt/soft_bak/postgresql-9.4.4/src/backend/access/transam/xlog.c:2031").return, par:'\000'
0 postgres(5497): -> time:1441457627431, pp:process("/opt/pgsql9.4.4/bin/postgres").function("XLogCheckBuffer@/opt/soft_bak/postgresql-9.4.4/src/backend/access/transam/xlog.c:2031").call
20 postgres(5497): <- time:0, pp:process("/opt/pgsql9.4.4/bin/postgres").function("XLogCheckBuffer@/opt/soft_bak/postgresql-9.4.4/src/backend/access/transam/xlog.c:2031").return, par:'\000'
因此我们可以认为,当发生SetHintBits时,并不是每次都需要写XLOG,仅仅当开启了wal_log_hints或者initdb时开启了checksum,并且这个被sethintbits的数据块是在checkpoint后第一次被修改(改 bit也算一种修改。),这种情况下才需要写full page到WAL。其他情况不会写WAL。
同时,还有一种可能性,因为SetHintBits是针对单条TUPLE的,所以当有并行的会话在对一个PAGE的多个TUPLE进行SetHintBits操作时,可能导致这个PAGE在多次checkpoint时被写多次到WAL。或者在2个checkpoint之间,多次被bgwriter刷到OS dirty page,可能造成多次OS IO。
用gdb可以跟踪到XLogSaveBufferForHint 调用recptr = XLogInsert(RM_XLOG_ID, XLOG_FPI, rdata)来写XLOG。
其他例子,当发生行级别锁冲突时,等待锁的会话在获得锁后,也会对修改的TUPLE设置HINT BITS。
例如两个update语句更新同一条记录时,后获得锁的事务会在先获得锁的事务提交后,对它所修改的记录设置hint bits。
/*
* UpdateXmaxHintBits - update tuple hint bits after xmax transaction ends
*
* This is called after we have waited for the XMAX transaction to terminate.
* If the transaction aborted, we guarantee the XMAX_INVALID hint bit will
* be set on exit. If the transaction committed, we set the XMAX_COMMITTED
* hint bit if possible --- but beware that that may not yet be possible,
* if the transaction committed asynchronously.
*
* Note that if the transaction was a locker only, we set HEAP_XMAX_INVALID
* even if it commits.
*
* Hence callers should look only at XMAX_INVALID.
*
* Note this is not allowed for tuples whose xmax is a multixact.
*/
static void
UpdateXmaxHintBits(HeapTupleHeader tuple, Buffer buffer, TransactionId xid)
{
Assert(TransactionIdEquals(HeapTupleHeaderGetRawXmax(tuple), xid));
Assert(!(tuple->t_infomask & HEAP_XMAX_IS_MULTI));
if (!(tuple->t_infomask & (HEAP_XMAX_COMMITTED | HEAP_XMAX_INVALID)))
{
if (!HEAP_XMAX_IS_LOCKED_ONLY(tuple->t_infomask) &&
TransactionIdDidCommit(xid))
HeapTupleSetHintBits(tuple, buffer, HEAP_XMAX_COMMITTED,
xid);
else
HeapTupleSetHintBits(tuple, buffer, HEAP_XMAX_INVALID,
InvalidTransactionId);
}
}
参考
https://wiki.postgresql.org/wiki/Hint_Bits
src/include/access/htup_details.h
src/backend/utils/time/tqual.c
src/backend/storage/buffer/bufmgr.c
src/include/access/xlog.h
src/backend/access/transam/xlog.c