pg_clog的原子操作与pg_subtrans(子事务)

7 minute read

背景

如果没有子事务,其实很容易保证pg_clog的原子操作,但是,如果加入了子事务并为子事务分配了XID,并且某些子事务XID和父事务的XID不在同一个CLOG PAGE时,保证事务一致性就涉及CLOG的原子写了。

PostgreSQL是通过2PC来实现CLOG的原子写的。

1. 首先将主事务以外的CLOG PAGE中的子事务设置为sub-committed状态。

2. 然后将主事务所在的CLOG PAGE中的子事务设置为sub-committed,同时设置主事务为committed状态,将同页的子事务设置为committed状态。

3. 将其他CLOG PAGE中的子事务设置为committed状态。

代码如下:

src/backend/access/transam/clog.c

/*  
 * TransactionIdSetTreeStatus  
 *  
 * Record the final state of transaction entries in the commit log for  
 * a transaction and its subtransaction tree. Take care to ensure this is  
 * efficient, and as atomic as possible.  
 *  
 * xid is a single xid to set status for. This will typically be  
 * the top level transactionid for a top level commit or abort. It can  
 * also be a subtransaction when we record transaction aborts.  
 *  
 * subxids is an array of xids of length nsubxids, representing subtransactions  
 * in the tree of xid. In various cases nsubxids may be zero.  
 *  
 * lsn must be the WAL location of the commit record when recording an async  
 * commit.  For a synchronous commit it can be InvalidXLogRecPtr, since the  
 * caller guarantees the commit record is already flushed in that case.  It  
 * should be InvalidXLogRecPtr for abort cases, too.  
 *  
 * In the commit case, atomicity is limited by whether all the subxids are in  
 * the same CLOG page as xid.  If they all are, then the lock will be grabbed  
 * only once, and the status will be set to committed directly.  Otherwise  
 * we must  
 *       1. set sub-committed all subxids that are not on the same page as the  
 *              main xid  
 *       2. atomically set committed the main xid and the subxids on the same page  
 *       3. go over the first bunch again and set them committed  
 * Note that as far as concurrent checkers are concerned, main transaction  
 * commit as a whole is still atomic.  
 *  
 * Example:  
 *              TransactionId t commits and has subxids t1, t2, t3, t4  
 *              t is on page p1, t1 is also on p1, t2 and t3 are on p2, t4 is on p3  
 *              1. update pages2-3:  
 *                                      page2: set t2,t3 as sub-committed  
 *                                      page3: set t4 as sub-committed  
 *              2. update page1:  
 *                                      set t1 as sub-committed,  
 *                                      then set t as committed,  
                                        then set t1 as committed  
 *              3. update pages2-3:  
 *                                      page2: set t2,t3 as committed  
 *                                      page3: set t4 as committed  
 *  
 * NB: this is a low-level routine and is NOT the preferred entry point  
 * for most uses; functions in transam.c are the intended callers.  
 *  
 * XXX Think about issuing FADVISE_WILLNEED on pages that we will need,  
 * but aren't yet in cache, as well as hinting pages not to fall out of  
 * cache yet.  
 */  

实际调用的入口代码在transam.c。subtrans.c中是一些低级接口。

那么什么是subtrans?

当我们使用savepoint时,会产生子事务。子事务和父事务一样,可能消耗XID。一旦为子事务分配了XID,那么就涉及CLOG的原子操作了。因为要保证父事务和所有的子事务的CLOG一致性。

当不消耗XID时,需要通过SubTransactionId来区分子事务。

src/backend/access/transam/README

Transaction and Subtransaction Numbering  
----------------------------------------  
  
事务和子事务都可以有XID,子事务和事务一样,在真正需要XID的时候才会分配XID,  
  
也就是说,一个事务,如果它有子事务,可能消耗多个XID。  
  
另外需要注意,如果子事务要分配XID,必须先给它的父事务分配一个XID,才能给子事务分配XID,因为要确保子事务的XID是在父事务后分配的。  
  
Transactions and subtransactions are assigned permanent XIDs only when/if  
they first do something that requires one --- typically, insert/update/delete  
a tuple, though there are a few other places that need an XID assigned.  
If a subtransaction requires an XID, we always first assign one to its  
parent.  This maintains the invariant that child transactions have XIDs later  
than their parents, which is assumed in a number of places.  
  
The subsidiary actions of obtaining a lock on the XID and entering it into  
pg_subtrans and PG_PROC are done at the time it is assigned.  
  
A transaction that has no XID still needs to be identified for various  
purposes, notably holding locks.  For this purpose we assign a "virtual  
transaction ID" or VXID to each top-level transaction.  VXIDs are formed from  
two fields, the backendID and a backend-local counter; this arrangement allows  
assignment of a new VXID at transaction start without any contention for  
shared memory.  To ensure that a VXID isn't re-used too soon after backend  
exit, we store the last local counter value into shared memory at backend  
exit, and initialize it from the previous value for the same backendID slot  
at backend start.  All these counters go back to zero at shared memory  
re-initialization, but that's OK because VXIDs never appear anywhere on-disk.  
  
子事务没有分配事务号时,如何区分各个子事务呢?  
  
这里用到了SubTransactionId数据类型,从父事务开始SubTransactionId=1,后面的子事务递增。SubTransactionId是uint32的类型。  
  
Internally, a backend needs a way to identify subtransactions whether or not  
they have XIDs; but this need only lasts as long as the parent top transaction  
endures.  Therefore, we have SubTransactionId, which is somewhat like  
CommandId in that it's generated from a counter that we reset at the start of  
each top transaction.  The top-level transaction itself has SubTransactionId 1,  
and subtransactions have IDs 2 and up.  (Zero is reserved for  
InvalidSubTransactionId.)  Note that subtransactions do not have their  
own VXIDs; they use the parent top transaction's VXID.  
  
因为一个子事务要消耗4个字节,而且主事务默认会分配一个子事务号,所以和CLOG每事务消耗2BIT相比,pg_subtrans中会产生更多的文件。  
  
另外需要注意的是,子事务不一定会分配事务号,所以对于未分配事务号的子事务,在CLOG中是没有记录的。而在pg_subtrans中一定有记录并占空间。  

src/backend/access/transam/subtrans.c

/*  
 * Defines for SubTrans page sizes.  A page is the same BLCKSZ as is used  
 * everywhere else in Postgres.  
 *  
 * Note: because TransactionIds are 32 bits and wrap around at 0xFFFFFFFF,  
 * SubTrans page numbering also wraps around at  
 * 0xFFFFFFFF/SUBTRANS_XACTS_PER_PAGE, and segment numbering at  
 * 0xFFFFFFFF/SUBTRANS_XACTS_PER_PAGE/SLRU_PAGES_PER_SEGMENT.  We need take no  
 * explicit notice of that fact in this module, except when comparing segment  
 * and page numbers in TruncateSUBTRANS (see SubTransPagePrecedes).  
 */  
  
/* We need four bytes per xact */  
#define SUBTRANS_XACTS_PER_PAGE (BLCKSZ / sizeof(TransactionId))  
  
#define TransactionIdToPage(xid) ((xid) / (TransactionId) SUBTRANS_XACTS_PER_PAGE)  
#define TransactionIdToEntry(xid) ((xid) % (TransactionId) SUBTRANS_XACTS_PER_PAGE)  

验证

postgres@digoal-> psql  
psql (9.4.4)  
Type "help" for help.  
postgres=# select pg_backend_pid();  
 pg_backend_pid   
----------------  
           5749  
(1 row)  

跟踪:

[root@digoal ~]# cat trc.stp   
global f_start[999999]  
  
probe process("/opt/pgsql/bin/postgres").function("*@/opt/soft_bak/postgresql-9.4.4/src/backend/access/transam/subtrans.c").call {   
   f_start[execname(), pid(), tid(), cpu()] = gettimeofday_ms()  
   printf("%s -> time:%d, pp:%s, par:%s\n", thread_indent(-1), gettimeofday_ms(), pp(), $$parms$$)  
   # printf("%s -> time:%d, pp:%s\n", thread_indent(1), f_start[execname(), pid(), tid(), cpu()], pp() )  
}  
  
probe process("/opt/pgsql/bin/postgres").function("*@/opt/soft_bak/postgresql-9.4.4/src/backend/access/transam/subtrans.c").return {  
  t=gettimeofday_ms()  
  a=execname()  
  b=cpu()  
  c=pid()  
  d=pp()  
  e=tid()  
  if (f_start[a,c,e,b]) {  
  printf("%s <- time:%d, pp:%s, par:%s\n", thread_indent(-1), t - f_start[a,c,e,b], d, $$locals$$)  
  # printf("%s <- time:%d, pp:%s\n", thread_indent(-1), t - f_start[a,c,e,b], d)  
  }  
}  

执行如下SQL:

postgres@digoal-> psql  
psql (9.4.4)  
Type "help" for help.  
postgres=# begin;  // 主事务开始,但是不分配事务号。  
BEGIN  
postgres=# select txid_current();  // 主事务调用DML函数,分配一个事务号。  
 txid_current   
--------------  
    607466850  
(1 row)  
postgres=# savepoint a;  // 开启子事务,但是不分配事务号,父事务号为607466850  
SAVEPOINT  
postgres=# \dt  
        List of relations  
 Schema | Name | Type  |  Owner     
--------+------+-------+----------  
 public | t    | table | postgres  
 public | test | table | postgres  
(2 rows)  
postgres=# delete from t;  // 子事务中调用DML,分配事务号607466851  
DELETE 2  
postgres=# rollback to a;  //  回滚子事务,创建新的子事务,但是不分配事务号,父事务号为607466850  
ROLLBACK  
postgres=# delete from t;  // 子事务中调用DML,分配事务号607466852  
DELETE 2  
postgres=# rollback to a;  //  回滚子事务,创建新的子事务,但是不分配事务号,父事务号为607466850  
ROLLBACK  
postgres=# delete from t; // 子事务中调用DML,分配事务号607466853  
DELETE 2  
postgres=# rollback to a;  //  回滚子事务,创建新的子事务,但是不分配事务号,父事务号为607466850  
ROLLBACK  
postgres=# insert into t values (1);    // 子事务中调用DML,分配事务号607466854  
INSERT 0 1  
postgres=# insert into t values (1);  
INSERT 0 1  
postgres=# insert into t values (1);  
INSERT 0 1  
postgres=# savepoint b;   // 开启子事务,但是不分配事务号,父事务号为607466854  
SAVEPOINT  
postgres=# insert into t values (1);   // 子事务中调用DML,分配事务号607466855  
INSERT 0 1  
postgres=# insert into t values (1);  
INSERT 0 1  
postgres=# savepoint c;  // 开启子事务,但是不分配事务号,父事务号为607466855  
SAVEPOINT  
postgres=# insert into t values (1);   // 子事务中调用DML,分配事务号607466856  
INSERT 0 1  
postgres=# savepoint d;  // 开启子事务,但是不分配事务号,父事务号为607466856  
SAVEPOINT  
postgres=# insert into t values (1);   // 子事务中调用DML,分配事务号607466857  
INSERT 0 1  
postgres=# rollback to a;  //  回滚子事务,创建新的子事务,但是不分配事务号,父事务号为607466850  
ROLLBACK  
postgres=# insert into t values (1);   // 子事务中调用DML,分配事务号607466858  
INSERT 0 1  
postgres=# select txid_current();   // 查看主事务的事务号  
 txid_current   
--------------  
    607466850  
(1 row)  

跟踪结果

[root@digoal ~]# stap -vp 5 -DMAXSKIPPED=9999999 -DSTP_NO_OVERLOAD -DMAXTRYLOCK=100 ./trc.stp -x 5749  
Pass 1: parsed user script and 112 library script(s) using 209284virt/36876res/3172shr/34504data kb, in 110usr/90sys/192real ms.  
Pass 2: analyzed script: 36 probe(s), 33 function(s), 4 embed(s), 27 global(s) using 223660virt/51416res/4248shr/48880data kb, in 0usr/130sys/134real ms.  
Pass 3: using cached /root/.systemtap/cache/28/stap_282339931bbfe754a24af75ea3476930_35559.c  
Pass 4: using cached /root/.systemtap/cache/28/stap_282339931bbfe754a24af75ea3476930_35559.ko  
Pass 5: starting run.  
     0 postgres(5749): -> time:1441519748850, pp:process("/opt/pgsql9.4.4/bin/postgres").function("ExtendSUBTRANS@/opt/soft_bak/postgresql-9.4.4/src/backend/access/transam/subtrans.c:307").call, par:newestXact=607466848  
    22 postgres(5749): <- time:0, pp:process("/opt/pgsql9.4.4/bin/postgres").function("ExtendSUBTRANS@/opt/soft_bak/postgresql-9.4.4/src/backend/access/transam/subtrans.c:307").return, par:pageno=?  
20726607 postgres(5749): -> time:1441519769576, pp:process("/opt/pgsql9.4.4/bin/postgres").function("ExtendSUBTRANS@/opt/soft_bak/postgresql-9.4.4/src/backend/access/transam/subtrans.c:307").call, par:newestXact=607466849  
20726671 postgres(5749): <- time:0, pp:process("/opt/pgsql9.4.4/bin/postgres").function("ExtendSUBTRANS@/opt/soft_bak/postgresql-9.4.4/src/backend/access/transam/subtrans.c:307").return, par:pageno=?  
69692931 postgres(5749): -> time:1441519818543, pp:process("/opt/pgsql9.4.4/bin/postgres").function("ExtendSUBTRANS@/opt/soft_bak/postgresql-9.4.4/src/backend/access/transam/subtrans.c:307").call, par:newestXact=607466850  
69692991 postgres(5749): <- time:0, pp:process("/opt/pgsql9.4.4/bin/postgres").function("ExtendSUBTRANS@/opt/soft_bak/postgresql-9.4.4/src/backend/access/transam/subtrans.c:307").return, par:pageno=?  
85924642 postgres(5749): -> time:1441519834774, pp:process("/opt/pgsql9.4.4/bin/postgres").function("ExtendSUBTRANS@/opt/soft_bak/postgresql-9.4.4/src/backend/access/transam/subtrans.c:307").call, par:newestXact=607466851  
85924720 postgres(5749): <- time:0, pp:process("/opt/pgsql9.4.4/bin/postgres").function("ExtendSUBTRANS@/opt/soft_bak/postgresql-9.4.4/src/backend/access/transam/subtrans.c:307").return, par:pageno=?  
85924766 postgres(5749): -> time:1441519834774, pp:process("/opt/pgsql9.4.4/bin/postgres").function("SubTransSetParent@/opt/soft_bak/postgresql-9.4.4/src/backend/access/transam/subtrans.c:75").call, par:xid=607466851 parent=607466850 overwriteOK='\000'  
85924838 postgres(5749): <- time:1, pp:process("/opt/pgsql9.4.4/bin/postgres").function("SubTransSetParent@/opt/soft_bak/postgresql-9.4.4/src/backend/access/transam/subtrans.c:75").return, par:pageno=? entryno=? slotno=607466851 ptr=0  
102973659 postgres(5749): -> time:1441519851823, pp:process("/opt/pgsql9.4.4/bin/postgres").function("ExtendSUBTRANS@/opt/soft_bak/postgresql-9.4.4/src/backend/access/transam/subtrans.c:307").call, par:newestXact=607466852  
102973718 postgres(5749): <- time:0, pp:process("/opt/pgsql9.4.4/bin/postgres").function("ExtendSUBTRANS@/opt/soft_bak/postgresql-9.4.4/src/backend/access/transam/subtrans.c:307").return, par:pageno=?  
102973746 postgres(5749): -> time:1441519851823, pp:process("/opt/pgsql9.4.4/bin/postgres").function("SubTransSetParent@/opt/soft_bak/postgresql-9.4.4/src/backend/access/transam/subtrans.c:75").call, par:xid=607466852 parent=607466850 overwriteOK='\000'  
102973782 postgres(5749): <- time:0, pp:process("/opt/pgsql9.4.4/bin/postgres").function("SubTransSetParent@/opt/soft_bak/postgresql-9.4.4/src/backend/access/transam/subtrans.c:75").return, par:pageno=? entryno=? slotno=607466852 ptr=0  
112206905 postgres(5749): -> time:1441519861057, pp:process("/opt/pgsql9.4.4/bin/postgres").function("ExtendSUBTRANS@/opt/soft_bak/postgresql-9.4.4/src/backend/access/transam/subtrans.c:307").call, par:newestXact=607466853  
112206964 postgres(5749): <- time:0, pp:process("/opt/pgsql9.4.4/bin/postgres").function("ExtendSUBTRANS@/opt/soft_bak/postgresql-9.4.4/src/backend/access/transam/subtrans.c:307").return, par:pageno=?  
112206992 postgres(5749): -> time:1441519861057, pp:process("/opt/pgsql9.4.4/bin/postgres").function("SubTransSetParent@/opt/soft_bak/postgresql-9.4.4/src/backend/access/transam/subtrans.c:75").call, par:xid=607466853 parent=607466850 overwriteOK='\000'  
112207028 postgres(5749): <- time:0, pp:process("/opt/pgsql9.4.4/bin/postgres").function("SubTransSetParent@/opt/soft_bak/postgresql-9.4.4/src/backend/access/transam/subtrans.c:75").return, par:pageno=? entryno=? slotno=607466853 ptr=0  
152610154 postgres(5749): -> time:1441519901460, pp:process("/opt/pgsql9.4.4/bin/postgres").function("ExtendSUBTRANS@/opt/soft_bak/postgresql-9.4.4/src/backend/access/transam/subtrans.c:307").call, par:newestXact=607466854  
152610212 postgres(5749): <- time:0, pp:process("/opt/pgsql9.4.4/bin/postgres").function("ExtendSUBTRANS@/opt/soft_bak/postgresql-9.4.4/src/backend/access/transam/subtrans.c:307").return, par:pageno=?  
152610238 postgres(5749): -> time:1441519901460, pp:process("/opt/pgsql9.4.4/bin/postgres").function("SubTransSetParent@/opt/soft_bak/postgresql-9.4.4/src/backend/access/transam/subtrans.c:75").call, par:xid=607466854 parent=607466850 overwriteOK='\000'  
152610275 postgres(5749): <- time:0, pp:process("/opt/pgsql9.4.4/bin/postgres").function("SubTransSetParent@/opt/soft_bak/postgresql-9.4.4/src/backend/access/transam/subtrans.c:75").return, par:pageno=? entryno=? slotno=607466854 ptr=0  
167139858 postgres(5749): -> time:1441519915990, pp:process("/opt/pgsql9.4.4/bin/postgres").function("ExtendSUBTRANS@/opt/soft_bak/postgresql-9.4.4/src/backend/access/transam/subtrans.c:307").call, par:newestXact=607466855  
167139929 postgres(5749): <- time:1, pp:process("/opt/pgsql9.4.4/bin/postgres").function("ExtendSUBTRANS@/opt/soft_bak/postgresql-9.4.4/src/backend/access/transam/subtrans.c:307").return, par:pageno=?  
167139958 postgres(5749): -> time:1441519915990, pp:process("/opt/pgsql9.4.4/bin/postgres").function("SubTransSetParent@/opt/soft_bak/postgresql-9.4.4/src/backend/access/transam/subtrans.c:75").call, par:xid=607466855 parent=607466854 overwriteOK='\000'  
167139995 postgres(5749): <- time:0, pp:process("/opt/pgsql9.4.4/bin/postgres").function("SubTransSetParent@/opt/soft_bak/postgresql-9.4.4/src/backend/access/transam/subtrans.c:75").return, par:pageno=? entryno=? slotno=607466855 ptr=0  
184727823 postgres(5749): -> time:1441519933578, pp:process("/opt/pgsql9.4.4/bin/postgres").function("ExtendSUBTRANS@/opt/soft_bak/postgresql-9.4.4/src/backend/access/transam/subtrans.c:307").call, par:newestXact=607466856  
184727849 postgres(5749): <- time:0, pp:process("/opt/pgsql9.4.4/bin/postgres").function("ExtendSUBTRANS@/opt/soft_bak/postgresql-9.4.4/src/backend/access/transam/subtrans.c:307").return, par:pageno=?  
184727859 postgres(5749): -> time:1441519933578, pp:process("/opt/pgsql9.4.4/bin/postgres").function("SubTransSetParent@/opt/soft_bak/postgresql-9.4.4/src/backend/access/transam/subtrans.c:75").call, par:xid=607466856 parent=607466855 overwriteOK='\000'  
184727872 postgres(5749): <- time:0, pp:process("/opt/pgsql9.4.4/bin/postgres").function("SubTransSetParent@/opt/soft_bak/postgresql-9.4.4/src/backend/access/transam/subtrans.c:75").return, par:pageno=? entryno=? slotno=607466856 ptr=0  
228240429 postgres(5749): -> time:1441519977090, pp:process("/opt/pgsql9.4.4/bin/postgres").function("ExtendSUBTRANS@/opt/soft_bak/postgresql-9.4.4/src/backend/access/transam/subtrans.c:307").call, par:newestXact=607466857  
228240493 postgres(5749): <- time:0, pp:process("/opt/pgsql9.4.4/bin/postgres").function("ExtendSUBTRANS@/opt/soft_bak/postgresql-9.4.4/src/backend/access/transam/subtrans.c:307").return, par:pageno=?  
228240520 postgres(5749): -> time:1441519977090, pp:process("/opt/pgsql9.4.4/bin/postgres").function("SubTransSetParent@/opt/soft_bak/postgresql-9.4.4/src/backend/access/transam/subtrans.c:75").call, par:xid=607466857 parent=607466856 overwriteOK='\000'  
228240557 postgres(5749): <- time:0, pp:process("/opt/pgsql9.4.4/bin/postgres").function("SubTransSetParent@/opt/soft_bak/postgresql-9.4.4/src/backend/access/transam/subtrans.c:75").return, par:pageno=? entryno=? slotno=607466857 ptr=0  
316079437 postgres(5749): -> time:1441520064929, pp:process("/opt/pgsql9.4.4/bin/postgres").function("ExtendSUBTRANS@/opt/soft_bak/postgresql-9.4.4/src/backend/access/transam/subtrans.c:307").call, par:newestXact=607466858  
316079496 postgres(5749): <- time:0, pp:process("/opt/pgsql9.4.4/bin/postgres").function("ExtendSUBTRANS@/opt/soft_bak/postgresql-9.4.4/src/backend/access/transam/subtrans.c:307").return, par:pageno=?  
316079524 postgres(5749): -> time:1441520064929, pp:process("/opt/pgsql9.4.4/bin/postgres").function("SubTransSetParent@/opt/soft_bak/postgresql-9.4.4/src/backend/access/transam/subtrans.c:75").call, par:xid=607466858 parent=607466850 overwriteOK='\000'  
316079560 postgres(5749): <- time:0, pp:process("/opt/pgsql9.4.4/bin/postgres").function("SubTransSetParent@/opt/soft_bak/postgresql-9.4.4/src/backend/access/transam/subtrans.c:75").return, par:pageno=? entryno=? slotno=607466858 ptr=0  

重新开一个会话,你会发现,子事务也消耗了XID。因为重新分配的XID已经从607466859开始了。

postgres@digoal-> psql  
psql (9.4.4)  
Type "help" for help.  
postgres=# select txid_current();  
 txid_current   
--------------  
    607466859  
(1 row)  

参考

src/backend/access/transam/clog.c

src/backend/access/transam/subtrans.c

src/backend/access/transam/transam.c

src/backend/access/transam/README

src/include/c.h:typedef uint32 SubTransactionId;

Flag Counter

digoal’s 大量PostgreSQL文章入口