并发事务, 共享行锁管理 - pg_multixact manager for shared-row-lock implementation

5 minute read

背景

multixact是管理共享行锁的基础代码,一个MultiXactId对应一个MultiXactMember数组,每个MultiXactMember包含事务号以及该事务号对应的flag(MultiXactStatus)。

src/include/access/multixact.h

typedef struct MultiXactMember  
{  
        TransactionId xid;  
        MultiXactStatus status;  
} MultiXactMember;  
  
typedef struct xl_multixact_create  
{  
        MultiXactId mid;                        /* new MultiXact's ID */  
        MultiXactOffset moff;           /* its starting offset in members file */  
        int32           nmembers;               /* number of member XIDs */  
        MultiXactMember members[FLEXIBLE_ARRAY_MEMBER];  
} xl_multixact_create;  

src/backend/access/transam/multixact.c

/*-------------------------------------------------------------------------  
 *  
 * multixact.c  
 *              PostgreSQL multi-transaction-log manager  
 *  
 * The pg_multixact manager is a pg_clog-like manager that stores an array of  
 * MultiXactMember for each MultiXactId.  It is a fundamental part of the  
 * shared-row-lock implementation.  Each MultiXactMember is comprised of a  
 * TransactionId and a set of flag bits.  The name is a bit historical:  
 * originally, a MultiXactId consisted of more than one TransactionId (except  
 * in rare corner cases), hence "multi".  Nowadays, however, it's perfectly  
 * legitimate to have MultiXactIds that only include a single Xid.  
 *  
 * The meaning of the flag bits is opaque to this module, but they are mostly  
 * used in heapam.c to identify lock modes that each of the member transactions  
 * is holding on any given tuple.  This module just contains support to store  
 * and retrieve the arrays.  
 *  
 * We use two SLRU areas, one for storing the offsets at which the data  
 * starts for each MultiXactId in the other one.  This trick allows us to  
 * store variable length arrays of TransactionIds.  (We could alternatively  
 * use one area containing counts and TransactionIds, with valid MultiXactId  
 * values pointing at slots containing counts; but that way seems less robust  
 * since it would get completely confused if someone inquired about a bogus  
 * MultiXactId that pointed to an intermediate slot containing an XID.)  
 *  
 * XLOG interactions: this module generates an XLOG record whenever a new  
 * OFFSETs or MEMBERs page is initialized to zeroes, as well as an XLOG record  
 * whenever a new MultiXactId is defined.  This allows us to completely  
 * rebuild the data entered since the last checkpoint during XLOG replay.  
 * Because this is possible, we need not follow the normal rule of  
 * "write WAL before data"; the only correctness guarantee needed is that  
 * we flush and sync all dirty OFFSETs and MEMBERs pages to disk before a  
 * checkpoint is considered complete.  If a page does make it to disk ahead  
 * of corresponding WAL records, it will be forcibly zeroed before use anyway.  
 * Therefore, we don't need to mark our pages with LSN information; we have  
 * enough synchronization already.  
 *  
 * Like clog.c, and unlike subtrans.c, we have to preserve state across  
 * crashes and ensure that MXID and offset numbering increases monotonically  
 * across a crash.  We do this in the same way as it's done for transaction  
 * IDs: the WAL record is guaranteed to contain evidence of every MXID we  
 * could need to worry about, and we just make sure that at the end of  
 * replay, the next-MXID and next-offset counters are at least as large as  
 * anything we saw during replay.  
 *  
 * We are able to remove segments no longer necessary by carefully tracking  
 * each table's used values: during vacuum, any multixact older than a certain  
 * value is removed; the cutoff value is stored in pg_class.  The minimum value  
 * across all tables in each database is stored in pg_database, and the global  
 * minimum across all databases is part of pg_control and is kept in shared  
 * memory.  At checkpoint time, after the value is known flushed in WAL, any  
 * files that correspond to multixacts older than that value are removed.  
 * (These files are also removed when a restartpoint is executed.)  
 *  
 * When new multixactid values are to be created, care is taken that the  
 * counter does not fall within the wraparound horizon considering the global  
 * minimum value.  
 *  
 * Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group  
 * Portions Copyright (c) 1994, Regents of the University of California  
 *  
 * src/backend/access/transam/multixact.c  
 *  
 *-------------------------------------------------------------------------  
 */  

因为每个MultiXactMember包含事务号xid以及该事务号对应的flag(MultiXactStatus),PG使用1个字节存储flag,为了对其,每4个字节和4个XID作为一个MultiXactMember组(刚好20个字节,含4个MultiXactMember)

同样multixact page和BLOCKSZ一样大,所以如果是8K的block size,可以包含409个MultiXactMember组。

如果你的某一条记录被409个事务同时加共享行锁,需要消耗一个8KB的multixact page。

/*  
 * The situation for members is a bit more complex: we store one byte of  
 * additional flag bits for each TransactionId.  To do this without getting  
 * into alignment issues, we store four bytes of flags, and then the  
 * corresponding 4 Xids.  Each such 5-word (20-byte) set we call a "group", and  
 * are stored as a whole in pages.  Thus, with 8kB BLCKSZ, we keep 409 groups  
 * per page.  This wastes 12 bytes per page, but that's OK -- simplicity (and  
 * performance) trumps space efficiency here.  
 *  
 * Note that the "offset" macros work with byte offset, not array indexes, so  
 * arithmetic must be done using "char *" pointers.  
 */  
/* We need eight bits per xact, so one xact fits in a byte */  
#define MXACT_MEMBER_BITS_PER_XACT                      8  
#define MXACT_MEMBER_FLAGS_PER_BYTE                     1  
#define MXACT_MEMBER_XACT_BITMASK       ((1 << MXACT_MEMBER_BITS_PER_XACT) - 1)  
  
/* how many full bytes of flags are there in a group? */  
#define MULTIXACT_FLAGBYTES_PER_GROUP           4  
#define MULTIXACT_MEMBERS_PER_MEMBERGROUP       \  
        (MULTIXACT_FLAGBYTES_PER_GROUP * MXACT_MEMBER_FLAGS_PER_BYTE)  
/* size in bytes of a complete group */  
#define MULTIXACT_MEMBERGROUP_SIZE \  
        (sizeof(TransactionId) * MULTIXACT_MEMBERS_PER_MEMBERGROUP + MULTIXACT_FLAGBYTES_PER_GROUP)  
#define MULTIXACT_MEMBERGROUPS_PER_PAGE (BLCKSZ / MULTIXACT_MEMBERGROUP_SIZE)  
#define MULTIXACT_MEMBERS_PER_PAGE      \  
        (MULTIXACT_MEMBERGROUPS_PER_PAGE * MULTIXACT_MEMBERS_PER_MEMBERGROUP)  

multixact头文件:

有效的multixactid从1开始,最大为0xFFFFFFFF(和xid一致)。

src/include/access/multixact.h

/*  
 * The first two MultiXactId values are reserved to store the truncation Xid  
 * and epoch of the first segment, so we start assigning multixact values from  
 * 2.  
 */  
#define InvalidMultiXactId      ((MultiXactId) 0)  
#define FirstMultiXactId        ((MultiXactId) 1)  
#define MaxMultiXactId          ((MultiXactId) 0xFFFFFFFF)  
  
#define MultiXactIdIsValid(multi) ((multi) != InvalidMultiXactId)  
  
#define MaxMultiXactOffset      ((MultiXactOffset) 0xFFFFFFFF)  
  
/* Number of SLRU buffers to use for multixact */  
#define NUM_MXACTOFFSET_BUFFERS         8  
#define NUM_MXACTMEMBER_BUFFERS         16  

目前使用1个字节存储flag,即MultiXactStatus,现在支持6个状态值。

可以看到这些状态涉及6种行锁操作。

FOR KEY SHARE, FOR SHARE, FOR NO KEY UPDATE, FOR UPDATE  
an update that doesn't touch "key" columns  
other updates, and delete  
/*  
 * Possible multixact lock modes ("status").  The first four modes are for  
 * tuple locks (FOR KEY SHARE, FOR SHARE, FOR NO KEY UPDATE, FOR UPDATE); the  
 * next two are used for update and delete modes.  
 */  
typedef enum  
{  
        MultiXactStatusForKeyShare = 0x00,  
        MultiXactStatusForShare = 0x01,  
        MultiXactStatusForNoKeyUpdate = 0x02,  
        MultiXactStatusForUpdate = 0x03,  
        /* an update that doesn't touch "key" columns */  
        MultiXactStatusNoKeyUpdate = 0x04,  
        /* other updates, and delete */  
        MultiXactStatusUpdate = 0x05  
} MultiXactStatus;  

参考

src/include/access/multixact.h

src/backend/access/transam/multixact.c

Flag Counter

digoal’s 大量PostgreSQL文章入口