PostgreSQL xid(int4) to txid(int8)
背景
这篇BLOG的起源还是MVCC,PostgreSQL很多东西都是围绕 MVCC展开的。所以建议可以阅读一下手册的相关部分,另外我之前也写过几篇和MVCC相关的BLOG,有兴趣的朋友可以浏览一下。
今天我的同事francs发现我们昨天出现BUG的那个数据库的事务ID到100多亿了,正常情况下应该是在2^32约40亿以内的。
一开始我也非常惊讶,怎么会出现这样的情况。
到数据库里面查询,果真如此 :
digoal=# select txid_current();
txid_current
--------------
11658174917
(1 row)
那么怎么会这样呢?
我们先来看看数据库里面xid类型的定义 :
digoal=# select * from pg_type where typname='xid';
typname | typnamespace | typowner | typlen | typbyval | typtype | typcategory | typispreferred | typisdefined | typdelim | typrelid
| typelem | typarray | typinput | typoutput | typreceive | typsend | typmodin | typmodout | typanalyze | typalign | typstorage | ty
pnotnull | typbasetype | typtypmod | typndims | typdefaultbin | typdefault
---------+--------------+----------+--------+----------+---------+-------------+----------------+--------------+----------+---------
-+---------+----------+----------+-----------+------------+---------+----------+-----------+------------+----------+------------+---
---------+-------------+-----------+----------+---------------+------------
xid | 11 | 10 | 4 | t | b | U | f | t | , | 0
| 0 | 1011 | xidin | xidout | xidrecv | xidsend | - | - | - | i | p | f
| 0 | -1 | 0 | |
(1 row)
显然长度是4字节,32位没错的。
然后来看看 txid_current()这个函数返回的结果是什么类型 :
digoal=# \df+ txid_current
List of functions
Schema | Name | Result data type | Argument data types | Type | Volatility | Owner | Language | Source code |
Description
------------+--------------+------------------+---------------------+--------+------------+----------+----------+--------------+----
------------------------
pg_catalog | txid_current | bigint | | normal | stable | postgres | internal | txid_current | get
current transaction ID
(1 row)
结果是bigint,占据64个比特位.
那么为什么系统要使用这样一个超出XID范围的结果类型呢?
手册中有指出 :
The internal transaction ID type (xid) is 32 bits wide and wraps around every 4 billion transactions. However, these functions export a 64-bit format that is extended with an"epoch" counter so it will not wrap around during the life of an installation.
具体可以参考我另一篇BLOG
Get txid from pg_controldata’s output
所以txid_current()这个函数和tuple里面查到的xmin,xmax就没有太大的可比性了,计算年龄还是使用系统提供的age()函数比较靠谱。
关于这个函数返回的值,源码文件txid.c有这样一段 :
* Export internal transaction IDs to user level.
*
* Note that only top-level transaction IDs are ever converted to TXID.
* This is important because TXIDs frequently persist beyond the global
* xmin horizon, or may even be shipped to other machines, so we cannot
* rely on being able to correlate subtransaction IDs with their parents
* via functions such as SubTransGetTopmostTransaction().
*
/* txid will be signed int8 in database, so must limit to 63 bits */
#define MAX_TXID UINT64CONST(0x7FFFFFFFFFFFFFFF)
/*
* do a TransactionId -> txid conversion for an XID near the given epoch
*/
static txid
convert_xid(TransactionId xid, const TxidEpoch *state)
{
uint64 epoch;
/* return special xid's as-is */
if (!TransactionIdIsNormal(xid))
return (txid) xid;
/* xid can be on either side when near wrap-around */
epoch = (uint64) state->epoch;
if (xid > state->last_xid &&
TransactionIdPrecedes(xid, state->last_xid))
epoch--;
else if (xid < state->last_xid &&
TransactionIdFollows(xid, state->last_xid))
epoch++;
return (epoch << 32) | xid;
}
关于freeze的示例
digoal=# create table txid_test(id int);
CREATE TABLE
digoal=# insert into txid_test values (1);
INSERT 0 1
digoal=# insert into txid_test values (2);
INSERT 0 1
digoal=# insert into txid_test values (3);
INSERT 0 1
digoal=# select relfrozenxid,age(relfrozenxid) from pg_class where relname='txid_test';
relfrozenxid | age
--------------+--------
3070449946 | 361154
(1 row)
digoal=# select id,age(xmin),xmin,txid_current() from txid_test ;
id | age | xmin | txid_current
----+-------+------------+--------------
1 | 16578 | 3070798309 | 11660749479
2 | 14431 | 3070800456 | 11660749479
3 | 13595 | 3070801292 | 11660749479
(3 rows)
digoal=# vacuum full txid_test ;
VACUUM
digoal=# select relfrozenxid,age(relfrozenxid) from pg_class where relname='txid_test';
relfrozenxid | age
--------------+--------
3070449946 | 386485
(1 row)
digoal=# select id,age(xmin),xmin,txid_current() from txid_test ;
id | age | xmin | txid_current
----+-------+------------+--------------
1 | 46133 | 3070798309 | 11660779034
2 | 43986 | 3070800456 | 11660779034
3 | 43150 | 3070801292 | 11660779034
(3 rows)
vacuum full并没有对xmin进行修改,原因是这个表的relfrozenxid 的年龄未达到触发freeze事件 (默认是5000W).
下面是强制使用freeze。
digoal=# vacuum full freeze txid_test ;
VACUUM
digoal=# select relfrozenxid from pg_class where relname='txid_test';
relfrozenxid
--------------
3070104853
(1 row)
digoal=# select id,age(xmin),xmin,txid_current() from txid_test ;
id | age | xmin | txid_current
----+------------+------+--------------
1 | 2147483647 | 2 | 11660042643
2 | 2147483647 | 2 | 11660042643
3 | 2147483647 | 2 | 11660042643
(3 rows)
所以xmin发生了变化,
另外需要注意的是,freeze后的tuple, xmin 变成了一个FrozenXID,这个XID不参与比较,始终认为这个ID比其他的XID老,所以用 age 去计算始终显示 2147483647 。
术语解释请参考后面的术语部分。
关于freeze的示例2
这个示例和前面是一样的,只是凸显了set vacuum_freeze_min_age=0;的效果。
digoal=# delete from txid_test ;
DELETE 3
digoal=# alter table txid_test set (autovacuum_freeze_min_age=0);
ALTER TABLE
digoal=# insert into txid_test values (1);
INSERT 0 1
digoal=# insert into txid_test values (2);
INSERT 0 1
digoal=# insert into txid_test values (3);
INSERT 0 1
digoal=# select relfrozenxid,age(relfrozenxid) from pg_class where relname='txid_test';
relfrozenxid | age
--------------+--------
3070449946 | 852352
(1 row)
digoal=# vacuum full txid_test;
VACUUM
digoal=# select relfrozenxid,age(relfrozenxid) from pg_class where relname='txid_test';
relfrozenxid | age
--------------+--------
3070449946 | 859567
(1 row)
digoal=# select id,age(xmin),xmin,txid_current() from txid_test ;
id | age | xmin | txid_current
----+-------+------------+--------------
1 | 19165 | 3071293757 | 11661247514
2 | 17484 | 3071295438 | 11661247514
3 | 16902 | 3071296020 | 11661247514
(3 rows)
显然,没有XMIN没有发生变化,原因是autovacuum_freeze_min_age=0是autovacuum触发时生效的。我们手工的vacuum不会去读这个属性。
digoal=# set vacuum_freeze_min_age=0;
SET
digoal=# vacuum full txid_test ;
VACUUM
digoal=# select id,age(xmin),xmin,txid_current() from txid_test ;
id | age | xmin | txid_current
----+------------+------+--------------
1 | 2147483647 | 2 | 11661255866
2 | 2147483647 | 2 | 11661255866
3 | 2147483647 | 2 | 11661255866
(3 rows)
这里就生效了。
术语
参数 :
vacuum_freeze_min_age (integer)
Specifies the cutoff age (in transactions) that VACUUM should use to decide whether to replace transaction IDs with FrozenXID while scanning a table.
The default is 50 million transactions.
vacuum 加上freeze参数解释 :
FREEZE
Selects aggressive "freezing" of tuples. Specifying FREEZE is equivalent to performing VACUUM with the vacuum_freeze_min_age parameter set to zero.
系统表pg_class的relfrozenxid列的解释 :
relfrozenxid
xid
All transaction IDs before this one have been replaced with a permanent ("frozen") transaction ID in this table.
This is used to track whether the table needs to be vacuumed in order to prevent transaction ID wraparound or to allow pg_clog to be shrunk.
Zero (InvalidTransactionId) if the relation is not a table.
涉及到txid的系统函数 :
The functions shown in Table 9-52 provide server transaction information in an exportable form.
he main use of these functions is to determine which transactions were committed between two snapshots.
Table 9-52. Transaction IDs and Snapshots
Name | Return Type | Description |
---|---|---|
txid_current() | bigint | get current transaction ID |
txid_current_snapshot() | txid_snapshot | get current snapshot |
txid_snapshot_xip(txid_snapshot) | setof bigint | get in-progress transaction IDs in snapshot |
txid_snapshot_xmax(txid_snapshot) | bigint | get xmax of snapshot |
txid_snapshot_xmin(txid_snapshot) | bigint | get xmin of snapshot |
txid_visible_in_snapshot(bigint, txid_snapshot) | boolean | is transaction ID visible in snapshot? (do not use with subtransaction ids) |
The internal transaction ID type (xid) is 32 bits wide and wraps around every 4 billion transactions.
However, these functions export a 64-bit format that is extended with an”epoch” counter so it will not wrap around during the life of an installation.
The data type used by these functions, txid_snapshot, stores information about transaction ID visibility at a particular moment in time.
Its components are described in Table 9-53.
Table 9-53. Snapshot Components
Name | Description |
---|---|
xmin | Earliest transaction ID (txid) that is still active. All earlier transactions will either be committed and visible, or rolled back and dead. |
xmax | First as-yet-unassigned txid. All txids greater than or equal to this are not yet started as of the time of the snapshot, and thus invisible. |
xip_list | Active txids at the time of the snapshot. The list includes only those active txids between xmin and xmax; there might be active txids higher than xmax. A txid that isxmin <= txid < xmax and not in this list was already completed at the time of the snapshot, and thus either visible or dead according to its commit status. The list does not include txids of subtransactions. |
txid_snapshot's textual representation is xmin:xmax:xip_list.
For example 10:20:10,14,15 means xmin=10, xmax=20, xip_list=10, 14, 15.
截取自pg_resetxlog man page :
The transaction ID epoch is not actually stored anywhere in the database except in the field that is set by
pg_resetxlog, so any value will work so far as the database itself is concerned. You might need to adjust
this value to ensure that replication systems such as Slony-I work correctly — if so, an appropriate value
should be obtainable from the state of the downstream replicated database.
那么如何得知数据库集群的epoch呢? 这个数据在控制文件中.
pg_controldata $PGDATA
.....
Latest checkpoint's NextXID: 0/136923
.....
这里了的第一个斜线前面的部分0就是当前数据库集群的epoch.
对应源码
src/bin/pg_controldata/pg_controldata.c
printf(_("Latest checkpoint's NextXID: %u/%u\n"),
ControlFile.checkPointCopy.nextXidEpoch,
ControlFile.checkPointCopy.nextXid);
src/include/catalog/pg_control.h
typedef struct CheckPoint
{
XLogRecPtr redo; /* next RecPtr available when we began to
* create CheckPoint (i.e. REDO start point) */
TimeLineID ThisTimeLineID; /* current TLI */
TimeLineID PrevTimeLineID; /* previous TLI, if this record begins a new
* timeline (equals ThisTimeLineID otherwise) */
bool fullPageWrites; /* current full_page_writes */
uint32 nextXidEpoch; /* higher-order bits of nextXid */
参考
http://www.postgresql.org/docs/9.1/static/functions-info.html#FUNCTIONS-TXID-SNAPSHOT-PARTS
src/backend/utils/adt/txid.c
man pg_resetxlog
http://www.postgresql.org/docs/9.1/static/routine-vacuuming.html