PostgreSQL XLOG fsync simple introduce

15 minute read

背景

一、where is the tuples? datafile, blocks.

从上到下的顺序查找.

1. PostgreSQL shared buffer

2. OS FileSystem cache

3. storage cache(raid card or controller’s cache)

4. disk cache (almost disabled)

5. disk

二、How point to tuples?

Index(optional)  ->  Heap page -> itemid -> item(tuple).  

三、which tuples can be saw by client? must satisfied two factors below at the same time.

1. xmin , xmax in tuple. – client can see the tuples xid before the client’s xid or database’s current xid.

2. transaction (xmin or xmax) status (t_infomask). – client can see tuples has been commited success.

后面将举例使用pageinspect来举例说明这两点.

四、transaction Durable(ACID - Durable) :

async commit & sync commit

1. the transaction’s redo info (xlog buffer) fsynced to disk .

后面将举例使用strace 跟踪系统接口write和fsync的调用.

五、which process does “wal fsync”?

1. wal writer process

2. backend process

后面将举例使用strace 跟踪系统接口write和fsync的调用.

六、WAL fsync parameters

http://www.postgresql.org/docs/9.2/static/runtime-config-wal.html#RUNTIME-CONFIG-WAL-SETTINGS

举例

1. pageinspect的例子, 观察xmin, xmax, t_infomask的值. t_infomask的值见参考部分.

安装postgresql, 这里使用PostgreSQL 9.3, 从github上下载.

useradd pgdev  
vi /home/pgdev/.bash_profile  
export PS1="$USER@`/bin/hostname -s`-> "  
export PGPORT=9300  
export PGUSER=postgres  
export PGDATA=/data04/pgdev/pg_root  
export LANG=en_US.utf8  
export PGHOME=/home/pgdev/pgsql9.3  
export PGHOST=$PGDATA  
export LD_LIBRARY_PATH=$PGHOME/lib:/lib64:/usr/lib64:/usr/local/lib64:/lib:/usr/lib:/usr/local/lib  
export DATE=`date +"%Y%m%d%H%M"`  
export PATH=$PGHOME/bin:$PATH:.  
export MANPATH=$PGHOME/share/man:$MANPATH  
alias rm='rm -i'  
alias ll='ls -lh'  
  
wget https://nodeload.github.com/postgres/postgres/zipball/master  
unzip postgresql-snapshot.zip  
cd postgresql  
./configure --prefix=/home/pgdev/pgsql9.3 --with-pgport=9300 --with-perl --with-python --with-tcl --with-openssl --with-pam--without-ldap --with-libxml --with-libxslt --enable-thread-safety --with-wal-blocksize=16 --enable-debug && gmake  
gmake install  
cd postgresql/contrib/pageinspect  
. /home/pgdev/.bash_profile  
gmake  
gmake install  
su - pgdev  
initdb  

创建测试库, 表.

postgres=# create role digoal nosuperuser nocreatedb nocreaterole noinherit login encrypted password 'DIGOAL';  
CREATE ROLE  
postgres=# create database digoal with owner digoal;  
CREATE DATABASE  
postgres=# \c digoal postgres  
You are now connected to database "digoal" as user "postgres".  
digoal=# create extension pageinspect;  
CREATE EXTENSION  
digoal=# \c digoal digoal  
You are now connected to database "digoal" as user "digoal".  
digoal=> create schema digoal ;  
CREATE SCHEMA  
digoal=> create table t (id serial primary key, info text);  
CREATE TABLE  

使用pageinspect对heap信息讲解 :

digoal=> insert into t(info) values ('digoal');  
INSERT 0 1  
digoal=> select ctid,* from t;  
 ctid  | id |  info    
-------+----+--------  
 (0,1) |  1 | digoal  
(1 row)  
digoal=> \c digoal postgres  
You are now connected to database "digoal" as user "postgres".  
digoal=# select * from heap_page_items(get_raw_page('digoal.t',0));  
 lp | lp_off | lp_flags | lp_len | t_xmin | t_xmax | t_field3 | t_ctid | t_infomask2 | t_infomask | t_hoff | t_bits | t_oid   
----+--------+----------+--------+--------+--------+----------+--------+-------------+------------+--------+--------+-------  
  1 |   8152 |        1 |     35 |   1746 |      0 |        0 | (0,1)  |           2 |       2306 |     24 |        |        
(1 row)  

信息 :

lp = 1 ( 对应 ctid(0,1) 里面的 1 )

t_infomask = 2306 ( = 0x0902 = 0x0002 + 0x0100 + 0x0800 )  
HEAP_HASVARWIDTH 0x0002  
HEAP_XMIN_COMMITTED 0x0100  
HEAP_XMAX_INVALID 0x0800  
t_xmin = 1746  
t_xmax = 0  

解释 :

这条记录的t_infomax由3个值相加得到, 分别是HEAP_HASVARWIDTH,HEAP_XMIN_COMMITTED,HEAP_XMAX_INVALID. 表示这条记录的xmin已经提交, 同时xmax无效, 表示是新插入的记录. HEAP_HASVARWIDTH表示这条记录有变长字段.

所以这条记录能被后起的会话(分配了更新的xid的会话)看见. 如下(xid = 1747, 较t_xmin=1746 更新) :

digoal=# select txid_current();  
 txid_current   
--------------  
         1747  
(1 row)  
digoal=# select * from digoal.t;  
 id |  info    
----+--------  
  1 | digoal  
(1 row)  

接下来举一个插入并回滚的例子 :

session 1:

digoal=# begin;  
BEGIN  
digoal=# insert into digoal.t(info) values('rollback');  
INSERT 0 1  

session 2:

digoal=# select txid_current();  
 txid_current   
--------------  
         1749  
(1 row)  
digoal=# select * from heap_page_items(get_raw_page('digoal.t',0));  
 lp | lp_off | lp_flags | lp_len | t_xmin | t_xmax | t_field3 | t_ctid | t_infomask2 | t_infomask | t_hoff | t_bits | t_oid   
----+--------+----------+--------+--------+--------+----------+--------+-------------+------------+--------+--------+-------  
  1 |   8152 |        1 |     35 |   1746 |      0 |        0 | (0,1)  |           2 |       2306 |     24 |        |        
  2 |   8112 |        1 |     37 |   1748 |      0 |        0 | (0,2)  |           2 |       2050 |     24 |        |        
(2 rows)  

注意, 使用pageinspect分析的是块信息, 所以能见度较高, 无视xmin,xmax. 这里看到了会话1新插入的行. 信息如下 :

lp = 2 ( 对应 ctid(0,2) 里面的 2 )

t_infomask = 2050 ( = 0x0802 = 0x0002 + 0x0800 )  
HEAP_HASVARWIDTH 0x0002  
HEAP_XMAX_INVALID 0x0800  
t_xmin = 1748  
t_xmax = 0  

从以上信息可以分析, 这条记录的事务号是1748. 当前这个tuple的事务还未提交, 因为少了HEAP_XMIN_COMMITTED标记.

所以即使会话2的txid比会话1更新, 也看不到这条记录 (这个和事务隔离级别有关, PostgreSQL最低事务隔离级别=read committed)

所以在会话2中现在看不到这条记录.

digoal=# select * from digoal.t;  
 id |  info    
----+--------  
  1 | digoal  
(1 row)  

session 1:

回滚 :

digoal=# rollback;  
ROLLBACK  

session 2:

digoal=# select * from heap_page_items(get_raw_page('digoal.t',0));  
 lp | lp_off | lp_flags | lp_len | t_xmin | t_xmax | t_field3 | t_ctid | t_infomask2 | t_infomask | t_hoff | t_bits | t_oid   
----+--------+----------+--------+--------+--------+----------+--------+-------------+------------+--------+--------+-------  
  1 |   8152 |        1 |     35 |   1746 |      0 |        0 | (0,1)  |           2 |       2306 |     24 |        |        
  2 |   8112 |        1 |     37 |   1748 |      0 |        0 | (0,2)  |           2 |       2050 |     24 |        |        
(2 rows)  

PostgreSQL的回滚操作非常迅速, 从heap_page_items得到的信息中也可以验证, 显然, 回滚时, 数据块没有变化. 这条记录的t_infomask维持0x0802.

接下来举一个更新的例子 :

session 1:

digoal=# begin;  
BEGIN  
digoal=# update digoal.t set info='new' where id=1;  
UPDATE 1  

session 2:

digoal=# select * from heap_page_items(get_raw_page('digoal.t',0));  
 lp | lp_off | lp_flags | lp_len | t_xmin | t_xmax | t_field3 | t_ctid | t_infomask2 | t_infomask | t_hoff | t_bits | t_oid   
----+--------+----------+--------+--------+--------+----------+--------+-------------+------------+--------+--------+-------  
  1 |   8152 |        1 |     35 |   1746 |   1750 |        0 | (0,3)  |       16386 |        258 |     24 |        |        
  2 |   8112 |        1 |     37 |   1748 |      0 |        0 | (0,2)  |           2 |       2050 |     24 |        |        
  3 |   8080 |        1 |     32 |   1750 |      0 |        0 | (0,3)  |       32770 |      10242 |     24 |        |        
(3 rows)  

解释:

ctid=(0,1)的记录t_xmax变成了1750, t_infomask变成258 = 0x0102 (HEAP_HASVARWIDTH, HEAP_XMIN_COMMITTED) 说明此时xmin和xmax都是有效的, 但是xmax还未提交. 因此其他会话能看到ctid=(0,1)这条. t_ctid=(0,3), 表示这条记录同时指向ctid=(0,3)的记录.

同时新增了ctid=(0,3)的记录, t_xmin=1750, t_infomask = 10242 = 0x2802 (HEAP_UPDATED, HEAP_XMAX_INVALID, HEAP_HASVARWIDTH) 说明这条记录的xmin是有效的, 但是还未提交.

session 1:

COMMIT;  

session 2:

digoal=# select * from digoal.t;  
 id | info   
----+------  
  1 | new  
(1 row)  
digoal=# select * from heap_page_items(get_raw_page('digoal.t',0));  
 lp | lp_off | lp_flags | lp_len | t_xmin | t_xmax | t_field3 | t_ctid | t_infomask2 | t_infomask | t_hoff | t_bits | t_oid   
----+--------+----------+--------+--------+--------+----------+--------+-------------+------------+--------+--------+-------  
  1 |   8152 |        1 |     35 |   1746 |   1750 |        0 | (0,3)  |       16386 |       1282 |     24 |        |        
  2 |   8112 |        1 |     37 |   1748 |      0 |        0 | (0,2)  |           2 |       2562 |     24 |        |        
  3 |   8080 |        1 |     32 |   1750 |      0 |        0 | (0,3)  |       32770 |      10498 |     24 |        |        
(3 rows)  

session1提交后,

ctid=(0,1)的记录t_infomask变成1282 = 0x0502 (HEAP_HASVARWIDTH, HEAP_XMIN_COMMITTED, HEAP_XMAX_COMMITTED) 说明此时xmin和xmax都是有效并提交的状态. t_ctid=(0,3), 表示这条记录同时指向ctid=(0,3)的记录. (对于repeatable read和serializeable read的隔离级别, 读取哪条tuple, 就要分辨t_xmin和t_xmax了. 还有vacuum进程也需要看这两个值)

ctid=(0,3)的记录, t_infomask = 10498 = 0x2902 (HEAP_UPDATED, HEAP_XMIN_COMMITTED, HEAP_XMAX_INVALID, HEAP_HASVARWIDTH) 说明这条记录的xmin是有效的, 并已提交.

2. 使用strace跟踪fdatasync调用举例

下面使用的是PostgreSQL 9.3 devel版本进行的测试.

session 1 :

digoal=# select pg_backend_pid();  
 pg_backend_pid   
----------------  
           9499  
(1 row)  
digoal=# show synchronous_commit;  
 synchronous_commit   
--------------------  
 on  
(1 row)  
digoal=# show wal_sync_method;  
 wal_sync_method   
-----------------  
 fdatasync  
(1 row)  

session 2 :

pgdev@db-172-16-3-150-> ps -ewf|grep pgdev  
pgdev     8393     1  0 08:26 pts/5    00:00:00 /home/pgdev/pgsql9.3/bin/postgres  
pgdev     8395  8393  0 08:26 ?        00:00:00 postgres: checkpointer process     
pgdev     8396  8393  0 08:26 ?        00:00:00 postgres: writer process           
pgdev     8397  8393  0 08:26 ?        00:00:00 postgres: wal writer process       
pgdev     8398  8393  0 08:26 ?        00:00:00 postgres: autovacuum launcher process     
pgdev     8399  8393  0 08:26 ?        00:00:00 postgres: stats collector process     
pgdev     9499  8393  0 09:08 ?        00:00:00 postgres: postgres digoal [local] idle  

跟踪wal writer 进程

strace -p 8397  

整个过程中没有跟踪到wal writer进程有系统调用操作.

session 3 :

跟踪backend 进程

strace -p 9499  

回滚测试 :

# session 1 (begin;)  
recvfrom(8, "Q\0\0\0\vbegin;\0", 8192, 0, NULL, NULL) = 12  
sendto(8, "C\0\0\0\nBEGIN\0Z\0\0\0\5T", 17, 0, NULL, 0) = 17  
  
  
# session 1 (insert into digoal.t (info) values ('test');)  
recvfrom(8, "Q\0\0\0001insert into digoal.t (info)"..., 8192, 0, NULL, NULL) = 50  
sendto(8, "C\0\0\0\17INSERT 0 1\0Z\0\0\0\5T", 22, 0, NULL, 0) = 22  
  
  
# session 1 (rollback;)  
recvfrom(8, "Q\0\0\0\16rollback;\0", 8192, 0, NULL, NULL) = 15  
kill(8397, SIGUSR1)                     = 0  
sendto(7, "\2\0\0\0`\1\0\0006@\0\0\3\0\0\0\0\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0"..., 352, 0, NULL, 0) = 352  
sendto(8, "C\0\0\0\rROLLBACK\0Z\0\0\0\5I", 20, 0, NULL, 0) = 20  

注意到, rollback时backend process和wal writer process都没有写xlog文件的操作(write或者fdatasync都没看到). (但是, 并不代表rollback掉的事务就不产生XLOG信息了, 产生xlog信息是在每个SQL执行过程中产生的, commit只是触发了fsync操作. 这些XLOG信息其实已经在wal buffer里面了, 后面会有例子来讲这个.)

更加详细的调用需要使用gdb来跟踪. strace很多信息都跟踪不到.

提交测试 :

# session 1 (begin;)  
recvfrom(8, "Q\0\0\0\vbegin;\0", 8192, 0, NULL, NULL) = 12  
sendto(8, "C\0\0\0\nBEGIN\0Z\0\0\0\5T", 17, 0, NULL, 0) = 17  
  
  
# session 1 (insert into digoal.t (info) values ('test');)  
recvfrom(8, "Q\0\0\0001insert into digoal.t (info)"..., 8192, 0, NULL, NULL) = 50  
sendto(8, "C\0\0\0\17INSERT 0 1\0Z\0\0\0\5T", 22, 0, NULL, 0) = 22  
  
# session 1 (commit;)  
recvfrom(8, "Q\0\0\0\fcommit;\0", 8192, 0, NULL, NULL) = 13  
lseek(16, 327680, SEEK_SET)             = 327680  
write(16, "u\320\5\0\1\0\0\0\0\0\5\2\0\0\0\0O\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 16384) = 16384  
fdatasync(16)                           = 0  
sendto(7, "\2\0\0\0`\1\0\0006@\0\0\3\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 352, 0, NULL, 0) = 352  
sendto(8, "C\0\0\0\vCOMMIT\0Z\0\0\0\5I", 18, 0, NULL, 0) = 18  

注意到, commit时, backend process写xlog文件(查看该进程的fd)并调用fdatasync写入磁盘.

查看fd=16打开的是什么文件?

cd /proc/9499/fd  
ll  
lrwx------ 1 pgdev pgdev 64 Oct 16 09:21 16 -> /data04/pgdev/pg_root/pg_xlog/000000010000000000000002  

3. 使用strace跟踪fdatasync调用举例

以下使用的是PostgreSQL 9.1.3的版本进行的测试 :

session 1 :

postgres@db-172-16-3-40-> psql -h 127.0.0.1  
psql (9.1.3)  
Type "help" for help.  
postgres=# show synchronous_commit;  
 synchronous_commit   
--------------------  
 on  
(1 row)  
postgres=# show wal_sync_method;  
 wal_sync_method   
-----------------  
 fdatasync  
(1 row)  
postgres=# select pg_backend_pid();  
 pg_backend_pid   
----------------  
          26479  
(1 row)  

session 2 :

postgres@db-172-16-3-40-> ps -ewf|grep postgres  
postgres 21941     1  0 Oct15 pts/5    00:00:29 /opt/pgsql/bin/postgres  
postgres 21948 21941  0 Oct15 ?        00:00:11 postgres: logger process     
postgres 21950 21941  0 Oct15 ?        00:00:00 postgres: writer process     
postgres 21951 21941  0 Oct15 ?        00:00:00 postgres: wal writer process     
postgres 21952 21941  0 Oct15 ?        00:00:00 postgres: autovacuum launcher process     
postgres 21953 21941  0 Oct15 ?        00:00:00 postgres: stats collector process     
postgres 26479 21941  0 09:02 ?        00:00:00 postgres: postgres postgres 127.0.0.1(17658) idle  

跟踪wal writer 进程

strace -p 21951  

回滚测试 :

# session 1 (insert into digoal.t (info) values ('test');)  
lseek(5, 7520256, SEEK_SET)             = 7520256  
write(5, "f\320\1\0\1\0\0\0\0\0\0\0\0\300rH\22\0\0\0pi\1\0\0\0\0\0\0\0\0\0"..., 8192) = 8192  
fdatasync(5)                            = 0  
  
  
# session 1 (rollback;)  
write(5, "f\320\1\0\1\0\0\0\0\0\0\0\0\340rH\210\1\0\0\377\377\1\0\345/\0\0enab"..., 8192) = 8192  
fdatasync(5)                            = 0  

fd信息如下 :

[root@db-172-16-3-40 fd]# cd /proc/26479/fd  
[root@db-172-16-3-40 fd]# ll 5  
lrwx------ 1 postgres postgres 64 Oct 16 09:09 5 -> /pgdata/digoal/1921/data02/pg_root/base/12699/12505  
postgres=# select pg_relation_filepath('t'::regclass);  
 pg_relation_filepath   
----------------------  
 base/12699/16923  
(1 row)  

wal writer进程写的居然是pg_am这个系统表, 暂时还不知道啥原因.

postgres=# select relname,relkind,pg_relation_filepath(oid) from pg_class where pg_relation_filepath(oid) ~ '12505';  
 relname | relkind | pg_relation_filepath   
---------+---------+----------------------  
 pg_am   | r       | base/12699/12505  
(1 row)  

session 3 :

跟踪backend 进程

strace -p 26479  

回滚测试 :

# session 1 (begin;)  
recvfrom(7, "Q\0\0\0\vbegin;\0", 8192, 0, NULL, NULL) = 12  
close(40)                               = 0  
sendto(7, "C\0\0\0\nBEGIN\0Z\0\0\0\5T", 17, 0, NULL, 0) = 17  
  
  
# session 1 (insert into digoal.t (info) values ('test');)  
recvfrom(7, "Q\0\0\0,insert into t (info) values"..., 8192, 0, NULL, NULL) = 45  
open("base/12699/16923_fsm", O_RDWR)    = -1 ENOENT (No such file or directory)  
open("base/12699/16923", O_RDWR)        = 18  
lseek(18, 0, SEEK_END)                  = 0  
lseek(18, 0, SEEK_END)                  = 0  
write(18, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 8192) = 8192  
open("base/12699/16930", O_RDWR)        = 40  
read(40, "\0\0\0\0\0\0\0\0\1\0\0\0000\0\360\37\360\37\4 \0\0\0\0b1\5\0\2\0\0\0"..., 8192) = 8192  
open("base/12699/16930_fsm", O_RDWR)    = -1 ENOENT (No such file or directory)  
lseek(40, 0, SEEK_END)                  = 8192  
write(40, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 8192) = 8192  
close(40)                               = 0  
sendto(7, "C\0\0\0\17INSERT 0 1\0Z\0\0\0\5T", 22, 0, NULL, 0) = 22  
  
# session 1 (rollback;)  
recvfrom(7, "Q\0\0\0\16rollback;\0", 8192, 0, NULL, NULL) = 15  
sendto(6, "\2\0\0\0\300\3\0\0\2331\0\0\t\0\0\0\0\0\0\0\1\0\0\0\353\4\0\0\0\0\0\0"..., 960, 0, NULL, 0) = 960  
sendto(6, "\2\0\0\0\270\1\0\0\2331\0\0\4\0\0\0\0\0\0\0\0\0\0\0v\n\0\0\0\0\0\0"..., 440, 0, NULL, 0) = 440  
sendto(7, "C\0\0\0\rROLLBACK\0Z\0\0\0\5I", 20, 0, NULL, 0) = 20  

注意到, rollback时backend process和wal writer process都没有写xlog文件的操作(write或者fdatasync都没看到). 但是有write fsm文件的操作.

更加详细的调用需要使用gdb来跟踪. strace很多信息都跟踪不到.

提交测试 :

# session 1 (begin;)  
recvfrom(7, "Q\0\0\0\vbegin;\0", 8192, 0, NULL, NULL) = 12  
sendto(7, "C\0\0\0\nBEGIN\0Z\0\0\0\5T", 17, 0, NULL, 0) = 17  
  
  
# session 1 (insert into digoal.t (info) values ('test');)  
recvfrom(7, "Q\0\0\0,insert into t (info) values"..., 8192, 0, NULL, NULL) = 45  
sendto(7, "C\0\0\0\17INSERT 0 1\0Z\0\0\0\5T", 22, 0, NULL, 0) = 22  
  
  
# session 1 (commit;)  
recvfrom(7, "Q\0\0\0\fcommit;\0", 8192, 0, NULL, NULL) = 13  
write(38, "f\320\1\0\1\0\0\0\0\0\0\0\0\340rH\210\1\0\0\377\377\1\0\345/\0\0enab"..., 8192) = 8192  
fdatasync(38)                           = 0  
sendto(6, "\2\0\0\0P\1\0\0\2331\0\0\3\0\0\0\1\0\0\0\0\0\0\0\33B\0\0\0\0\0\0"..., 336, 0, NULL, 0) = 336  
sendto(7, "C\0\0\0\vCOMMIT\0Z\0\0\0\5I", 18, 0, NULL, 0) = 18  

注意到, commit时, backend process写xlog文件(查看该进程的fd)并调用fdatasync写入磁盘.

查看fd=38打开的是什么文件?

postgres@db-172-16-3-40-> cd /proc/26479/fd  
postgres@db-172-16-3-40-> ll 38  
lrwx------ 1 postgres postgres 64 Oct 16 09:09 38 -> /pgdata/digoal/1921/data01/pg_xlog/000000010000000000000012  

4. gdb

以下使用PostgreSQL 9.3 devel版本进行测试.

session 1 :

pgdev@db-172-16-3-150-> psql  
psql (9.3devel)  
Type "help" for help.  
  
postgres=# \c digoal postgres  
You are now connected to database "digoal" as user "postgres".  
digoal=# select pg_backend_pid();  
 pg_backend_pid   
----------------  
          12872  
(1 row)  

session 2 :

pgdev@db-172-16-3-150-> gdb  
GNU gdb (GDB) Red Hat Enterprise Linux (7.0.1-37.el5)  
Copyright (C) 2009 Free Software Foundation, Inc.  
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>  
This is free software: you are free to change and redistribute it.  
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"  
and "show warranty" for details.  
This GDB was configured as "x86_64-redhat-linux-gnu".  
For bug reporting instructions, please see:  
<http://www.gnu.org/software/gdb/bugs/>.  
(gdb) attach 12872  
Attaching to process 12872  
Reading symbols from /home/pgdev/pgsql9.3/bin/postgres...done.  
Reading symbols from /usr/lib64/libxslt.so.1...(no debugging symbols found)...done.  
Loaded symbols for /usr/lib64/libxslt.so.1  
Reading symbols from /usr/lib64/libxml2.so.2...(no debugging symbols found)...done.  
Loaded symbols for /usr/lib64/libxml2.so.2  
Reading symbols from /lib64/libssl.so.6...(no debugging symbols found)...done.  
Loaded symbols for /lib64/libssl.so.6  
Reading symbols from /lib64/libcrypto.so.6...(no debugging symbols found)...done.  
Loaded symbols for /lib64/libcrypto.so.6  
Reading symbols from /lib64/libcrypt.so.1...(no debugging symbols found)...done.  
Loaded symbols for /lib64/libcrypt.so.1  
Reading symbols from /lib64/libdl.so.2...(no debugging symbols found)...done.  
Loaded symbols for /lib64/libdl.so.2  
Reading symbols from /lib64/libm.so.6...(no debugging symbols found)...done.  
Loaded symbols for /lib64/libm.so.6  
Reading symbols from /lib64/libc.so.6...(no debugging symbols found)...done.  
Loaded symbols for /lib64/libc.so.6  
Reading symbols from /lib64/libz.so.1...(no debugging symbols found)...done.  
Loaded symbols for /lib64/libz.so.1  
Reading symbols from /usr/lib64/libgssapi_krb5.so.2...(no debugging symbols found)...done.  
Loaded symbols for /usr/lib64/libgssapi_krb5.so.2  
Reading symbols from /usr/lib64/libkrb5.so.3...(no debugging symbols found)...done.  
Loaded symbols for /usr/lib64/libkrb5.so.3  
Reading symbols from /lib64/libcom_err.so.2...(no debugging symbols found)...done.  
Loaded symbols for /lib64/libcom_err.so.2  
Reading symbols from /usr/lib64/libk5crypto.so.3...(no debugging symbols found)...done.  
Loaded symbols for /usr/lib64/libk5crypto.so.3  
Reading symbols from /lib64/ld-linux-x86-64.so.2...(no debugging symbols found)...done.  
Loaded symbols for /lib64/ld-linux-x86-64.so.2  
Reading symbols from /usr/lib64/libkrb5support.so.0...(no debugging symbols found)...done.  
Loaded symbols for /usr/lib64/libkrb5support.so.0  
Reading symbols from /lib64/libkeyutils.so.1...(no debugging symbols found)...done.  
Loaded symbols for /lib64/libkeyutils.so.1  
Reading symbols from /lib64/libresolv.so.2...(no debugging symbols found)...done.  
Loaded symbols for /lib64/libresolv.so.2  
Reading symbols from /lib64/libselinux.so.1...(no debugging symbols found)...done.  
Loaded symbols for /lib64/libselinux.so.1  
Reading symbols from /lib64/libsepol.so.1...(no debugging symbols found)...done.  
Loaded symbols for /lib64/libsepol.so.1  
Reading symbols from /lib64/libnss_files.so.2...(no debugging symbols found)...done.  
Loaded symbols for /lib64/libnss_files.so.2  
warning: no loadable sections found in added symbol-file system-supplied DSO at 0x7fff05bc9000  
0x000000345fad4ef5 in recv () from /lib64/libc.so.6  
(gdb) b fdatasync  
Breakpoint 1 at 0x345facda60  
(gdb) c  
Continuing.  

session 1 :

digoal=# begin;  
BEGIN  
digoal=# insert into digoal.t (info) values ('test');  
INSERT 0 1  
digoal=# end;  

session 2 :

Breakpoint 1, 0x000000345facda60 in fdatasync () from /lib64/libc.so.6  
(gdb) bt  
#0  0x000000345facda60 in fdatasync () from /lib64/libc.so.6  
#1  0x0000000000491a19 in issue_xlog_fsync (fd=35, segno=3) at xlog.c:8215  
#2  0x0000000000497507 in XLogWrite (WriteRqst=..., flexible=0 '\000', xlog_switch=0 '\000') at xlog.c:1697  
#3  0x0000000000497a38 in XLogFlush (record=50343800) at xlog.c:1976  
#4  0x0000000000488dfb in RecordTransactionCommit () at xact.c:1125  
#5  CommitTransaction () at xact.c:1876  
#6  0x000000000048a9b9 in CommitTransactionCommand () at xact.c:2622  
#7  0x000000000062fbb3 in finish_xact_command () at postgres.c:2421  
#8  0x0000000000631e65 in exec_simple_query (query_string=0x34651f0 "end;") at postgres.c:1085  
#9  0x0000000000632c2d in PostgresMain (argc=<value optimized out>, argv=<value optimized out>, username=<value optimized out>)  
    at postgres.c:3976  
#10 0x00000000005f5267 in ServerLoop () at postmaster.c:3671  
#11 0x00000000005f7e91 in PostmasterMain (argc=1, argv=0x33b2d10) at postmaster.c:1185  
#12 0x00000000005967b3 in main (argc=1, argv=<value optimized out>) at main.c:197  

5. 那么fsync xlog 的操作是不是只在commit 的时候发生呢?

当然不是, 上面没有观察到rollback时写XLOG是因为wal buffer设得较大, 没有被写满, 同时数据库中没有其他会话有commit操作.

真正发生fsync xlog有几种可能, 1. commit的时候. 2. wal buffer被写满的时候.

下面使用大批量插入进行测试就能看出, 此时wal writer进程将会发生写xlog的操作.

postgres=# begin;insert into t (info) select generate_series(1,10000000);  
BEGIN  
INSERT 0 10000000  

跟踪wal writer进程, 未提交时也发生了xlog write :

postgres@db-172-16-3-40-> strace -p 21951   
Process 21951 attached - interrupt to quit  
open("pg_xlog/000000010000000000000021", O_RDWR) = 5  
write(5, "f\320\2\0\1\0\0\0\0\0\0\0\0\0\0\204\355\377_O\305\277^O\0\0\0\4\0 \0\0"..., 303104) = 303104  
fdatasync(5)                            = 0  
....write..  
close(5)                                = 0  
open("pg_xlog/000000010000000000000022", O_RDWR) = 5  
write(5, "f\320\2\0\1\0\0\0\0\0\0\0\0\0\0\210\355\377_O\305\277^O\0\0\0\4\0 \0\0"..., 245760) = 245760  
fdatasync(5)                            = 0  
....write..  
close(5)                                = 0  
open("pg_xlog/000000010000000000000023", O_RDWR) = 5  
write(5, "f\320\3\0\1\0\0\0\0\0\0\0\0\0\0\214\355\377_O\305\277^O\0\0\0\4\0 \0\0"..., 344064) = 344064  
fdatasync(5)                            = 0  
....write..  
close(5)                                = 0  
open("pg_xlog/000000010000000000000024", O_RDWR) = -1 ENOENT (No such file or directory)  
unlink("pg_xlog/xlogtemp.21951")        = -1 ENOENT (No such file or directory)  
open("pg_xlog/xlogtemp.21951", O_RDWR|O_CREAT|O_EXCL, 0600) = 5  
write(5, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 8192) = 8192  
  
... 省略多行同样的初始化xlog文件的操作.  
  
fsync(5)                                = 0  
close(5)                                = 0  
stat("pg_xlog/000000010000000000000024", 0x7fff4b204c30) = -1 ENOENT (No such file or directory)  
link("pg_xlog/xlogtemp.21951", "pg_xlog/000000010000000000000024") = 0  
unlink("pg_xlog/xlogtemp.21951")        = 0  
open("pg_xlog/000000010000000000000024", O_RDWR) = 5  
write(5, "f\320\2\0\1\0\0\0\0\0\0\0\0\0\0\220\355\377_O\305\277^O\0\0\0\4\0 \0\0"..., 303104) = 303104  
fdatasync(5)                            = 0  
....write.....  
......   
close(5)                                = 0  
open("pg_xlog/000000010000000000000028", O_RDWR) = -1 ENOENT (No such file or directory)  
unlink("pg_xlog/xlogtemp.21951")        = -1 ENOENT (No such file or directory)  
open("pg_xlog/xlogtemp.21951", O_RDWR|O_CREAT|O_EXCL, 0600) = 5  
write(5, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 8192) = 8192  
............  
stat("pg_xlog/000000010000000000000028", 0x7fff4b204c30) = -1 ENOENT (No such file or directory)  
link("pg_xlog/xlogtemp.21951", "pg_xlog/000000010000000000000028") = 0  
unlink("pg_xlog/xlogtemp.21951")        = 0  
open("pg_xlog/000000010000000000000028", O_RDWR) = 5  
write(5, "f\320\2\0\1\0\0\0\0\0\0\0\0\0\0\240\355\377_O\305\277^O\0\0\0\4\0 \0\0"..., 311296) = 311296  
fdatasync(5)                            = 0  

跟踪backend process :

postgres@db-172-16-3-40-> strace -p 26479   
  
# 仅当commit时backend process进程会对xlog进行写操作  
  
  
Process 26479 attached - interrupt to quit  
recvfrom(7, "Q\0\0\0\fcommit;\0", 8192, 0, NULL, NULL) = 13  
fadvise64(38, 0, 0, POSIX_FADV_DONTNEED) = 0  
close(38)                               = 0  
open("pg_xlog/000000010000000000000028", O_RDWR) = 38  
lseek(38, 56131584, SEEK_SET)           = 56131584  
write(38, "f\320\1\0\1\0\0\0\0\0\0\0\0\200X\243\"\0\0\0\177\6\0\0\2331\0\0\"B\0\0"..., 8192) = 8192  
fdatasync(38)                           = 0  
sendto(6, "\2\0\0\0\210\2\0\0\2331\0\0\6\0\0\0\1\0\0\0\0\0\0\0\33B\0\0\0\0\0\0"..., 648, 0, NULL, 0) = 648  
sendto(7, "C\0\0\0\vCOMMIT\0Z\0\0\0\5I", 18, 0, NULL, 0) = 18  
  
[root@db-172-16-3-40 fd]# cd /proc/26479/fd  
[root@db-172-16-3-40 fd]# ll 38  
lrwx------ 1 postgres postgres 64 Oct 16 09:09 38 -> /pgdata/digoal/1921/data01/pg_xlog/000000010000000000000028  

因此在使用流复制的primary-standby环境中, standby可以迅速的接收到主节点已经fsync到disk的xlog信息. 取决于wal buffer是否满, 或者系统中有commit. 那么主节点上fsync到磁盘的xlog信息将被sender process(每10ms或者每次commit后被唤醒)发送给standby 的receiver process.

参考

1. Use pageinspect EXTENSION view PostgreSQL Page’s raw infomation

http://blog.163.com/digoal@126/blog/static/16387704020114273265960/

2. http://www.postgresql.org/docs/9.2/static/pageinspect.html

3. http://www.postgresql.org/docs/9.2/static/runtime-config-wal.html#RUNTIME-CONFIG-WAL-SETTINGS

4. USE GDB debug postgres

http://blog.163.com/digoal@126/blog/static/163877040201252605924116/

5. t_infomask标示 :

htup.h

00158 /*  
00159  * information stored in t_infomask:  
00160  */  
00161 #define HEAP_HASNULL            0x0001  /* has null attribute(s) */  
00162 #define HEAP_HASVARWIDTH        0x0002  /* has variable-width attribute(s) */  
00163 #define HEAP_HASEXTERNAL        0x0004  /* has external stored attribute(s) */  
00164 #define HEAP_HASOID             0x0008  /* has an object-id field */  
00165 /* bit 0x0010 is available */  
00166 #define HEAP_COMBOCID           0x0020  /* t_cid is a combo cid */  
00167 #define HEAP_XMAX_EXCL_LOCK     0x0040  /* xmax is exclusive locker */  
00168 #define HEAP_XMAX_SHARED_LOCK   0x0080  /* xmax is shared locker */  
00169 /* if either LOCK bit is set, xmax hasn't deleted the tuple, only locked it */  
00170 #define HEAP_IS_LOCKED  (HEAP_XMAX_EXCL_LOCK | HEAP_XMAX_SHARED_LOCK)  
00171 #define HEAP_XMIN_COMMITTED     0x0100  /* t_xmin committed */  
00172 #define HEAP_XMIN_INVALID       0x0200  /* t_xmin invalid/aborted */  
00173 #define HEAP_XMAX_COMMITTED     0x0400  /* t_xmax committed */  
00174 #define HEAP_XMAX_INVALID       0x0800  /* t_xmax invalid/aborted */  
00175 #define HEAP_XMAX_IS_MULTI      0x1000  /* t_xmax is a MultiXactId */  
00176 #define HEAP_UPDATED            0x2000  /* this is UPDATEd version of row */  
00177 #define HEAP_MOVED_OFF          0x4000  /* moved to another place by pre-9.0  
00178                                          * VACUUM FULL; kept for binary  
00179                                          * upgrade support */  
00180 #define HEAP_MOVED_IN           0x8000  /* moved from another place by pre-9.0  
00181                                          * VACUUM FULL; kept for binary  
00182                                          * upgrade support */  
00183 #define HEAP_MOVED (HEAP_MOVED_OFF | HEAP_MOVED_IN)  
00184   
00185 #define HEAP_XACT_MASK          0xFFE0  /* visibility-related bits */  

6. wal buffer :

wal_buffers (integer)  
  
The amount of shared memory used for WAL data that has not yet been written to disk.   
  
The default setting of -1 selects a size equal to 1/32nd (about 3%) of shared_buffers,   
but not less than 64kB nor more than the size of one WAL segment, typically 16MB.   
  
This value can be set manually if the automatic choice is too large or too small,   
but any positive value less than 32kB will be treated as 32kB. This parameter can only be set at server start.  
  
The contents of the WAL buffers are written out to disk at every transaction commit, so extremely large values are unlikely to provide a significant benefit.   
  
However, setting this value to at least a few megabytes can improve write performance on a busy server where many clients are committing at once.   
  
The auto-tuning selected by the default setting of -1 should give reasonable results in most cases.  
  
Increasing this parameter might cause PostgreSQL to request more System V shared memory than your operating system's default configuration allows.   
  
See Section 17.4.1 for information on how to adjust those parameters, if necessary.  

Flag Counter

digoal’s 大量PostgreSQL文章入口