PostgreSQL pending patch : fail-back without fresh backup (have bug?)

18 minute read

背景

补充说明

本文测试fail-back未成功的bug是-m immediate的一个bug, 已经修复 :

http://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=91c3613d3748d881706c3e60d8221ea92833ac1a

http://blog.163.com/digoal@126/blog/static/16387704020137894542352/

正文

PostgreSQL的流复制应用非常普遍, 同时也可以作为HA的关键技术, 当主节点出现故障时, 可以通过激活备节点的方法达到高可用到目的.

当备份节点激活后, 原来的主节点可以作为备节点来使用.

但是, 如果主节点在发生故障前的WAL没有全部同步给备节点的话, 备节点激活后, 实际上与主节点是存在一定差异的. 这种情况下主节点不能直接作为备节点参与利用流复制做的增量恢复. 遇到这样的情况怎么样把主节点变成备节点呢?

1. 使用pg_basebackup, 或者COPY 的方法重新做一个standby, 这样的话相当于要把整个数据库的数据文件全部传一遍. 数据量大或者网络交差的话是比较痛苦的事情.

2. 使用rsync, 只传输增量数据, 但是需要注意的是, 如果数据库很大的话, rsync的数据比对过程非常漫长, 并且要消耗大量的io资源.

除了以上两种方法, 还有其他的工具可以利用, 例如在此前写过一篇关于fail-back的工具pg_rewind的使用文章.

pg_rewind利用新的primary的history 文件, 取出切换时间线是的xlog位置信息, 用这个位置信息来匹配旧的primary节点的xlog文件中的信息, 如果旧的primary在这个点之后还产生了xlog信息, 那么说明standby在promote后有未同步至standby节点的xlog数据, 所以需要对这部分xlog做差异处理. 把这部分xlog对应的变更的块以及控制文件,clog等从新的primary拷贝到老的primary节点.

具体可参考 :

http://blog.163.com/digoal@126/blog/static/1638770402013758511857/

除了pg_rewind, PostgreSQL可能在内核中加入类似pg_rewind的功能, 当然原理完全不一样, 加入内核的功能原理如下 :

The proposal is based on the concept of master should not do any file  
system level change until corresponding WAL record is replicated to the  
standby.  
  
There are many places in the code which need to be handled to support the  
proposed solution.  Following cases explains the need of fresh backup at  
the time of failover, and how can we avoid this need by our approach.  
  
1. We must not write any heap pages to the disk before the WAL records  
corresponding to those changes are received by the standby. Otherwise if  
standby failed to receive WAL corresponding to those heap pages there will  
be inconsistency.  
  
2. When CHECKPOINT happens on the master, control file of master gets  
updated and last checkpoint record is written to it. Suppose failover  
happens and standby fails to receive the WAL record corresponding to  
CHECKPOINT, then master and standby has inconsistent copies of control file  
that leads to the mismatch in redo record and recovery will not start  
normally. To avoid this situation we must not update the control file of  
master before the corresponding checkpoint WAL record is received by the  
standby  
  
3. Also when we truncate any of the physical files on the master and  
suppose the standby failed to receive corresponding WAL, then that physical  
file is truncated on master but still available on standby causing  
inconsistency. To avoid this we must not truncate physical files on the  
master before the WAL record corresponding to that operation is received by  
the standby.  
  
4. Same case applies to CLOG pages. If CLOG page is written to the disk and  
corresponding WAL record is not replicated to the standby, leads to the  
inconsistency. So we must not write the CLOG pages (and may be other SLRU  
pages too) to the disk before the corresponding WAL records are received by  
standby.  
  
5. The same problem applies for the commit hint bits. But it is more  
complicated than the other problems, because no WAL records are generated  
for that, hence we cannot apply the same above method, that is wait for  
corresponding WAL record to be replicated on standby.  So we delay the  
processes of updating the commit hint bits, similar to what is done by  
asynchronous commits.  In other words we need to check if the WAL  
corresponding to the transaction commit is received by the failback safe  
standby and then only allow hint bit updates.  
  
*Patch explanation:*  
  
The initial work on this patch is done by Pavan Deolasee. I tested it and  
will make further enhancements based on the community feedback.  
  
This patch is not complete yet, but I plan to do so with the help of this  
community. At this point, the primary purpose is to understand the  
complexities and get some initial performance numbers to alleviate some of  
the concerns raised by the community.  
  
There are two GUC parameters which supports this failsafe standby  
  
1. failback_safe_standby_name  [ name of the failsafe standby ] It is the  
name of failsafe standby. Master will not do any file system level change  
before corresponding WAL is replicated on the this failsafe standby  
  
2. failback_safe_standby_mode  [ off/remote_write/remote_flush] This  
parameter specifies the behavior of master i.e. whether it should wait for  
WAL to be written on standby or WAL to be flushed on standby.  We should  
turn it off when we do not want the failsafe standby. This failsafe mode  
can be combined with synchronous as well as asynchronous streaming  
replication.  
  
Most of the changes are done in the syncrep.c. This is a slight misnomer  
because that file deals with synchronous standby and a failback standby  
could and most like be a async standby. But keeping the changes this way  
has ensured that the patch is easy to read. Once we have acceptance on the  
approach, the patch can be modified to reorganize the code in a more  
logical way.  
  
The patch adds a new state SYNC_REP_WAITING_FOR_FAILBACK_SAFETY to the sync  
standby states. A backend which is waiting for a failback safe standby to  
receive WAL records, will wait in this state.  Failback safe mechanism can  
work in two different modes, that is wait for WAL to be written or flushed  
on failsafe standby. That is represented by two new modes  
SYNC_REP_WAIT_FAILBACK_SAFE_WRITE and SYNC_REP_WAIT_FAILBACK_SAFE_FLUSH  
respectively.  
  
Also the SyncRepWaitForLSN() is changed for conditional wait. So that we  
can delay hint bit updates on master instead of blocking the wait for the  
failback safe standby to receiver WAL's.  

到第四版的补丁, 新增的参数名更改为synchronous_transfer, 可配置3个值 :

synchronous_transfer = commit   # data page synchronization level  
                                # commit, data_flush or all  

synchronous_transfer参数的含义 :

-'commit'  
    'commit' means that master waits for corresponding WAL to flushed  
to disk of standby server on commits.  
    but master doesn't waits for replicated data pages.  
  
  -'data_flush'  
    'data_flush' means that master waits for replicated data page  
(e.g, CLOG, pg_control) before flush to disk of master server.  
    but if user set to 'data_flush' to this parameter,  
'synchronous_commit' values is ignored even if user set  
'synchronous_commit'.  
  
  -'all'  
    'all' means that master waits for replicated WAL and data page.  

commit类似同步流复制,记录master节点的commit状态前,等待该xlog已经在standby flushed to disk.

data_flush则表示master节点在flush data page前, 必须确保这些变更对应的xlog信息已经在standby 节点flush to disk了. (这个配置将忽略synchronous_commit的配置)

同步流复制增加1种等待模式(SYNC_REP_WAIT_DATA_FLUSH) :

+++ b/src/include/replication/syncrep.h  
@@ -19,23 +19,42 @@  
 #define SyncRepRequested() \  
        (max_wal_senders > 0 && synchronous_commit > SYNCHRONOUS_COMMIT_LOCAL_FLUSH)  
   
+#define SyncTransRequested() \  
+       (max_wal_senders > 0 && synchronous_transfer > SYNCHRONOUS_TRANSFER_COMMIT)  
+  
+#define IsSyncRepSkipped() \  
+       (max_wal_senders > 0 && synchronous_transfer ==  SYNCHRONOUS_TRANSFER_DATA_FLUSH)  
+  
 /* SyncRepWaitMode */  
-#define SYNC_REP_NO_WAIT               -1  
-#define SYNC_REP_WAIT_WRITE            0  
-#define SYNC_REP_WAIT_FLUSH            1  
+#define SYNC_REP_NO_WAIT                                       -1  
+#define SYNC_REP_WAIT_WRITE                                    0  
+#define SYNC_REP_WAIT_FLUSH                                    1  
+#define SYNC_REP_WAIT_DATA_FLUSH       2  
   
-#define NUM_SYNC_REP_WAIT_MODE 2  
+#define NUM_SYNC_REP_WAIT_MODE                         3  
   
 /* syncRepState */  
-#define SYNC_REP_NOT_WAITING           0  
-#define SYNC_REP_WAITING                       1  
-#define SYNC_REP_WAIT_COMPLETE         2  
+#define SYNC_REP_NOT_WAITING                                   0  
+#define SYNC_REP_WAITING                                               1  
+#define SYNC_REP_WAIT_COMPLETE                                 2  
+  
+typedef enum  
+{  
+       SYNCHRONOUS_TRANSFER_COMMIT,            /* no wait for flush data page */  
+       SYNCHRONOUS_TRANSFER_DATA_FLUSH,        /* wait for data page flush only  
+                                                                                * no wait for WAL */  
+       SYNCHRONOUS_TRANSFER_ALL                /* wait for data page flush */  
+}      SynchronousTransferLevel;  
src/backend/replication/syncrep.c  
/*  
 * ===========================================================  
 * Synchronous Replication functions for normal user backends  
 * ===========================================================  
 */  
  
/*  
 * Wait for synchronous replication, if requested by user.  
 *  
 * Initially backends start in state SYNC_REP_NOT_WAITING and then  
 * change that state to SYNC_REP_WAITING/SYNC_REP_WAITING_FOR_DATA_FLUSH  
 * before adding ourselves to the wait queue. During SyncRepWakeQueue() a  
 * WALSender changes the state to SYNC_REP_WAIT_COMPLETE once replication is  
 * confirmed. This backend then resets its state to SYNC_REP_NOT_WAITING.  
 *  
 * ForDataFlush - if TRUE, we wait for the flushing data page.  
 * Otherwise wait for the sync standby  
 *  
 * Wait - if FALSE, we don't actually wait, but tell the caller whether or not  
 * the standby has already made progressed upto the given XactCommitLSN  
 *  
 * Return TRUE if either the sync standby is not  
 * configured/turned off OR the standby has made enough progress  
 */  
bool  
SyncRepWaitForLSN(XLogRecPtr XactCommitLSN, bool ForDataFlush, bool Wait)  

测试

下载PostgreSQL src :

http://git.postgresql.org/gitweb/?p=postgresql.git;a=snapshot;h=e5592c61adb0766eaee53ec07d2f05783d1c6548;sf=tgz  
tar -zxvf postgresql-e5592c6.tar.gz  
cd postgresql-e5592c6  

补丁 :

[root@db-172-16-3-39 postgresql-e5592c6]# wget http://www.postgresql.org/message-id/attachment/29640/failback_safe_standby_v4.patch  
[root@db-172-16-3-39 postgresql-e5592c6]# patch -p1 < ./failback_safe_standby_v4.patch   
patching file src/backend/access/transam/clog.c  
patching file src/backend/access/transam/slru.c  
patching file src/backend/access/transam/twophase.c  
patching file src/backend/access/transam/xact.c  
patching file src/backend/access/transam/xlog.c  
patching file src/backend/catalog/storage.c  
patching file src/backend/replication/syncrep.c  
patching file src/backend/storage/buffer/bufmgr.c  
patching file src/backend/utils/cache/relmapper.c  
patching file src/backend/utils/misc/guc.c  
patching file src/backend/utils/misc/postgresql.conf.sample  
patching file src/backend/utils/time/tqual.c  
patching file src/include/access/xlog.h  
patching file src/include/replication/syncrep.h  

安装PostgreSQL

[root@db-172-16-3-39 postgresql-e5592c6]# ./configure --prefix=/home/pg94/pgsql9.4devel --with-pgport=2999 --with-perl --with-tcl --with-python --with-openssl --with-pam --without-ldap --with-libxml --with-libxslt --enable-thread-safety --with-wal-blocksize=16 && gmake && gmake install  
cd contrib  
gmake  
gmake install  

初始化数据库

initdb -D $PGDATA -E UTF8 --locale=C -W -U postgres  

配置master

cd $PGDATA  
vi pg_hba.conf   
host replication postgres 172.16.3.0/24 md5  
host all all 0.0.0.0/0 md5  
vi postgresql.conf   
listen_addresses = '0.0.0.0'            # what IP address(es) to listen on;  
port = 2999                             # (change requires restart)  
max_connections = 100                   # (change requires restart)  
superuser_reserved_connections = 13     # (change requires restart)  
unix_socket_directories = '.'   # comma-separated list of directories  
unix_socket_permissions = 0700          # begin with 0 to use octal notation  
tcp_keepalives_idle = 60                # TCP_KEEPIDLE, in seconds;  
tcp_keepalives_interval = 10            # TCP_KEEPINTVL, in seconds;  
tcp_keepalives_count = 10               # TCP_KEEPCNT;  
shared_buffers = 1024MB                 # min 128kB  
maintenance_work_mem = 512MB            # min 1MB  
shared_preload_libraries = 'pg_stat_statements'         # (change requires restart)  
vacuum_cost_delay = 10                  # 0-100 milliseconds  
vacuum_cost_limit = 10000               # 1-10000 credits  
bgwriter_delay = 10ms                   # 10-10000ms between rounds  
wal_level = hot_standby                 # minimal, archive, or hot_standby  
synchronous_commit = remote_write               # synchronization level;  
wal_sync_method = fdatasync             # the default is the first option  
full_page_writes = on                   # recover from partial page writes  
wal_buffers = 16384kB                   # min 32kB, -1 sets based on shared_buffers  
checkpoint_segments = 32                # in logfile segments, min 1, 16MB each  
archive_mode = on               # allows archiving to be done  
archive_command = '/bin/date'           # command to use to archive a logfile segment  
max_wal_senders = 32            # max number of walsender processes  
wal_keep_segments = 128         # in logfile segments, 16MB each; 0 disables  
synchronous_transfer = data_flush   # data page synchronization level  
hot_standby = on                        # "on" allows queries during recovery  
max_standby_archive_delay = 300s        # max delay before canceling queries  
max_standby_streaming_delay = 300s      # max delay before canceling queries  
wal_receiver_status_interval = 1s       # send replies at least this often  
hot_standby_feedback = on               # send info from standby to prevent  
random_page_cost = 1.5                  # same scale as above  
effective_cache_size = 128000MB  
log_destination = 'csvlog'              # Valid values are combinations of  
logging_collector = on          # Enable capturing of stderr and csvlog  
log_directory = 'pg_log'                # directory where log files are written,  
log_filename = 'postgresql-%Y-%m-%d_%H%M%S.log' # log file name pattern,  
log_file_mode = 0600                    # creation mode for log files,  
log_truncate_on_rotation = on           # If on, an existing log file with the  
log_rotation_age = 1d                   # Automatic rotation of logfiles will  
log_rotation_size = 10MB                # Automatic rotation of logfiles will  
log_checkpoints = on  
log_connections = on  
log_disconnections = on  
log_error_verbosity = verbose           # terse, default, or verbose messages  
log_timezone = 'PRC'  
autovacuum = on                 # Enable autovacuum subprocess?  'on'  
log_autovacuum_min_duration = 0 # -1 disables, 0 logs all actions and  
datestyle = 'iso, mdy'  
timezone = 'PRC'  
lc_messages = 'C'                       # locale for system error message  
lc_monetary = 'C'                       # locale for monetary formatting  
lc_numeric = 'C'                        # locale for number formatting  
lc_time = 'C'                           # locale for time formatting  
default_text_search_config = 'pg_catalog.english'  
pg_stat_statements.max = 1000  
pg_stat_statements.track = all  
vi ~/.pgpass  
172.16.3.33:2999:replication:postgres:postgres  
172.16.3.39:2999:replication:postgres:postgres  
chmod 400 ~/.pgpass  
vi $PGDATA/recovery.done  
recovery_target_timeline = 'latest'  
standby_mode = on  
primary_conninfo = 'host=172.16.3.33 port=2999 user=postgres keepalives_idle=60'                # e.g. 'host=localhost port=5432'  

创建standby

安装过程略

复制

pg94@db-172-16-3-33-> pg_basebackup -D $PGDATA -F p -P -v -h 172.16.3.39 -p 2999 -U postgres  
WARNING:  skipping special file "./.s.PGSQL.2999"  
WARNING:  skipping special file "./.s.PGSQL.2999"94/pg_root/pg_clog/0000)  
26943/26943 kB (100%), 1/1 tablespace                                           
NOTICE:  pg_stop_backup complete, all required WAL segments have been archived  
pg_basebackup: base backup completed  

配置

cd $PGDATA  
mv recovery.done recovery.conf  
vi recovery.conf  
recovery_target_timeline = 'latest'  
standby_mode = on  
primary_conninfo = 'host=172.16.3.39 port=2999 user=postgres keepalives_idle=60'                # e.g. 'host=localhost port=5432'  

启动standby

pg_ctl start  

主节点, 查看复制状态 :

pg94@db-172-16-3-39-> psql  
psql (9.4devel)  
Type "help" for help.  
digoal=# select * from pg_stat_replication ;  
  pid  | usesysid | usename  | application_name | client_addr | client_hostname | client_port |         backend_start         |   st  
ate   | sent_location | write_location | flush_location | replay_location | sync_priority | sync_state   
-------+----------+----------+------------------+-------------+-----------------+-------------+-------------------------------+-----  
------+---------------+----------------+----------------+-----------------+---------------+------------  
 12221 |       10 | postgres | walreceiver      | 172.16.3.33 |                 |       20441 | 2013-08-06 09:15:35.617838+08 | stre  
aming | 0/3000090     | 0/3000090      | 0/3000090      | 0/3000090       |             0 | async  
(1 row)  

创建测试表, 测试函数

digoal=# create table test (id int primary key, info text, crt_time timestamp);  
CREATE TABLE  
digoal=# create or replace function func() returns void as $$  
digoal$# declare   
digoal$#   v_id int;  
digoal$# begin  
digoal$#   v_id := round(5000000*random());  
digoal$#   update test set info=md5(random()::text),crt_time=clock_timestamp() where id=v_id;  
digoal$#   if found then  
digoal$#     return;  
digoal$#   else  
digoal$#     insert into test values (v_id,md5(random()::text),clock_timestamp());  
digoal$#   end if;  
digoal$#   return;  
digoal$# end;  
digoal$# $$ language plpgsql strict;  
CREATE FUNCTION  

测试

digoal=# select func();  
 func   
------  
   
(1 row)  
digoal=# select * from test;  
   id    |               info               |          crt_time            
---------+----------------------------------+----------------------------  
 1894289 | e759b24e9cb227c300f85666c4ac2f81 | 2013-08-06 09:21:30.910392  
(1 row)  

压力测试 :

pg94@db-172-16-3-39-> vi test.sql  
select func();  
pg94@db-172-16-3-39-> pgbench -M prepared -f ./test.sql -r -n -h $PGDATA -p 2999 -U postgres -c 16 -j 4 -T 10 digoal  
transaction type: Custom query  
scaling factor: 1  
query mode: prepared  
number of clients: 16  
number of threads: 4  
duration: 10 s  
number of transactions actually processed: 211697  
tps = 21156.265771 (including connections establishing)  
tps = 21187.462509 (excluding connections establishing)  
statement latencies in milliseconds:  
        0.752017        select func();  

测试failback 失败 :

pg94@db-172-16-3-39-> pgbench -M prepared -f ./test.sql -r -n -h $PGDATA -p 2999 -U postgres -c 16 -j 4 -T 100 digoal  
Client 2 aborted in state 0: ERROR:  duplicate key value violates unique constraint "test_pkey"  
DETAIL:  Key (id)=(521247) already exists.  
CONTEXT:  SQL statement "insert into test values (v_id,md5(random()::text),clock_timestamp())"  
PL/pgSQL function func() line 10 at SQL statement  
Client 14 aborted in state 0: ERROR:  duplicate key value violates unique constraint "test_pkey"  
DETAIL:  Key (id)=(4957439) already exists.  
CONTEXT:  SQL statement "insert into test values (v_id,md5(random()::text),clock_timestamp())"  
PL/pgSQL function func() line 10 at SQL statement  
Client 15 aborted in state 0: ERROR:  duplicate key value violates unique constraint "test_pkey"  
DETAIL:  Key (id)=(3050904) already exists.  
CONTEXT:  SQL statement "insert into test values (v_id,md5(random()::text),clock_timestamp())"  
PL/pgSQL function func() line 10 at SQL statement  
transaction type: Custom query  
scaling factor: 1  
query mode: prepared  
number of clients: 16  
number of threads: 4  
duration: 100 s  
number of transactions actually processed: 2063626  
tps = 20633.969217 (including connections establishing)  
tps = 20637.002726 (excluding connections establishing)  
statement latencies in milliseconds:  
        0.712672        select func();  

以上测试过程中将standby数据库关闭

pg94@db-172-16-3-33-> pg_ctl stop -m fast  
waiting for server to shut down.... done  
server stopped  

测试完后, 在主节点执行checkpoint; 因为配置了synchronous_transfer = data_flush. 会出现等待状态!

postgres: checkpointer process    waiting for 0/2D0C99D8  

此时停主库也是无法停止的, 也有flush的动作, 所以会等待.

12198 pg94      15   0 1181m  47m  46m S  0.0  0.6   0:00.04 /home/pg94/pgsql9.4devel/bin/postgres                                    
12199 pg94      15   0  108m 1032  504 S  0.0  0.0   0:00.00 postgres: logger process                                                 
12201 pg94      15   0 1182m 198m 197m S  0.0  2.5   0:00.34 postgres: checkpointer process    waiting for 0/2D0C99D8                 
12205 pg94      15   0  110m 1024  460 S  0.0  0.0   0:00.01 postgres: archiver process   last was 00000001000000000000002C           
12206 pg94      15   0  110m 1228  576 S  0.0  0.0   0:00.04 postgres: stats collector process                                        
12687 pg94      15   0 66068 1552 1208 S  0.0  0.0   0:00.01 -bash                                                                    
12786 pg94      15   0  109m 1748 1408 S  0.0  0.0   0:00.00 pg_ctl stop -m fast  

使用-m immediate关闭后, 重启主库.

pg94@db-172-16-3-39-> psql  
psql (9.4devel)  
Type "help" for help.  
digoal=# select count(*),sum(hashtext(test.*::text)) from test;  
  count  |     sum        
---------+--------------  
 1828052 | 747103747132  
(1 row)  

开启standby, 数据一致

pg94@db-172-16-3-33-> psql  
psql (9.4devel)  
Type "help" for help.  
  
digoal=# select count(*),sum(hashtext(test.*::text)) from test;  
  count  |     sum        
---------+--------------  
 1828052 | 747103747132  
(1 row)  

测试fail-over.

主库测试过程中, promote备库.

pg94@db-172-16-3-39-> pgbench -M prepared -f ./test.sql -r -n -h $PGDATA -p 2999 -U postgres -c 16 -j 4 -T 100 digoal  
./test.sql: No such file or directory  
pg94@db-172-16-3-39-> cd  
pg94@db-172-16-3-39-> pgbench -M prepared -f ./test.sql -r -n -h $PGDATA -p 2999 -U postgres -c 16 -j 4 -T 100 digoal  
Client 4 aborted in state 0: ERROR:  duplicate key value violates unique constraint "test_pkey"  
DETAIL:  Key (id)=(2592452) already exists.  
CONTEXT:  SQL statement "insert into test values (v_id,md5(random()::text),clock_timestamp())"  
PL/pgSQL function func() line 10 at SQL statement  
Client 7 aborted in state 0: ERROR:  duplicate key value violates unique constraint "test_pkey"  
DETAIL:  Key (id)=(3677118) already exists.  
CONTEXT:  SQL statement "insert into test values (v_id,md5(random()::text),clock_timestamp())"  
PL/pgSQL function func() line 10 at SQL statement  

promote备库.

pg94@db-172-16-3-33-> pg_ctl promote  
server promoting  
2013-08-06 09:34:46.772 CST,,,4728,,52005220.1278,6,,2013-08-06 09:32:16 CST,1/0,0,LOG,00000,"received promote request",,,,,,,,"CheckForStandbyTrigger, xlog.c:11127",""  
2013-08-06 09:34:46.772 CST,,,4745,,5200523e.1289,2,,2013-08-06 09:32:46 CST,,0,FATAL,57P01,"terminating walreceiver process due to administrator command",,,,,,,,"ProcessWalRcvInterrupts, walreceiver.c:167",""  
2013-08-06 09:34:46.871 CST,,,4728,,52005220.1278,7,,2013-08-06 09:32:16 CST,1/0,0,LOG,00000,"record with zero length at 0/45AD5F80",,,,,,,,"ReadRecord, xlog.c:4420",""  
2013-08-06 09:34:46.871 CST,,,4728,,52005220.1278,8,,2013-08-06 09:32:16 CST,1/0,0,LOG,00000,"redo done at 0/45AD5F50",,,,,,,,"StartupXLOG, xlog.c:6878",""  
2013-08-06 09:34:46.871 CST,,,4728,,52005220.1278,9,,2013-08-06 09:32:16 CST,1/0,0,LOG,00000,"last completed transaction was at log time 2013-08-06 09:34:46.753686+08",,,,,,,,"StartupXLOG, xlog.c:6883",""  
2013-08-06 09:34:46.871 CST,,,4728,,52005220.1278,10,,2013-08-06 09:32:16 CST,1/0,0,LOG,00000,"selected new timeline ID: 2",,,,,,,,"StartupXLOG, xlog.c:6989",""  
2013-08-06 09:34:47.080 CST,,,4728,,52005220.1278,11,,2013-08-06 09:32:16 CST,1/0,0,LOG,00000,"archive recovery complete",,,,,,,,"exitArchiveRecovery, xlog.c:5639",""  
2013-08-06 09:35:21.939 CST,,,4729,,52005220.1279,2,,2013-08-06 09:32:16 CST,,0,LOG,00000,"restartpoint complete: wrote 24656 buffers (18.8%); 0 transaction log file(s) added, 0 removed, 0 recycled; write=149.076 s, sync=0.868 s, total=150.366 s; sync files=41, longest=0.527 s, average=0.021 s",,,,,,,,"LogCheckpointEnd, xlog.c:7898",""  
2013-08-06 09:35:21.939 CST,,,4729,,52005220.1279,3,,2013-08-06 09:32:16 CST,,0,LOG,00000,"recovery restart point at 0/2D0C99D8","last completed transaction was at log time 2013-08-06 09:34:46.753686+08",,,,,,,"CreateRestartPoint, xlog.c:8715",""  
2013-08-06 09:36:03.135 CST,,,4840,"",52005303.12e8,1,"",2013-08-06 09:36:03 CST,,0,LOG,00000,"connection received: host=[local]",,,,,,,,"BackendInitialize, postmaster.c:3923",""  
2013-08-06 09:36:03.136 CST,"postgres","digoal",4840,"[local]",52005303.12e8,2,"authentication",2013-08-06 09:36:03 CST,2/3,0,LOG,00000,"connection authorized: user=postgres database=digoal",,,,,,,,"PerformAuthentication, postinit.c:239",""  
2013-08-06 09:36:16.409 CST,"postgres","digoal",4840,"[local]",52005303.12e8,3,"idle",2013-08-06 09:36:03 CST,,0,LOG,00000,"disconnection: session time: 0:00:13.274 user=postgres database=digoal host=[local]",,,,,,,,"log_disconnections, postgres.c:4427","psql"  

注意由于备库也配置了synchronous_transfer = data_flush, 所有的data flush操作都会等待, 所以要尽快把主库转成备库.

pg94@db-172-16-3-39-> cd $PGDATA  
pg94@db-172-16-3-39-> mv recovery.done recovery.conf  
pg94@db-172-16-3-39-> pg_ctl restart -m immediate  
waiting for server to shut down.... done  
server stopped  
server starting  
pg94@db-172-16-3-39-> LOG:  00000: loaded library "pg_stat_statements"  
LOCATION:  load_libraries, miscinit.c:1296  

主库 fail-back失败 :

2013-08-06 09:39:38.208 CST,,,13320,,520053da.3408,1,,2013-08-06 09:39:38 CST,,0,LOG,00000,"database system was interrupted; last known up at 2013-08-06 09:32:33 CST",,,,,,,,"StartupXLOG, xlog.c:6080",""  
2013-08-06 09:39:38.208 CST,,,13320,,520053da.3408,2,,2013-08-06 09:39:38 CST,,0,LOG,00000,"entering standby mode",,,,,,,,"StartupXLOG, xlog.c:6133",""  
2013-08-06 09:39:38.208 CST,,,13320,,520053da.3408,3,,2013-08-06 09:39:38 CST,,0,LOG,00000,"database system was not properly shut down; automatic recovery in progress",,,,,,,,"StartupXLOG, xlog.c:6443",""  
2013-08-06 09:39:38.210 CST,,,13320,,520053da.3408,4,,2013-08-06 09:39:38 CST,1/0,0,LOG,00000,"redo starts at 0/36000090",,,,,,,,"StartupXLOG, xlog.c:6687",""  
2013-08-06 09:39:53.644 CST,,,13320,,520053da.3408,5,,2013-08-06 09:39:38 CST,1/0,0,LOG,00000,"record with zero length at 0/52438570",,,,,,,,"ReadRecord, xlog.c:4420",""  
2013-08-06 09:39:53.644 CST,,,13320,,520053da.3408,6,,2013-08-06 09:39:38 CST,1/0,0,LOG,00000,"consistent recovery state reached at 0/52438570",,,,,,,,"CheckRecoveryConsistency, xlog.c:7348",""  
2013-08-06 09:39:53.645 CST,,,13318,,520053da.3406,1,,2013-08-06 09:39:38 CST,,0,LOG,00000,"database system is ready to accept read only connections",,,,,,,,"sigusr1_handler, postmaster.c:4737",""  
2013-08-06 09:39:53.649 CST,,,13339,,520053e9.341b,1,,2013-08-06 09:39:53 CST,,0,LOG,00000,"started streaming WAL from primary at 0/52000000 on timeline 1",,,,,,,,"WalReceiverMain, walreceiver.c:366",""  
2013-08-06 09:39:53.650 CST,,,13339,,520053e9.341b,2,,2013-08-06 09:39:53 CST,,0,FATAL,XX000,"could not receive data from WAL stream: ERROR:  requested starting point 0/52000000 is ahead of the WAL flush position of this server 0/45AD6088  
",,,,,,,,"libpqrcv_receive, libpqwalreceiver.c:513",""  
2013-08-06 09:40:08.653 CST,,,13346,,520053f8.3422,1,,2013-08-06 09:40:08 CST,,0,LOG,00000,"started streaming WAL from primary at 0/52000000 on timeline 1",,,,,,,,"WalReceiverMain, walreceiver.c:366",""  
2013-08-06 09:40:08.654 CST,,,13346,,520053f8.3422,2,,2013-08-06 09:40:08 CST,,0,FATAL,XX000,"could not receive data from WAL stream: ERROR:  requested starting point 0/52000000 is ahead of the WAL flush position of this server 0/45AD6088  
",,,,,,,,"libpqrcv_receive, libpqwalreceiver.c:513",""  

将备库的synchronous_transfer改为commit. fail-back失败.

pg94@db-172-16-3-39-> pgbench -M prepared -f ./test.sql -r -n -h $PGDATA -p 2999 -U postgres -c 16 -j 4 -T 100 digoal  
Client 0 aborted in state 0: ERROR:  duplicate key value violates unique constraint "test_pkey"  
DETAIL:  Key (id)=(2242608) already exists.  
CONTEXT:  SQL statement "insert into test values (v_id,md5(random()::text),clock_timestamp())"  
PL/pgSQL function func() line 10 at SQL statement  
Client 7 aborted in state 0: ERROR:  duplicate key value violates unique constraint "test_pkey"  
DETAIL:  Key (id)=(2615244) already exists.  
CONTEXT:  SQL statement "insert into test values (v_id,md5(random()::text),clock_timestamp())"  
PL/pgSQL function func() line 10 at SQL statement  
  
pg94@db-172-16-3-39-> psql  
psql (9.4devel)  
Type "help" for help.  
digoal=# checkpoint;  
CHECKPOINT  
digoal=# checkpoint;  
CHECKPOINT  
  
pg94@db-172-16-3-33-> pg_ctl promote  
server promoting  
pg94@db-172-16-3-33-> top -c -u pg94  
top - 09:52:29 up 74 days, 16:10,  1 user,  load average: 0.29, 0.13, 0.07  
Tasks: 209 total,   1 running, 208 sleeping,   0 stopped,   0 zombie  
Cpu(s):  0.0%us,  0.0%sy,  0.0%ni, 99.9%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st  
Mem:  14369440k total,  9195600k used,  5173840k free,   404756k buffers  
Swap:  8385920k total,      184k used,  8385736k free,  8036496k cached  
  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND                                                                  
 5409 pg94      15   0 12760 1204  844 R  0.7  0.0   0:00.12 top -c -u pg94                                                           
 4262 pg94      15   0 67272 2900 1328 S  0.0  0.0   0:00.26 -bash                                                                    
 5382 pg94      15   0 1181m  47m  46m S  0.0  0.3   0:00.04 /home/pg94/pgsql9.4devel/bin/postgres                                    
 5383 pg94      16   0  108m 1048  516 S  0.0  0.0   0:00.00 postgres: logger process                                                 
 5386 pg94      15   0 1182m 296m 295m S  0.0  2.1   0:00.55 postgres: checkpointer process                                           
 5387 pg94      15   0 1182m 6744 6172 S  0.0  0.0   0:00.00 postgres: writer process                                                 
 5388 pg94      15   0  110m 1100  524 S  0.0  0.0   0:00.00 postgres: stats collector process                                        
 5404 pg94      15   0 1182m 1192  620 S  0.0  0.0   0:00.00 postgres: wal writer process                                             
 5405 pg94      17   0 1183m 2248 1164 S  0.0  0.0   0:00.00 postgres: autovacuum launcher process                                    
 5406 pg94      18   0  110m 1044  472 S  0.0  0.0   0:00.00 postgres: archiver process   last was 00000002.history   
  
pg94@db-172-16-3-39-> pg_ctl stop -m immediate  
waiting for server to shut down.... done  
server stopped  
pg94@db-172-16-3-39-> cd $PGDATA  
pg94@db-172-16-3-39-> mv recovery.done recovery.conf  
pg94@db-172-16-3-39-> pg_ctl start  
failed-back 失败 :   
pg94@db-172-16-3-39-> cat postgresql-2013-08-06_095340.csv  
2013-08-06 09:53:40.585 CST,,,13969,,52005724.3691,1,,2013-08-06 09:53:40 CST,,0,LOG,00000,"database system was interrupted; last known up at 2013-08-06 09:51:32 CST",,,,,,,,"StartupXLOG, xlog.c:6080",""  
2013-08-06 09:53:40.585 CST,,,13969,,52005724.3691,2,,2013-08-06 09:53:40 CST,,0,LOG,00000,"entering standby mode",,,,,,,,"StartupXLOG, xlog.c:6133",""  
2013-08-06 09:53:40.585 CST,,,13969,,52005724.3691,3,,2013-08-06 09:53:40 CST,,0,LOG,00000,"database system was not properly shut down; automatic recovery in progress",,,,,,,,"StartupXLOG, xlog.c:6443",""  
2013-08-06 09:53:40.587 CST,,,13969,,52005724.3691,4,,2013-08-06 09:53:40 CST,1/0,0,LOG,00000,"redo starts at 0/74BAF018",,,,,,,,"StartupXLOG, xlog.c:6687",""  
2013-08-06 09:53:45.958 CST,,,13969,,52005724.3691,5,,2013-08-06 09:53:40 CST,1/0,0,LOG,00000,"record with zero length at 0/89217048",,,,,,,,"ReadRecord, xlog.c:4420",""  
2013-08-06 09:53:45.959 CST,,,13969,,52005724.3691,6,,2013-08-06 09:53:40 CST,1/0,0,LOG,00000,"consistent recovery state reached at 0/89217048",,,,,,,,"CheckRecoveryConsistency, xlog.c:7348",""  
2013-08-06 09:53:45.960 CST,,,13967,,52005724.368f,1,,2013-08-06 09:53:40 CST,,0,LOG,00000,"database system is ready to accept read only connections",,,,,,,,"sigusr1_handler, postmaster.c:4737",""  
2013-08-06 09:53:45.965 CST,,,13976,,52005729.3698,1,,2013-08-06 09:53:45 CST,,0,LOG,00000,"fetching timeline history file for timeline 2 from primary server",,,,,,,,"WalRcvFetchTimeLineHistoryFiles, walreceiver.c:666",""  
2013-08-06 09:53:45.966 CST,,,13976,,52005729.3698,2,,2013-08-06 09:53:45 CST,,0,FATAL,XX000,"could not start WAL streaming: ERROR:  requested starting point 0/89000000 on timeline 1 is not in this server's history  
DETAIL:  This server's history forked from timeline 1 at 0/82603A30  
",,,,,,,,"libpqrcv_startstreaming, libpqwalreceiver.c:195",""  

其他

1. 加入data_flush后 , 需要特别注意等待事件, 如果standby正常的话, 影响不大, 如果standby 性能不行或者standby 挂了, 就如同sync replication一样会带来堵塞.

2. 如果配置了synchronous_transfer = data_flush all, 在没有已经连上来的其他standby节点的情况下, 使用pg_basebackup创建standby会处于等待状态, 因为pg_basebackup需要在master节点执行checkpoint.

3. 使用异步流复制的情况下, 测试synchronous_transfer = data_flush, all都不能完成主节点的failback.

4. 使用同步流复制的情况下, 测试synchronous_transfer = data_flush, all可以完成正常的主节点的failback.

参考

1. http://blog.163.com/digoal@126/blog/static/1638770402013758511857/

2. http://www.postgresql.org/message-id/flat/CAF8Q-Gy7xa60HwXc0MKajjkWFEbFDWTG=gGyu1KmT+s2xcQ-bw@mail.gmail.com#CAF8Q-Gy7xa60HwXc0MKajjkWFEbFDWTG=gGyu1KmT+s2xcQ-bw@mail.gmail.com

3. http://www.postgresql.org/message-id/flat/CAF8Q-Gxg3PQTf71NVECe-6OzRaew5pWhk7yQtbJgWrFu513s+Q@mail.gmail.com#CAF8Q-Gxg3PQTf71NVECe-6OzRaew5pWhk7yQtbJgWrFu513s+Q@mail.gmail.com

4. http://www.postgresql.org/message-id/519DF910.4020609@vmware.com

5. https://github.com/vmware/pg_rewind

6. http://blog.163.com/digoal@126/blog/static/16387704020136197354054/

7. 第4版补丁synchronous_transfer参数介绍

we support following 2 cases.  
1\. SYNC server and also make same failback safe standby server  
2\. ASYNC server and also make same failback safe standby server  
  
1.  changed name of parameter  
  give up 'failback_safe_standby_names' parameter from the first patch.  
  and changed name of parameter from 'failback_safe_mode ' to  
'synchronous_transfer'.  
  this parameter accepts 'all', 'data_flush' and 'commit'.  
  
  -'commit'  
    'commit' means that master waits for corresponding WAL to flushed  
to disk of standby server on commits.  
    but master doesn't waits for replicated data pages.  
  
  -'data_flush'  
    'data_flush' means that master waits for replicated data page  
(e.g, CLOG, pg_control) before flush to disk of master server.  
    but if user set to 'data_flush' to this parameter,  
'synchronous_commit' values is ignored even if user set  
'synchronous_commit'.  
  
  -'all'  
    'all' means that master waits for replicated WAL and data page.  
  
2. put SyncRepWaitForLSN() function into XLogFlush() function  
  we have put SyncRepWaitForLSN() function into XLogFlush() function,  
and change argument of XLogFlush().  
  
they are setup case and need to set parameters.  
  
- SYNC server and also make same failback safe standgy server (case 1)  
  synchronous_transfer = all  
  synchronous_commit = remote_write/on  
  synchronous_standby_names = <ServerName>  
  
- ASYNC server and also make same failback safe standgy server (case 2)  
  synchronous_transfer = data_flush  
  (synchronous_commit values is ignored)  
  
- default SYNC replication  
  synchronous_transfer = commit  
  synchronous_commit = on  
  synchronous_standby_names = <ServerName>  
  
- default ASYNC replication  
  synchronous_transfer = commit  
  
ToDo  
1. currently this patch supports synchronous transfer. so we can't set  
different synchronous transfer mode to each server.  
    we need to improve the patch for support following cases.  
   - SYNC standby and make separate ASYNC failback safe standby  
   - ASYNC standby and make separate ASYNC failback safe standby  
  
2. we have not measure performance yet. we need to measure perfomance.  
  
please give me your feedback.  
  
Regards,  
  
-------  
Sawada Masahiko  

Flag Counter

digoal’s 大量PostgreSQL文章入口