PostgreSQL 9.5 new feature - pg_rewind fast sync Split Brain Primary & Standby

10 minute read

背景

PostgreSQL 9.5 将pg_rewind纳入源码, pg_rewind是一个非常好的工具.

如果是9.3或者9.4,你可以从github下载对应的版本分支。例如:

git clone -b REL9_4_STABLE https://github.com/vmware/pg_rewind.git  
git clone -b REL9_3_STABLE https://github.com/vmware/pg_rewind.git  

当我们在使用PostgreSQL流复制实施主备复制时, 如果在某些情况下, 例如主机有问题时, 备机激活成为读写, 并对外提供服务.

但是如果主机那边还有读写的话, 主备就会出现脑裂的情况.

出现脑裂之后, 原来的主机要变成备机怎么办?

一般我们需要重建备机, 即把数据文件删掉, 重新同步.

另一种方法是用rsync , 同步修改的文件.

如果数据量很大, 以上两种方法都需要耗费大量的时间.

pg_rewind的原理

1. 首选获得备机激活的时间线

2. 根据备机激活的时间线, 在老的主机上找到这个时间线之前的最后一个checkpoint

3. 在老的主机根据这个checkpoint位置, 找到自此以后老的主机产生的所有的XLOG.

4. 从这些XLOG中解析出变更的数据块.

5. 从新的主机将这些数据块抓取过来, 并覆盖掉老的主机上的这些数据块. (同时老库上面新增的块被被删掉.)

6. 从新主机拷贝所有除数据文件以外的所有文件(如clog, etc等)到老的主机.

7. 现在老的主机回到了时间线的位置,

pg_rewind退出后只能到达以上状态, 以下步骤需要手工执行.

8. 修改老主机的配置文件, 例如 postgresql.conf, recovery.conf, pg_hba.conf 以成为新主机的standby.

9. 特别需要注意配置 restore_command, 因为新主机在发生promote后产生的XLOG可能已经归档了.

10. 启动老主机, 开始恢复.

接下来测试一下 :

测试

在同一台主机上测试, 主备配置不同的cluster_name.

cluster_name=db1  
cluster_name=db2  
  
db1  
$PGDATA    /data02/pgdata95/pg_root  
$ARCH         /data02/pgdata95/pg_arch  
$PGPORT    1922  
  
db2  
$PGDATA    /data03/pgdata95/pg_root  
$ARCH         /data03/pgdata95/pg_arch  
$PGPORT    1923  

配置primary

# grep "^[a-z]" postgresql.conf  
listen_addresses = '0.0.0.0'            # what IP address(es) to listen on;  
port = 1922                             # (change requires restart)  
max_connections = 199                   # (change requires restart)  
superuser_reserved_connections = 13     # (change requires restart)  
unix_socket_directories = '.'   # comma-separated list of directories  
unix_socket_permissions = 0700          # begin with 0 to use octal notation  
tcp_keepalives_idle = 60                # TCP_KEEPIDLE, in seconds;  
tcp_keepalives_interval = 10            # TCP_KEEPINTVL, in seconds;  
tcp_keepalives_count = 10               # TCP_KEEPCNT;  
shared_buffers = 1024MB                 # min 128kB  
huge_pages = try                        # on, off, or try  
dynamic_shared_memory_type = posix      # the default is the first option  
vacuum_cost_delay = 10                  # 0-100 milliseconds  
vacuum_cost_limit = 10000               # 1-10000 credits  
bgwriter_delay = 10ms                   # 10-10000ms between rounds  
wal_level = logical                     # minimal, archive, hot_standby, or logical  
synchronous_commit = off                # synchronization level;  
wal_sync_method = open_sync             # the default is the first option  
full_page_writes = on                   # recover from partial page writes  
wal_compression = on                    # enable compression of full-page writes  
wal_log_hints = on  
wal_writer_delay = 10ms         # 1-10000 milliseconds  
archive_mode = on               # allows archiving to be done  
archive_command = 'test ! -f /data02/pgdata95/pg_arch/%f && cp %p /data02/pgdata95/pg_arch/%f'          # command to use to archive a logfile segment  
max_wal_senders = 12            # max number of walsender processes  
max_replication_slots = 12      # max number of replication slots  
track_commit_timestamp = on     # collect timestamp of transaction commit  
hot_standby = on                        # "on" allows queries during recovery  
wal_receiver_status_interval = 1s       # send replies at least this often  
hot_standby_feedback = on               # send info from standby to prevent  
log_destination = 'csvlog'              # Valid values are combinations of  
logging_collector = on          # Enable capturing of stderr and csvlog  
log_directory = 'pg_log'                # directory where log files are written,  
log_truncate_on_rotation = on           # If on, an existing log file with the  
log_checkpoints = on  
log_connections = on  
log_disconnections = on  
log_error_verbosity = verbose           # terse, default, or verbose messages  
log_replication_commands = on  
log_timezone = 'PRC'  
cluster_name = 'db1'                    # added to process titles if nonempty  
autovacuum = on                 # Enable autovacuum subprocess?  'on'  
log_autovacuum_min_duration = 0 # -1 disables, 0 logs all actions and  
datestyle = 'iso, mdy'  
timezone = 'PRC'  
lc_messages = 'C'                       # locale for system error message  
lc_monetary = 'C'                       # locale for monetary formatting  
lc_numeric = 'C'                        # locale for number formatting  
lc_time = 'C'                           # locale for time formatting  
default_text_search_config = 'pg_catalog.english'  

pg_hba.conf

# replication privilege.  
local   replication     postgres                                trust  
host    replication     postgres        127.0.0.1/32            trust  
host    replication     postgres        ::1/128                 trust  

recovery.conf

pg95@db-172-16-3-150-> cp /opt/pgsql9.5/share/recovery.conf.sample $PGDATA/  
pg95@db-172-16-3-150-> cd $PGDATA  
pg95@db-172-16-3-150-> mv recovery.conf.sample recovery.done  
pg95@db-172-16-3-150-> vi recovery.done   
restore_command = 'cp /data03/pgdata95/pg_arch/%f %p'           # e.g. 'cp /mnt/server/archivedir/%f %p'  
recovery_target_timeline = 'latest'  
standby_mode = on  
primary_conninfo = 'host=127.0.0.1 port=1923 user=postgres keepalives_idle=60'          # e.g. 'host=localhost port=5432'  

归档

[root@db-172-16-3-150 postgresql-7320681]# mkdir /data02/pgdata95/pg_arch  
[root@db-172-16-3-150 postgresql-7320681]# chown pg95:pg95 /data02/pgdata95/pg_arch  

启动数据库

pg_ctl restart -m fast  

创建primary-standby 环境

[root@db-172-16-3-150 postgresql-7320681]# mkdir -p /data03/pgdata95/pg_root  
[root@db-172-16-3-150 postgresql-7320681]# mkdir -p /data03/pgdata95/pg_arch  
[root@db-172-16-3-150 postgresql-7320681]# chown -R pg95:pg95 /data03/pgdata95  
[root@db-172-16-3-150 postgresql-7320681]# chmod 700 /data03/pgdata95/pg_root  
  
[root@db-172-16-3-150 postgresql-7320681]# su - pg95  
pg95@db-172-16-3-150-> pg_basebackup -D /data03/pgdata95/pg_root -F p -U postgres -h 127.0.0.1 -p 1922  
WARNING:  skipping special file "./.s.PGSQL.1922"  
NOTICE:  pg_stop_backup complete, all required WAL segments have been archived  

修改配置文件 :

pg95@db-172-16-3-150-> cd /data03/pgdata95/pg_root/  
pg95@db-172-16-3-150-> mv recovery.done recovery.conf  
pg95@db-172-16-3-150-> vi recovery.conf    修改两处  
restore_command = 'cp /data02/pgdata95/pg_arch/%f %p'           # e.g. 'cp /mnt/server/archivedir/%f %p'  
primary_conninfo = 'host=127.0.0.1 port=1922 user=postgres keepalives_idle=60'          # e.g. 'host=localhost port=5432'  
  
pg95@db-172-16-3-150-> vi postgresql.conf  
  
修改  
port = 1923  
cluster_name = 'db2'  
archive_command = 'test ! -f /data03/pgdata95/pg_arch/%f && cp %p /data03/pgdata95/pg_arch/%f'  

启动standby

pg95@db-172-16-3-150-> pg_ctl start -D /data03/pgdata95/pg_root  
server starting  
pg95@db-172-16-3-150-> LOG:  00000: redirecting log output to logging collector process  
HINT:  Future log output will appear in directory "pg_log".  
LOCATION:  SysLogger_Start, syslogger.c:622  

可以看到已经启动了

[root@db-172-16-3-150 postgresql-7320681]# ps -ewf|grep pg95  
pg95     27194     1  0 15:49 pts/2    00:00:00 /opt/pgsql9.5/bin/postgres  
pg95     27196 27194  0 15:49 ?        00:00:00 postgres: db1: logger process     
pg95     27198 27194  0 15:49 ?        00:00:00 postgres: db1: checkpointer process     
pg95     27199 27194  0 15:49 ?        00:00:00 postgres: db1: writer process     
pg95     27200 27194  0 15:49 ?        00:00:00 postgres: db1: wal writer process     
pg95     27201 27194  0 15:49 ?        00:00:00 postgres: db1: autovacuum launcher process     
pg95     27202 27194  0 15:49 ?        00:00:00 postgres: db1: archiver process   last was 000000010000000000000002  
pg95     27203 27194  0 15:49 ?        00:00:00 postgres: db1: stats collector process     
pg95     27349     1  0 15:57 pts/2    00:00:00 /opt/pgsql9.5/bin/postgres -D /data03/pgdata95/pg_root  
pg95     27351 27349  0 15:57 ?        00:00:00 postgres: db2: logger process                           
pg95     27352 27349  0 15:57 ?        00:00:00 postgres: db2: startup process   recovering 000000010000000000000003  
pg95     27355 27349  0 15:57 ?        00:00:00 postgres: db2: checkpointer process                     
pg95     27356 27349  0 15:57 ?        00:00:00 postgres: db2: writer process                           
pg95     27357 27349  0 15:57 ?        00:00:00 postgres: db2: stats collector process                  
pg95     27359 27349  0 15:57 ?        00:00:00 postgres: db2: wal receiver process   streaming 0/3000140  
pg95     27360 27194  0 15:57 ?        00:00:00 postgres: db1: wal sender process postgres 127.0.0.1(16326) streaming 0/3000140  
root     27401  1264  0 15:58 pts/2    00:00:00 grep pg95  

创建一些测试数据, 执行checkpoint

pg95@db-172-16-3-150-> psql -h 127.0.0.1 -p 1922  
psql (9.5devel)  
Type "help" for help.  
postgres=# create table t1(id int, info text);  
CREATE TABLE  
postgres=# insert into t1 select generate_series(1,1000000), md5(random()::text);  
INSERT 0 1000000  
postgres=# select sum(hashtext(t1.*::text)) from t1;  
      sum        
---------------  
 -585064225655  
(1 row)  
postgres=# checkpoint;  
CHECKPOINT  

确保已同步

pg95@db-172-16-3-150-> psql -h 127.0.0.1 -p 1923  
psql (9.5devel)  
Type "help" for help.  
postgres=# select sum(hashtext(t1.*::text)) from t1;  
      sum        
---------------  
 -585064225655  
(1 row)  

激活standby.

pg95@db-172-16-3-150-> pg_ctl promote -D /data03/pgdata95/pg_root  
server promoting  

此时主备已经没有同步关系了, 模拟脑裂

在老的primary执行.

pg95@db-172-16-3-150-> psql -h 127.0.0.1 -p 1922  
psql (9.5devel)  
Type "help" for help.  
postgres=# insert into t1 select generate_series(1,1000000), md5(random()::text);  
INSERT 0 1000000  
postgres=# insert into t1 select generate_series(1,5000000), md5(random()::text);  
INSERT 0 5000000  
postgres=# delete from t1;  
DELETE 7000000  
postgres=# insert into t1 select generate_series(1,5000000), md5(random()::text);  
INSERT 0 5000000  
postgres=# select sum(hashtext(t1.*::text)) from t1;  
     sum        
--------------  
 316362957898  
(1 row)  
postgres=# create table t2 as select * from t1;  
SELECT 5000000  
postgres=# create index idx_t1 on t1(id);  
CREATE INDEX  
postgres=# create function ft() returns void as $$  
postgres$# declare  
postgres$# begin  
postgres$# return;  
postgres$# end;  
postgres$# $$ language plpgsql;  
CREATE FUNCTION  
postgres=# alter table t1 add column crt_time timestamp;  
ALTER TABLE  
postgres=# alter table t1 add constraint pk primary key(id);  
ALTER TABLE  
postgres=# create view vt1 as select * from t1 where id<100;  
CREATE VIEW  

在新的primary执行 :

pg95@db-172-16-3-150-> psql -h 127.0.0.1 -p 1923  
psql (9.5devel)  
Type "help" for help.  
postgres=# insert into t1 select generate_series(1,6000000), md5(random()::text);  
INSERT 0 6000000  
postgres=# delete from t1;  
DELETE 7000000  
postgres=# insert into t1 select generate_series(1,5000000), md5(random()::text);  
INSERT 0 5000000  
postgres=# select sum(hashtext(t1.*::text)) from t1;  
      sum        
---------------  
 1868065255940  
(1 row)  

现在的目标是老的primary要变成standby.

停止老的primary :

pg95@db-172-16-3-150-> pg_ctl stop -m fast -D /data02/pgdata95/pg_root  

执行测试, 但不执行恢复 :

pg95@db-172-16-3-150-> pg_rewind -D /data02/pgdata95/pg_root --source-server='hostaddr=127.0.0.1 port=1923 user=postgres dbname=postgres password=postgres' -P --debug -n  
connected to remote server  
fetched file "global/pg_control", length 8192  
fetched file "pg_xlog/00000002.history", length 41  
The servers diverged at WAL position 0/8BF34E0 on timeline 1.  
could not open file "/data02/pgdata95/pg_root/pg_xlog/000000010000000000000008": No such file or directory  
  
could not find previous WAL record at 0/8BF34E0  
Failure, exiting  

pg_rewind需要知道老的primary在timeline之后发生了哪些变更, 这些变更需要从pg_xlog取出. 这些变更的数据块需要从新库复制过来.

因为老的primary脑裂后经过了大量的数据变更, 很多xlog已经归档了, 所以需要从归档拷贝到pg_xlog目录.

拷贝 :

pg95@db-172-16-3-150-> cd /data02/pgdata95/pg_arch/  
pg95@db-172-16-3-150-> ll  
total 2.4G  
-rw------- 1 pg95 pg95 16M Apr  9 15:50 000000010000000000000001  
-rw------- 1 pg95 pg95 16M Apr  9 15:50 000000010000000000000002  
-rw------- 1 pg95 pg95 302 Apr  9 15:50 000000010000000000000002.00000060.backup  
-rw------- 1 pg95 pg95 16M Apr  9 15:59 000000010000000000000003  
-rw------- 1 pg95 pg95 16M Apr  9 15:59 000000010000000000000004  
-rw------- 1 pg95 pg95 16M Apr  9 15:59 000000010000000000000005  
-rw------- 1 pg95 pg95 16M Apr  9 15:59 000000010000000000000006  
-rw------- 1 pg95 pg95 16M Apr  9 15:59 000000010000000000000007  
-rw------- 1 pg95 pg95 16M Apr  9 16:02 000000010000000000000008  
-rw------- 1 pg95 pg95 16M Apr  9 16:02 000000010000000000000009  
.....  
  
pg95@db-172-16-3-150-> cp * /data02/pgdata95/pg_root/pg_xlog/  

再次测试 :

pg95@db-172-16-3-150-> pg_rewind -D /data02/pgdata95/pg_root --source-server='hostaddr=127.0.0.1 port=1923 user=postgres dbname=postgres password=postgres' -P -n  
connected to remote server  
The servers diverged at WAL position 0/8BF34E0 on timeline 1.  
Rewinding from last common checkpoint at 0/8BF3438 on timeline 1  
reading source file list  
reading target file list  
reading WAL in target  
Need to copy 1755 MB (total source directory size is 1786 MB)  
1797713/1797713 kB (100%) copied  
creating backup label and updating control file  
Done!  

测试没问题就可以直接执行了, 去掉-n参数即可.

pg95@db-172-16-3-150-> pg_rewind -D /data02/pgdata95/pg_root --source-server='hostaddr=127.0.0.1 port=1923 user=postgres dbname=postgres password=postgres' -P  
connected to remote server  
The servers diverged at WAL position 0/8BF34E0 on timeline 1.  
Rewinding from last common checkpoint at 0/8BF3438 on timeline 1  
reading source file list  
reading target file list  
reading WAL in target  
Need to copy 1755 MB (total source directory size is 1786 MB)  
1797714/1797714 kB (100%) copied  
creating backup label and updating control file  
Done!  

执行完后, 老的primary数据库需要手工启动,

启动前, 请先修改配置 :

recovery.conf

pg95@db-172-16-3-150-> cd /data02/pgdata95/pg_root/  
pg95@db-172-16-3-150-> ll  
total 148K  
-rw------- 1 pg95 pg95  175 Apr  9 16:21 backup_label  
-rw------- 1 pg95 pg95  206 Apr  9 16:21 backup_label.old  
drwx------ 5 pg95 pg95 4.0K Apr  9 16:21 base  
drwx------ 2 pg95 pg95 4.0K Apr  9 16:21 global  
drwx------ 2 pg95 pg95 4.0K Apr  9 09:04 pg_clog  
drwx------ 2 pg95 pg95 4.0K Apr  9 11:09 pg_commit_ts  
drwx------ 2 pg95 pg95 4.0K Apr  9 09:04 pg_dynshmem  
-rw------- 1 pg95 pg95 4.4K Apr  9 16:21 pg_hba.conf  
-rw------- 1 pg95 pg95 1.6K Apr  9 16:21 pg_ident.conf  
drwx------ 2 pg95 pg95 4.0K Apr  9 16:21 pg_log  
drwx------ 4 pg95 pg95 4.0K Apr  9 09:04 pg_logical  
drwx------ 4 pg95 pg95 4.0K Apr  9 09:04 pg_multixact  
drwx------ 2 pg95 pg95 4.0K Apr  9 16:14 pg_notify  
drwx------ 2 pg95 pg95 4.0K Apr  9 09:04 pg_replslot  
drwx------ 2 pg95 pg95 4.0K Apr  9 09:04 pg_serial  
drwx------ 2 pg95 pg95 4.0K Apr  9 09:04 pg_snapshots  
drwx------ 2 pg95 pg95 4.0K Apr  9 16:21 pg_stat  
drwx------ 2 pg95 pg95 4.0K Apr  9 16:21 pg_stat_tmp  
drwx------ 2 pg95 pg95 4.0K Apr  9 09:04 pg_subtrans  
drwx------ 2 pg95 pg95 4.0K Apr  9 09:04 pg_tblspc  
drwx------ 2 pg95 pg95 4.0K Apr  9 09:04 pg_twophase  
-rw------- 1 pg95 pg95    4 Apr  9 09:04 PG_VERSION  
drwx------ 3 pg95 pg95  16K Apr  9 16:21 pg_xlog  
-rw------- 1 pg95 pg95   88 Apr  9 16:21 postgresql.auto.conf  
-rw------- 1 pg95 pg95  22K Apr  9 16:21 postgresql.conf  
-rw------- 1 pg95 pg95   27 Apr  9 16:14 postmaster.opts  
-rw-r--r-- 1 pg95 pg95 5.7K Apr  9 16:21 recovery.done  
pg95@db-172-16-3-150-> mv recovery.done recovery.conf  
  
pg95@db-172-16-3-150-> vi recovery.conf    修改两处  
restore_command = 'cp /data03/pgdata95/pg_arch/%f %p'           # e.g. 'cp /mnt/server/archivedir/%f %p'  
primary_conninfo = 'host=127.0.0.1 port=1923 user=postgres keepalives_idle=60'          # e.g. 'host=localhost port=5432'  
pg95@db-172-16-3-150-> vi postgresql.conf  
修改  
port = 1922  
cluster_name = 'db1'  
archive_command = 'test ! -f /data02/pgdata95/pg_arch/%f && cp %p /data02/pgdata95/pg_arch/%f'  

启动老的主库, 变成standby,

并检查是否解决了脑裂, 与新的primary完全一致, 例如脑裂过程创建的约束, 函数, 视图都会消失.

pg95@db-172-16-3-150-> pg_ctl start -D /data02/pgdata95/pg_root  
server starting  
pg95@db-172-16-3-150-> LOG:  00000: redirecting log output to logging collector process  
HINT:  Future log output will appear in directory "pg_log".  
LOCATION:  SysLogger_Start, syslogger.c:622  
  
pg95@db-172-16-3-150-> psql -h 127.0.0.1 -p 1922  
psql (9.5devel)  
Type "help" for help.  
  
postgres=# \dt  
        List of relations  
 Schema | Name | Type  |  Owner     
--------+------+-------+----------  
 public | t1   | table | postgres  
 public | tbl  | table | postgres  
 public | test | table | postgres  
(3 rows)  
  
postgres=# \dv  
No relations found.  
postgres=# \d+ t1  
                          Table "public.t1"  
 Column |  Type   | Modifiers | Storage  | Stats target | Description   
--------+---------+-----------+----------+--------------+-------------  
 id     | integer |           | plain    |              |   
 info   | text    |           | extended |              |   
postgres=# \df ft  
                       List of functions  
 Schema | Name | Result data type | Argument data types | Type   
--------+------+------------------+---------------------+------  
(0 rows)  
postgres=# select sum(hashtext(t1.*::text)) from t1;  
      sum        
---------------  
 1868065255940   数据和新的primary吻合  
(1 row)  

再检查一下老库上新增的表是否还在? 通过SIZE来区分

pg95@db-172-16-3-150-> psql  
psql (9.5devel)  
Type "help" for help.  
  
postgres=# select oid from pg_database where datname='postgres';  
  oid    
-------  
 13253  
(1 row)  
  
postgres=# select pg_relation_filepath('t1');  
 pg_relation_filepath   
----------------------  
 base/13253/16451  
(1 row)  

这是t1表

pg95@db-172-16-3-150-> cd /data02/pgdata95/pg_root/base/13253/  
pg95@db-172-16-3-150-> ll 16451*  
-rw------- 1 pg95 pg95 780M Apr  9 16:31 16451  
-rw------- 1 pg95 pg95 128K Apr  9 16:21 16451_fsm  
-rw------- 1 pg95 pg95  32K Apr  9 16:31 16451_vm  

可以看到, 除了t1, 老库上分裂时创建的t2显然不在了. 注意size

如果你要更加精确的区分, 在老库执行pg_rewind之前, 记录一下t2的pg_relation_filepath, 然后来这里比对看看文件还在不在.

pg95@db-172-16-3-150-> ll -S | head -n 10  
total 796M  
-rw------- 1 pg95 pg95 780M Apr  9 16:31 16451  
-rw------- 1 pg95 pg95 576K Apr  9 16:21 1255  
-rw------- 1 pg95 pg95 480K Apr  9 16:21 2618  
-rw------- 1 pg95 pg95 448K Apr  9 16:31 2608  
-rw------- 1 pg95 pg95 448K Apr  9 16:31 2674  
-rw------- 1 pg95 pg95 416K Apr  9 16:31 2673  
-rw------- 1 pg95 pg95 384K Apr  9 16:31 1249  
-rw------- 1 pg95 pg95 288K Apr  9 12:27 2609  
-rw------- 1 pg95 pg95 256K Apr  9 16:21 2691  

至此恢复结束.

注意

pg_rewind 可以实施的前提条件

1. 需要打开wal_log_hints 选项或者初始化数据库是开启了checksum.

否则会报错 :

pg95@db-172-16-3-150-> pg_rewind -D /data02/pgdata95/pg_root --source-server='hostaddr=127.0.0.1 port=1923 user=postgres dbname=postgres password=postgres' -P --debug -n  
connected to remote server  
fetched file "global/pg_control", length 8192  
target server need to use either data checksums or "wal_log_hints = on"  
Failure, exiting  

2. 需要开启full_page_writes

3. 执行pg_rewind的过程中, 老的主库需要发生failover前的最后一个checkpoint之后产生的所有xlog都放到pg_xlog目录.

pg_rewind不会帮你执行restore_command自动去获取archive file.

4. 老的主库必须在干净的关闭数据库后执行pg_rewind.

参考

1. http://www.postgresql.org/docs/devel/static/app-pgrewind.html

2. http://www.postgresql.org/docs/devel/static/libpq-envars.html

Flag Counter

digoal’s 大量PostgreSQL文章入口