PostgreSQL 9.5 new feature - pg_rewind fast sync Split Brain Primary & Standby
背景
PostgreSQL 9.5 将pg_rewind纳入源码, pg_rewind是一个非常好的工具.
如果是9.3或者9.4,你可以从github下载对应的版本分支。例如:
git clone -b REL9_4_STABLE https://github.com/vmware/pg_rewind.git
git clone -b REL9_3_STABLE https://github.com/vmware/pg_rewind.git
当我们在使用PostgreSQL流复制实施主备复制时, 如果在某些情况下, 例如主机有问题时, 备机激活成为读写, 并对外提供服务.
但是如果主机那边还有读写的话, 主备就会出现脑裂的情况.
出现脑裂之后, 原来的主机要变成备机怎么办?
一般我们需要重建备机, 即把数据文件删掉, 重新同步.
另一种方法是用rsync , 同步修改的文件.
如果数据量很大, 以上两种方法都需要耗费大量的时间.
pg_rewind的原理
1. 首选获得备机激活的时间线
2. 根据备机激活的时间线, 在老的主机上找到这个时间线之前的最后一个checkpoint
3. 在老的主机根据这个checkpoint位置, 找到自此以后老的主机产生的所有的XLOG.
4. 从这些XLOG中解析出变更的数据块.
5. 从新的主机将这些数据块抓取过来, 并覆盖掉老的主机上的这些数据块. (同时老库上面新增的块被被删掉.)
6. 从新主机拷贝所有除数据文件以外的所有文件(如clog, etc等)到老的主机.
7. 现在老的主机回到了时间线的位置,
pg_rewind退出后只能到达以上状态, 以下步骤需要手工执行.
8. 修改老主机的配置文件, 例如 postgresql.conf, recovery.conf, pg_hba.conf 以成为新主机的standby.
9. 特别需要注意配置 restore_command, 因为新主机在发生promote后产生的XLOG可能已经归档了.
10. 启动老主机, 开始恢复.
接下来测试一下 :
测试
在同一台主机上测试, 主备配置不同的cluster_name.
cluster_name=db1
cluster_name=db2
db1
$PGDATA /data02/pgdata95/pg_root
$ARCH /data02/pgdata95/pg_arch
$PGPORT 1922
db2
$PGDATA /data03/pgdata95/pg_root
$ARCH /data03/pgdata95/pg_arch
$PGPORT 1923
配置primary
# grep "^[a-z]" postgresql.conf
listen_addresses = '0.0.0.0' # what IP address(es) to listen on;
port = 1922 # (change requires restart)
max_connections = 199 # (change requires restart)
superuser_reserved_connections = 13 # (change requires restart)
unix_socket_directories = '.' # comma-separated list of directories
unix_socket_permissions = 0700 # begin with 0 to use octal notation
tcp_keepalives_idle = 60 # TCP_KEEPIDLE, in seconds;
tcp_keepalives_interval = 10 # TCP_KEEPINTVL, in seconds;
tcp_keepalives_count = 10 # TCP_KEEPCNT;
shared_buffers = 1024MB # min 128kB
huge_pages = try # on, off, or try
dynamic_shared_memory_type = posix # the default is the first option
vacuum_cost_delay = 10 # 0-100 milliseconds
vacuum_cost_limit = 10000 # 1-10000 credits
bgwriter_delay = 10ms # 10-10000ms between rounds
wal_level = logical # minimal, archive, hot_standby, or logical
synchronous_commit = off # synchronization level;
wal_sync_method = open_sync # the default is the first option
full_page_writes = on # recover from partial page writes
wal_compression = on # enable compression of full-page writes
wal_log_hints = on
wal_writer_delay = 10ms # 1-10000 milliseconds
archive_mode = on # allows archiving to be done
archive_command = 'test ! -f /data02/pgdata95/pg_arch/%f && cp %p /data02/pgdata95/pg_arch/%f' # command to use to archive a logfile segment
max_wal_senders = 12 # max number of walsender processes
max_replication_slots = 12 # max number of replication slots
track_commit_timestamp = on # collect timestamp of transaction commit
hot_standby = on # "on" allows queries during recovery
wal_receiver_status_interval = 1s # send replies at least this often
hot_standby_feedback = on # send info from standby to prevent
log_destination = 'csvlog' # Valid values are combinations of
logging_collector = on # Enable capturing of stderr and csvlog
log_directory = 'pg_log' # directory where log files are written,
log_truncate_on_rotation = on # If on, an existing log file with the
log_checkpoints = on
log_connections = on
log_disconnections = on
log_error_verbosity = verbose # terse, default, or verbose messages
log_replication_commands = on
log_timezone = 'PRC'
cluster_name = 'db1' # added to process titles if nonempty
autovacuum = on # Enable autovacuum subprocess? 'on'
log_autovacuum_min_duration = 0 # -1 disables, 0 logs all actions and
datestyle = 'iso, mdy'
timezone = 'PRC'
lc_messages = 'C' # locale for system error message
lc_monetary = 'C' # locale for monetary formatting
lc_numeric = 'C' # locale for number formatting
lc_time = 'C' # locale for time formatting
default_text_search_config = 'pg_catalog.english'
pg_hba.conf
# replication privilege.
local replication postgres trust
host replication postgres 127.0.0.1/32 trust
host replication postgres ::1/128 trust
recovery.conf
pg95@db-172-16-3-150-> cp /opt/pgsql9.5/share/recovery.conf.sample $PGDATA/
pg95@db-172-16-3-150-> cd $PGDATA
pg95@db-172-16-3-150-> mv recovery.conf.sample recovery.done
pg95@db-172-16-3-150-> vi recovery.done
restore_command = 'cp /data03/pgdata95/pg_arch/%f %p' # e.g. 'cp /mnt/server/archivedir/%f %p'
recovery_target_timeline = 'latest'
standby_mode = on
primary_conninfo = 'host=127.0.0.1 port=1923 user=postgres keepalives_idle=60' # e.g. 'host=localhost port=5432'
归档
[root@db-172-16-3-150 postgresql-7320681]# mkdir /data02/pgdata95/pg_arch
[root@db-172-16-3-150 postgresql-7320681]# chown pg95:pg95 /data02/pgdata95/pg_arch
启动数据库
pg_ctl restart -m fast
创建primary-standby 环境
[root@db-172-16-3-150 postgresql-7320681]# mkdir -p /data03/pgdata95/pg_root
[root@db-172-16-3-150 postgresql-7320681]# mkdir -p /data03/pgdata95/pg_arch
[root@db-172-16-3-150 postgresql-7320681]# chown -R pg95:pg95 /data03/pgdata95
[root@db-172-16-3-150 postgresql-7320681]# chmod 700 /data03/pgdata95/pg_root
[root@db-172-16-3-150 postgresql-7320681]# su - pg95
pg95@db-172-16-3-150-> pg_basebackup -D /data03/pgdata95/pg_root -F p -U postgres -h 127.0.0.1 -p 1922
WARNING: skipping special file "./.s.PGSQL.1922"
NOTICE: pg_stop_backup complete, all required WAL segments have been archived
修改配置文件 :
pg95@db-172-16-3-150-> cd /data03/pgdata95/pg_root/
pg95@db-172-16-3-150-> mv recovery.done recovery.conf
pg95@db-172-16-3-150-> vi recovery.conf 修改两处
restore_command = 'cp /data02/pgdata95/pg_arch/%f %p' # e.g. 'cp /mnt/server/archivedir/%f %p'
primary_conninfo = 'host=127.0.0.1 port=1922 user=postgres keepalives_idle=60' # e.g. 'host=localhost port=5432'
pg95@db-172-16-3-150-> vi postgresql.conf
修改
port = 1923
cluster_name = 'db2'
archive_command = 'test ! -f /data03/pgdata95/pg_arch/%f && cp %p /data03/pgdata95/pg_arch/%f'
启动standby
pg95@db-172-16-3-150-> pg_ctl start -D /data03/pgdata95/pg_root
server starting
pg95@db-172-16-3-150-> LOG: 00000: redirecting log output to logging collector process
HINT: Future log output will appear in directory "pg_log".
LOCATION: SysLogger_Start, syslogger.c:622
可以看到已经启动了
[root@db-172-16-3-150 postgresql-7320681]# ps -ewf|grep pg95
pg95 27194 1 0 15:49 pts/2 00:00:00 /opt/pgsql9.5/bin/postgres
pg95 27196 27194 0 15:49 ? 00:00:00 postgres: db1: logger process
pg95 27198 27194 0 15:49 ? 00:00:00 postgres: db1: checkpointer process
pg95 27199 27194 0 15:49 ? 00:00:00 postgres: db1: writer process
pg95 27200 27194 0 15:49 ? 00:00:00 postgres: db1: wal writer process
pg95 27201 27194 0 15:49 ? 00:00:00 postgres: db1: autovacuum launcher process
pg95 27202 27194 0 15:49 ? 00:00:00 postgres: db1: archiver process last was 000000010000000000000002
pg95 27203 27194 0 15:49 ? 00:00:00 postgres: db1: stats collector process
pg95 27349 1 0 15:57 pts/2 00:00:00 /opt/pgsql9.5/bin/postgres -D /data03/pgdata95/pg_root
pg95 27351 27349 0 15:57 ? 00:00:00 postgres: db2: logger process
pg95 27352 27349 0 15:57 ? 00:00:00 postgres: db2: startup process recovering 000000010000000000000003
pg95 27355 27349 0 15:57 ? 00:00:00 postgres: db2: checkpointer process
pg95 27356 27349 0 15:57 ? 00:00:00 postgres: db2: writer process
pg95 27357 27349 0 15:57 ? 00:00:00 postgres: db2: stats collector process
pg95 27359 27349 0 15:57 ? 00:00:00 postgres: db2: wal receiver process streaming 0/3000140
pg95 27360 27194 0 15:57 ? 00:00:00 postgres: db1: wal sender process postgres 127.0.0.1(16326) streaming 0/3000140
root 27401 1264 0 15:58 pts/2 00:00:00 grep pg95
创建一些测试数据, 执行checkpoint
pg95@db-172-16-3-150-> psql -h 127.0.0.1 -p 1922
psql (9.5devel)
Type "help" for help.
postgres=# create table t1(id int, info text);
CREATE TABLE
postgres=# insert into t1 select generate_series(1,1000000), md5(random()::text);
INSERT 0 1000000
postgres=# select sum(hashtext(t1.*::text)) from t1;
sum
---------------
-585064225655
(1 row)
postgres=# checkpoint;
CHECKPOINT
确保已同步
pg95@db-172-16-3-150-> psql -h 127.0.0.1 -p 1923
psql (9.5devel)
Type "help" for help.
postgres=# select sum(hashtext(t1.*::text)) from t1;
sum
---------------
-585064225655
(1 row)
激活standby.
pg95@db-172-16-3-150-> pg_ctl promote -D /data03/pgdata95/pg_root
server promoting
此时主备已经没有同步关系了, 模拟脑裂
在老的primary执行.
pg95@db-172-16-3-150-> psql -h 127.0.0.1 -p 1922
psql (9.5devel)
Type "help" for help.
postgres=# insert into t1 select generate_series(1,1000000), md5(random()::text);
INSERT 0 1000000
postgres=# insert into t1 select generate_series(1,5000000), md5(random()::text);
INSERT 0 5000000
postgres=# delete from t1;
DELETE 7000000
postgres=# insert into t1 select generate_series(1,5000000), md5(random()::text);
INSERT 0 5000000
postgres=# select sum(hashtext(t1.*::text)) from t1;
sum
--------------
316362957898
(1 row)
postgres=# create table t2 as select * from t1;
SELECT 5000000
postgres=# create index idx_t1 on t1(id);
CREATE INDEX
postgres=# create function ft() returns void as $$
postgres$# declare
postgres$# begin
postgres$# return;
postgres$# end;
postgres$# $$ language plpgsql;
CREATE FUNCTION
postgres=# alter table t1 add column crt_time timestamp;
ALTER TABLE
postgres=# alter table t1 add constraint pk primary key(id);
ALTER TABLE
postgres=# create view vt1 as select * from t1 where id<100;
CREATE VIEW
在新的primary执行 :
pg95@db-172-16-3-150-> psql -h 127.0.0.1 -p 1923
psql (9.5devel)
Type "help" for help.
postgres=# insert into t1 select generate_series(1,6000000), md5(random()::text);
INSERT 0 6000000
postgres=# delete from t1;
DELETE 7000000
postgres=# insert into t1 select generate_series(1,5000000), md5(random()::text);
INSERT 0 5000000
postgres=# select sum(hashtext(t1.*::text)) from t1;
sum
---------------
1868065255940
(1 row)
现在的目标是老的primary要变成standby.
停止老的primary :
pg95@db-172-16-3-150-> pg_ctl stop -m fast -D /data02/pgdata95/pg_root
执行测试, 但不执行恢复 :
pg95@db-172-16-3-150-> pg_rewind -D /data02/pgdata95/pg_root --source-server='hostaddr=127.0.0.1 port=1923 user=postgres dbname=postgres password=postgres' -P --debug -n
connected to remote server
fetched file "global/pg_control", length 8192
fetched file "pg_xlog/00000002.history", length 41
The servers diverged at WAL position 0/8BF34E0 on timeline 1.
could not open file "/data02/pgdata95/pg_root/pg_xlog/000000010000000000000008": No such file or directory
could not find previous WAL record at 0/8BF34E0
Failure, exiting
pg_rewind需要知道老的primary在timeline之后发生了哪些变更, 这些变更需要从pg_xlog取出. 这些变更的数据块需要从新库复制过来.
因为老的primary脑裂后经过了大量的数据变更, 很多xlog已经归档了, 所以需要从归档拷贝到pg_xlog目录.
拷贝 :
pg95@db-172-16-3-150-> cd /data02/pgdata95/pg_arch/
pg95@db-172-16-3-150-> ll
total 2.4G
-rw------- 1 pg95 pg95 16M Apr 9 15:50 000000010000000000000001
-rw------- 1 pg95 pg95 16M Apr 9 15:50 000000010000000000000002
-rw------- 1 pg95 pg95 302 Apr 9 15:50 000000010000000000000002.00000060.backup
-rw------- 1 pg95 pg95 16M Apr 9 15:59 000000010000000000000003
-rw------- 1 pg95 pg95 16M Apr 9 15:59 000000010000000000000004
-rw------- 1 pg95 pg95 16M Apr 9 15:59 000000010000000000000005
-rw------- 1 pg95 pg95 16M Apr 9 15:59 000000010000000000000006
-rw------- 1 pg95 pg95 16M Apr 9 15:59 000000010000000000000007
-rw------- 1 pg95 pg95 16M Apr 9 16:02 000000010000000000000008
-rw------- 1 pg95 pg95 16M Apr 9 16:02 000000010000000000000009
.....
pg95@db-172-16-3-150-> cp * /data02/pgdata95/pg_root/pg_xlog/
再次测试 :
pg95@db-172-16-3-150-> pg_rewind -D /data02/pgdata95/pg_root --source-server='hostaddr=127.0.0.1 port=1923 user=postgres dbname=postgres password=postgres' -P -n
connected to remote server
The servers diverged at WAL position 0/8BF34E0 on timeline 1.
Rewinding from last common checkpoint at 0/8BF3438 on timeline 1
reading source file list
reading target file list
reading WAL in target
Need to copy 1755 MB (total source directory size is 1786 MB)
1797713/1797713 kB (100%) copied
creating backup label and updating control file
Done!
测试没问题就可以直接执行了, 去掉-n参数即可.
pg95@db-172-16-3-150-> pg_rewind -D /data02/pgdata95/pg_root --source-server='hostaddr=127.0.0.1 port=1923 user=postgres dbname=postgres password=postgres' -P
connected to remote server
The servers diverged at WAL position 0/8BF34E0 on timeline 1.
Rewinding from last common checkpoint at 0/8BF3438 on timeline 1
reading source file list
reading target file list
reading WAL in target
Need to copy 1755 MB (total source directory size is 1786 MB)
1797714/1797714 kB (100%) copied
creating backup label and updating control file
Done!
执行完后, 老的primary数据库需要手工启动,
启动前, 请先修改配置 :
recovery.conf
pg95@db-172-16-3-150-> cd /data02/pgdata95/pg_root/
pg95@db-172-16-3-150-> ll
total 148K
-rw------- 1 pg95 pg95 175 Apr 9 16:21 backup_label
-rw------- 1 pg95 pg95 206 Apr 9 16:21 backup_label.old
drwx------ 5 pg95 pg95 4.0K Apr 9 16:21 base
drwx------ 2 pg95 pg95 4.0K Apr 9 16:21 global
drwx------ 2 pg95 pg95 4.0K Apr 9 09:04 pg_clog
drwx------ 2 pg95 pg95 4.0K Apr 9 11:09 pg_commit_ts
drwx------ 2 pg95 pg95 4.0K Apr 9 09:04 pg_dynshmem
-rw------- 1 pg95 pg95 4.4K Apr 9 16:21 pg_hba.conf
-rw------- 1 pg95 pg95 1.6K Apr 9 16:21 pg_ident.conf
drwx------ 2 pg95 pg95 4.0K Apr 9 16:21 pg_log
drwx------ 4 pg95 pg95 4.0K Apr 9 09:04 pg_logical
drwx------ 4 pg95 pg95 4.0K Apr 9 09:04 pg_multixact
drwx------ 2 pg95 pg95 4.0K Apr 9 16:14 pg_notify
drwx------ 2 pg95 pg95 4.0K Apr 9 09:04 pg_replslot
drwx------ 2 pg95 pg95 4.0K Apr 9 09:04 pg_serial
drwx------ 2 pg95 pg95 4.0K Apr 9 09:04 pg_snapshots
drwx------ 2 pg95 pg95 4.0K Apr 9 16:21 pg_stat
drwx------ 2 pg95 pg95 4.0K Apr 9 16:21 pg_stat_tmp
drwx------ 2 pg95 pg95 4.0K Apr 9 09:04 pg_subtrans
drwx------ 2 pg95 pg95 4.0K Apr 9 09:04 pg_tblspc
drwx------ 2 pg95 pg95 4.0K Apr 9 09:04 pg_twophase
-rw------- 1 pg95 pg95 4 Apr 9 09:04 PG_VERSION
drwx------ 3 pg95 pg95 16K Apr 9 16:21 pg_xlog
-rw------- 1 pg95 pg95 88 Apr 9 16:21 postgresql.auto.conf
-rw------- 1 pg95 pg95 22K Apr 9 16:21 postgresql.conf
-rw------- 1 pg95 pg95 27 Apr 9 16:14 postmaster.opts
-rw-r--r-- 1 pg95 pg95 5.7K Apr 9 16:21 recovery.done
pg95@db-172-16-3-150-> mv recovery.done recovery.conf
pg95@db-172-16-3-150-> vi recovery.conf 修改两处
restore_command = 'cp /data03/pgdata95/pg_arch/%f %p' # e.g. 'cp /mnt/server/archivedir/%f %p'
primary_conninfo = 'host=127.0.0.1 port=1923 user=postgres keepalives_idle=60' # e.g. 'host=localhost port=5432'
pg95@db-172-16-3-150-> vi postgresql.conf
修改
port = 1922
cluster_name = 'db1'
archive_command = 'test ! -f /data02/pgdata95/pg_arch/%f && cp %p /data02/pgdata95/pg_arch/%f'
启动老的主库, 变成standby,
并检查是否解决了脑裂, 与新的primary完全一致, 例如脑裂过程创建的约束, 函数, 视图都会消失.
pg95@db-172-16-3-150-> pg_ctl start -D /data02/pgdata95/pg_root
server starting
pg95@db-172-16-3-150-> LOG: 00000: redirecting log output to logging collector process
HINT: Future log output will appear in directory "pg_log".
LOCATION: SysLogger_Start, syslogger.c:622
pg95@db-172-16-3-150-> psql -h 127.0.0.1 -p 1922
psql (9.5devel)
Type "help" for help.
postgres=# \dt
List of relations
Schema | Name | Type | Owner
--------+------+-------+----------
public | t1 | table | postgres
public | tbl | table | postgres
public | test | table | postgres
(3 rows)
postgres=# \dv
No relations found.
postgres=# \d+ t1
Table "public.t1"
Column | Type | Modifiers | Storage | Stats target | Description
--------+---------+-----------+----------+--------------+-------------
id | integer | | plain | |
info | text | | extended | |
postgres=# \df ft
List of functions
Schema | Name | Result data type | Argument data types | Type
--------+------+------------------+---------------------+------
(0 rows)
postgres=# select sum(hashtext(t1.*::text)) from t1;
sum
---------------
1868065255940 数据和新的primary吻合
(1 row)
再检查一下老库上新增的表是否还在? 通过SIZE来区分
pg95@db-172-16-3-150-> psql
psql (9.5devel)
Type "help" for help.
postgres=# select oid from pg_database where datname='postgres';
oid
-------
13253
(1 row)
postgres=# select pg_relation_filepath('t1');
pg_relation_filepath
----------------------
base/13253/16451
(1 row)
这是t1表
pg95@db-172-16-3-150-> cd /data02/pgdata95/pg_root/base/13253/
pg95@db-172-16-3-150-> ll 16451*
-rw------- 1 pg95 pg95 780M Apr 9 16:31 16451
-rw------- 1 pg95 pg95 128K Apr 9 16:21 16451_fsm
-rw------- 1 pg95 pg95 32K Apr 9 16:31 16451_vm
可以看到, 除了t1, 老库上分裂时创建的t2显然不在了. 注意size
如果你要更加精确的区分, 在老库执行pg_rewind之前, 记录一下t2的pg_relation_filepath, 然后来这里比对看看文件还在不在.
pg95@db-172-16-3-150-> ll -S | head -n 10
total 796M
-rw------- 1 pg95 pg95 780M Apr 9 16:31 16451
-rw------- 1 pg95 pg95 576K Apr 9 16:21 1255
-rw------- 1 pg95 pg95 480K Apr 9 16:21 2618
-rw------- 1 pg95 pg95 448K Apr 9 16:31 2608
-rw------- 1 pg95 pg95 448K Apr 9 16:31 2674
-rw------- 1 pg95 pg95 416K Apr 9 16:31 2673
-rw------- 1 pg95 pg95 384K Apr 9 16:31 1249
-rw------- 1 pg95 pg95 288K Apr 9 12:27 2609
-rw------- 1 pg95 pg95 256K Apr 9 16:21 2691
至此恢复结束.
注意
pg_rewind 可以实施的前提条件
1. 需要打开wal_log_hints 选项或者初始化数据库是开启了checksum.
否则会报错 :
pg95@db-172-16-3-150-> pg_rewind -D /data02/pgdata95/pg_root --source-server='hostaddr=127.0.0.1 port=1923 user=postgres dbname=postgres password=postgres' -P --debug -n
connected to remote server
fetched file "global/pg_control", length 8192
target server need to use either data checksums or "wal_log_hints = on"
Failure, exiting
2. 需要开启full_page_writes
3. 执行pg_rewind的过程中, 老的主库需要发生failover前的最后一个checkpoint之后产生的所有xlog都放到pg_xlog目录.
pg_rewind不会帮你执行restore_command自动去获取archive file.
4. 老的主库必须在干净的关闭数据库后执行pg_rewind.
参考
1. http://www.postgresql.org/docs/devel/static/app-pgrewind.html
2. http://www.postgresql.org/docs/devel/static/libpq-envars.html