PostgreSQL 一主多从(多副本,强同步)简明手册 - 配置、压测、监控、切换、防脑裂、修复、0丢失 - 珍藏级

34 minute read

背景

PostgreSQL物理流复制有几个特点

1、延迟及低,毫不畏惧大事务

2、支持断点续传

3、支持多副本

4、配置简单,看本文

5、备库与主库物理完全一致,并支持只读

所以大多数用户都会使用流复制来搭建只读备库,容灾,备份节点,HA等。

本文主要介绍一下PostgreSQL一主多从的配置,以及多副本强同步的配置。

1 准备硬件环境

假设我们需要构建一个1主2备的环境,那么需要3台主机。如果你需要用这个环境来实现高可用,则有几种方案。

1、PROXY或DNS方案

PROXY或DNS需要另外准备,这里不多说,目的就是让DNS连接到

2、VIP漂移方案

不管用哪套方案,只要做HA,就需要仲裁,即管理整个集群的节点。它主要负责切换主节点,漂移VIP或通知PROXY\DNS指向新主节点。

3、如果做到0丢失,防脑裂。(我们需要一个约定,只要客户端还没有收到事务结束的状态,这个事务是unknown的,也就是说回滚或者已提交都是可以接受的。)

只要控制了大于或等于(一会PG中参数指定“同步备库数 - 同步副本数 + 1”)个节点,再选择新的主节点(选出它们中WAL位点最新的作为新主节点),则绝对不可能出现脑裂的问题,也不会丢失数据。

《PG多节点(quorum based), 0丢失 HA(failover,switchover)方案》

pic

例如:

总共有5个从库,如果配置如下  
  
synchronous_standby_names = 'ANY 2 (*)'  
  
那么需要控制 5-2+1=4个或以上从库,如果需要切换,就可以认为绝对不会出现脑裂或丢数据的情况。  
  
因为冻结了4个从库后,主库剩下的从库只有1个,就无法满足2个副本,所以不可能再提交并返回用户事务状态。  
  
目前只能冻结从库的replay,无法冻结receiver  
 pg_catalog | pg_wal_replay_pause           | void                     |                     | normal  
 pg_catalog | pg_wal_replay_resume          | void                     |                     | normal  
如果可以直接冻结receiver就完美了,切换时很好用  

我们把重点放在如何构建一种多从,如何设置多副本强同步上。

硬件:

3台物理机,万兆互联,同一局域网最好,如果你要做跨机房容灾,就不要在乎同一局域网了(只要保证足够大的带宽可以实现主备流复制的延迟较低就可以)。

由于我这里写文档用的是测试环境,用单台虚拟机(56核,224GB MEM,3TB DISK)代替,读者请关注监听端口,不同的端口代表不同的库。(在三主机中,根据本文的配置,更换一下IP+PORT即可)

pic

主库     127.0.0.1 : 1921  
从库1    127.0.0.1 : 1922  
从库2    127.0.0.1 : 1923  

2 准备软件环境

CentOS 7.x x64

PostgreSQL 10

可以参考如下任意文档进行部署

《PostgreSQL 10 + PostGIS + Sharding(pg_pathman) + MySQL(fdw外部表) on ECS 部署指南(适合新用户)》

《PostgreSQL on Linux 最佳部署手册》

准备两个大目录,分别存放数据文件和WAL REDO文件.

/dev/vdc1       1.5T  554G  902G  39% /data02   # wal redo  
/dev/vdb1       1.5T  275G  1.2T  19% /data01   # datafile  
mkdir /data01/pg  
mkdir /data02/pg  
chown postgres:postgres /data01/pg  
chown postgres:postgres /data02/pg  

如果你是在三台物理机上部署,请确保三台物理机之间可以相互访问它们的数据库监听端口。(否则怎么基于网络做流复制呀.)

3 初始化主库

1、创建一个配置文件,方便我们切换主备的连接环境。

su - postgres  
  
vi env.sh  
  
export PS1="$USER@`/bin/hostname -s`-> "    
export PGPORT=$1  
# export PGPORT=1921  
export PGDATA=/data01/pg/pg_root$PGPORT    
export LANG=en_US.utf8    
export PGHOME=/usr/pgsql-10  
export LD_LIBRARY_PATH=$PGHOME/lib:/lib64:/usr/lib64:/usr/local/lib64:/lib:/usr/lib:/usr/local/lib:$LD_LIBRARY_PATH    
export DATE=`date +"%Y%m%d%H%M"`    
export PATH=$PGHOME/bin:$PATH:.    
export MANPATH=$PGHOME/share/man:$MANPATH    
export PGHOST=$PGDATA    
export PGUSER=postgres    
export PGDATABASE=postgres    
alias rm='rm -i'    
alias ll='ls -lh'    
unalias vi   

2、连接到主库,初始化数据库

su - postgres  
. ./env.sh 1921  
  
initdb -D $PGDATA -U postgres -E UTF8 --locale=en_US.utf8 -X /data02/pg/pg_wal_$PGPORT  

4 配置主库

su - postgres  
. ./env.sh 1921  

1、配置postgresql.conf

vi $PGDATA/postgresql.conf  
  
修改如下配置  
  
listen_addresses = '0.0.0.0'  
port = 1921   ## 主库监听端口  
max_connections = 5000  
superuser_reserved_connections = 13  
unix_socket_directories = '.'  
tcp_keepalives_idle = 60  
tcp_keepalives_interval = 10  
tcp_keepalives_count = 10  
shared_buffers = 24GB  
huge_pages = try  
work_mem = 4MB  
maintenance_work_mem = 1GB  
dynamic_shared_memory_type = posix  
shared_preload_libraries = 'pg_stat_statements'  
vacuum_cost_delay = 0  
bgwriter_delay = 10ms  
bgwriter_lru_maxpages = 1000  
bgwriter_lru_multiplier = 5.0  
effective_io_concurrency = 0  
max_worker_processes = 128  
wal_level = replica  
synchronous_commit = remote_write  
full_page_writes = on  
wal_buffers = 64MB  
wal_writer_delay = 10ms  
checkpoint_timeout = 30min  
max_wal_size = 48GB  
min_wal_size = 12GB  
checkpoint_completion_target = 0.1  
archive_mode = on  
archive_command = '/bin/date'  
max_wal_senders = 8  
wal_keep_segments = 4096  
wal_sender_timeout = 15s  
hot_standby = on  
max_standby_archive_delay = 600s  
max_standby_streaming_delay = 600s  
wal_receiver_status_interval = 1s  
hot_standby_feedback = off  
wal_receiver_timeout = 30s  
wal_retrieve_retry_interval = 5s  
random_page_cost = 1.1  
log_destination = 'csvlog'  
logging_collector = on  
log_directory = 'log'  
log_filename = 'postgresql-%a.log'  
log_truncate_on_rotation = on  
log_rotation_age = 1d  
log_rotation_size = 0  
log_checkpoints = on  
log_connections = on  
log_disconnections = on  
log_error_verbosity = verbose  
log_line_prefix = '%m [%p] '  
log_lock_waits = on  
log_statement = 'ddl'  
log_timezone = 'PRC'  
autovacuum = on  
log_autovacuum_min_duration = 0  
autovacuum_max_workers = 8  
autovacuum_freeze_max_age = 950000000  
autovacuum_multixact_freeze_max_age = 1100000000  
autovacuum_vacuum_cost_delay = 0  
datestyle = 'iso, mdy'  
timezone = 'PRC'  
lc_messages = 'en_US.utf8'  
lc_monetary = 'en_US.utf8'  
lc_numeric = 'en_US.utf8'  
lc_time = 'en_US.utf8'  
default_text_search_config = 'pg_catalog.english'  
# wal_log_hints = on    # 如果你需要用pg_rewind修复WAL的时间线差异, 需要开启它, 但是开启它会导致写wal变多, 请斟酌

2、启动主库

pg_ctl start  

3、创建流复制角色

psql  
psql (10.3)  
Type "help" for help.  
  
postgres=# create role rep login encrypted password 'REpliCa12343231_-1!' replication;  
CREATE ROLE  

4、配置数据库防火墙pg_hba.conf

vi $PGDATA/pg_hba.conf  
  
追加到配置文件末尾  
  
# 多主机应该这样配置, 如果你在可信任网络中,也可以配置为truse代替md5,那么就不需要密码认证了  
# host replication rep 主机1IP/32 md5  
# host replication rep 主机2IP/32 md5  
# host replication rep 主机3IP/32 md5  
  
# 我的测试环境配置  
host replication rep 127.0.0.1/32 md5  
host all all 0.0.0.0/0 md5  

5、使配置生效

pg_ctl reload  
server signaled  

6、生成流复制配置模板

cp $PGHOME/share/recovery.conf.sample $PGDATA/  
cd $PGDATA  
mv recovery.conf.sample recovery.done  
  
vi recovery.done  
  
recovery_target_timeline = 'latest'  
standby_mode = on  
primary_conninfo = 'host=127.0.0.1 port=1921 user=rep password=REpliCa12343231_-1!'  
# host 主库ip, port 主库监听端口, user 流复制用户名, password 流复制角色rep的密码  

5 创建流复制从库1

su - postgres  
. ./env.sh 1922  

1、使用流复制用户创建从库1

pg_basebackup -D /data01/pg/pg_root1922 -F p --waldir=/data02/pg/pg_wal_1922 -h 127.0.0.1 -p 1921 -U rep  
Password:   输入rep用户的密码  
WARNING:  skipping special file "./.s.PGSQL.1921"  

2、配置postgresql.conf(多主机环境可选项),因为我的测试环境是一台主机启动3个库,需要避免监听端口冲突

cd $PGDATA  
vi postgresql.conf  
port = 1922   

3、配置流复制配置文件recovery.conf

cd $PGDATA  
mv recovery.done recovery.conf  

4、启动从库1

pg_ctl start  

6 创建流复制从库2

su - postgres  
. ./env.sh 1923  

1、使用流复制用户创建从库2

pg_basebackup -D /data01/pg/pg_root1923 -F p --waldir=/data02/pg/pg_wal_1923 -h 127.0.0.1 -p 1921 -U rep  
Password:   输入rep用户的密码  
WARNING:  skipping special file "./.s.PGSQL.1921"  

2、配置postgresql.conf(多主机环境可选项),因为我的测试环境是一台主机启动3个库,需要避免监听端口冲突

cd $PGDATA  
vi postgresql.conf  
port = 1923  

3、配置流复制配置文件recovery.conf

cd $PGDATA  
mv recovery.done recovery.conf  

4、启动从库2

pg_ctl start  

7 配置多副本强同步

1、配置主节点

psql -h 127.0.0.1 -p 1921 -U postgres postgres  
psql (10.3)  
Type "help" for help.  
  
postgres=# alter system set synchronous_standby_names='ANY 1 (*)';  
ALTER SYSTEM  
postgres=# select pg_reload_conf();  
 pg_reload_conf   
----------------  
 t  
(1 row)  

2、配置从库1 (这样确保激活从库1后,它也要求多副本强同步)

psql -h 127.0.0.1 -p 1922 -U postgres postgres  
psql (10.3)  
Type "help" for help.  
  
postgres=# alter system set synchronous_standby_names='ANY 1 (*)';  
ALTER SYSTEM  
postgres=# select pg_reload_conf();  
 pg_reload_conf   
----------------  
 t  
(1 row)  

3、配置从库2 (这样确保激活从库2后,它也要求多副本强同步)

psql -h 127.0.0.1 -p 1923 -U postgres postgres  
psql (10.3)  
Type "help" for help.  
  
postgres=# alter system set synchronous_standby_names='ANY 1 (*)';  
ALTER SYSTEM  
postgres=# select pg_reload_conf();  
 pg_reload_conf   
----------------  
 t  
(1 row)  

7 压测,观察主从延迟

1 高并发小事务压测

创建1024个表,使用merge insert写入200亿数据。

连接到主节点执行如下:

1、建表

do language plpgsql $$      
declare      
begin      
  execute 'drop table if exists test';      
  execute 'create table test(id int8 primary key, info text, crt_time timestamp)';      
        
  for i in 0..1023 loop      
    execute format('drop table if exists test%s', i);      
    execute format('create table test%s (like test including all)', i);      
  end loop;      
end;      
$$;      

2、创建动态数据写入函数

create or replace function dyn_pre(int8) returns void as $$      
declare      
  suffix int8 := mod($1,1024);      
begin      
  execute format('execute p%s(%s)', suffix, $1);      
  exception when others then      
    execute format('prepare p%s(int8) as insert into test%s values($1, md5(random()::text), now()) on conflict(id) do update set info=excluded.info,crt_time=excluded.crt_time', suffix, suffix);      
    execute format('execute p%s(%s)', suffix, $1);      
end;      
$$ language plpgsql strict;      

3、创建压测脚本,使用merge insert写入200亿数据。

vi test.sql      
      
\set id random(1,20000000000)      
select dyn_pre(:id);      

4、写入压测

pgbench -M prepared -n -r -P 1 -f ./test.sql -h 127.0.0.1 -p 1921 -U postgres postgres -c 56 -j 56 -T 1200000      

性能如下,(在不使用多副本同步复制(synchronous_commit = local)时,性能约8.6万tps,同步复制因为需要等从库的FEED BACK,所以RT有一定的影响。)

progress: 166.0 s, 72845.6 tps, lat 0.769 ms stddev 1.235  
progress: 167.0 s, 71348.9 tps, lat 0.774 ms stddev 1.475  
progress: 168.0 s, 69103.7 tps, lat 0.822 ms stddev 1.912  
progress: 169.0 s, 71369.3 tps, lat 0.785 ms stddev 1.279  
progress: 170.0 s, 71994.7 tps, lat 0.778 ms stddev 1.475  
progress: 171.0 s, 72761.9 tps, lat 0.770 ms stddev 1.260  
progress: 172.0 s, 69511.8 tps, lat 0.806 ms stddev 2.059  
progress: 173.0 s, 71999.7 tps, lat 0.778 ms stddev 1.303  
progress: 174.0 s, 72752.9 tps, lat 0.770 ms stddev 1.392  
progress: 175.0 s, 71915.5 tps, lat 0.776 ms stddev 1.339  
progress: 176.0 s, 73192.6 tps, lat 0.768 ms stddev 1.371  
progress: 177.0 s, 68944.8 tps, lat 0.812 ms stddev 1.821  
progress: 178.0 s, 72611.7 tps, lat 0.771 ms stddev 1.250  
progress: 179.0 s, 72492.1 tps, lat 0.773 ms stddev 1.293  
progress: 180.0 s, 74471.1 tps, lat 0.752 ms stddev 1.216  
progress: 181.0 s, 67790.2 tps, lat 0.826 ms stddev 2.087  
progress: 182.0 s, 72486.8 tps, lat 0.773 ms stddev 1.298  
progress: 183.0 s, 74707.1 tps, lat 0.750 ms stddev 1.216  
progress: 184.0 s, 75784.9 tps, lat 0.739 ms stddev 1.267  
progress: 185.0 s, 71072.0 tps, lat 0.788 ms stddev 1.860  
progress: 186.0 s, 75719.3 tps, lat 0.739 ms stddev 1.277  
progress: 187.0 s, 75638.9 tps, lat 0.740 ms stddev 1.309  

5、观察主备延迟

psql -h 127.0.0.1 -p 1921 -U postgres postgres  
  
postgres=# select pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_flush_lsn(),write_lsn)) delay_wal_size,* from pg_stat_replication ;  
-[ RECORD 1 ]----+------------------------------  
delay_wal_size   | 5440 bytes  ## 从库1 wal的延迟  
pid              | 9107  
usesysid         | 206200  
usename          | rep  
application_name | walreceiver  
client_addr      | 127.0.0.1  
client_hostname  |   
client_port      | 50174  
backend_start    | 2018-03-26 16:05:26.404767+08  
backend_xmin     |   
state            | streaming  
sent_lsn         | 4B/93790ED0  
write_lsn        | 4B/93790130  
flush_lsn        | 4B/93790130  
replay_lsn       | 4B/93789F60  
write_lag        | 00:00:00.000055  ## 计算的延迟时间,非常小  
flush_lag        | 00:00:00.000128  
replay_lag       | 00:00:00.001399  
sync_priority    | 1  
sync_state       | quorum  
-[ RECORD 2 ]----+------------------------------  
delay_wal_size   | 8328 bytes  ## 从库2 wal的延迟  
pid              | 9064  
usesysid         | 206200  
usename          | rep  
application_name | walreceiver  
client_addr      | 127.0.0.1  
client_hostname  |   
client_port      | 50166  
backend_start    | 2018-03-26 16:02:51.03693+08  
backend_xmin     |   
state            | streaming  
sent_lsn         | 4B/9378FB98  
write_lsn        | 4B/9378F5E8  
flush_lsn        | 4B/9378F5E8  
replay_lsn       | 4B/93788590  
write_lag        | 00:00:00.000224   ## 计算的延迟时间,非常小  
flush_lag        | 00:00:00.000224  
replay_lag       | 00:00:00.001519  
sync_priority    | 1  
sync_state       | quorum  

6、观察iotop

# iotop  
  
Total DISK READ :       0.00 B/s | Total DISK WRITE :     236.99 M/s  
Actual DISK READ:       0.00 B/s | Actual DISK WRITE:     210.50 M/s  
  TID  PRIO  USER     DISK READ  DISK WRITE  SWAPIN     IO>    COMMAND 
 9106 be/4 postgres    0.00 B/s   50.95 M/s  0.00 % 27.36 % postgres: wal receiver process   streaming 4C/6BC37028  
 9063 be/4 postgres    0.00 B/s   50.69 M/s  0.00 % 26.05 % postgres: wal receiver process   streaming 4C/6BC39900  
 9632 be/4 postgres    0.00 B/s   46.15 K/s  0.00 %  4.17 % postgres: autovacuum worker process   postgres       
 9230 be/4 postgres    0.00 B/s 1692.27 K/s  0.00 %  3.94 % postgres: postgres postgres 127.0.0.1(50218) SELECT  
 9217 be/4 postgres    0.00 B/s 1784.58 K/s  0.00 %  3.79 % postgres: postgres postgres 127.0.0.1(50210) idle    
 9198 be/4 postgres    0.00 B/s 1884.58 K/s  0.00 %  3.70 % postgres: postgres postgres 127.0.0.1(50198) SELECT  
 9270 be/4 postgres    0.00 B/s 1746.12 K/s  0.00 %  3.69 % postgres: postgres postgres 127.0.0.1(50286) SELECT  
 9202 be/4 postgres    0.00 B/s 1723.04 K/s  0.00 %  3.67 % postgres: postgres postgres 127.0.0.1(50200) SELECT  
 9172 be/4 postgres    0.00 B/s 1592.27 K/s  0.00 %  3.52 % postgres: postgres postgres 127.0.0.1(50184) idle    
 9245 be/4 postgres    0.00 B/s 2046.11 K/s  0.00 %  3.51 % postgres: postgres postgres 127.0.0.1(50236) SELECT  
 9226 be/4 postgres    0.00 B/s 1861.50 K/s  0.00 %  3.45 % postgres: postgres postgres 127.0.0.1(50216) SELECT  
 9249 be/4 postgres    0.00 B/s 1399.97 K/s  0.00 %  3.39 % postgres: postgres postgres 127.0.0.1(50244) SELECT  
 9252 be/4 postgres    0.00 B/s 1784.58 K/s  0.00 %  3.35 % postgres: postgres postgres 127.0.0.1(50250) SELECT  
 9247 be/4 postgres    0.00 B/s 1938.42 K/s  0.00 %  3.35 % postgres: postgres postgres 127.0.0.1(50240) SELECT  
 9274 be/4 postgres    0.00 B/s 1853.81 K/s  0.00 %  3.34 % postgres: postgres postgres 127.0.0.1(50294) SELECT  
 9211 be/4 postgres    0.00 B/s 1907.65 K/s  0.00 %  3.32 % postgres: postgres postgres 127.0.0.1(50206) SELECT  
 9248 be/4 postgres    0.00 B/s 1753.81 K/s  0.00 %  3.32 % postgres: postgres postgres 127.0.0.1(50242) BIND    
 9273 be/4 postgres    0.00 B/s 1838.42 K/s  0.00 %  3.32 % postgres: postgres postgres 127.0.0.1(50292) SELECT  
 9244 be/4 postgres    0.00 B/s 1523.04 K/s  0.00 %  3.31 % postgres: postgres postgres 127.0.0.1(50234) SELECT  

2 高并发大事务压测

创建1024个表,大批量INSERT。

连接到主节点执行如下:

1、建表

do language plpgsql $$      
declare      
begin      
  execute 'drop table if exists test';      
  execute 'create table test(id int8, info text, crt_time timestamp)';      
        
  for i in 0..1023 loop      
    execute format('drop table if exists test%s', i);      
    execute format('create table test%s (like test including all)', i);      
  end loop;      
end;      
$$;      

2、创建动态数据写入函数

create or replace function dyn_pre(int8) returns void as $$      
declare      
  suffix int8 := mod($1,1024);      
begin      
  execute format('execute p%s(%s)', suffix, $1);      
  exception when others then      
    execute format('prepare p%s(int8) as insert into test%s select generate_series(1,10000), %L, now()', suffix, suffix, 'postgres: wal receiver process   streaming 4C/6BC37028');      
    execute format('execute p%s(%s)', suffix, $1);      
end;      
$$ language plpgsql strict;      

3、创建压测脚本,批量写入,每批写入1万条。

vi test.sql      
      
\set id random(1,20000000000)      
select dyn_pre(:id);      

4、写入压测,开启批量写之后,每秒写入峰值达到了200万行。

progress: 48.0 s, 168.0 tps, lat 329.783 ms stddev 36.013  
progress: 49.0 s, 168.0 tps, lat 334.085 ms stddev 31.487  
progress: 50.0 s, 143.0 tps, lat 363.151 ms stddev 43.855  
progress: 51.0 s, 168.0 tps, lat 345.992 ms stddev 42.808  
progress: 52.0 s, 168.0 tps, lat 356.376 ms stddev 31.336  
progress: 53.0 s, 117.0 tps, lat 387.883 ms stddev 61.924  
progress: 54.0 s, 203.1 tps, lat 333.567 ms stddev 97.568  
progress: 55.0 s, 176.0 tps, lat 324.117 ms stddev 36.841  
progress: 56.0 s, 137.0 tps, lat 356.819 ms stddev 34.533  
progress: 57.0 s, 174.0 tps, lat 331.817 ms stddev 46.634  
progress: 58.0 s, 172.0 tps, lat 329.905 ms stddev 32.809  
progress: 59.0 s, 168.0 tps, lat 315.655 ms stddev 27.721  
progress: 60.0 s, 149.0 tps, lat 411.454 ms stddev 91.748  
progress: 61.0 s, 159.0 tps, lat 338.382 ms stddev 49.510  
progress: 62.0 s, 126.0 tps, lat 404.776 ms stddev 92.685  

5、观察主备延迟,发送WAL有一定延迟,但是并不影响写入。

postgres=# select pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_flush_lsn(),write_lsn)) delay_wal_size,* from pg_stat_replication ;  
-[ RECORD 1 ]----+------------------------------  
delay_wal_size   | 57 MB  ## 延迟了57 MB,但是仅提交是需要等WAL的位点,所以并不影响  
pid              | 9107  
usesysid         | 206200  
usename          | rep  
application_name | walreceiver  
client_addr      | 127.0.0.1  
client_hostname  |   
client_port      | 50174  
backend_start    | 2018-03-26 16:05:26.404767+08  
backend_xmin     |   
state            | streaming  
sent_lsn         | 4E/58040000  
write_lsn        | 4E/56000000  
flush_lsn        | 4E/56000000  
replay_lsn       | 4D/19D00ED0  
write_lag        | 00:00:00.251625  
flush_lag        | 00:00:00.251625  
replay_lag       | 00:00:28.018643  
sync_priority    | 1  
sync_state       | quorum  
-[ RECORD 2 ]----+------------------------------  
delay_wal_size   | 41 MB  
pid              | 9064  
usesysid         | 206200  
usename          | rep  
application_name | walreceiver  
client_addr      | 127.0.0.1  
client_hostname  |   
client_port      | 50166  
backend_start    | 2018-03-26 16:02:51.03693+08  
backend_xmin     |   
state            | streaming  
sent_lsn         | 4E/58012000  
write_lsn        | 4E/57000000  
flush_lsn        | 4E/57000000  
replay_lsn       | 4D/1A47B998  
write_lag        | 00:00:00.069438  
flush_lag        | 00:00:00.069438  
replay_lag       | 00:00:27.856028  
sync_priority    | 1  
sync_state       | quorum  

6、观察IO使用情况,显然现在磁盘读写都达到了1GB/s多,同时两个备库的流复制接收进程,也达到了400多MB/s的写。同时观察到有一些进程在提交事务时,正在等待commit wal record的位点同步到至少一个从库。

iotop   
Total DISK READ :      23.10 K/s | Total DISK WRITE :    1415.20 M/s  
Actual DISK READ:      23.10 K/s | Actual DISK WRITE:    1342.53 M/s  
  TID  PRIO  USER     DISK READ  DISK WRITE  SWAPIN     IO>    COMMAND  
 9063 be/4 postgres   23.10 K/s  431.26 M/s  0.00 % 38.41 % postgres: wal receiver process   streaming 4F/A0000000           
 9106 be/4 postgres    0.00 B/s  415.86 M/s  0.00 % 35.86 % postgres: wal receiver process   streaming 4F/9F000000           
  941 be/3 root        0.00 B/s  311.89 K/s  0.00 % 26.99 % [jbd2/vdc1-8]  
10281 be/4 root        0.00 B/s    0.00 B/s  0.00 % 17.40 % [kworker/u112:0]              
 8918 be/4 postgres    0.00 B/s   95.19 M/s  0.00 % 10.89 % postgres: wal writer process  
10255 be/4 postgres    0.00 B/s   18.53 M/s  0.00 %  5.24 % postgres: postgres postgres 127.0.0.1(50610) SELECT  
10188 be/4 postgres    0.00 B/s   17.63 M/s  0.00 %  4.32 % postgres: postgres postgres 127.0.0.1(50540) SELECT waiting for 4F/A192D6F8  
10250 be/4 postgres    0.00 B/s   11.20 M/s  0.00 %  4.22 % postgres: postgres postgres 127.0.0.1(50600) SELECT  
10191 be/4 postgres    0.00 B/s   11.91 M/s  0.00 %  4.10 % postgres: postgres postgres 127.0.0.1(50542) SELECT waiting for 4F/A1881A48  
10230 be/4 postgres    0.00 B/s   31.88 M/s  0.00 %  3.42 % postgres: postgres postgres 127.0.0.1(50566) SELECT  
10254 be/4 postgres    0.00 B/s   15.70 M/s  0.00 %  3.27 % postgres: postgres postgres 127.0.0.1(50608) SELECT waiting for 4F/A1942CA8  
10236 be/4 postgres    0.00 B/s    3.20 M/s  0.00 %  2.61 % postgres: postgres postgres 127.0.0.1(50572) SELECT  
10234 be/4 postgres    0.00 B/s   14.05 M/s  0.00 %  2.58 % postgres: postgres postgres 127.0.0.1(50568) SELECT waiting for 4F/A0BAC028  
10259 be/4 postgres    0.00 B/s    3.17 M/s  0.00 %  2.48 % postgres: postgres postgres 127.0.0.1(50618) SELECT  
10269 be/4 postgres    0.00 B/s    3.19 M/s  0.00 %  2.48 % postgres: postgres postgres 127.0.0.1(50638) SELECT  
10194 be/4 postgres    0.00 B/s    5.54 M/s  0.00 %  2.46 % postgres: postgres postgres 127.0.0.1(50544) SELECT waiting for 4F/A0DA0EB8  
10186 be/4 postgres    0.00 B/s    3.22 M/s  0.00 %  2.45 % postgres: postgres postgres 127.0.0.1(50538) SELECT  
10260 be/4 postgres    0.00 B/s    3.14 M/s  0.00 %  2.45 % postgres: postgres postgres 127.0.0.1(50620) SELECT  
10169 be/4 postgres    0.00 B/s    3.17 M/s  0.00 %  2.42 % postgres: postgres postgres 127.0.0.1(50528) SELECT  
10213 be/4 postgres    0.00 B/s    3.14 M/s  0.00 %  2.39 % postgres: postgres postgres 127.0.0.1(50558) SELECT  
10176 be/4 postgres    0.00 B/s    3.15 M/s  0.00 %  2.38 % postgres: postgres postgres 127.0.0.1(50532) SELECT  
10262 be/4 postgres    0.00 B/s    3.18 M/s  0.00 %  2.36 % postgres: postgres postgres 127.0.0.1(50624) SELECT  
10241 be/4 postgres    0.00 B/s    3.40 M/s  0.00 %  2.35 % postgres: postgres postgres 127.0.0.1(50582) SELECT  
10265 be/4 postgres    0.00 B/s    3.12 M/s  0.00 %  2.33 % postgres: postgres postgres 127.0.0.1(50630) SELECT  
10172 be/4 postgres    0.00 B/s    3.12 M/s  0.00 %  2.33 % postgres: postgres postgres 127.0.0.1(50530) SELECT  
10246 be/4 postgres    0.00 B/s    3.14 M/s  0.00 %  2.33 % postgres: postgres postgres 127.0.0.1(50592) SELECT  
10243 be/4 postgres    0.00 B/s    3.14 M/s  0.00 %  2.33 % postgres: postgres postgres 127.0.0.1(50586) SELECT  

7、关闭压测,瞬间追平

postgres=# select pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_flush_lsn(),write_lsn)) delay_wal_size,* from pg_stat_replication ;  
-[ RECORD 1 ]----+------------------------------  
delay_wal_size   | 0 bytes  
pid              | 9107  
usesysid         | 206200  
usename          | rep  
application_name | walreceiver  
client_addr      | 127.0.0.1  
client_hostname  |   
client_port      | 50174  
backend_start    | 2018-03-26 16:05:26.404767+08  
backend_xmin     |   
state            | streaming  
sent_lsn         | 51/70F46070  
write_lsn        | 51/70F46070  
flush_lsn        | 51/70F46070  
replay_lsn       | 4F/3420CF60  
write_lag        | 00:00:00.00009  
flush_lag        | 00:00:00.000194  
replay_lag       | 00:01:02.852602  
sync_priority    | 1  
sync_state       | quorum  
-[ RECORD 2 ]----+------------------------------  
delay_wal_size   | 0 bytes  
pid              | 9064  
usesysid         | 206200  
usename          | rep  
application_name | walreceiver  
client_addr      | 127.0.0.1  
client_hostname  |   
client_port      | 50166  
backend_start    | 2018-03-26 16:02:51.03693+08  
backend_xmin     |   
state            | streaming  
sent_lsn         | 51/70F46070  
write_lsn        | 51/70F46070  
flush_lsn        | 51/70F46070  
replay_lsn       | 4F/38846D10  
write_lag        | 00:00:00.000089  
flush_lag        | 00:00:00.000198  
replay_lag       | 00:01:02.58513  
sync_priority    | 1  
sync_state       | quorum  

8 模拟挂掉1台从库

pic

1、关闭从库1

pg_ctl stop -m fast -D /data01/pg/pg_root1922  

2、从主节点观察从库状态

postgres=# select pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_flush_lsn(),write_lsn)) delay_wal_size,* from pg_stat_replication ;  
-[ RECORD 1 ]----+------------------------------  
delay_wal_size   | 0 bytes  
pid              | 9107  
usesysid         | 206200  
usename          | rep  
application_name | walreceiver  
client_addr      | 127.0.0.1  
client_hostname  |   
client_port      | 50174  
backend_start    | 2018-03-26 16:05:26.404767+08  
backend_xmin     |   
state            | streaming  
sent_lsn         | 51/7117F428  
write_lsn        | 51/7117F428  
flush_lsn        | 51/7117F428  
replay_lsn       | 51/7117F428  
write_lag        |   
flush_lag        |   
replay_lag       |   
sync_priority    | 1  
sync_state       | quorum  

3、因为还有一个从库,可以继续读写主库

pgbench -M prepared -n -r -P 1 -f ./test.sql -h 127.0.0.1 -p 1921 -U postgres postgres -c 56 -j 56 -T 1200000      
progress: 1.0 s, 157.9 tps, lat 283.577 ms stddev 68.653  
progress: 2.0 s, 198.1 tps, lat 289.255 ms stddev 50.272  
progress: 3.0 s, 203.0 tps, lat 269.144 ms stddev 31.036  
progress: 4.0 s, 235.0 tps, lat 259.014 ms stddev 24.828  
progress: 5.0 s, 210.0 tps, lat 263.051 ms stddev 30.233  

6、观察剩余的1个从库的延迟

postgres=# select pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_flush_lsn(),write_lsn)) delay_wal_size,* from pg_stat_replication ;  
-[ RECORD 1 ]----+------------------------------  
delay_wal_size   | 53 MB  
pid              | 9107  
usesysid         | 206200  
usename          | rep  
application_name | walreceiver  
client_addr      | 127.0.0.1  
client_hostname  |   
client_port      | 50174  
backend_start    | 2018-03-26 16:05:26.404767+08  
backend_xmin     |   
state            | streaming  
sent_lsn         | 52/78498000  
write_lsn        | 52/76000000  
flush_lsn        | 52/76000000  
replay_lsn       | 51/E2B88798  
write_lag        | 00:00:00.132428  
flush_lag        | 00:00:00.132428  
replay_lag       | 00:00:09.864225  
sync_priority    | 1  
sync_state       | quorum  

9 模拟挂掉2台从库

pic

1、关闭第二台从库

pg_ctl stop -m fast -D /data01/pg/pg_root1923  

2、可以看到批量写的压测进程TPS跌到0了,因为所有事务提交都处于等待状态

progress: 58.0 s, 0.0 tps, lat -nan ms stddev -nan  
progress: 59.0 s, 0.0 tps, lat -nan ms stddev -nan  
progress: 60.0 s, 0.0 tps, lat -nan ms stddev -nan  
progress: 61.0 s, 0.0 tps, lat -nan ms stddev -nan  
progress: 62.0 s, 0.0 tps, lat -nan ms stddev -nan  
progress: 63.0 s, 0.0 tps, lat -nan ms stddev -nan  
progress: 64.0 s, 0.0 tps, lat -nan ms stddev -nan  
progress: 65.0 s, 0.0 tps, lat -nan ms stddev -nan  
progress: 66.0 s, 0.0 tps, lat -nan ms stddev -nan  

3、此时观察主库的从库状态,两个从库已经下线了

postgres=# select pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_flush_lsn(),write_lsn)) delay_wal_size,* from pg_stat_replication ;  
(0 rows)  

4、观察当前的会话状态,批量写的事务结束状态被冻结,等待同步提交结束才能返回客户端事务结束的状态。

postgres=# select usename,datname,wait_event_type,wait_event,query from pg_stat_activity ;  
 usename  | datname  | wait_event_type |     wait_event      |                                      query                                        
----------+----------+-----------------+---------------------+---------------------------------------------------------------------------------  
          |          | Activity        | AutoVacuumMain      |   
 postgres |          | Activity        | LogicalLauncherMain |   
 postgres | postgres | IPC             | SyncRep             | select dyn_pre($1);  
 postgres | postgres |                 |                     | select usename,datname,wait_event_type,wait_event,query from pg_stat_activity ;  
 postgres | postgres | IPC             | SyncRep             | select dyn_pre($1);  
 postgres | postgres | IPC             | SyncRep             | select dyn_pre($1);  
 postgres | postgres | IPC             | SyncRep             | select dyn_pre($1);  
 postgres | postgres | IPC             | SyncRep             | select dyn_pre($1);  
 postgres | postgres | IPC             | SyncRep             | select dyn_pre($1);  
 postgres | postgres | IPC             | SyncRep             | select dyn_pre($1);  
 postgres | postgres | IPC             | SyncRep             | select dyn_pre($1);  
 postgres | postgres | IPC             | SyncRep             | select dyn_pre($1);  
 postgres | postgres | IPC             | SyncRep             | select dyn_pre($1);  
 postgres | postgres | IPC             | SyncRep             | select dyn_pre($1);  
 postgres | postgres | IPC             | SyncRep             | select dyn_pre($1);  
 postgres | postgres | IPC             | SyncRep             | select dyn_pre($1);  
 postgres | postgres | IPC             | SyncRep             | select dyn_pre($1);  
 postgres | postgres | IPC             | SyncRep             | select dyn_pre($1);  
 postgres | postgres | IPC             | SyncRep             | select dyn_pre($1);  
 postgres | postgres | IPC             | SyncRep             | select dyn_pre($1);  
 postgres | postgres | IPC             | SyncRep             | select dyn_pre($1);  
 postgres | postgres | IPC             | SyncRep             | select dyn_pre($1);  
......  

5、在top中也可以看到其状态,正在等待的WAL COMMIT RECORD的位点

11530 postgres  20   0 25.247g   8744   4024 S   0.0  0.0   0:05.85 postgres: postgres postgres 127.0.0.1(50650) SELECT waiting for 54/48223A48  
11531 postgres  20   0 25.247g   8748   4024 S   0.0  0.0   0:06.12 postgres: postgres postgres 127.0.0.1(50652) SELECT waiting for 54/4A8CB450  
11535 postgres  20   0 25.247g   8744   4024 S   0.0  0.0   0:06.15 postgres: postgres postgres 127.0.0.1(50654) SELECT waiting for 54/484E8280  
11538 postgres  20   0 25.247g   8704   4024 S   0.0  0.0   0:06.15 postgres: postgres postgres 127.0.0.1(50656) SELECT waiting for 54/4A8AE1D8  
11542 postgres  20   0 25.247g   8744   4024 S   0.0  0.0   0:06.26 postgres: postgres postgres 127.0.0.1(50658) SELECT waiting for 54/4828C590  
11545 postgres  20   0 25.247g   8664   4024 S   0.0  0.0   0:06.11 postgres: postgres postgres 127.0.0.1(50660) SELECT waiting for 54/4A92D238  
11546 postgres  20   0 25.247g   8748   4024 S   0.0  0.0   0:06.52 postgres: postgres postgres 127.0.0.1(50662) SELECT waiting for 54/48F44E18  
11549 postgres  20   0 25.247g   8584   4024 S   0.0  0.0   0:06.27 postgres: postgres postgres 127.0.0.1(50664) SELECT waiting for 54/47E12960  
11551 postgres  20   0 25.247g   8668   4024 S   0.0  0.0   0:06.38 postgres: postgres postgres 127.0.0.1(50666) SELECT waiting for 54/4A492EC8  
11556 postgres  20   0 25.247g   8724   4024 S   0.0  0.0   0:06.34 postgres: postgres postgres 127.0.0.1(50668) SELECT waiting for 54/47E09540  
11558 postgres  20   0 25.247g   8676   4024 S   0.0  0.0   0:06.07 postgres: postgres postgres 127.0.0.1(50670) SELECT waiting for 54/4A1F0CB8  
11561 postgres  20   0 25.247g   8648   4024 S   0.0  0.0   0:06.03 postgres: postgres postgres 127.0.0.1(50672) SELECT waiting for 54/48C08120  
11564 postgres  20   0 25.247g   8624   4024 S   0.0  0.0   0:05.98 postgres: postgres postgres 127.0.0.1(50674) SELECT waiting for 54/48AE6110  
11568 postgres  20   0 25.247g   8652   4024 S   0.0  0.0   0:05.83 postgres: postgres postgres 127.0.0.1(50676) SELECT waiting for 54/48B4FB18  
11571 postgres  20   0 25.247g   8756   4024 S   0.0  0.0   0:06.24 postgres: postgres postgres 127.0.0.1(50678) SELECT waiting for 54/4A905DD8  
11575 postgres  20   0 25.247g   8708   4024 S   0.0  0.0   0:06.40 postgres: postgres postgres 127.0.0.1(50680) SELECT waiting for 54/4A7F4228  
11578 postgres  20   0 25.247g   8664   4024 S   0.0  0.0   0:06.13 postgres: postgres postgres 127.0.0.1(50682) SELECT waiting for 54/4A487990  
......  

10 模拟从库恢复

1、启动从库

pg_ctl start -D /data01/pg/pg_root1922  
  
  
  
pg_ctl start -D /data01/pg/pg_root1923  

2、可以看到启动后,从库开始从上一个检查点开始恢复。

12388 postgres  20   0 25.238g   3228   1324 R  98.3  0.0   0:52.31 postgres: startup process   recovering 000000010000005200000014  
12401 postgres  20   0 25.238g   3228   1324 R  97.4  0.0   0:17.52 postgres: startup process   recovering 00000001000000510000000F  

3、从主节点观察从库状态

其中一个从库追平

pic

postgres=# select pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_flush_lsn(),write_lsn)) delay_wal_size,* from pg_stat_replication ;  
-[ RECORD 1 ]----+------------------------------  
delay_wal_size   | 0 bytes  
pid              | 12396  
usesysid         | 206200  
usename          | rep  
application_name | walreceiver  
client_addr      | 127.0.0.1  
client_hostname  |   
client_port      | 50758  
backend_start    | 2018-03-26 16:52:19.321112+08  
backend_xmin     |   
state            | streaming  
sent_lsn         | 56/956AB438  
write_lsn        | 56/956AB438  
flush_lsn        | 56/956AB438  
replay_lsn       | 54/A36CDE38  
write_lag        | 00:00:00.000083  
flush_lag        | 00:00:00.000146  
replay_lag       | 00:01:17.75422  
sync_priority    | 1  
sync_state       | quorum  

有一个从库追平后,由于我们配置的同步提交只需要1个副本即可,所以1个从库追平后,等待就解除了,开始继续批量写入

progress: 267.0 s, 0.0 tps, lat -nan ms stddev -nan  
progress: 268.0 s, 0.0 tps, lat -nan ms stddev -nan  
progress: 269.0 s, 137.1 tps, lat 90422.766 ms stddev 108453.563  
progress: 270.0 s, 210.0 tps, lat 270.240 ms stddev 26.268  
progress: 271.0 s, 205.0 tps, lat 270.547 ms stddev 26.334  
progress: 272.0 s, 210.0 tps, lat 279.882 ms stddev 29.748  
progress: 273.0 s, 208.0 tps, lat 262.075 ms stddev 25.902  
progress: 274.0 s, 208.0 tps, lat 270.541 ms stddev 28.560  
progress: 275.0 s, 167.9 tps, lat 275.042 ms stddev 27.207  
progress: 276.0 s, 57.0 tps, lat 1029.869 ms stddev 154.074  
progress: 277.0 s, 209.0 tps, lat 291.212 ms stddev 67.857  
progress: 278.0 s, 201.0 tps, lat 282.762 ms stddev 31.040  
progress: 279.0 s, 195.0 tps, lat 277.304 ms stddev 47.745  
progress: 280.0 s, 230.0 tps, lat 273.271 ms stddev 32.877  

当第二个从库也进入stream状态后,开始追

postgres=# select pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_flush_lsn(),write_lsn)) delay_wal_size,* from pg_stat_replication ;  
-[ RECORD 1 ]----+------------------------------  
delay_wal_size   | 7992 MB  
pid              | 13012  
usesysid         | 206200  
usename          | rep  
application_name | walreceiver  
client_addr      | 127.0.0.1  
client_hostname  |   
client_port      | 50760  
backend_start    | 2018-03-26 16:54:39.246404+08  
backend_xmin     |   
state            | catchup    ## 状态为追  
sent_lsn         | 54/A31A0000  
write_lsn        | 54/A2000000  
flush_lsn        | 54/A2000000  
replay_lsn       | 54/6D037C78  
write_lag        | 00:00:05.308718  
flush_lag        | 00:00:05.308718  
replay_lag       | 00:00:05.308718  
sync_priority    | 1  
sync_state       | potential    ## 这个节点还没有追平,所以是待定状态  
-[ RECORD 2 ]----+------------------------------  
delay_wal_size   | 0 bytes  
pid              | 12396  
usesysid         | 206200  
usename          | rep  
application_name | walreceiver  
client_addr      | 127.0.0.1  
client_hostname  |   
client_port      | 50758  
backend_start    | 2018-03-26 16:52:19.321112+08  
backend_xmin     |   
state            | streaming   ## 状态为streaming  
sent_lsn         | 56/957C9D80  
write_lsn        | 56/957C9D80  
flush_lsn        | 56/957C9D80  
replay_lsn       | 55/30982DC0  
write_lag        | 00:00:00.00009  
flush_lag        | 00:00:00.000147  
replay_lag       | 00:01:28.262611  
sync_priority    | 1  
sync_state       | quorum   ## 同步状态为quorum  

第二个从库很快就追平了

pic

postgres=# select pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_flush_lsn(),write_lsn)) delay_wal_size,* from pg_stat_replication ;  
-[ RECORD 1 ]----+------------------------------  
delay_wal_size   | 0 bytes  
pid              | 13012  
usesysid         | 206200  
usename          | rep  
application_name | walreceiver  
client_addr      | 127.0.0.1  
client_hostname  |   
client_port      | 50760  
backend_start    | 2018-03-26 16:54:39.246404+08  
backend_xmin     |   
state            | streaming  
sent_lsn         | 56/957E21E0  
write_lsn        | 56/957E21E0  
flush_lsn        | 56/957E21E0  
replay_lsn       | 55/501B2D28  
write_lag        | 00:00:00.000071  
flush_lag        | 00:00:00.000126  
replay_lag       | 00:00:38.590875  
sync_priority    | 1  
sync_state       | quorum  ## 追平后,就是quorum状态了  
-[ RECORD 2 ]----+------------------------------  
delay_wal_size   | 0 bytes  
pid              | 12396  
usesysid         | 206200  
usename          | rep  
application_name | walreceiver  
client_addr      | 127.0.0.1  
client_hostname  |   
client_port      | 50758  
backend_start    | 2018-03-26 16:52:19.321112+08  
backend_xmin     |   
state            | streaming  
sent_lsn         | 56/957E21E0  
write_lsn        | 56/957E21E0  
flush_lsn        | 56/957E21E0  
replay_lsn       | 56/F12A2D0  
write_lag        | 00:00:00.00018  
flush_lag        | 00:00:00.000227  
replay_lag       | 00:01:46.434071  
sync_priority    | 1  
sync_state       | quorum  

11 正常情况下的主备切换

pic

1、继续批量写入压测

pgbench -M prepared -n -r -P 1 -f ./test.sql -h 127.0.0.1 -p 1921 -U postgres postgres -c 56 -j 56 -T 1200000  

2、正常关闭主库

pg_ctl stop -m fast -D /data01/pg/pg_root1921  

3、检查从库1、2哪个位点最新

从库1接收到的WAL位点以及回放到的位点

psql -h 127.0.0.1 -p 1922  
psql (10.3)  
Type "help" for help.  
  
postgres=# select pg_last_wal_receive_lsn();  
 pg_last_wal_receive_lsn   
-------------------------  
 57/60000098  
(1 row)  
  
postgres=# select pg_last_wal_replay_lsn();  
 pg_last_wal_replay_lsn   
------------------------  
 57/60000098  
(1 row)  

从库2接收到的WAL位点以及回放到的位点

psql -h 127.0.0.1 -p 1923  
psql (10.3)  
Type "help" for help.  
  
postgres=# select pg_last_wal_receive_lsn();  
 pg_last_wal_receive_lsn   
-------------------------  
 57/60000098  
(1 row)  
  
postgres=# select pg_last_wal_replay_lsn();  
 pg_last_wal_replay_lsn   
------------------------  
 57/60000098  
(1 row)  

使用pg_waldump也能观察到对应wal文件已经到达这个位置

pg_waldump /data02/pg/pg_wal_1923/000000010000005700000060  
rmgr: XLOG        len (rec/tot):    106/   106, tx:          0, lsn: 57/60000028, prev 57/5F0D6468, desc: CHECKPOINT_SHUTDOWN redo 57/60000028; tli 1; prev tli 1; fpw true; xid 0:2205203730; oid 217962; multi 1; offset 0; oldest xid 2136558886 in DB 1; oldest multi 1 in DB 13805; oldest/newest commit timestamp xid: 0/0; oldest running xid 0; shutdown  
pg_waldump: FATAL:  error in WAL record at 57/60000028: invalid record length at 57/60000098: wanted 24, got 0  

4、激活最新从库作为新主库

激活pg_last_wal_receive_lsn最大的,如果有多个,则选择其中pg_last_wal_replay_lsn最大的。

例如激活从库1(可以使用promote,或者写trigger file。使用promote很方便)

pg_ctl promote -D /data01/pg/pg_root1922  
waiting for server to promote.... done  
server promoted  

5、另一个从库作为新主库的从库

修改recovery.conf的新主库连接信息

vi /data01/pg/pg_root1923/recovery.conf   
  
primary_conninfo = 'host=127.0.0.1 port=1922 user=rep password=REpliCa12343231_-1!'  

重启这个从库

pg_ctl stop -m fast -D /data01/pg/pg_root1923  
  
pg_ctl start -D /data01/pg/pg_root1923  

6、老的主库,作为新的主库的从库

修改recovery.conf的新主库连接信息

mv /data01/pg/pg_root1921/recovery.done /data01/pg/pg_root1921/recovery.conf   
vi /data01/pg/pg_root1921/recovery.conf   
  
primary_conninfo = 'host=127.0.0.1 port=1922 user=rep password=REpliCa12343231_-1!'  

启动老主库

pg_ctl start -D /data01/pg/pg_root1921  

7、观察新的主库的两个从库是否正常

psql -h 127.0.0.1 -p 1922  
  
postgres=# select pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_flush_lsn(),write_lsn)) delay_wal_size,* from pg_stat_replication ;  
-[ RECORD 1 ]----+------------------------------  
delay_wal_size   | 0 bytes  
pid              | 13925  
usesysid         | 206200  
usename          | rep  
application_name | walreceiver  
client_addr      | 127.0.0.1  
client_hostname  |   
client_port      | 47660  
backend_start    | 2018-03-26 17:26:49.093754+08  
backend_xmin     |   
state            | streaming  
sent_lsn         | 57/600018B0  
write_lsn        | 57/600018B0  
flush_lsn        | 57/600018B0  
replay_lsn       | 57/600018B0  
write_lag        |   
flush_lag        |   
replay_lag       |   
sync_priority    | 1  
sync_state       | quorum  
-[ RECORD 2 ]----+------------------------------  
delay_wal_size   | 0 bytes  
pid              | 13906  
usesysid         | 206200  
usename          | rep  
application_name | walreceiver  
client_addr      | 127.0.0.1  
client_hostname  |   
client_port      | 47658  
backend_start    | 2018-03-26 17:22:08.461081+08  
backend_xmin     |   
state            | streaming  
sent_lsn         | 57/600018B0  
write_lsn        | 57/600018B0  
flush_lsn        | 57/600018B0  
replay_lsn       | 57/600018B0  
write_lag        |   
flush_lag        |   
replay_lag       |   
sync_priority    | 1  
sync_state       | quorum  

8、连接到新的主库进行批量写入压测

pgbench -M prepared -n -r -P 1 -f ./test.sql -h 127.0.0.1 -p 1922 -U postgres postgres -c 56 -j 56 -T 1200000   

9、观察新的主库压测时的从库延迟

postgres=# select pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_flush_lsn(),write_lsn)) delay_wal_size,* from pg_stat_replication ;  
-[ RECORD 1 ]----+------------------------------  
delay_wal_size   | 18 MB  
pid              | 13925  
usesysid         | 206200  
usename          | rep  
application_name | walreceiver  
client_addr      | 127.0.0.1  
client_hostname  |   
client_port      | 47660  
backend_start    | 2018-03-26 17:26:49.093754+08  
backend_xmin     |   
state            | streaming  
sent_lsn         | 57/E219E0C0  
write_lsn        | 57/E1000000  
flush_lsn        | 57/E1000000  
replay_lsn       | 57/BBFD57A0  
write_lag        | 00:00:00.23305  
flush_lag        | 00:00:00.23305  
replay_lag       | 00:00:03.917973  
sync_priority    | 1  
sync_state       | quorum  
-[ RECORD 2 ]----+------------------------------  
delay_wal_size   | 66 MB  
pid              | 13906  
usesysid         | 206200  
usename          | rep  
application_name | walreceiver  
client_addr      | 127.0.0.1  
client_hostname  |   
client_port      | 47658  
backend_start    | 2018-03-26 17:22:08.461081+08  
backend_xmin     |   
state            | streaming  
sent_lsn         | 57/E042E000  
write_lsn        | 57/DE000000  
flush_lsn        | 57/DE000000  
replay_lsn       | 57/B75E03D0  
write_lag        | 00:00:00.602368  
flush_lag        | 00:00:00.602368  
replay_lag       | 00:00:04.278079  
sync_priority    | 1  
sync_state       | quorum  

12 异常情况下的主备切换

pic

异常情况,指不知道主库在什么状况下的切换。本例要防止数据丢失,防止脑裂的话需要控制2个从库

如果总共有5个从库,如果配置如下  
  
synchronous_standby_names = 'ANY 2 (*)'  
  
那么需要控制 5-2+1=4个或以上从库,如果需要切换,就可以认为绝对不会出现脑裂或丢数据的情况。  
  
因为冻结了4个从库后,主库剩下的从库只有1个,就无法满足2个副本,所以不可能再提交并返回用户事务状态。  

1、继续批量写入压测

pgbench -M prepared -n -r -P 1 -f ./test.sql -h 127.0.0.1 -p 1922 -U postgres postgres -c 56 -j 56 -T 1200000  

2、控制2个从库(控制足够数量的从库(关闭),目前没有好的方法能够冻结它接收新的WAL,只能冻结replay,所以我们使用关闭的手段先。)

关闭从库1

pg_ctl stop -m fast -D /data01/pg/pg_root1921  

关闭从库2

pg_ctl stop -m fast -D /data01/pg/pg_root1923  

观察到压测HANG住,事务正在等待至少1个从库响应已接收到至少一份REDO副本的FEEDBACK。

progress: 74.0 s, 22.0 tps, lat 282.421 ms stddev 26.183  
progress: 75.0 s, 0.0 tps, lat -nan ms stddev -nan  
progress: 76.0 s, 0.0 tps, lat -nan ms stddev -nan  
progress: 77.0 s, 0.0 tps, lat -nan ms stddev -nan  
progress: 78.0 s, 0.0 tps, lat -nan ms stddev -nan  
progress: 79.0 s, 0.0 tps, lat -nan ms stddev -nan  
progress: 80.0 s, 0.0 tps, lat -nan ms stddev -nan  

3、检查从库1、2哪个位点最新

因为数据库已经关闭了,可以通过pg_waldump来观察位点。

从库1位点5A/52D7DF60

ll -rth /data02/pg/pg_wal_1921|tail -n 3  
drwx------ 2 postgres postgres 232K Mar 26 17:41 archive_status  
-rw------- 1 postgres postgres  16M Mar 26 17:41 000000020000005A00000052  
-rw------- 1 postgres postgres  16M Mar 26 17:41 000000020000005A00000053  
  
pg_waldump /data02/pg/pg_wal_1921/000000020000005A00000053|tail -n 3  
pg_waldump: FATAL:  could not find a valid record after 5A/53000000  
  
  
pg_waldump /data02/pg/pg_wal_1921/000000020000005A00000052|tail -n 3  
pg_waldump: FATAL:  error in WAL record at 5A/52D7DF60: invalid magic number 0000 in log segment 000000000000005A00000052, offset 14147584  
rmgr: Heap        len (rec/tot):    127/   127, tx: 2205214610, lsn: 5A/52D7DE60, prev 5A/52D7DDE0, desc: INSERT off 63, blkref #0: rel 1663/13806/217026 blk 5741  
rmgr: Heap        len (rec/tot):    127/   127, tx: 2205214606, lsn: 5A/52D7DEE0, prev 5A/52D7DE60, desc: INSERT off 68, blkref #0: rel 1663/13806/217350 blk 4813  
rmgr: Heap        len (rec/tot):    127/   127, tx: 2205214617, lsn: 5A/52D7DF60, prev 5A/52D7DEE0, desc: INSERT off 59, blkref #0: rel 1663/13806/217074 blk 5101  

从库2位点5B/11000000

ll -rth /data02/pg/pg_wal_1923|tail -n 3  
-rw------- 1 postgres postgres  16M Mar 26 17:41 000000020000005B00000010  
drwx------ 2 postgres postgres 248K Mar 26 17:41 archive_status  
-rw------- 1 postgres postgres  16M Mar 26 17:41 000000020000005B00000011  
  
pg_waldump /data02/pg/pg_wal_1923/000000020000005B00000011|tail -n 3  
pg_waldump: FATAL:  could not find a valid record after 5B/11000000  
  
  
pg_waldump /data02/pg/pg_wal_1923/000000020000005B00000010|tail -n 3  
rmgr: Heap        len (rec/tot):    127/   127, tx: 2205217476, lsn: 5B/10FFFE18, prev 5B/10FFFD98, desc: INSERT off 7, blkref #0: rel 1663/13806/213270 blk 5397  
rmgr: Heap        len (rec/tot):    127/   127, tx: 2205217465, lsn: 5B/10FFFE98, prev 5B/10FFFE18, desc: INSERT off 35, blkref #0: rel 1663/13806/215400 blk 5142  
rmgr: Heap        len (rec/tot):    127/   127, tx: 2205217482, lsn: 5B/10FFFF18, prev 5B/10FFFE98, desc: INSERT off 55, blkref #0: rel 1663/13806/216612 blk 5638  

选择从库2作为新的主库

4、激活最新从库作为新主库

mv /data01/pg/pg_root1923/recovery.conf /data01/pg/pg_root1923/recovery.done  
  
pg_ctl start -D /data01/pg/pg_root1923  

5、另一个从库作为新主库的从库

vi /data01/pg/pg_root1921/recovery.conf  
  
primary_conninfo = 'host=127.0.0.1 port=1923 user=rep password=REpliCa12343231_-1!'  
  
pg_ctl start -D /data01/pg/pg_root1921  

6、关闭老的主库,可以使用pg_rewind连接到新主库修正它(但是我们一开始需要设置参数wal_log_hints = on,同时需要使用超级用户连接新主库进行修复)

详见

《PostgreSQL primary-standby failback tools : pg_rewind》

《PostgreSQL 9.5 new feature - pg_rewind fast sync Split Brain Primary & Standby》

《PostgreSQL 9.5 add pg_rewind for Fast align for PostgreSQL unaligned primary & standby》

首先确保新主库已启动正常。

 psql -h 127.0.0.1 -p 1923  
psql (10.3)  
Type "help" for help.  
  
postgres=# \x  
Expanded display is on.  
postgres=# select pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_flush_lsn(),write_lsn)) delay_wal_size,* from pg_stat_replication ;  
(0 rows)  

使用pg_rewind修复老主库

pg_rewind -D /data01/pg/pg_root1922 --source-server='host=127.0.0.1 port=1923 user=postgres password=pwd'  

7、配置老主库变从库,将新主库作为它的主库

cd /data01/pg/pg_root1922  
  
mv recovery.done recovery.conf  
  
vi recovery.conf  
  
primary_conninfo = 'host=127.0.0.1 port=1923 user=rep password=REpliCa12343231_-1!'  
  
pg_ctl stop -m fast -D /data01/pg/pg_root1922  
pg_ctl start -D /data01/pg/pg_root1922  

8、如果无法使用pg_rewind修复老的主库,那么需要重建它。

rm -rf /data01/pg/pg_root1922  
rm -rf /data02/pg/pg_wal_1922  
pg_basebackup -D /data01/pg/pg_root1922 -F p --waldir=/data02/pg/pg_wal_1922 -h 127.0.0.1 -p 1923 -U rep  
could not identify current directory: No such file or directory  
Password:  输入rep用户的密码  
WARNING:  skipping special file "./.s.PGSQL.1923"  
拷贝中  
cd /data01/pg/pg_root1922  
  
mv recovery.done recovery.conf  
  
vi recovery.conf  
  
primary_conninfo = 'host=127.0.0.1 port=1923 user=rep password=REpliCa12343231_-1!'  
  
vi postgresql.conf  
port=1922  
  
pg_ctl start -D /data01/pg/pg_root1922  

13 新增从库

pic

新增一个从库,假设我这里使用端口为1924作为新的一个从库。

1、修改主库的pg_hba.conf,允许新增从库的主机访问它

vi /data01/pg/pg_root1923/pg_hba.conf  
  
  
追加到配置文件末尾  
  
# 多主机应该这样配置, 如果你在可信任网络中,也可以配置为truse代替md5,那么就不需要密码认证了  
# host replication rep 主机1IP/32 md5  
# host replication rep 主机2IP/32 md5  
# host replication rep 主机3IP/32 md5  
# host replication rep 主机4IP/32 md5  
  
生效配置  
  
pg_ctl reload /data01/pg/pg_root1923  

2、(假设PostgreSQL软件、目录都已部署好)

注意操作系统启动PostgreSQL的用户为普通用户,你的pgdata, pg_wal目录必须为空目录,并且有写权限。或者你可以不建目录,但是OS的这个用户需要有目标父目录的写权限。

pg_basebackup -D /data01/pg/pg_root1924 -F p --waldir=/data02/pg/pg_wal_1924 -h 127.0.0.1 -p 1923 -U rep  
could not identify current directory: No such file or directory  
Password:  输入rep用户的密码  
WARNING:  skipping special file "./.s.PGSQL.1923"  
拷贝中  
cd /data01/pg/pg_root1924  
  
mv recovery.done recovery.conf  
  
vi recovery.conf  
  
primary_conninfo = 'host=127.0.0.1 port=1923 user=rep password=REpliCa12343231_-1!'  
  
vi postgresql.conf  
port=1924  
  
pg_ctl start -D /data01/pg/pg_root1924  

在主库上查看,可以看到已经有3个备库了。

postgres=# select pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_flush_lsn(),write_lsn)) delay_wal_size,* from pg_stat_replication ;  
-[ RECORD 1 ]----+------------------------------  
delay_wal_size   | 0 bytes  
pid              | 15463  
usesysid         | 206200  
usename          | rep  
application_name | walreceiver  
client_addr      | 127.0.0.1  
client_hostname  |   
client_port      | 45880  
backend_start    | 2018-03-26 18:56:50.825897+08  
backend_xmin     |   
state            | streaming  
sent_lsn         | 5B/16000060  
write_lsn        | 5B/16000060  
flush_lsn        | 5B/16000060  
replay_lsn       | 5B/16000060  
write_lag        |   
flush_lag        |   
replay_lag       |   
sync_priority    | 1  
sync_state       | quorum  
-[ RECORD 2 ]----+------------------------------  
delay_wal_size   | 0 bytes  
pid              | 15437  
usesysid         | 206200  
usename          | rep  
application_name | walreceiver  
client_addr      | 127.0.0.1  
client_hostname  |   
client_port      | 45872  
backend_start    | 2018-03-26 18:52:41.027181+08  
backend_xmin     |   
state            | streaming  
sent_lsn         | 5B/16000060  
write_lsn        | 5B/16000060  
flush_lsn        | 5B/16000060  
replay_lsn       | 5B/16000060  
write_lag        |   
flush_lag        |   
replay_lag       |   
sync_priority    | 1  
sync_state       | quorum  
-[ RECORD 3 ]----+------------------------------  
delay_wal_size   | 0 bytes  
pid              | 15270  
usesysid         | 206200  
usename          | rep  
application_name | walreceiver  
client_addr      | 127.0.0.1  
client_hostname  |   
client_port      | 45840  
backend_start    | 2018-03-26 18:00:24.558369+08  
backend_xmin     |   
state            | streaming  
sent_lsn         | 5B/16000060  
write_lsn        | 5B/16000060  
flush_lsn        | 5B/16000060  
replay_lsn       | 5B/16000060  
write_lag        |   
flush_lag        |   
replay_lag       |   
sync_priority    | 1  
sync_state       | quorum  

14 重搭从库

参考 12章的第8个步骤。

15 新增级联从库

pic

以现在的从库为上游节点,建立级联的从库

比如以从库1 (1921这个) 为上游,构建它的级联从库。

1、修改作为上游节点的从库的pg_hba.conf,允许新增级联从库的主机访问它

vi /data01/pg/pg_root1921/pg_hba.conf  
  
  
追加到配置文件末尾  
  
# 多主机应该这样配置, 如果你在可信任网络中,也可以配置为truse代替md5,那么就不需要密码认证了  
# host replication rep 主机5IP/32 md5  
  
生效配置  
  
pg_ctl reload /data01/pg/pg_root1921  

2、新建从库(假设PostgreSQL软件、目录都已部署好)

pg_basebackup -D /data01/pg/pg_root1925 -F p --waldir=/data02/pg/pg_wal_1925 -h 127.0.0.1 -p 1921 -U rep  
could not identify current directory: No such file or directory  
Password:  输入rep用户的密码  
WARNING:  skipping special file "./.s.PGSQL.1921"  
拷贝中  
cd /data01/pg/pg_root1925  
  
vi recovery.conf  
  
primary_conninfo = 'host=127.0.0.1 port=1921 user=rep password=REpliCa12343231_-1!'  
# 这里连接的是上游的从库,不是主库哦,注意。  
  
vi postgresql.conf  
port=1925  
  
pg_ctl start -D /data01/pg/pg_root1925  

在上游从库上查看,可以看到已经有1个级联从库了。

psql -h 127.0.0.1 -p 1921 -U postgres postgres  
  
postgres=# select pg_size_pretty(pg_wal_lsn_diff(pg_last_wal_receive_lsn(),write_lsn)) delay_wal_size,* from pg_stat_replication ;  
-[ RECORD 1 ]----+-----------------------------  
delay_wal_size   | 0 bytes  
pid              | 15513  
usesysid         | 206200  
usename          | rep  
application_name | walreceiver  
client_addr      | 127.0.0.1  
client_hostname  |   
client_port      | 51988  
backend_start    | 2018-03-26 19:04:20.89889+08  
backend_xmin     |   
state            | streaming  
sent_lsn         | 5B/16000060  
write_lsn        | 5B/16000060  
flush_lsn        | 5B/16000060  
replay_lsn       | 5B/16000060  
write_lag        |   
flush_lag        |   
replay_lag       |   
sync_priority    | 0  
sync_state       | async  

3个直接从库,一个级联从库。总共4个从库。现在压测很有喜感,但是写入依旧有80万行/s左右。

pgbench -M prepared -n -r -P 1 -f ./test.sql -h 127.0.0.1 -p 1923 -U postgres postgres -c 56 -j 56 -T 1200000    
  
  
progress: 74.0 s, 84.0 tps, lat 820.490 ms stddev 81.186  
progress: 75.0 s, 28.0 tps, lat 1006.275 ms stddev 63.322  
progress: 76.0 s, 64.0 tps, lat 965.826 ms stddev 212.833  
progress: 77.0 s, 83.0 tps, lat 903.694 ms stddev 146.248  
progress: 78.0 s, 56.0 tps, lat 803.509 ms stddev 98.570  
progress: 79.0 s, 63.0 tps, lat 937.920 ms stddev 129.824  
progress: 80.0 s, 93.0 tps, lat 825.556 ms stddev 183.147  
iotop    
  
Total DISK READ :       4.31 M/s | Total DISK WRITE :    1109.59 M/s  
Actual DISK READ:       4.31 M/s | Actual DISK WRITE:    1110.52 M/s  
  TID  PRIO  USER     DISK READ  DISK WRITE  SWAPIN     IO>    COMMAND                                                
15436 be/4 postgres    0.00 B/s  153.79 M/s  0.00 % 62.86 % postgres: wal receiver process   streaming 5C/60000000  
15462 be/4 postgres    3.84 K/s  169.17 M/s  0.00 % 60.02 % postgres: wal receiver process   streaming 5C/60000000  
15269 be/4 postgres    7.69 K/s  169.17 M/s  0.00 % 57.47 % postgres: wal receiver process   streaming 5C/61000000  
15839 be/4 postgres    3.89 M/s   73.82 M/s  0.00 % 43.58 % postgres: autovacuum worker process   postgres          
15512 be/4 postgres    0.00 B/s  107.65 M/s  0.00 % 41.94 % postgres: wal receiver process   streaming 5C/5E000000  
  941 be/3 root        0.00 B/s  376.79 K/s  0.00 % 33.17 % [jbd2/vdc1-8]  
15498 be/4 root        0.00 B/s    0.00 B/s  0.00 % 28.72 % [kworker/u112:1]  
15567 be/4 postgres    0.00 B/s   19.97 M/s  0.00 % 15.37 % postgres: postgres postgres 127.0.0.1(45910) SELECT  
15432 be/4 postgres   73.05 K/s   74.87 M/s  0.00 % 14.88 % postgres: startup process   recovering 000000020000005C0000005B              
15458 be/4 postgres   19.22 K/s   83.71 M/s  0.00 %  9.03 % postgres: startup process   recovering 000000020000005C00000058  
15645 be/4 postgres    0.00 B/s   30.42 M/s  0.00 %  7.32 % postgres: postgres postgres 127.0.0.1(46004) SELECT              
15622 be/4 postgres   11.53 K/s 1907.00 K/s  0.00 %  6.81 % postgres: postgres postgres 127.0.0.1(45958) SELECT              
15639 be/4 postgres    3.84 K/s   18.34 M/s  0.00 %  6.70 % postgres: postgres postgres 127.0.0.1(45992) SELECT  
15631 be/4 postgres    3.84 K/s   10.09 M/s  0.00 %  6.47 % postgres: postgres postgres 127.0.0.1(45976) SELECT  
15577 be/4 postgres    7.69 K/s 1907.00 K/s  0.00 %  5.35 % postgres: postgres postgres 127.0.0.1(45920) SELECT              
15642 be/4 postgres    7.69 K/s 1899.32 K/s  0.00 %  5.20 % postgres: postgres postgres 127.0.0.1(45998) SELECT                          
15615 be/4 postgres    7.69 K/s 1514.84 K/s  0.00 %  5.11 % postgres: postgres postgres 127.0.0.1(45946) SELECT              
15207 be/4 postgres    3.84 K/s   66.49 M/s  0.00 %  4.48 % postgres: startup process   recovering 000000020000005C00000060  
15634 be/4 postgres    3.84 K/s 1891.63 K/s  0.00 %  1.39 % postgres: postgres postgres 127.0.0.1(45982) SELECT  
15632 be/4 postgres    0.00 B/s   13.79 M/s  0.00 %  1.38 % postgres: postgres postgres 127.0.0.1(45978) SELECT waiting for 5C/61E789D8  
15629 be/4 postgres    3.84 K/s  945.81 K/s  0.00 %  0.82 % postgres: postgres postgres 127.0.0.1(45972) SELECT  

16 特殊场景隔离性说明

在多副本模式下,客户端提交事务,收到COMMIT成功消息之前,其他会话能查到本地已提交的数据吗?

答案是,当然不可以(保证事务隔离性),在客户端收到COMMIT成功消息之前,其他会话是看不到这个会话在本地已提交的数据的。(即本地已提交,backend process在等待sync wal sender等待多个副本的反馈的接收WAL位点超过提交COMMIT WAL RECORD OFFSET的状态)

同时,对于本地已提交状态明确的(客户端收到COMMIT或CACNEL状态),其他会话就能看到这个事务本地提交的数据了。

1、会话A

postgres=# create table tt(id int);
CREATE TABLE
postgres=# insert into tt values (1);
INSERT 0 1
postgres=# insert into tt values (2);
^CCancel request sent
WARNING:  canceling wait for synchronous replication due to user request
DETAIL:  The transaction has already committed locally, but might not have been replicated to the standby.
INSERT 0 1
postgres=# insert into tt values (3);
^CCancel request sent
WARNING:  canceling wait for synchronous replication due to user request
DETAIL:  The transaction has already committed locally, but might not have been replicated to the standby.
INSERT 0 1
postgres=# \set VERBOSITY verbose
postgres=# insert into tt values (3);
^CCancel request sent
WARNING:  01000: canceling wait for synchronous replication due to user request
DETAIL:  The transaction has already committed locally, but might not have been replicated to the standby.
LOCATION:  SyncRepWaitForLSN, syncrep.c:266
INSERT 0 1
postgres=# begin;
BEGIN
postgres=# insert into tt values (4);
INSERT 0 1
postgres=# end;
^CCancel request sent
WARNING:  01000: canceling wait for synchronous replication due to user request
DETAIL:  The transaction has already committed locally, but might not have been replicated to the standby.
LOCATION:  SyncRepWaitForLSN, syncrep.c:266
COMMIT
postgres=# select pg_backend_pid;
ERROR:  42703: column "pg_backend_pid" does not exist
LINE 1: select pg_backend_pid;
               ^
LOCATION:  errorMissingColumn, parse_relation.c:3293
postgres=# select pg_backend_pid();
 pg_backend_pid 
----------------
          43467
(1 row)

postgres=# insert into tt values (100);
WARNING:  57P01: canceling the wait for synchronous replication and terminating connection due to administrator command
DETAIL:  The transaction has already committed locally, but might not have been replicated to the standby.
LOCATION:  SyncRepWaitForLSN, syncrep.c:249
server closed the connection unexpectedly
        This probably means the server terminated abnormally
        before or while processing the request.
The connection to the server was lost. Attempting reset: Succeeded.

postgres=# select pg_backend_pid();
 pg_backend_pid 
----------------
          43781
(1 row)

postgres=# insert into tt values (101);
WARNING:  01000: canceling wait for synchronous replication due to user request
DETAIL:  The transaction has already committed locally, but might not have been replicated to the standby.
LOCATION:  SyncRepWaitForLSN, syncrep.c:266
INSERT 0 1

2、会话B

postgres=# select * from txid_current_snapshot();
 txid_current_snapshot  
------------------------
 2482094559:2482094559:
(1 row)

postgres=# select pg_terminate_backend(43467);
 pg_terminate_backend 
----------------------
 t
(1 row)

postgres=# select * from tt;
 id  
-----
   1
   2
   3
   3
   4
 100
(6 rows)

postgres=# select pg_cancel_backend(43781);
 pg_cancel_backend 
-------------------
 t
(1 row)

postgres=# select * from tt;
 id  
-----
   1
   2
   3
   3
   4
 100
 101
(7 rows)

17 同步模式 VS 异步模式 性能对比

1、生成1亿TPC-B测试数据

pgbench -i -s 1000  

2、多副本同步模式

postgres=# show synchronous_commit ;  
 synchronous_commit   
--------------------  
 remote_write  
(1 row)  
  
postgres=# show synchronous_standby_names ;  
 synchronous_standby_names   
---------------------------  
 ANY 1 (*)  
(1 row)  

3、压测

pgbench -M prepared -n -r -P 1 -c 56 -j 56 -T 120  
  
transaction type: <builtin: TPC-B (sort of)>  
scaling factor: 1000  
query mode: prepared  
number of clients: 56  
number of threads: 56  
duration: 120 s  
number of transactions actually processed: 5381146  
latency average = 1.248 ms  
latency stddev = 4.118 ms  
tps = 44840.272218 (including connections establishing)  
tps = 44853.145859 (excluding connections establishing)  
script statistics:  
 - statement latencies in milliseconds:  
         0.002  \set aid random(1, 100000 * :scale)  
         0.001  \set bid random(1, 1 * :scale)  
         0.001  \set tid random(1, 10 * :scale)  
         0.001  \set delta random(-5000, 5000)  
         0.063  BEGIN;  
         0.171  UPDATE pgbench_accounts SET abalance = abalance + :delta WHERE aid = :aid;  
         0.096  SELECT abalance FROM pgbench_accounts WHERE aid = :aid;  
         0.109  UPDATE pgbench_tellers SET tbalance = tbalance + :delta WHERE tid = :tid;  
         0.118  UPDATE pgbench_branches SET bbalance = bbalance + :delta WHERE bid = :bid;  
         0.090  INSERT INTO pgbench_history (tid, bid, aid, delta, mtime) VALUES (:tid, :bid, :aid, :delta, CURRENT_TIMESTAMP);  
         0.597  END;  

4、本地持久化,单副本模式

postgres@-> psql  
psql (10.3)  
Type "help" for help.  
  
postgres=# alter role postgres set synchronous_commit = local;  
ALTER ROLE  
postgres=# \q  
postgres@-> psql  
psql (10.3)  
Type "help" for help.  
  
postgres=# show synchronous_commit ;  
 synchronous_commit   
--------------------  
 local  
(1 row)  

5、压测

pgbench -M prepared -n -r -P 1 -c 56 -j 56 -T 120  
  
transaction type: <builtin: TPC-B (sort of)>  
scaling factor: 1000  
query mode: prepared  
number of clients: 56  
number of threads: 56  
duration: 120 s  
number of transactions actually processed: 6684833  
latency average = 1.005 ms  
latency stddev = 1.494 ms  
tps = 55695.800213 (including connections establishing)  
tps = 55700.524316 (excluding connections establishing)  
script statistics:  
 - statement latencies in milliseconds:  
         0.002  \set aid random(1, 100000 * :scale)  
         0.001  \set bid random(1, 1 * :scale)  
         0.001  \set tid random(1, 10 * :scale)  
         0.001  \set delta random(-5000, 5000)  
         0.061  BEGIN;  
         0.133  UPDATE pgbench_accounts SET abalance = abalance + :delta WHERE aid = :aid;  
         0.096  SELECT abalance FROM pgbench_accounts WHERE aid = :aid;  
         0.108  UPDATE pgbench_tellers SET tbalance = tbalance + :delta WHERE tid = :tid;  
         0.114  UPDATE pgbench_branches SET bbalance = bbalance + :delta WHERE bid = :bid;  
         0.091  INSERT INTO pgbench_history (tid, bid, aid, delta, mtime) VALUES (:tid, :bid, :aid, :delta, CURRENT_TIMESTAMP);  
         0.400  END;  

OLTP写场景本地持久化 VS 多副本性能对比:

1、多副本模式,RT 提高了0.24毫秒,约20%

2、多副本模式,TPS 下降了10855,约20%

小结

PostgreSQL的多副本复制非常的简单并且健壮。

流复制有几个特点

1、延迟及低,毫不畏惧大事务

2、支持断点续传

3、支持多副本

4、配置简单,看本文

5、备库与主库物理完全一致,并支持只读

6、支持多级级联从库

许多用户会使用流复制来搭建只读备库,容灾,备份节点,HA等。

本文主要介绍了PostgreSQL一主多从的配置,以及多副本强同步的配置。

如果想使用多副本实现“0数据丢失、高可用”方案,可以先实现内核层冻结receiver进程,这样可以更加容易的实现新主的选举。(否则需要使用pg_waldump来观察位点是谁的最新)。然后就可以愉快的切换到最新位点的从库,确保切换的0丢失。

同时为了保证不出现脑裂,需要控制住了大于或等于(一会PG中参数指定“同步备库数 - 同步副本数 + 1”)个节点,再选择新的主节点(选出它们中WAL位点最新的作为新主节点),则绝对不可能出现脑裂的问题,也不会丢失数据。

至于使用智能DNS还是PROXY,又或者是VIP来实现业务端透明,根据你的数据库所在的环境来决定,哪个方便用哪个。VIP或者DNS的话性能应该是最好的,因为应用程序到达数据库的跳数最少。

参考

你可能还对如下文档感兴趣

《PostgreSQL 10 on ECS 实施 流复制备库镜像+自动快照备份+自动备份验证+自动清理备份与归档》

(它也可以代替pg_rewind来修复时间线问题,对于特别庞大的数据库非常有效。)

《PostgreSQL 10 + PostGIS + Sharding(pg_pathman) + MySQL(fdw外部表) on ECS 部署指南(适合新用户)》

《PostgreSQL on Linux 最佳部署手册》

《PostgreSQL - 鱼与熊掌可兼得 - 多副本0丢失与高性能兼得 - 事务级异步、同步开关》

《PG多节点(quorum based), 0丢失 HA(failover,switchover)方案》

《PostgreSQL 同步流复制(高并发写入)锁瓶颈分析》

《PostgreSQL 9.6 同步多副本 与 remote_apply事务同步级别》

《PostgreSQL 同步流复制原理和代码浅析》

《异步流复制模式如何保证不丢数据?》

《PostgreSQL 流复制延迟的测试方法》

《PostgreSQL 小改动,解决流复制遇到的pg_xlog已删除的问题(主库wal sender读归档目录文件发送给wal sender)》

Flag Counter

digoal’s 大量PostgreSQL文章入口