PostgreSQL 10 流式物理、逻辑主从 最佳实践

6 minute read

背景

流复制起源

PostgreSQL 自从2010年推出的9.0版本开始,支持流式物理复制,用户可以通过流式复制,构建只读备库(主备物理复制,块级别一致)。流式物理复制可以做到极低的延迟(通常在1毫秒以内)。

同步流复制

2011年推出的9.1版本,支持同步复制,当时只支持一个同步流复制备节点(例如配置了3个备,只有一个是同步模式的,其他都是异步模式)。

在同步复制模式下,当用户提交事务时,需要等待这笔事务的WAL日志复制到同步流复制备节点,才会返回提交成功的ACK给客户端。

同步模式下,可以确保数据的0丢失。(只要客户端收到了事务提交成功的ACK,这笔事务的WAL就有两份。)

级联流复制

2012年推出的9.2版本,支持级联流复制。意思是备库还可以再连备库。

级联复制特别适合跨机房的使用,例如主库在A机房,备库在B机房,但是B机房需要建立多个备库时,那么B机房只需要建立一个直连主库的备库,其他的备库可以通过B机房的这个备库级联产生。从而减少网络开销。

流式虚拟备库

2012年推出的9.2版本,除了支持级联复制,还支持虚拟备库,什么是虚拟备库呢?就是只有WAL,没有数据文件的备库。

通过虚拟备库,可以流式的接收WAL,进行实时的流式WAL归档。提高备份或归档的实时性。

逻辑复制的基础

2014年推出的9.4版本,在WAL中增加了逻辑复制需要的基础信息,通过插件,可以实现逻辑复制。

逻辑复制可以做到对主库的部分复制,例如表级复制,而不是整个集群的块级一致复制。

逻辑复制的备库不仅仅是只读的,也可以执行写操作。

增加几种同步级别

2016年推出的9.6版本,PG的流式复制,通过复制WAL达到同步的目的,因此同步级别也和WAL有关。通过synchronous_commit参数,可以配置事务的同步级别。

1、on, 表示本地WAL fsync,同步standby WAL fsync。即两份持久化的WAL。

2、remote_apply, 表示本地WAL fsync,同步standby WAL 已恢复。这个带来的RT最高。

3、remote_write, 表示本地WAL fsync,同步standby WAL 异步write完成。一份持久化,备库的WAL可能还在OS CACHE中。

4、local, 表示本地WAL fsync。

5、off, 表示本地WAL写到wal buffer中即返回客户端事务提交成功的ACK,为异步提交(数据库CRASH可能导致事务丢失,但不会导致数据库不一致)。

RT影响,从低到高如下:

off, local, remote_write, on, remote_apply。

流式备份压缩

2017年推出的10版本,pg_basebackup, pg_receivewal支持流式压缩备份WAL。

quorum based 同步流复制

2017年推出的10版本,支持quorum based的同步流复制,例如有3个备节点,你可以告诉主库,这个事务需要3份WAL副本,那么主库需要等待至少2个备节点已将WAL同步过去的反馈,才会将事务提交成功的ACK返回给客户端。

quorum based同步流复制,结合raft协议,可以实现零数据丢失的高可用、高可靠架构。

内置逻辑订阅、复制

2017年推出的10版本,内置了逻辑订阅的功能。

多master

2017年推出的10版本,通过逻辑订阅的功能,可以实现多Master架构。

一、流式 物理主从 最佳实践

以一主两从,quorum based 同步(一副本)为例,介绍PG 10的主从最佳实践。

环境

三台机器(假设主机有64G内存),同一局域网,相互网络互通(至少数据库监听端口应该互通)。

Linux CentOS 7.x x64

HOSTA : 监听端口1921

HOSTB : 监听端口1921

HOSTC : 监听端口1921

(本文使用一台物理机来模拟,IP为127.0.0.1,端口分别为2921,2922,2923,读者请根据实际环境修改)

软件安装略

请参考

《PostgreSQL on Linux 最佳部署手册》

初始化主库

mkdir /disk1/digoal/pgdata2921  
  
chown digoal /disk1/digoal/pgdata2921  
  
initdb -D /disk1/digoal/pgdata2921 -E UTF8 --locale=C -U postgres  

配置postgresql.conf

cd /disk1/digoal/pgdata2921  
  
vi postgresql.conf  
  
listen_addresses = '0.0.0.0'  
port = 2921  
max_connections = 1000  
unix_socket_directories = '.'  
tcp_keepalives_idle = 60  
tcp_keepalives_interval = 10  
tcp_keepalives_count = 10  
shared_buffers = 8GB  
maintenance_work_mem = 1GB  
dynamic_shared_memory_type = posix  
vacuum_cost_delay = 0  
bgwriter_delay = 10ms  
bgwriter_lru_maxpages = 500  
bgwriter_lru_multiplier = 5.0  
bgwriter_flush_after = 0  
effective_io_concurrency = 0  
max_worker_processes = 16  
backend_flush_after = 0  
wal_level = replica  
fsync = on  
synchronous_commit = remote_write  
full_page_writes = on  
wal_buffers = 128MB  
wal_writer_delay = 10ms  
wal_writer_flush_after = 0  
checkpoint_timeout = 30min  
max_wal_size = 16GB  
min_wal_size = 8GB  
checkpoint_completion_target = 0.5  
checkpoint_flush_after = 0  
max_wal_senders = 10  
wal_keep_segments = 1024  
synchronous_standby_names = 'ANY 1 (*)'  
hot_standby = on  
max_standby_archive_delay = 300s  
max_standby_streaming_delay = 300s  
wal_receiver_status_interval = 1s  
hot_standby_feedback = off  
log_destination = 'csvlog'  
logging_collector = on  
log_checkpoints = on  
log_connections = on  
log_disconnections = on  
log_error_verbosity = verbose      
log_timezone = 'PRC'  
log_autovacuum_min_duration = 0  
autovacuum_vacuum_scale_factor = 0.1  
autovacuum_analyze_scale_factor = 0.1  
autovacuum_freeze_max_age = 1000000000  
autovacuum_multixact_freeze_max_age = 1200000000  
autovacuum_vacuum_cost_delay = 0  
autovacuum_vacuum_cost_limit = 0  
vacuum_freeze_table_age = 800000000  
vacuum_multixact_freeze_table_age = 800000000  
datestyle = 'iso, mdy'  
timezone = 'PRC'  
lc_messages = 'C'  
lc_monetary = 'C'  
lc_numeric = 'C'  
lc_time = 'C'  
default_text_search_config = 'pg_catalog.english'  

配置pg_hba.conf

cd /disk1/digoal/pgdata2921  
  
vi pg_hba.conf  
  
# "local" is for Unix domain socket connections only  
local   all             all                                     trust  
# IPv4 local connections:  
host    all             all             127.0.0.1/32            trust  
# IPv6 local connections:  
host    all             all             ::1/128                 trust  
# Allow replication connections from localhost, by a user with the  
# replication privilege.  
local   replication     all                                     trust  
host    replication     all             127.0.0.1/32            trust  
host    replication     all             ::1/128                 trust  
host replication all 0.0.0.0/0 md5  

配置recovery.done

cd /disk1/digoal/pgdata2921  
  
vi recovery.done  
  
recovery_target_timeline = 'latest'  
standby_mode = on  
primary_conninfo = 'host=127.0.0.1 port=2921 user=rep password=pwd'  
# recovery_min_apply_delay = 0   #延迟多少分钟应用,用户可以配置延迟的备库,例如给一点误操作的缓冲时间。在备库不会这么早被应用。  

启动主库

pg_ctl start -D /disk1/digoal/pgdata2921  

创建流复制角色

psql -h 127.0.0.1 -p 2921  
  
psql (10beta1)  
Type "help" for help.  
  
postgres=# set synchronous_commit =off;  
SET  
postgres=# create role rep login replication encrypted password 'pwd';  
CREATE ROLE  

生成备库1

mkdir /disk1/digoal/pgdata2922  
chown digoal /disk1/digoal/pgdata2922  
chmod 700 /disk1/digoal/pgdata2922  
  
export PGPASSWORD="pwd"  
pg_basebackup -D /disk1/digoal/pgdata2922 -F p -X stream -h 127.0.0.1 -p 2921 -U rep  

配置备库1 postgresql.conf

cd /disk1/digoal/pgdata2922  
  
vi postgresql.conf  
  
port = 2922  

配置备库1 recovery.conf

cd /disk1/digoal/pgdata2922  
  
mv recovery.done recovery.conf  

启动备库1

pg_ctl start -D /disk1/digoal/pgdata2922  

生成备库2

mkdir /disk1/digoal/pgdata2923  
chown digoal /disk1/digoal/pgdata2923  
chmod 700 /disk1/digoal/pgdata2923  
  
export PGPASSWORD="pwd"  
pg_basebackup -D /disk1/digoal/pgdata2923 -F p -X stream -h 127.0.0.1 -p 2921 -U rep  

配置备库2 postgresql.conf

cd /disk1/digoal/pgdata2923  
  
vi postgresql.conf  
  
port = 2923  

配置备库2 recovery.conf

cd /disk1/digoal/pgdata2923  
  
mv recovery.done recovery.conf  

启动备库2

pg_ctl start -D /disk1/digoal/pgdata2923  

流复制节点的状态监控

主库查询

psql -h 127.0.0.1 -p 2921  
psql (10beta1)  
Type "help" for help.  
  
postgres=# \x  
Expanded display is on.  
  
postgres=# select pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), sent_lsn)) as sent_delay,   
  pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), write_lsn)) as write_delay,   
  pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), flush_lsn)) as flush_delay,   
  pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn)) as replay_delay,   
  *  
from pg_stat_replication;  
  
-[ RECORD 1 ]----+------------------------------  
sent_delay       | 0 bytes  
write_delay      | 0 bytes  
flush_delay      | 0 bytes  
replay_delay     | 0 bytes  
pid              | 11962  
usesysid         | 16384  
usename          | rep  
application_name | walreceiver  
client_addr      | 127.0.0.1  
client_hostname  |   
client_port      | 63083  
backend_start    | 2017-07-11 17:15:31.231492+08  
backend_xmin     |   
state            | streaming  
sent_lsn         | 1/C0000060  
write_lsn        | 1/C0000060  
flush_lsn        | 1/C0000060  
replay_lsn       | 1/C0000060  
write_lag        |   
flush_lag        |   
replay_lag       |   
sync_priority    | 1  
sync_state       | quorum  
-[ RECORD 2 ]----+------------------------------  
sent_delay       | 0 bytes  
write_delay      | 0 bytes  
flush_delay      | 0 bytes  
replay_delay     | 0 bytes  
pid              | 11350  
usesysid         | 16384  
usename          | rep  
application_name | walreceiver  
client_addr      | 127.0.0.1  
client_hostname  |   
client_port      | 63077  
backend_start    | 2017-07-11 17:15:13.818043+08  
backend_xmin     |   
state            | streaming  
sent_lsn         | 1/C0000060  
write_lsn        | 1/C0000060  
flush_lsn        | 1/C0000060  
replay_lsn       | 1/C0000060  
write_lag        |   
flush_lag        |   
replay_lag       |   
sync_priority    | 1  
sync_state       | quorum  

备库查询

psql -h 127.0.0.1 -p 2922  
  
-- 查看当前WAL应用是否暂停了  
postgres=# select pg_is_wal_replay_paused();  
 pg_is_wal_replay_paused   
-------------------------  
 f  
(1 row)  
  
-- 查看WAL接收到的位点  
postgres=# select pg_last_wal_receive_lsn();  
 pg_last_wal_receive_lsn   
-------------------------  
 1/C0000060  
(1 row)  
  
-- 查看WAL的应用位点  
postgres=# select pg_last_wal_replay_lsn();  
 pg_last_wal_replay_lsn   
------------------------  
 1/C0000060  
(1 row)  
  
-- 查看wal receiver的统计信息  
postgres=# \x  
Expanded display is on.  
postgres=# select * from pg_stat_get_wal_receiver();  
-[ RECORD 1 ]---------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------  
pid                   | 11349  
status                | streaming  
receive_start_lsn     | 1/C0000000  
receive_start_tli     | 1  
received_lsn          | 1/C0000060  
received_tli          | 1  
last_msg_send_time    | 2017-07-11 17:23:14.372327+08  
last_msg_receipt_time | 2017-07-11 17:23:14.372361+08  
latest_end_lsn        | 1/C0000060  
latest_end_time       | 2017-07-11 17:15:13.819553+08  
slot_name             |   
conninfo              | user=rep password=******** dbname=replication host=127.0.0.1 port=2921 fallback_application_name=walreceiver sslmode=disable sslcompression=1 target_session_attrs=any  
  
  
postgres=# select pg_wal_replay_pause();  
-[ RECORD 1 ]-------+-  
pg_wal_replay_pause |   
  
-- 暂停WAL的应用,例如要做一些排错时  
postgres=# select pg_is_wal_replay_paused();  
-[ RECORD 1 ]-----------+--  
pg_is_wal_replay_paused | t  
  
postgres=# select pg_wal_replay_resume();  
-[ RECORD 1 ]--------+-  
pg_wal_replay_resume |   
  
-- 继续应用WAL  
postgres=# select pg_is_wal_replay_paused();  
-[ RECORD 1 ]-----------+--  
pg_is_wal_replay_paused | f  

注意事项

1、如果要防止主库删除备库还没有接收的WAL文件,有两种方法。

使用slot,或者配置足够大的wal keep。

但是这两种方法都有一定的风险或问题,例如当备库挂了,或者备库不再使用了,而用户忘记删除对应的SLOT时。可能导致主库WAL无限膨胀。

而wal keep则会导致主库的WAL预留足够的个数,占用一定空间。

相关参数

主 postgresql.conf  
# max_replication_slots = 10  
# wal_keep_segments = 1024  
  
备 recovery.conf  
# primary_slot_name = ''  

2、如果不想通过以上方法预防备库需要的WAL已被删除,那么可以配置主库的归档,同时备库需要能获取到已归档的WAL文件。

相关参数

主 postgresql.conf  
#archive_mode = off             # enables archiving; off, on, or always  
                                # (change requires restart)  
#archive_command = ''           # command to use to archive a logfile segment  
                                # placeholders: %p = path of file to archive  
                                #               %f = file name only  
                                # e.g. 'test ! -f /mnt/server/archivedir/%f && cp %p /mnt/server/archivedir/%f'  
  
备 recovery.conf  
# restore_command = ''           # e.g. 'cp /mnt/server/archivedir/%f %p'  

3、保护好recovery.conf文件中的密码,因为配置的是明文。

4、主备之间的带宽请足够大,否则可能导致主备延迟。

压测

连接主库进行TPC-B的压测

pgbench -i -s 100 -h 127.0.0.1 -p 2921 -U postgres  
  
pgbench -n -r -P 1 -h 127.0.0.1 -p 2921 -U postgres -c 32 -j 32 -T 120  

transaction type: <builtin: TPC-B (sort of)>
scaling factor: 100
query mode: simple
number of clients: 32
number of threads: 32
duration: 120 s
number of transactions actually processed: 1326066
latency average = 2.896 ms
latency stddev = 2.030 ms
tps = 11050.199659 (including connections establishing)
tps = 11051.140876 (excluding connections establishing)
script statistics:
 - statement latencies in milliseconds:
         0.002  \set aid random(1, 100000 * :scale)
         0.001  \set bid random(1, 1 * :scale)
         0.001  \set tid random(1, 10 * :scale)
         0.000  \set delta random(-5000, 5000)
         0.043  BEGIN;
         0.154  UPDATE pgbench_accounts SET abalance = abalance + :delta WHERE aid = :aid;
         0.112  SELECT abalance FROM pgbench_accounts WHERE aid = :aid;
         0.159  UPDATE pgbench_tellers SET tbalance = tbalance + :delta WHERE tid = :tid;
         0.423  UPDATE pgbench_branches SET bbalance = bbalance + :delta WHERE bid = :bid;
         0.092  INSERT INTO pgbench_history (tid, bid, aid, delta, mtime) VALUES (:tid, :bid, :aid, :delta, CURRENT_TIMESTAMP);
         1.910  END;

观察主备的延迟

psql -h 127.0.0.1 -p 2921  
psql (10beta1)  
Type "help" for help.  
  
postgres=# \x  
Expanded display is on.  
postgres=# select pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), sent_lsn)) as sent_delay,   
postgres-#   pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), write_lsn)) as write_delay,   
postgres-#   pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), flush_lsn)) as flush_delay,   
postgres-#   pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn)) as replay_delay,   
postgres-#   *  
postgres-# from pg_stat_replication;  
-[ RECORD 1 ]----+------------------------------  
sent_delay       | 4024 bytes  
write_delay      | 4024 bytes  
flush_delay      | 9080 bytes  
replay_delay     | 13 kB  
pid              | 11962  
usesysid         | 16384  
usename          | rep  
application_name | walreceiver  
client_addr      | 127.0.0.1  
client_hostname  |   
client_port      | 63083  
backend_start    | 2017-07-11 17:15:31.231492+08  
backend_xmin     |   
state            | streaming  
sent_lsn         | 2/1C61E398  
write_lsn        | 2/1C61E398  
flush_lsn        | 2/1C61CFD8  
replay_lsn       | 2/1C61BEF8  
write_lag        | 00:00:00.000129  
flush_lag        | 00:00:00.001106  
replay_lag       | 00:00:00.001626  
sync_priority    | 1  
sync_state       | quorum  
-[ RECORD 2 ]----+------------------------------  
sent_delay       | 0 bytes  
write_delay      | 4024 bytes  
flush_delay      | 9080 bytes  
replay_delay     | 12 kB  
pid              | 11350  
usesysid         | 16384  
usename          | rep  
application_name | walreceiver  
client_addr      | 127.0.0.1  
client_hostname  |   
client_port      | 63077  
backend_start    | 2017-07-11 17:15:13.818043+08  
backend_xmin     |   
state            | streaming  
sent_lsn         | 2/1C61F350  
write_lsn        | 2/1C61E398  
flush_lsn        | 2/1C61CFD8  
replay_lsn       | 2/1C61C388  
write_lag        | 00:00:00.000542  
flush_lag        | 00:00:00.001582  
replay_lag       | 00:00:00.001952  
sync_priority    | 1  
sync_state       | quorum  
  
postgres=# \watch 1  

二、流式 逻辑订阅、逻辑主从 最佳实践

《PostgreSQL 逻辑订阅 - 给业务架构带来了什么希望?》

《PostgreSQL 10.0 preview 逻辑复制 - 原理与最佳实践》

《使用PostgreSQL逻辑订阅实现multi-master》

Flag Counter

digoal’s 大量PostgreSQL文章入口