为什么PostgreSQL流复制备库在启动时不会立即启动wal receiver - startup慢导致需要的WAL在上游已删除

1 minute read

背景

PostgreSQL在启动时，如果处于恢复状态，可能不会立即启动wal receiver去问上游主库要WAL日志。

原因是什么呢？

standby获取WAL的流程

《PostgreSQL 物理恢复时查找wal record的顺序 - loop(pg_wal, restore_command, stream)》

src/backend/access/transam/xlog.c

/*  
 * Codes indicating where we got a WAL file from during recovery, or where  
 * to attempt to get one.  
 */  
typedef enum  
{  
        XLOG_FROM_ANY = 0,                      /* request to read WAL from any source */  
        XLOG_FROM_ARCHIVE,                      /* restored using restore_command */  
        XLOG_FROM_PG_WAL,                       /* existing file in pg_wal */  
        XLOG_FROM_STREAM                        /* streamed from master */  
} XLogSource;  
  
..............................  
  
        /*-------  
         * Standby mode is implemented by a state machine:  
         *  
         * 1. Read from either archive or pg_wal (XLOG_FROM_ARCHIVE), or just  
         *        pg_wal (XLOG_FROM_PG_WAL)  
         * 2. Check trigger file  
         * 3. Read from primary server via walreceiver (XLOG_FROM_STREAM)  
         * 4. Rescan timelines  
         * 5. Sleep wal_retrieve_retry_interval milliseconds, and loop back to 1.  
         *  
         * Failure to read from the current source advances the state machine to  
         * the next state.  
         *  
         * 'currentSource' indicates the current state. There are no currentSource  
         * values for "check trigger", "rescan timelines", and "sleep" states,  
         * those actions are taken when reading from the previous source fails, as  
         * part of advancing to the next state.  
         *-------  
         */  

首先从archive拉取，或者直接读pg_wal目录。

实际上只要本地有没有APPLY完的WAL，apply源头就不会跳到XLOG_FROM_STREAM这个模块中。

问题

如果数据库启动时, startup进程(恢复过程)耗时很久，那么wal receiver就一直不会启动，那么如果这个时间区间，主库产生了很多WAL，老的WAL可能会被ROTATE（清理）掉，导致备库wal receiver启动时，已经无法从主库获取到需要的wal .

为了解决这个问题，可以选择以下任意方法：

1、内核改进，wal receiver尽快启动，从上游拉取需要的日志，而不是等待startup把本地所有wal apply完才启动。

2、配置restore_command，从归档拉取。

3、在部署系统时，另外开启一个pg_receivewal进程，持续拉取wal，同时配置restore_command从pg_receivewal的写入目标路径拉取。

4、上游配置wal_keep_segments，配置足够大，防止这样的问题产生。

参考

src/backend/access/transam/xlog.c

《PostgreSQL 物理恢复时查找wal record的顺序 - loop(pg_wal, restore_command, stream)》

digoal’s 大量PostgreSQL文章入口

Twitter Facebook Google+ LinkedIn

Digoal.zhou

为什么PostgreSQL流复制备库在启动时不会立即启动wal receiver - startup慢导致需要的WAL在上游已删除

背景

standby获取WAL的流程

问题

参考

digoal’s 大量PostgreSQL文章入口

You May Also Enjoy

PostgreSQL(PPAS 兼容Oracle) 从零开始入门手册 - 珍藏版

PostgreSQL pipelinedb 流计算插件 - IoT应用 - 实时轨迹聚合

PostgreSQL plpgsql 存储过程、函数 - 状态、异常变量打印、异常捕获… - GET [STACKED] DIAGNOSTICS

PostgreSQL datediff 日期间隔（单位转换）兼容SQL用法