为什么PostgreSQL流复制备库在启动时不会立即启动wal receiver - startup慢导致需要的WAL在上游已删除
背景
PostgreSQL在启动时,如果处于恢复状态,可能不会立即启动wal receiver去问上游主库要WAL日志。
原因是什么呢?
standby获取WAL的流程
《PostgreSQL 物理恢复时查找wal record的顺序 - loop(pg_wal, restore_command, stream)》
src/backend/access/transam/xlog.c
/*
* Codes indicating where we got a WAL file from during recovery, or where
* to attempt to get one.
*/
typedef enum
{
XLOG_FROM_ANY = 0, /* request to read WAL from any source */
XLOG_FROM_ARCHIVE, /* restored using restore_command */
XLOG_FROM_PG_WAL, /* existing file in pg_wal */
XLOG_FROM_STREAM /* streamed from master */
} XLogSource;
..............................
/*-------
* Standby mode is implemented by a state machine:
*
* 1. Read from either archive or pg_wal (XLOG_FROM_ARCHIVE), or just
* pg_wal (XLOG_FROM_PG_WAL)
* 2. Check trigger file
* 3. Read from primary server via walreceiver (XLOG_FROM_STREAM)
* 4. Rescan timelines
* 5. Sleep wal_retrieve_retry_interval milliseconds, and loop back to 1.
*
* Failure to read from the current source advances the state machine to
* the next state.
*
* 'currentSource' indicates the current state. There are no currentSource
* values for "check trigger", "rescan timelines", and "sleep" states,
* those actions are taken when reading from the previous source fails, as
* part of advancing to the next state.
*-------
*/
首先从archive拉取,或者直接读pg_wal目录。
实际上只要本地有没有APPLY完的WAL,apply源头就不会跳到XLOG_FROM_STREAM这个模块中。
问题
如果数据库启动时, startup进程(恢复过程)耗时很久,那么wal receiver就一直不会启动,那么如果这个时间区间,主库产生了很多WAL,老的WAL可能会被ROTATE(清理)掉,导致备库wal receiver启动时,已经无法从主库获取到需要的wal .
为了解决这个问题,可以选择以下任意方法:
1、内核改进,wal receiver尽快启动,从上游拉取需要的日志,而不是等待startup把本地所有wal apply完才启动。
2、配置restore_command,从归档拉取。
3、在部署系统时,另外开启一个pg_receivewal进程,持续拉取wal,同时配置restore_command从pg_receivewal的写入目标路径拉取。
4、上游配置wal_keep_segments,配置足够大,防止这样的问题产生。
参考
src/backend/access/transam/xlog.c
《PostgreSQL 物理恢复时查找wal record的顺序 - loop(pg_wal, restore_command, stream)》