PostgreSQL 小改动,解决流复制遇到的pg_xlog已删除的问题(主库wal sender读归档目录文件发送给wal sender)
背景
PostgreSQL 通过流复制搭建STANDBY,在某些特定的场景可能会遭遇standby需要的xlog文件已经从master删除的情况。
例如
1. 备库写入XLOG的速度比主库产生XLOG的速度慢。
2. 网络传输速度比产生XLOG的速度慢。
3. 备库故障,中断时间过长。
4. 网络故障,中断时间过长。
目前的解决办法
1. 归档
主库对xlog文件进行归档,当备库需要的文件已经不在pg_xlog目录中时,备库可以通过配置restore_command,从主库的归档来获取需要的xlog文件。这种方法比较通用。
缺点
备库需要有获取归档的通道,例如能访问归档存储,或者能通过ftp, http服务获取到归档文件。如果归档是在磁带库的,要有从磁带库获取归档的路径。
2. slot
slot会记录下备库已经同步到XLOG的什么位置的信息,所以主库不会删除备库还需要的xlog文件。
缺点
因为主库不能删除未发送的XLOG,那么如果备库落后的太多,可能导致主库的空间爆满。
3. wal keep segments
设置主库最多保留的XLOG文件个数,不能彻底解决问题。
另类的解决办法,其实通过归档来解决是比较好的办法,但是有一个空缺,就是备库可能无法访问归档存储或归档服务。
这种情况下,有什么好的方法来解决呢?
首先我们要了解一个问题,主库一定是能访问归档的,或者说概率非常大。
我们以主库能访问归档为前提,当备库需要的XLOG文件已经被删除时,主库主动去归档获取归档文件,并通过流复制发送给备库。
已有的代码:
当BasicOpenFile失败时,表示$PGDATA/pg_xlog中不存在备库请求的xlog信息,返回错误信息requested WAL segment %s has already been removed”。
src/backend/replication/walsender.c
static void
XLogRead(char *buf, XLogRecPtr startptr, Size count)
{
....
XLogFilePath(path, curFileTimeLine, sendSegNo);
sendFile = BasicOpenFile(path, O_RDONLY | PG_BINARY, 0);
if (sendFile < 0)
{
/*
* If the file is not found, assume it's because the standby
* asked for a too old WAL segment that has already been
* removed or recycled.
*/
if (errno == ENOENT)
ereport(ERROR,
(errcode_for_file_access(),
errmsg("requested WAL segment %s has already been removed",
XLogFileNameP(curFileTimeLine, sendSegNo))));
else
ereport(ERROR,
(errcode_for_file_access(),
errmsg("could not open file \"%s\": %m",
path)));
}
sendOff = 0;
我们假设归档目录是一个分布式文件系统例如lustre, moosefs或者其他。
归档命令是
archive_command = 'test ! -f /home/digoal/arch/%f && cp %p /home/digoal/arch/%f'
我们需要新增一个参数, 这个参数是主库的参数。
xlog_search_dir
通过它告诉主库归档目录在哪里。本文的归档目录是/home/digoal/arch
代码如下:
src/include/replication/walsender.h
extern PGDLLIMPORT char *xlog_search_dir;
src/backend/utils/misc/guc.c
static struct config_string ConfigureNamesString[] =
{
{
{"archive_command", PGC_SIGHUP, WAL_ARCHIVING,
gettext_noop("Sets the shell command that will be called to archive a WAL file."),
NULL
},
&XLogArchiveCommand,
"",
NULL, NULL, show_archive_command
},
{
{"xlog_search_dir", PGC_SIGHUP, WAL_ARCHIVING,
gettext_noop("Sets the dir by walsender search XLOG from archive dir, when xlog cleaned in pg_xlog file which walreceiver requesting."),
NULL
},
&xlog_search_dir,
"",
NULL, NULL, NULL
},
另外还需要调整walsender.c的代码:
src/include/replication/walsender.h
extern PGDLLIMPORT char *xlog_search_dir;
src/backend/replication/walsender.c
char * xlog_search_dir = NULL;
static void
XLogRead(char *buf, XLogRecPtr startptr, Size count)
{
...
if (sendFile < 0 || !XLByteInSeg(recptr, sendSegNo))
{
char path[MAXPGPATH];
...
sendFile = BasicOpenFile(path, O_RDONLY | PG_BINARY, 0);
if (sendFile < 0)
{
if (xlog_search_dir != NULL && xlog_search_dir[0] != '\0')
{
snprintf(path, MAXPGPATH, "%s/%08X%08X%08X", xlog_search_dir, curFileTimeLine,
(uint32) ((sendSegNo) / XLogSegmentsPerXLogId),
(uint32) ((sendSegNo) % XLogSegmentsPerXLogId));
sendFile = BasicOpenFile(path, O_RDONLY | PG_BINARY, 0);
if (sendFile <0)
{
/*
* If the file is not found, assume it's because the standby
* asked for a too old WAL segment that has already been
* removed or recycled.
*/
if (errno == ENOENT)
ereport(ERROR,
(errcode_for_file_access(),
errmsg("requested WAL segment %s has already been removed",
XLogFileNameP(curFileTimeLine, sendSegNo))));
else
ereport(ERROR,
(errcode_for_file_access(),
errmsg("could not open file \"%s\": %m",
path)));
}
}
else
{
/*
* If the file is not found, assume it's because the standby
* asked for a too old WAL segment that has already been
* removed or recycled.
*/
if (errno == ENOENT)
ereport(ERROR,
(errcode_for_file_access(),
errmsg("requested WAL segment %s has already been removed",
XLogFileNameP(curFileTimeLine, sendSegNo))));
else
ereport(ERROR,
(errcode_for_file_access(),
errmsg("could not open file \"%s\": %m",
path)));
}
}
sendOff = 0;
}
...
重新编译,并在postgresql.conf中配置
xlog_search_dir='/home/digoal/arch'
现在,把备库需要的已归档的XLOG从主库的pg_xlog目录删除,也不会报requested WAL segment %s has already been removed”了。