异版本pg_resetxlog后导致的控制文件差异问题处理
背景
数据库的redo日志损坏时,或者控制文件损坏时,可能导致数据库无法启动。
如果存放pg_xlog或者pg_control文件的块设备遇到问题,可能引发这种情况。
遇到xlog或者控制文件损坏的时候,怎么处理呢?
数据库正常关闭时会写控制文件,redo是在数据库crash后需要用来恢复数据库的,如果数据库正常的关闭,实际上不需要从redo恢复。
PostgreSQL 提供了一个工具,用来生成或改写控制文件,抹除指定的pg_xlog。
在数据库因为控制文件损坏,或者pg_xlog损坏,导致数据库不能正常启动时使用。
使用后,数据库起来之后,请务必逻辑导出后再导入一个新的集群。
导出时最好设置跳过错误的块(设置zero_damaged_pages=true),因为这种情况下十有八九块会损坏。
同时,由于pg_resetxlog会改写控制文件,如果你使用的pg_resetxlog和数据库的版本不一致,会导致生成的控制文件版本跟随pg_resetxlog的版本,导致后面一系列的问题。
控制文件的内容
查看控制文件的头文件,可以了解到控制文件的内容定义
src/include/catalog/pg_control.h
控制文件的版本
/* Version identifier for this pg_control format */
#define PG_CONTROL_VERSION 942
控制文件的检查点信息
/*
* Body of CheckPoint XLOG records. This is declared here because we keep
* a copy of the latest one in pg_control for possible disaster recovery.
* Changing this struct requires a PG_CONTROL_VERSION bump.
*/
typedef struct CheckPoint
{
XLogRecPtr redo; /* next RecPtr available when we began to
* create CheckPoint (i.e. REDO start point) */
TimeLineID ThisTimeLineID; /* current TLI */
TimeLineID PrevTimeLineID; /* previous TLI, if this record begins a new
* timeline (equals ThisTimeLineID otherwise) */
bool fullPageWrites; /* current full_page_writes */
uint32 nextXidEpoch; /* higher-order bits of nextXid */
TransactionId nextXid; /* next free XID */
Oid nextOid; /* next free OID */
MultiXactId nextMulti; /* next free MultiXactId */
MultiXactOffset nextMultiOffset; /* next free MultiXact offset */
TransactionId oldestXid; /* cluster-wide minimum datfrozenxid */
Oid oldestXidDB; /* database with minimum datfrozenxid */
MultiXactId oldestMulti; /* cluster-wide minimum datminmxid */
Oid oldestMultiDB; /* database with minimum datminmxid */
pg_time_t time; /* time stamp of checkpoint */
TransactionId oldestCommitTsXid; /* oldest Xid with valid commit
* timestamp */
TransactionId newestCommitTsXid; /* newest Xid with valid commit
* timestamp */
/*
* Oldest XID still running. This is only needed to initialize hot standby
* mode from an online checkpoint, so we only bother calculating this for
* online checkpoints and only when wal_level is hot_standby. Otherwise
* it's set to InvalidTransactionId.
*/
TransactionId oldestActiveXid;
} CheckPoint;
控制文件数据
/*
* Contents of pg_control.
*
* NOTE: try to keep this under 512 bytes so that it will fit on one physical
* sector of typical disk drives. This reduces the odds of corruption due to
* power failure midway through a write.
*/
typedef struct ControlFileData
{
...
} ControlFileData;
通过pg_controldata命令则可以输出控制文件的内容,源码如下。
src/bin/pg_controldata/pg_controldata.c
例子
$ export PGDATA=/home/digoal/pgdata
$ pg_controldata
pg_control version number: 942
Catalog version number: 201510051
Database system identifier: 6318621837015461309
Database cluster state: in production
pg_control last modified: Sun 14 Aug 2016 06:54:51 PM CST
Latest checkpoint location: 42/E20000E0
Prior checkpoint location: 42/E2000028
Latest checkpoint's REDO location: 42/E20000E0
Latest checkpoint's REDO WAL file: 0000000100000042000000E2
Latest checkpoint's TimeLineID: 1
Latest checkpoint's PrevTimeLineID: 1
Latest checkpoint's full_page_writes: on
Latest checkpoint's NextXID: 0/3896508417
Latest checkpoint's NextOID: 10000
Latest checkpoint's NextMultiXactId: 1
Latest checkpoint's NextMultiOffset: 0
Latest checkpoint's oldestXID: 3800764117
Latest checkpoint's oldestXID's DB: 1
Latest checkpoint's oldestActiveXID: 0
Latest checkpoint's oldestMultiXid: 1
Latest checkpoint's oldestMulti's DB: 13294
Latest checkpoint's oldestCommitTsXid:0
Latest checkpoint's newestCommitTsXid:0
Time of latest checkpoint: Sun 14 Aug 2016 06:54:51 PM CST
Fake LSN counter for unlogged rels: 0/1
Minimum recovery ending location: 0/0
Min recovery ending loc's timeline: 0
Backup start location: 0/0
Backup end location: 0/0
End-of-backup record required: no
wal_level setting: minimal
wal_log_hints setting: off
max_connections setting: 100
max_worker_processes setting: 8
max_prepared_xacts setting: 0
max_locks_per_xact setting: 64
track_commit_timestamp setting: off
Maximum data alignment: 8
Database block size: 8192
Blocks per segment of large relation: 131072
WAL block size: 8192
Bytes per WAL segment: 16777216
Maximum length of identifiers: 64
Maximum columns in an index: 32
Maximum size of a TOAST chunk: 1996
Size of a large-object chunk: 2048
Date/time type storage: 64-bit integers
Float4 argument passing: by value
Float8 argument passing: by value
Data page checksum version: 0
pg_resetxlog 与数据库集群版本不一致时会怎样
如果数据库集群时9.5,而你使用其他版本的pg_resetxlog修改其控制文件,结果会怎样呢?
例如使用9.2的pg_resetxlog设置9.5集群的pg_control文件。
/data/temp/pgdata/pgsql9.2/bin/pg_resetxlog -D /data/temp/pgdata/main
pg_resetxlog: pg_control exists but is broken or unknown version; ignoring it
Guessed pg_control values:
如果你使用-f选项强制刷了这个pg_control文件,则版本号会变成9.2的,这个时候,你再使用9.5去启动数据库是会失败的。
那么使用9.2就能启动成功了吗?
当然也不行,因为数据库还有其他地方记录了版本号,那就是在$PGDATA以及数据库的数据文件目录对应的PG_VERSION文件。
这个文件是不会被pg_resetxlog纂改的,因为可以用来追溯真正的版本。
cat PG_VERSION
9.5
被不同版本的pg_resetxlog强制重建控制文件后,数据库将无法启动。
怎么办呢?
方法很简单,使用PG_VERSION对应版本的pg_resetxlog重新生成一遍控制文件即可。
pg_resetxlog 的参数计算方法
pg_resetxlog参数有几个,分别用来设置控制文件中对应的next xid , next oid, next multi-xact xid, ……
这些值有固定的算法,而且有安全范围,例如,不能设置比实际已创建事务更小的事务号,否则会造成对应事务产生的数据在事务号未消耗前数据不可见。
安全值的计算方法,可以参考pg_resetxlog的说明。
其实就是使用xlog中的文件名推算next xid。使用pg_multixact中的members , offsets推断oldest multixact id, next multi-xact。等等。
The -o, -x, -e, -m, -O, -c and -l options allow the next OID, next transaction ID, next transaction ID's epoch, next and oldest multitransaction ID, next multitransaction offset, oldest and newest transaction IDs for which
the commit time can be retrieved, and WAL starting address values to be set manually. These are only needed when pg_resetxlog is unable to determine appropriate values by reading pg_control. Safe values can be determined as
follows:
· A safe value for the next transaction ID (-x) can be determined by looking for the numerically largest file name in the directory pg_clog under the data directory, adding one, and then multiplying by 1048576. Note that
the file names are in hexadecimal. It is usually easiest to specify the option value in hexadecimal too. For example, if 0011 is the largest entry in pg_clog, -x 0x1200000 will work (five trailing zeroes provide the
proper multiplier).
· A safe value for the next multitransaction ID (first part of -m) can be determined by looking for the numerically largest file name in the directory pg_multixact/offsets under the data directory, adding one, and then
multiplying by 65536. Conversely, a safe value for the oldest multitransaction ID (second part of -m) can be determined by looking for the numerically smallest file name in the same directory and multiplying by 65536. As
above, the file names are in hexadecimal, so the easiest way to do this is to specify the option value in hexadecimal and append four zeroes.
· A safe value for the next multitransaction offset (-O) can be determined by looking for the numerically largest file name in the directory pg_multixact/members under the data directory, adding one, and then multiplying
by 52352. As above, the file names are in hexadecimal. There is no simple recipe such as the ones above of appending zeroes.
· A safe value for the oldest transaction ID for which the commit time can be retrieved (first part of -c) can be determined by looking for the numerically smallest file name in the directory pg_commit_ts under the data
directory. Conversely, a safe value for the newest transaction ID for which the commit time can be retrieved (second part of -c) can be determined by looking for the numerically greatest file name in the same directory.
As above, the file names are in hexadecimal.
· The WAL starting address (-l) should be larger than any WAL segment file name currently existing in the directory pg_xlog under the data directory. These names are also in hexadecimal and have three parts. The first part
is the “timeline ID” and should usually be kept the same. For example, if 00000001000000320000004A is the largest entry in pg_xlog, use -l 00000001000000320000004B or higher.
Note
pg_resetxlog itself looks at the files in pg_xlog and chooses a default -l setting beyond the last existing file name. Therefore, manual adjustment of -l should only be needed if you are aware of WAL segment files
that are not currently present in pg_xlog, such as entries in an offline archive; or if the contents of pg_xlog have been lost entirely.
· There is no comparably easy way to determine a next OID that's beyond the largest one in the database, but fortunately it is not critical to get the next-OID setting right.
· The transaction ID epoch is not actually stored anywhere in the database except in the field that is set by pg_resetxlog, so any value will work so far as the database itself is concerned. You might need to adjust this
value to ensure that replication systems such as Slony-I and Skytools work correctly — if so, an appropriate value should be obtainable from the state of the downstream replicated database.
The -n (no operation) option instructs pg_resetxlog to print the values reconstructed from pg_control and values about to be changed, and then exit without modifying anything. This is mainly a debugging tool, but can be
useful as a sanity check before allowing pg_resetxlog to proceed for real.
推断出正确的值后,就可以开始通过pg_resetxlog设置控制文件了。
如何从$PGDATA查看正确的数据库版本
除了pg_controldata以外,还可以查看文件PG_VERSION得到。
控制文件修复例子
http://blog.163.com/digoal@126/blog/static/163877040201171233710582/
http://blog.163.com/digoal@126/blog/static/16387704020130109400557/
小结
使用与数据文件异版本的pg_resetxlog,将导致数据文件对应到 控制文件损坏。 无法启动数据库。
解决版本,使用与数据文件版本一致的pg_resetxlog,重新生成控制文件,需要加-f强制执行。
pg_resetxlog参数的安全值计算方法,见pg_resetxlog的参考手册。
祝大家玩得开心,欢迎随时来 阿里云促膝长谈业务需求 ,恭候光临。
阿里云的小伙伴们加油,努力 做好内核与服务,打造最贴地气的云数据库 。