PostgreSQL 数据文件灾难恢复 - 解析与数据dump
背景
俗话说常在河边站哪有不湿鞋,作为一名战斗在一线的DBA或者开发者,可能有遇到过磁盘损坏,磁盘阵列损坏,如果有备份或者备库的话,还好。
如果没有备份,或者没有备库(通常有一些小型或者创业型的企业),那么遇到磁盘损坏或者其他原因(比如掉电文件系统损坏),导致数据库的数据文件并不完整时,如何从有限的资料中找出数据呢?
比如PostgreSQL,如果读到坏块,会报块不可读的错误,这种情况下通过设置zero_damaged_pages=on可以跳过损坏的数据块。
如果连元数据都损坏了,又或者坏了一些磁盘,只有某些表空间被幸免于难,这些情况下你的数据库都已经无法启动时,如何能从有限的数据文件中找回数据呢?
数据文件解析pg_filedump
pg_filedump是PostgreSQL社区托管的一个项目,类似于pg_xlogdump,不需要开启数据库,可以直接从数据文件中将数据dump出来。
pg_filedump实际上可以DUMP 堆表、索引数据文件,控制文件的内容。(从pg_filedump引用的头文件也能看出端倪)
安装很简单
git clone git://git.postgresql.org/git/pg_filedump.git
cd pg_filedump
export PATH=/home/digoal/pgsql9.6/bin:$PATH
make ; make install
命令帮助如下,通常来说,你只需要指定需要DUMP的文件即可。
如果文件的块头损坏了,那么你可以手工指定一些信息,包括块大小,段大小,解析哪个块,根据什么格式解析(字段类型列表)等。
pg_filedump [-abcdfhikxy] [-R startblock [endblock]] [-D attrlist] [-S blocksize] [-s segsize] [-n segnumber] file
Defaults are: relative addressing, range of the entire file, block size
as listed on block 0 in the file
The following options are valid for heap and index files:
-a Display absolute addresses when formatting (Block header
information is always block relative)
-b Display binary block images within a range (Option will turn
off all formatting options)
-d Display formatted block content dump (Option will turn off
all other formatting options)
-D Try to decode tuples using given comma separated list of types.
List of supported types:
* bigint
* bigserial
* bool
* char
* charN -- char(n)
* date
* float
* float4
* float8
* int
* json
* macaddr
* name
* oid
* real
* serial
* smallint
* smallserial
* text
* time
* timestamp
* timetz
* uuid
* varchar
* varcharN -- varchar(n)
* xid
* xml
* ~ -- ignores are attributes left in a tuple
-f Display formatted block content dump along with interpretation
-h Display this information
-i Display interpreted item details
-k Verify block checksums
-R Display specific block ranges within the file (Blocks are
indexed from 0)
[startblock]: block to start at
[endblock]: block to end at
A startblock without an endblock will format the single block
-s Force segment size to [segsize]
-n Force segment number to [segnumber]
-S Force block size to [blocksize]
-x Force interpreted formatting of block items as index items
-y Force interpreted formatting of block items as heap items
The following options are valid for control files:
-c Interpret the file listed as a control file
-f Display formatted content dump along with interpretation
-S Force block size to [blocksize]
In most cases it's recommended to use the -i and -f options to get
the most useful dump output.
pg_filedump使用举例
1. 创建测试表
postgres=# create table digoal(id int, info text, crt_time timestamp);
CREATE TABLE
2. 插入测试数据
postgres=# insert into digoal select generate_series(1,1000000),md5(random()::text), clock_timestamp();
INSERT 0 1000000
3. 找出表对应的数据文件
postgres=# select pg_relation_filepath('digoal');
pg_relation_filepath
----------------------
base/13269/173369
(1 row)
4. 调用checkpoint,把数据刷盘,便于我们接下来的观察
checkpoint;
5. 使用pg_filedump直接读取数据文件,导出数据
-> cd $PGDATA/base/13269
-> pg_filedump -i -f ./173369
输出截取
*******************************************************************
* PostgreSQL File/Block Formatted Dump Utility - Version 9.6.0
*
* File: ./173369
* Options used: -i -f
*
* Dump created on: Sun Mar 12 00:28:56 2017
*******************************************************************
首先是块的头部内容
Block 0 ********************************************************
<Header> -----
Block Offset: 0x00000000 Offsets: Lower 452 (0x01c4)
Block: Size 8192 Version 4 Upper 488 (0x01e8)
LSN: logid 61 recoff 0xe69d6490 Special 8192 (0x2000)
Items: 107 Free Space: 36
Checksum: 0x0000 Prune XID: 0x00000000 Flags: 0x0000 ()
Length (including item array): 452
...
然后是DATA部分,
<Data> ------
Item 1 -- Length: 72 Offset: 8120 (0x1fb8) Flags: NORMAL
XMIN: 88072212 XMAX: 88072214 CID|XVAC: 0
Block Id: 9345 linp Index: 86 Attributes: 3 Size: 24
infomask: 0x0102 (HASVARWIDTH|XMIN_COMMITTED)
1fb8: 14e03f05 16e03f05 00000000 00008124 ..?...?........$
1fc8: 56000300 02011800 01000000 43363134 V...........C614
1fd8: 62653439 31616339 65356636 64633136 be491ac9e5f6dc16
1fe8: 35653065 31323162 36316563 33000000 5e0e121b61ec3...
1ff8: 791cce69 7ced0100 y..i|...
COPY: 1 614be491ac9e5f6dc165e0e121b61ec3 2017-03-12 00:26:23.553657
Item 2 -- Length: 72 Offset: 8048 (0x1f70) Flags: NORMAL
XMIN: 88072212 XMAX: 0 CID|XVAC: 0
Block Id: 0 linp Index: 2 Attributes: 3 Size: 24
infomask: 0x0902 (HASVARWIDTH|XMIN_COMMITTED|XMAX_INVALID)
1f70: 14e03f05 00000000 00000000 00000000 ..?.............
1f80: 02000300 02091800 02000000 43383335 ............C835
1f90: 39653064 31623462 61323261 64336139 9e0d1b4ba22ad3a9
1fa0: 65386634 38316231 61633336 31000000 e8f481b1ac361...
1fb0: df1cce69 7ced0100 ...i|...
COPY: 2 8359e0d1b4ba22ad3a9e8f481b1ac361 2017-03-12 00:26:23.553759
......
为了得到记录,需要提供一下字段类型LIST,必须保证与表结构一致
-> pg_filedump -D int,text,timestamp ./173369|less
*******************************************************************
* PostgreSQL File/Block Formatted Dump Utility - Version 9.6.0
*
* File: ./173369
* Options used: -D int,text,timestamp
*
* Dump created on: Sun Mar 12 00:31:25 2017
*******************************************************************
Block 0 ********************************************************
<Header> -----
Block Offset: 0x00000000 Offsets: Lower 452 (0x01c4)
Block: Size 8192 Version 4 Upper 488 (0x01e8)
LSN: logid 61 recoff 0xe69d6490 Special 8192 (0x2000)
Items: 107 Free Space: 36
Checksum: 0x0000 Prune XID: 0x00000000 Flags: 0x0000 ()
Length (including item array): 452
<Data> ------
Item 1 -- Length: 72 Offset: 8120 (0x1fb8) Flags: NORMAL
COPY: 1 614be491ac9e5f6dc165e0e121b61ec3 2017-03-12 00:26:23.553657
Item 2 -- Length: 72 Offset: 8048 (0x1f70) Flags: NORMAL
COPY: 2 8359e0d1b4ba22ad3a9e8f481b1ac361 2017-03-12 00:26:23.553759
Item 3 -- Length: 72 Offset: 7976 (0x1f28) Flags: NORMAL
COPY: 3 0dc8c441e91217897f994ae163510653 2017-03-12 00:26:23.553764
..........
COPY得到的就是使用-D提供的类型列表decode拼装的记录。
是不是可以从文件中DUMP数据了呢?莫急,还要看看掩码哦,否则你不知道这条记录是否为你需要的,因为它可能是DEAD TUPLE。
例子
-> pg_filedump -D int,text,timestamp -i -f ./173369|less
COPY: 1 614be491ac9e5f6dc165e0e121b61ec3 2017-03-12 00:26:23.553657
Item 2 -- Length: 72 Offset: 8048 (0x1f70) Flags: NORMAL
XMIN: 88072212 XMAX: 0 CID|XVAC: 0
Block Id: 0 linp Index: 2 Attributes: 3 Size: 24
infomask: 0x0902 (HASVARWIDTH|XMIN_COMMITTED|XMAX_INVALID)
记录头的infomask解释
每条记录,头部都有infomask, infomask2掩码,掩码表示的意思,可以参考头文件
比如什么是DEAD TUPLE呢?
src/include/access/htup_details.h
/*
* information stored in t_infomask:
*/
#define HEAP_HASNULL 0x0001 /* has null attribute(s) */
#define HEAP_HASVARWIDTH 0x0002 /* has variable-width attribute(s) */
#define HEAP_HASEXTERNAL 0x0004 /* has external stored attribute(s) */
#define HEAP_HASOID 0x0008 /* has an object-id field */
#define HEAP_XMAX_KEYSHR_LOCK 0x0010 /* xmax is a key-shared locker */
#define HEAP_COMBOCID 0x0020 /* t_cid is a combo cid */
#define HEAP_XMAX_EXCL_LOCK 0x0040 /* xmax is exclusive locker */
#define HEAP_XMAX_LOCK_ONLY 0x0080 /* xmax, if valid, is only a locker */
/* xmax is a shared locker */
#define HEAP_XMAX_SHR_LOCK (HEAP_XMAX_EXCL_LOCK | HEAP_XMAX_KEYSHR_LOCK)
#define HEAP_LOCK_MASK (HEAP_XMAX_SHR_LOCK | HEAP_XMAX_EXCL_LOCK | \
HEAP_XMAX_KEYSHR_LOCK)
#define HEAP_XMIN_COMMITTED 0x0100 /* t_xmin committed */
#define HEAP_XMIN_INVALID 0x0200 /* t_xmin invalid/aborted */
#define HEAP_XMIN_FROZEN (HEAP_XMIN_COMMITTED|HEAP_XMIN_INVALID)
#define HEAP_XMAX_COMMITTED 0x0400 /* t_xmax committed */
#define HEAP_XMAX_INVALID 0x0800 /* t_xmax invalid/aborted */
#define HEAP_XMAX_IS_MULTI 0x1000 /* t_xmax is a MultiXactId */
#define HEAP_UPDATED 0x2000 /* this is UPDATEd version of row */
#define HEAP_MOVED_OFF 0x4000 /* moved to another place by pre-9.0
* VACUUM FULL; kept for binary
* upgrade support */
#define HEAP_MOVED_IN 0x8000 /* moved from another place by pre-9.0
* VACUUM FULL; kept for binary
* upgrade support */
#define HEAP_MOVED (HEAP_MOVED_OFF | HEAP_MOVED_IN)
#define HEAP_XACT_MASK 0xFFF0 /* visibility-related bits */
观察deadtuple
postgres=# update digoal set info='new' where id=1;
UPDATE 1
postgres=# checkpoint;
CHECKPOINT
观察
pg_filedump -D int,text,timestamp -i -f ./173369|less
Item 1 -- Length: 72 Offset: 8120 (0x1fb8) Flags: NORMAL
包含了xmax,说明是更新过的记录
XMIN: 88072212 XMAX: 88072214 CID|XVAC: 0
blockid表示记录指向,即新版本记录在哪个新数据块,Linp指新数据块的第几条记录。
Block Id: 9345 linp Index: 86 Attributes: 3 Size: 24
infomask 参考前面的头文件,解读
infomask: 0x0102 (HASVARWIDTH|XMIN_COMMITTED)
1fb8: 14e03f05 16e03f05 00000000 00008124 ..?...?........$
1fc8: 56000300 02011800 01000000 43363134 V...........C614
1fd8: 62653439 31616339 65356636 64633136 be491ac9e5f6dc16
1fe8: 35653065 31323162 36316563 33000000 5e0e121b61ec3...
1ff8: 791cce69 7ced0100 y..i|...
COPY: 1 614be491ac9e5f6dc165e0e121b61ec3 2017-03-12 00:26:23.553657
查看新版本(编号为9345数据块,第86条记录)
pg_filedump -D int,text,timestamp -i -f -R 9345 ./173369|less
Item 86 -- Length: 40 Offset: 2032 (0x07f0) Flags: NORMAL
XMIN: 88072214 XMAX: 0 CID|XVAC: 0
Block Id: 9345 linp Index: 86 Attributes: 3 Size: 24
infomask: 0x2802 (HASVARWIDTH|XMAX_INVALID|UPDATED)
07f0: 16e03f05 00000000 00000000 00008124 ..?............$
0800: 56000300 02281800 01000000 096e6577 V....(.......new
0810: 791cce69 7ced0100 y..i|...
COPY: 1 new 2017-03-12 00:26:23.553657
PostgreSQL数据块的简介
对于数据文件的组织形式,可以参考头文件
src/include/storage
/*
* BlockNumber:
*
* each data file (heap or index) is divided into postgres disk blocks
* (which may be thought of as the unit of i/o -- a postgres buffer
* contains exactly one disk block). the blocks are numbered
* sequentially, 0 to 0xFFFFFFFE.
*
* InvalidBlockNumber is the same thing as P_NEW in buf.h.
*
* the access methods, the buffer manager and the storage manager are
* more or less the only pieces of code that should be accessing disk
* blocks directly.
*/
typedef uint32 BlockNumber;
#define InvalidBlockNumber ((BlockNumber) 0xFFFFFFFF)
#define MaxBlockNumber ((BlockNumber) 0xFFFFFFFE)
每个块内的组织,与对象类型有关,比如堆表,B-TREE,HASH等索引,TOAST, FSM等。
可以参考数据layout介绍
https://www.postgresql.org/docs/9.6/static/storage.html
也可以参考对应类型的头文件
阅读pg_filedump的源码,也有助于你对PostgreSQL存储构造的理解
不妨仔细阅读以下头文件
#include "access/gin_private.h"
#include "access/gist.h"
#include "access/hash.h"
#include "access/htup.h"
#include "access/htup_details.h"
#include "access/itup.h"
#include "access/nbtree.h"
#include "access/spgist_private.h"
#include "catalog/pg_control.h"
#include "storage/bufpage.h"
防止脱裤
我们已经看到,使用pg_filedump可直接decode数据文件的内容,因此泄露数据文件其实是比较危险的。
那么如何防止脱裤呢?TDE是一个很好的手段,即数据文件透明加密。你可以参考我末尾的文章。
另外还有加密方法,比如对敏感数据,使用加密字段存储。加解密交给程序完成。彻底杜绝因泄露文件导致的数据泄露。
参考
《PostgreSQL 透明加密(TDE,FDE) - 块级加密》