从文件系统恢复遗失的UNLOGGED table’s datafile(文件系统恢复实践)

5 minute read

背景

注意PostgreSQL的unlogged table是不记录xlog的,所以在备库上没有unlogged table的数据记录。

同时,数据库在进入恢复状态时,为了保证数据的一致性,postgresql会自动清除unlogged table的数据文件。

那么问题来了,万一你不小心误创建了unlogged table,你可能一开始会没有感知,但是一旦发生以下情况,你会发现数据不见了。

1. 数据库crash掉之后,重启。

2. 主备切换,备库变成主库,主库变成备库。

以上两种情况,都是数据库在启动并进入recovery状态后,原来主库下的unlogged table的datafile都会被清空。

代码见:

src/backend/access/transam/xlog.c

/*    
 * This must be called ONCE during postmaster or standalone-backend startup    
 */    
void    
StartupXLOG(void)    
{    
......    
    
        /* REDO */    
        if (InRecovery)    
        {    
	......    
                /*    
                 * We're in recovery, so unlogged relations may be trashed and must be    
                 * reset.  This should be done BEFORE allowing Hot Standby    
                 * connections, so that read-only backends don't try to read whatever    
                 * garbage is left over from before.    
                 */    
                ResetUnloggedRelations(UNLOGGED_RELATION_CLEANUP);    
......    
        /*    
         * Reset unlogged relations to the contents of their INIT fork. This is    
         * done AFTER recovery is complete so as to include any unlogged relations    
         * created during recovery, but BEFORE recovery is marked as having    
         * completed successfully. Otherwise we'd not retry if any of the post    
         * end-of-recovery steps fail.    
         */    
        if (InRecovery)    
                ResetUnloggedRelations(UNLOGGED_RELATION_INIT);    
......    

backend/storage/file/reinit.c

/*    
 * Reset unlogged relations from before the last restart.    
 *    
 * If op includes UNLOGGED_RELATION_CLEANUP, we remove all forks of any    
 * relation with an "init" fork, except for the "init" fork itself.    
 *    
 * If op includes UNLOGGED_RELATION_INIT, we copy the "init" fork to the main    
 * fork.    
 */    
void    
ResetUnloggedRelations(int op)    
{    
        char            temp_path[MAXPGPATH];    
        DIR                *spc_dir;    
        struct dirent *spc_de;    
        MemoryContext tmpctx,    
                                oldctx;    
    
        /* Log it. */    
        elog(DEBUG1, "resetting unlogged relations: cleanup %d init %d",    
                 (op & UNLOGGED_RELATION_CLEANUP) != 0,    
                 (op & UNLOGGED_RELATION_INIT) != 0);    
    
        /*    
         * Just to be sure we don't leak any memory, let's create a temporary    
         * memory context for this operation.    
         */    
        tmpctx = AllocSetContextCreate(CurrentMemoryContext,    
                                                                   "ResetUnloggedRelations",    
                                                                   ALLOCSET_DEFAULT_MINSIZE,    
                                                                   ALLOCSET_DEFAULT_INITSIZE,    
                                                                   ALLOCSET_DEFAULT_MAXSIZE);    
        oldctx = MemoryContextSwitchTo(tmpctx);    
    
        /*    
         * First process unlogged files in pg_default ($PGDATA/base)    
         */    
        ResetUnloggedRelationsInTablespaceDir("base", op);    
    
        /*    
         * Cycle through directories for all non-default tablespaces.    
         */    
        spc_dir = AllocateDir("pg_tblspc");    
    
        while ((spc_de = ReadDir(spc_dir, "pg_tblspc")) != NULL)    
        {    
                if (strcmp(spc_de->d_name, ".") == 0 ||    
                        strcmp(spc_de->d_name, "..") == 0)    
                        continue;    
    
                snprintf(temp_path, sizeof(temp_path), "pg_tblspc/%s/%s",    
                                 spc_de->d_name, TABLESPACE_VERSION_DIRECTORY);    
                ResetUnloggedRelationsInTablespaceDir(temp_path, op);    
        }    
        FreeDir(spc_dir);    
    
        /*    
         * Restore memory context.    
         */    
        MemoryContextSwitchTo(oldctx);    
        MemoryContextDelete(tmpctx);    
}    

src/include/common/relpath.h

/*    
 * Stuff for fork names.    
 *    
 * The physical storage of a relation consists of one or more forks.    
 * The main fork is always created, but in addition to that there can be    
 * additional forks for storing various metadata. ForkNumber is used when    
 * we need to refer to a specific fork in a relation.    
 */    
typedef enum ForkNumber    
{    
        InvalidForkNumber = -1,    
        MAIN_FORKNUM = 0,    
        FSM_FORKNUM,    
        VISIBILITYMAP_FORKNUM,    
        INIT_FORKNUM    
    
        /*    
         * NOTE: if you add a new fork, change MAX_FORKNUM and possibly    
         * FORKNAMECHARS below, and update the forkNames array in    
         * src/common/relpath.c    
         */    
} ForkNumber;    

那么问题来了,如果真的这样了,有办法恢复吗?

1. 首先,如果你在原来的主库上有基础备份,你可以从基础备份恢复。为什么需要主库的备份呢,因为备库上没有unlogged table的数据文件内容,所以在备库备份是备不到unlogged table的datafile的。

但是这种恢复方法也务必要小心,你需要在启动数据库前,先把_init的文件都删掉,这样启动数据库时数据文件就不会被清除。

2. 从主库的逻辑备份中恢复。

3. 从审计日志中回放SQL恢复。

4. 如果你没有主库的基础备份,那么可有从文件系统中去恢复删掉的数据文件。例如ext4文件系统的恢复方式如下:

http://blog.163.com/digoal@126/blog/static/16387704020142124032866/

例子:

创建一个unlogged table,并记录它的filenode

postgres=# create unlogged table utbl1(id int);    
CREATE TABLE    
postgres=# insert into utbl1 select generate_series(1,1000);    
INSERT 0 1000    
postgres=# select pg_relation_filepath('utbl1'::regclass);    
 pg_relation_filepath     
----------------------    
 base/151898/33822    
(1 row)    

查看到PG对unlogged table做了_init的后缀标记

http://www.postgresql.org/docs/9.4/static/storage-init.html

postgres@digoal-> cd $PGDATA    
postgres@digoal-> ll base/151898/33822*    
-rw------- 1 postgres postgres 40K Sep 26 11:39 base/151898/33822    
-rw------- 1 postgres postgres 24K Sep 26 11:39 base/151898/33822_fsm    
-rw------- 1 postgres postgres   0 Sep 26 11:38 base/151898/33822_init    

现在,我们进入恢复模式

postgres@digoal-> mv recovery.done recovery.conf    
postgres@digoal-> pg_ctl start    

数据库启动后,其实已经将unlogged table的数据文件清理掉了,见前面的源码。

postgres=# select count(*) from utbl1 ;    
ERROR:  cannot access temporary or unlogged relations during recovery    

现在再起来,unlogged table就没有数据了。

postgres@digoal-> mv recovery.conf recovery.done    
postgres@digoal-> pg_ctl start    
postgres=# select count(*) from utbl1 ;    
 count     
-------    
     0    
(1 row)    

恢复过程,先停库,然后umount数据文件所在的文件系统。

pg_ctl stop -m fast    
root@digoal-> umount /data01    

假设我已经安装了extundelete

http://sourceforge.net/projects/extundelete/files/extundelete/

http://blog.163.com/digoal@126/blog/static/16387704020142124032866/

root@digoal-> cd /opt/extundelete/    
root@digoal-> ll    
total 4.0K    
drwxr-xr-x 2 root root 4.0K Sep 26 13:43 bin    
root@digoal-> cd bin/    
root@digoal-> ll    
total 1.2M    
-rwxr-xr-x 1 root root 1.2M Sep 26 13:43 extundelete    

查看数据文件所在的文件系统的块设备的inode信息。

root@digoal-> ./extundelete --ls --inode 2 /dev/mapper/vgdata01-lv01    
./extundelete: unrecognized option '--ls'    
NOTICE: Extended attributes are not restored.    
Loading filesystem metadata ... 80 groups loaded.    
Group: 0    
Contents of inode 2:    
0000 | ed 41 00 00 00 10 00 00 6d 08 fa 55 bc 08 fa 55 | .A......m..U...U    
0010 | bc 08 fa 55 00 00 00 00 00 00 05 00 08 00 00 00 | ...U............    
0020 | 00 00 00 00 02 00 00 00 a1 22 00 00 00 00 00 00 | ........."......    
0030 | 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 | ................    
0040 | 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 | ................    
0050 | 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 | ................    
0060 | 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 | ................    
0070 | 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 | ................    
0080 | 1c 00 00 00 88 90 cc 84 88 90 cc 84 00 00 00 00 | ................    
0090 | 6d 08 fa 55 00 00 00 00 00 00 00 00 00 00 00 00 | m..U............    
00a0 | 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 | ................    
00b0 | 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 | ................    
00c0 | 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 | ................    
00d0 | 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 | ................    
00e0 | 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 | ................    
00f0 | 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 | ................    
    
Inode is Allocated    
File mode: 16877    
Low 16 bits of Owner Uid: 0    
Size in bytes: 4096    
Access time: 1442449517    
Creation time: 1442449596    
Modification time: 1442449596    
Deletion Time: 0    
Low 16 bits of Group Id: 0    
Links count: 5    
Blocks count: 8    
File flags: 0    
File version (for NFS): 0    
File ACL: 0    
Directory ACL: 0    
Fragment address: 0    
Direct blocks: 8865, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0    
Indirect block: 0    
Double indirect block: 0    
Triple indirect block: 0    
    
File name                                       | Inode number | Deleted status    
.                                                 2    
..                                                2    
lost+found                                        11    
pg_root_1921                                      131073    
pg_root_1922                                      393217    

我们的数据文件在pg_root_1921目录下,对应的inode=131073,根据这个inode继续查询下一级目录的inode

root@digoal-> ./extundelete --ls --inode 2 /dev/mapper/vgdata01-lv01 --inode 131073    

找到了base的inode

base                                              131077    
    

继续找,找到了unlogged table所在的database的inode

root@digoal-> ./extundelete --ls --inode 2 /dev/mapper/vgdata01-lv01 --inode 131077    
151898        131078    

继续找,找对应的_init文件,找到了它的前缀,根据前缀过滤

root@digoal-> ./extundelete --ls --inode 2 /dev/mapper/vgdata01-lv01 --inode 131078|grep _init    
33822_init                                        131152    

找到了被删除的文件

root@digoal-> ./extundelete --ls --inode 2 /dev/mapper/vgdata01-lv01 --inode 131078|grep 33822    
./extundelete: unrecognized option '--ls'    
33822                                             131116         Deleted    
33822_fsm                                         131147         Deleted    
33822_init                                        131152    

使用inode恢复,恢复删除的datafile

root@digoal-> ./extundelete --ls --inode 2 /dev/mapper/vgdata01-lv01 --restore-inode 131116    

恢复后放在这里

root@digoal->  ll RECOVERED_FILES/     
total 40K    
-rw-r--r-- 1 root root 40K Sep 26 13:50 file.131116    

将文件拷贝到原来的位置

root@digoal-> mount /data01    
root@digoal-> cp file.131116 /data01/pg_root_1921/base/151898/33822    
root@digoal-> chown -R postgres:postgres /data01/    
root@digoal-> su - postgres    

启动数据库前,切记,删除_init后缀的文件,否则数据文件可能又会被清理掉。

cd $PGDATA    
postgres@digoal-> rm -f base/151898/33822_init     

启动数据库,数据已经回来了。

postgres@digoal-> pg_ctl start    
postgres=# select count(*) from utbl1 ;    
 count     
-------    
  1000    
(1 row)    

大家千万要切记,不要滥用unlogged table,除非你可以为你的行为负责。

hash index也是这样,因为不写XLOG,所以如果你使用流复制来实施HA的话,在切换到备库后,如果走hash index scan,你会发现数据凭空”消失”,实际上是因为hash index的变更没有复制过去的原因。所以就不要使用hash index了吧。

参考

1. http://sourceforge.net/projects/extundelete/files/extundelete/

2. http://blog.163.com/digoal@126/blog/static/16387704020142124032866/

Flag Counter

digoal’s 大量PostgreSQL文章入口