zfs pool self healing and scrub and pre-replace bad-disks

6 minute read

背景

ZFS的又一个强大之处, 支持坏块的自愈 (如果使用了冗余的话,如raidz1, raidz2, raidze, ... 并且正确的块可通过ECC重新计算出的话.).   
同时ZFS具备类似ECC DIMM的校验功能, 默认使用SHA-256 checksum.  
使用scrub来检测ZPOOL底层的块设备是否健康, 对于SAS或FC硬盘, 可以一个月检测一次, 而对于低端的SATA, SCSI设备则最好1周检测一次.  
这些可以放在定时任务中执行, 例如每天0点1分开始执行一次scrub.  
crontab -e  
1 0 * * * /opt/zfs0.6.2/sbin/zpool scrub zptest  
对于检测到的指标不好的盘, 可以提前更换(使用zpool replace).  
指标 :   
The rows in the "zpool status" command give you vital information about the pool, most of which are self-explanatory. They are defined as follows:  
  
pool- The name of the pool.  
state- The current health of the pool. This information refers only to the ability of the pool to provide the necessary replication level.  
status- A description of what is wrong with the pool. This field is omitted if no problems are found.  
action- A recommended action for repairing the errors. This field is an abbreviated form directing the user to one of the following sections. This field is omitted if no problems are found.  
see- A reference to a knowledge article containing detailed repair information. Online articles are updated more often than this guide can be updated, and should always be referenced for the most up-to-date repair procedures. This field is omitted if no problems are found.  
scrub- Identifies the current status of a scrub operation, which might include the date and time that the last scrub was completed, a scrub in progress, or if no scrubbing was requested.  
errors- Identifies known data errors or the absence of known data errors.  
config- Describes the configuration layout of the devices comprising the pool, as well as their state and any errors generated from the devices. The state can be one of the following: ONLINE, FAULTED, DEGRADED, UNAVAILABLE, or OFFLINE. If the state is anything but ONLINE, the fault tolerance of the pool has been compromised.  
The columns in the status output, "READ", "WRITE" and "CHKSUM" are defined as follows:  
  
NAME- The name of each VDEV in the pool, presented in a nested order.  
STATE- The state of each VDEV in the pool. The state can be any of the states found in "config" above.  
READ- I/O errors occurred while issuing a read request.  
WRITE- I/O errors occurred while issuing a write request.  
CHKSUM- Checksum errors. The device returned corrupted data as the result of a read request.  
  
Scrubbing ZFS storage pools is not something that happens automatically. You need to do it manually, and it's highly recommended that you do it on a regularly scheduled interval. The recommended frequency at which you should scrub the data depends on the quality of the underlying disks. If you have SAS or FC disks, then once per month should be sufficient. If you have consumer grade SATA or SCSI, you should do once per week. You can schedule a scrub easily with the following command:  
# zpool scrub tank  
# zpool status tank  
  pool: tank  
 state: ONLINE  
 scan: scrub in progress since Sat Dec  8 08:06:36 2012  
    32.0M scanned out of 48.5M at 16.0M/s, 0h0m to go  
    0 repaired, 65.99% done  
config:  
  
        NAME        STATE     READ WRITE CKSUM  
        tank        ONLINE       0     0     0  
          mirror-0  ONLINE       0     0     0  
            sde     ONLINE       0     0     0  
            sdf     ONLINE       0     0     0  
          mirror-1  ONLINE       0     0     0  
            sdg     ONLINE       0     0     0  
            sdh     ONLINE       0     0     0  
          mirror-2  ONLINE       0     0     0  
            sdi     ONLINE       0     0     0  
            sdj     ONLINE       0     0     0  
  
errors: No known data errors  
  
例如, 使用raidz1冗余, 创建一个zp pool.  
[root@spark01 ~]# zpool create zp raidz1 /home/digoal/zfs.disk1 /home/digoal/zfs.disk2 /home/digoal/zfs.disk3 /home/digoal/zfs.disk4 log mirror /home/digoal/zfs.log1 /home/digoal/zfs.log2  
  
[root@spark01 ~]# zpool status  
  pool: zp  
 state: ONLINE  
  scan: none requested  
config:  
  
        NAME                        STATE     READ WRITE CKSUM  
        zp                          ONLINE       0     0     0  
          raidz1-0                  ONLINE       0     0     0  
            /home/digoal/zfs.disk1  ONLINE       0     0     0  
            /home/digoal/zfs.disk2  ONLINE       0     0     0  
            /home/digoal/zfs.disk3  ONLINE       0     0     0  
            /home/digoal/zfs.disk4  ONLINE       0     0     0  
        logs  
          mirror-1                  ONLINE       0     0     0  
            /home/digoal/zfs.log1   ONLINE       0     0     0  
            /home/digoal/zfs.log2   ONLINE       0     0     0  
  
errors: No known data errors  
拷贝一些文件到dataset.  
[root@spark01 ~]# cd /home/digoal  
[root@spark01 digoal]# ll  
total 10575000  
drwxr-xr-x.  9 digoal digoal       4096 Mar 31 17:15 hadoop-2.4.0  
-rw-rw-r--.  1 digoal digoal  138943699 Mar 31 17:16 hadoop-2.4.0.tar.gz  
drwxr-xr-x. 10   7900   7900       4096 May 19 01:24 spl-0.6.2  
-rw-r--r--.  1 root   root       565277 Aug 24  2013 spl-0.6.2.tar.gz  
drwxr-xr-x. 13   7900   7900       4096 May 19 01:28 zfs-0.6.2  
-rw-r--r--.  1 root   root      2158948 Aug 24  2013 zfs-0.6.2.tar.gz  
-rw-r--r--.  1 root   root   2147483648 May 19 05:54 zfs.disk1  
-rw-r--r--.  1 root   root   2147483648 May 19 05:54 zfs.disk2  
-rw-r--r--.  1 root   root   2147483648 May 19 05:54 zfs.disk3  
-rw-r--r--.  1 root   root   2147483648 May 19 05:54 zfs.disk4  
-rw-r--r--.  1 root   root   1048576000 May 19 05:54 zfs.log1  
-rw-r--r--.  1 root   root   1048576000 May 19 05:54 zfs.log2  
[root@spark01 digoal]# zfs create zp/test  
[root@spark01 digoal]# cp -r spl-0.6.2* zfs-0.6.2* hadoop-2.4.0* /zp/test/  
  
[root@spark01 digoal]# df -h  
Filesystem      Size  Used Avail Use% Mounted on  
/dev/sda1        31G  1.2G   29G   5% /  
tmpfs            12G     0   12G   0% /dev/shm  
/dev/sda3        89G   11G   74G  13% /home  
zp              5.4G     0  5.4G   0% /zp  
zp/test         5.9G  535M  5.4G   9% /zp/test  
  
使用zpool scrub检查这个pool.  
[root@spark01 digoal]# zpool scrub zp  
[root@spark01 digoal]# zpool status  
  pool: zp  
 state: ONLINE  
  scan: scrub repaired 0 in 0h0m with 0 errors on Mon May 19 05:56:17 2014  
config:  
  
        NAME                        STATE     READ WRITE CKSUM  
        zp                          ONLINE       0     0     0  
          raidz1-0                  ONLINE       0     0     0  
            /home/digoal/zfs.disk1  ONLINE       0     0     0  
            /home/digoal/zfs.disk2  ONLINE       0     0     0  
            /home/digoal/zfs.disk3  ONLINE       0     0     0  
            /home/digoal/zfs.disk4  ONLINE       0     0     0  
        logs  
          mirror-1                  ONLINE       0     0     0  
            /home/digoal/zfs.log1   ONLINE       0     0     0  
            /home/digoal/zfs.log2   ONLINE       0     0     0  
errors: No known data errors  
  
关闭一个正在执行的scrub任务 :   
[root@spark01 test]# zpool scrub -s zp  
cannot cancel scrubbing zp: there is no active scrub  
接下来要测试一下在线替换scrub检查到问题的块设备, 我这里使用删除一个zfs.disk来模拟坏盘.  
[root@spark01 digoal]# rm -f zfs.disk1  
[root@spark01 digoal]# zpool scrub zp    #使用scrub没有检测到删除的盘.  
[root@spark01 digoal]# zpool status  
  pool: zp  
 state: ONLINE  
  scan: scrub repaired 0 in 0h0m with 0 errors on Mon May 19 05:56:44 2014  
config:  
  
        NAME                        STATE     READ WRITE CKSUM  
        zp                          ONLINE       0     0     0  
          raidz1-0                  ONLINE       0     0     0  
            /home/digoal/zfs.disk1  ONLINE       0     0     0  
            /home/digoal/zfs.disk2  ONLINE       0     0     0  
            /home/digoal/zfs.disk3  ONLINE       0     0     0  
            /home/digoal/zfs.disk4  ONLINE       0     0     0  
        logs  
          mirror-1                  ONLINE       0     0     0  
            /home/digoal/zfs.log1   ONLINE       0     0     0  
            /home/digoal/zfs.log2   ONLINE       0     0     0  
  
errors: No known data errors  
但是因为使用了raidz1, 所以删除disk1后还能查询. (从校验数据中计算出原始数据. raidz1允许坏1块盘)  
[root@spark01 digoal]# cd /zp/test  
[root@spark01 test]# ll  
total 138651  
drwxr-xr-x.  9 root root        12 May 19 05:55 hadoop-2.4.0  
-rw-r--r--.  1 root root 138943699 May 19 05:56 hadoop-2.4.0.tar.gz  
drwxr-xr-x. 10 root root        30 May 19 05:55 spl-0.6.2  
-rw-r--r--.  1 root root    565277 May 19 05:55 spl-0.6.2.tar.gz  
drwxr-xr-x. 13 root root        37 May 19 05:55 zfs-0.6.2  
-rw-r--r--.  1 root root   2158948 May 19 05:55 zfs-0.6.2.tar.gz  
[root@spark01 test]# du -sh *  
250M    hadoop-2.4.0  
133M    hadoop-2.4.0.tar.gz  
39M     spl-0.6.2  
643K    spl-0.6.2.tar.gz  
193M    zfs-0.6.2  
2.2M    zfs-0.6.2.tar.gz  
  
新建一个文件, 用来替换被我删掉的zfs.disk1文件, 新增的这个文件可以与zfs.disk1同名, 也可以不同名.   
[root@spark01 test]# cd /home/digoal/  
[root@spark01 digoal]# dd if=/dev/zero of=./zfs.disk1 bs=1024k count=2048  
2048+0 records in  
2048+0 records out  
2147483648 bytes (2.1 GB) copied, 1.29587 s, 1.7 GB/s  
使用zpool replace替换坏盘 :   
[root@spark01 digoal]# zpool replace -h  
usage:  
        replace [-f] <pool> <device> [new-device]  
  
[root@spark01 digoal]# zpool replace zp /home/digoal/zfs.disk1 /home/digoal/zfs.disk1  
[root@spark01 digoal]# zpool scrub zp  
[root@spark01 digoal]# zpool status zp  
  pool: zp  
 state: ONLINE  
  scan: scrub repaired 0 in 0h0m with 0 errors on Mon May 19 06:01:19 2014  
config:  
  
        NAME                        STATE     READ WRITE CKSUM  
        zp                          ONLINE       0     0     0  
          raidz1-0                  ONLINE       0     0     0  
            /home/digoal/zfs.disk1  ONLINE       0     0     0  
            /home/digoal/zfs.disk2  ONLINE       0     0     0  
            /home/digoal/zfs.disk3  ONLINE       0     0     0  
            /home/digoal/zfs.disk4  ONLINE       0     0     0  
        logs  
          mirror-1                  ONLINE       0     0     0  
            /home/digoal/zfs.log1   ONLINE       0     0     0  
            /home/digoal/zfs.log2   ONLINE       0     0     0  
  
errors: No known data errors  
使用status -x参数查看pool的健康状态  
[root@spark01 digoal]# zpool status zp -x  
pool 'zp' is healthy  
  
注意如果是真实环境中的硬盘替换的话, 支持热插拔的硬盘可以直接替换硬盘, 然后使用zpool replace替换.  
对于不能热插拔的硬盘, 需要关机替换硬盘, 再使用zpool replace替换掉坏盘.  
查看坏盘对应的设备号(或序列号, 因为更换硬盘时需要拔下硬盘后现场对比一下序列号, 以免弄错).  
hdparm -I, 对比zpool status中的设备名.  

参考

1. http://docs.oracle.com/cd/E26502_01/pdf/E29007.pdf

2. http://www.root.cz/clanky/suborovy-system-zfs-konzistentnost-dat/

3. https://pthree.org/2012/12/11/zfs-administration-part-vi-scrub-and-resilver/

4. https://pthree.org/2012/12/05/zfs-administration-part-ii-raidz/

Flag Counter

digoal’s 大量PostgreSQL文章入口