zfs pool self healing and scrub and pre-replace bad-disks
背景
ZFS的又一个强大之处, 支持坏块的自愈 (如果使用了冗余的话,如raidz1, raidz2, raidze, ... 并且正确的块可通过ECC重新计算出的话.).
同时ZFS具备类似ECC DIMM的校验功能, 默认使用SHA-256 checksum.
使用scrub来检测ZPOOL底层的块设备是否健康, 对于SAS或FC硬盘, 可以一个月检测一次, 而对于低端的SATA, SCSI设备则最好1周检测一次.
这些可以放在定时任务中执行, 例如每天0点1分开始执行一次scrub.
crontab -e
1 0 * * * /opt/zfs0.6.2/sbin/zpool scrub zptest
对于检测到的指标不好的盘, 可以提前更换(使用zpool replace).
指标 :
The rows in the "zpool status" command give you vital information about the pool, most of which are self-explanatory. They are defined as follows:
pool- The name of the pool.
state- The current health of the pool. This information refers only to the ability of the pool to provide the necessary replication level.
status- A description of what is wrong with the pool. This field is omitted if no problems are found.
action- A recommended action for repairing the errors. This field is an abbreviated form directing the user to one of the following sections. This field is omitted if no problems are found.
see- A reference to a knowledge article containing detailed repair information. Online articles are updated more often than this guide can be updated, and should always be referenced for the most up-to-date repair procedures. This field is omitted if no problems are found.
scrub- Identifies the current status of a scrub operation, which might include the date and time that the last scrub was completed, a scrub in progress, or if no scrubbing was requested.
errors- Identifies known data errors or the absence of known data errors.
config- Describes the configuration layout of the devices comprising the pool, as well as their state and any errors generated from the devices. The state can be one of the following: ONLINE, FAULTED, DEGRADED, UNAVAILABLE, or OFFLINE. If the state is anything but ONLINE, the fault tolerance of the pool has been compromised.
The columns in the status output, "READ", "WRITE" and "CHKSUM" are defined as follows:
NAME- The name of each VDEV in the pool, presented in a nested order.
STATE- The state of each VDEV in the pool. The state can be any of the states found in "config" above.
READ- I/O errors occurred while issuing a read request.
WRITE- I/O errors occurred while issuing a write request.
CHKSUM- Checksum errors. The device returned corrupted data as the result of a read request.
Scrubbing ZFS storage pools is not something that happens automatically. You need to do it manually, and it's highly recommended that you do it on a regularly scheduled interval. The recommended frequency at which you should scrub the data depends on the quality of the underlying disks. If you have SAS or FC disks, then once per month should be sufficient. If you have consumer grade SATA or SCSI, you should do once per week. You can schedule a scrub easily with the following command:
# zpool scrub tank
# zpool status tank
pool: tank
state: ONLINE
scan: scrub in progress since Sat Dec 8 08:06:36 2012
32.0M scanned out of 48.5M at 16.0M/s, 0h0m to go
0 repaired, 65.99% done
config:
NAME STATE READ WRITE CKSUM
tank ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
sde ONLINE 0 0 0
sdf ONLINE 0 0 0
mirror-1 ONLINE 0 0 0
sdg ONLINE 0 0 0
sdh ONLINE 0 0 0
mirror-2 ONLINE 0 0 0
sdi ONLINE 0 0 0
sdj ONLINE 0 0 0
errors: No known data errors
例如, 使用raidz1冗余, 创建一个zp pool.
[root@spark01 ~]# zpool create zp raidz1 /home/digoal/zfs.disk1 /home/digoal/zfs.disk2 /home/digoal/zfs.disk3 /home/digoal/zfs.disk4 log mirror /home/digoal/zfs.log1 /home/digoal/zfs.log2
[root@spark01 ~]# zpool status
pool: zp
state: ONLINE
scan: none requested
config:
NAME STATE READ WRITE CKSUM
zp ONLINE 0 0 0
raidz1-0 ONLINE 0 0 0
/home/digoal/zfs.disk1 ONLINE 0 0 0
/home/digoal/zfs.disk2 ONLINE 0 0 0
/home/digoal/zfs.disk3 ONLINE 0 0 0
/home/digoal/zfs.disk4 ONLINE 0 0 0
logs
mirror-1 ONLINE 0 0 0
/home/digoal/zfs.log1 ONLINE 0 0 0
/home/digoal/zfs.log2 ONLINE 0 0 0
errors: No known data errors
拷贝一些文件到dataset.
[root@spark01 ~]# cd /home/digoal
[root@spark01 digoal]# ll
total 10575000
drwxr-xr-x. 9 digoal digoal 4096 Mar 31 17:15 hadoop-2.4.0
-rw-rw-r--. 1 digoal digoal 138943699 Mar 31 17:16 hadoop-2.4.0.tar.gz
drwxr-xr-x. 10 7900 7900 4096 May 19 01:24 spl-0.6.2
-rw-r--r--. 1 root root 565277 Aug 24 2013 spl-0.6.2.tar.gz
drwxr-xr-x. 13 7900 7900 4096 May 19 01:28 zfs-0.6.2
-rw-r--r--. 1 root root 2158948 Aug 24 2013 zfs-0.6.2.tar.gz
-rw-r--r--. 1 root root 2147483648 May 19 05:54 zfs.disk1
-rw-r--r--. 1 root root 2147483648 May 19 05:54 zfs.disk2
-rw-r--r--. 1 root root 2147483648 May 19 05:54 zfs.disk3
-rw-r--r--. 1 root root 2147483648 May 19 05:54 zfs.disk4
-rw-r--r--. 1 root root 1048576000 May 19 05:54 zfs.log1
-rw-r--r--. 1 root root 1048576000 May 19 05:54 zfs.log2
[root@spark01 digoal]# zfs create zp/test
[root@spark01 digoal]# cp -r spl-0.6.2* zfs-0.6.2* hadoop-2.4.0* /zp/test/
[root@spark01 digoal]# df -h
Filesystem Size Used Avail Use% Mounted on
/dev/sda1 31G 1.2G 29G 5% /
tmpfs 12G 0 12G 0% /dev/shm
/dev/sda3 89G 11G 74G 13% /home
zp 5.4G 0 5.4G 0% /zp
zp/test 5.9G 535M 5.4G 9% /zp/test
使用zpool scrub检查这个pool.
[root@spark01 digoal]# zpool scrub zp
[root@spark01 digoal]# zpool status
pool: zp
state: ONLINE
scan: scrub repaired 0 in 0h0m with 0 errors on Mon May 19 05:56:17 2014
config:
NAME STATE READ WRITE CKSUM
zp ONLINE 0 0 0
raidz1-0 ONLINE 0 0 0
/home/digoal/zfs.disk1 ONLINE 0 0 0
/home/digoal/zfs.disk2 ONLINE 0 0 0
/home/digoal/zfs.disk3 ONLINE 0 0 0
/home/digoal/zfs.disk4 ONLINE 0 0 0
logs
mirror-1 ONLINE 0 0 0
/home/digoal/zfs.log1 ONLINE 0 0 0
/home/digoal/zfs.log2 ONLINE 0 0 0
errors: No known data errors
关闭一个正在执行的scrub任务 :
[root@spark01 test]# zpool scrub -s zp
cannot cancel scrubbing zp: there is no active scrub
接下来要测试一下在线替换scrub检查到问题的块设备, 我这里使用删除一个zfs.disk来模拟坏盘.
[root@spark01 digoal]# rm -f zfs.disk1
[root@spark01 digoal]# zpool scrub zp #使用scrub没有检测到删除的盘.
[root@spark01 digoal]# zpool status
pool: zp
state: ONLINE
scan: scrub repaired 0 in 0h0m with 0 errors on Mon May 19 05:56:44 2014
config:
NAME STATE READ WRITE CKSUM
zp ONLINE 0 0 0
raidz1-0 ONLINE 0 0 0
/home/digoal/zfs.disk1 ONLINE 0 0 0
/home/digoal/zfs.disk2 ONLINE 0 0 0
/home/digoal/zfs.disk3 ONLINE 0 0 0
/home/digoal/zfs.disk4 ONLINE 0 0 0
logs
mirror-1 ONLINE 0 0 0
/home/digoal/zfs.log1 ONLINE 0 0 0
/home/digoal/zfs.log2 ONLINE 0 0 0
errors: No known data errors
但是因为使用了raidz1, 所以删除disk1后还能查询. (从校验数据中计算出原始数据. raidz1允许坏1块盘)
[root@spark01 digoal]# cd /zp/test
[root@spark01 test]# ll
total 138651
drwxr-xr-x. 9 root root 12 May 19 05:55 hadoop-2.4.0
-rw-r--r--. 1 root root 138943699 May 19 05:56 hadoop-2.4.0.tar.gz
drwxr-xr-x. 10 root root 30 May 19 05:55 spl-0.6.2
-rw-r--r--. 1 root root 565277 May 19 05:55 spl-0.6.2.tar.gz
drwxr-xr-x. 13 root root 37 May 19 05:55 zfs-0.6.2
-rw-r--r--. 1 root root 2158948 May 19 05:55 zfs-0.6.2.tar.gz
[root@spark01 test]# du -sh *
250M hadoop-2.4.0
133M hadoop-2.4.0.tar.gz
39M spl-0.6.2
643K spl-0.6.2.tar.gz
193M zfs-0.6.2
2.2M zfs-0.6.2.tar.gz
新建一个文件, 用来替换被我删掉的zfs.disk1文件, 新增的这个文件可以与zfs.disk1同名, 也可以不同名.
[root@spark01 test]# cd /home/digoal/
[root@spark01 digoal]# dd if=/dev/zero of=./zfs.disk1 bs=1024k count=2048
2048+0 records in
2048+0 records out
2147483648 bytes (2.1 GB) copied, 1.29587 s, 1.7 GB/s
使用zpool replace替换坏盘 :
[root@spark01 digoal]# zpool replace -h
usage:
replace [-f] <pool> <device> [new-device]
[root@spark01 digoal]# zpool replace zp /home/digoal/zfs.disk1 /home/digoal/zfs.disk1
[root@spark01 digoal]# zpool scrub zp
[root@spark01 digoal]# zpool status zp
pool: zp
state: ONLINE
scan: scrub repaired 0 in 0h0m with 0 errors on Mon May 19 06:01:19 2014
config:
NAME STATE READ WRITE CKSUM
zp ONLINE 0 0 0
raidz1-0 ONLINE 0 0 0
/home/digoal/zfs.disk1 ONLINE 0 0 0
/home/digoal/zfs.disk2 ONLINE 0 0 0
/home/digoal/zfs.disk3 ONLINE 0 0 0
/home/digoal/zfs.disk4 ONLINE 0 0 0
logs
mirror-1 ONLINE 0 0 0
/home/digoal/zfs.log1 ONLINE 0 0 0
/home/digoal/zfs.log2 ONLINE 0 0 0
errors: No known data errors
使用status -x参数查看pool的健康状态
[root@spark01 digoal]# zpool status zp -x
pool 'zp' is healthy
注意如果是真实环境中的硬盘替换的话, 支持热插拔的硬盘可以直接替换硬盘, 然后使用zpool replace替换.
对于不能热插拔的硬盘, 需要关机替换硬盘, 再使用zpool replace替换掉坏盘.
查看坏盘对应的设备号(或序列号, 因为更换硬盘时需要拔下硬盘后现场对比一下序列号, 以免弄错).
hdparm -I, 对比zpool status中的设备名.
参考
1. http://docs.oracle.com/cd/E26502_01/pdf/E29007.pdf
2. http://www.root.cz/clanky/suborovy-system-zfs-konzistentnost-dat/
3. https://pthree.org/2012/12/11/zfs-administration-part-vi-scrub-and-resilver/
4. https://pthree.org/2012/12/05/zfs-administration-part-ii-raidz/