ZPOOL health check and repair use scrub

3 minute read

背景

zpool健康检查(scrub)主要用于通过checksum来检查zpool数据块的数据是否正常, 如果vdev是mirror或raidz的, 可以自动从其他设备来修复异常的数据块. 由于健康检查是IO开销很大的动作, 所以建议在不繁忙的时候操作(scrub只检查分配出去的数据块, 不会检查空闲的数据块, 所以只和使用率有关, 对于一个很大的zpool, 如果使用率很低的话, scrub也是很快完成的).

用法 :

# zpool scrub zp1  

查看zpool状态, 如下, 正在做scrub

# zpool status zp1  
  pool: zp1  
 state: ONLINE  
  scan: scrub in progress since Tue Jun 17 15:17:01 2014  
    19.7G scanned out of 1.23T at 56.7M/s, 6h11m to go  
    0 repaired, 1.57% done  
config:  
  
        NAME         STATE     READ WRITE CKSUM  
        zp1          ONLINE       0     0     0  
          vd01vol01  ONLINE       0     0     0  
          vd02vol01  ONLINE       0     0     0  
          vd03vol01  ONLINE       0     0     0  
          vd04vol01  ONLINE       0     0     0  
  
errors: No known data errors  
  
# zpool get all zp1  
NAME  PROPERTY               VALUE                  SOURCE  
zp1   size                   9.75T                  -  
zp1   capacity               12%                    -  
zp1   altroot                -                      default  
zp1   health                 ONLINE                 -  
zp1   guid                   5877722976139588848    default  
zp1   version                -                      default  
zp1   bootfs                 -                      default  
zp1   delegation             on                     default  
zp1   autoreplace            off                    default  
zp1   cachefile              -                      default  
zp1   failmode               wait                   default  
zp1   listsnapshots          off                    default  
zp1   autoexpand             off                    default  
zp1   dedupditto             0                      default  
zp1   dedupratio             1.01x                  -  
zp1   free                   8.52T                  -  
zp1   allocated              1.23T                  -  
zp1   readonly               off                    -  
zp1   ashift                 0                      default  
zp1   comment                -                      default  
zp1   expandsize             0                      -  
zp1   freeing                0                      default  
zp1   feature@async_destroy  enabled                local  
zp1   feature@empty_bpobj    active                 local  
zp1   feature@lz4_compress   active                 local  

在做scrub时, 设备的io几乎消耗殆尽.

Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s avgrq-sz avgqu-sz   await  svctm  %util  
sda               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00  
sdb               0.00     0.00  160.00    0.00   982.00     0.00     6.14     0.61    3.81   3.80  60.80  
sdc               0.00     0.00  164.00    0.00   864.00     0.00     5.27     0.59    3.57   3.57  58.50  
sdd               0.00     0.00  168.00    0.00  1100.00     0.00     6.55     0.61    3.63   3.64  61.20  
sde               0.00     0.00  172.00    0.00  1116.00     0.00     6.49     0.61    3.52   3.52  60.60  
sdf               0.00     0.00  160.00    0.00  1018.00     0.00     6.36     0.61    3.80   3.80  60.80  
sdg               0.00     0.00  164.00    0.00   857.00     0.00     5.23     0.57    3.49   3.50  57.40  
sdh               0.00     0.00  169.00    0.00   995.00     0.00     5.89     0.61    3.63   3.60  60.90  
sdi               0.00     0.00  173.00    0.00  1066.00     0.00     6.16     0.60    3.49   3.46  59.90  
dm-0              0.00     0.00  320.00    0.00  2000.00     0.00     6.25     1.99    6.21   3.12 100.00  
dm-1              0.00     0.00  328.00    0.00  1721.00     0.00     5.25     1.91    5.83   3.01  98.60  
dm-2              0.00     0.00  337.00    0.00  2095.00     0.00     6.22     1.99    5.92   2.97 100.00  
dm-3              0.00     0.00  345.00    0.00  2182.00     0.00     6.32     1.99    5.77   2.90 100.00  

scrub结束后, 可以通过-xv来查看是否有异常.

# zpool status zp1 -xv  
pool 'zp1' is healthy  

参考

  1. man zpool
       zpool scrub [-s] pool ...  
  
           Begins  a  scrub. The scrub examines all data in the specified pools to verify that it checksums correctly.  
           For replicated (mirror or raidz) devices, ZFS automatically repairs any damage discovered during the scrub.  
           The  "zpool  status" command reports the progress of the scrub and summarizes the results of the scrub upon  
           completion.  
  
           Scrubbing and resilvering are very similar operations. The difference is  that  resilvering  only  examines  
           data that ZFS knows to be out of date (for example, when attaching a new device to a mirror or replacing an  
           existing device), whereas scrubbing examines all data to discover silent errors due to hardware  faults  or  
           disk failure.  
  
           Because  scrubbing  and resilvering are I/O-intensive operations, ZFS only allows one at a time. If a scrub  
           is already in progress, the "zpool scrub" command terminates it and starts a new scrub. If a resilver is in  
           progress, ZFS does not allow a scrub to be started until the resilver completes.  
  
           -s    Stop scrubbing.  
  
       zpool status [-xvD] [-T d | u] [pool] ... [interval [count]]  
  
           Displays  the  detailed health status for the given pools. If no pool is specified, then the status of each  
           pool in the system is displayed. For more information on pool and device health, see  the  "Device  Failure  
           and Recovery" section.  
  
           If  a  scrub or resilver is in progress, this command reports the percentage done and the estimated time to  
           completion. Both of these are only approximate, because the amount of data in the pool and the other  work-  
           loads on the system can change.  
  
           -x          Only display status for pools that are exhibiting errors or are otherwise unavailable. Warnings  
                       about pools not using the latest on-disk format will not be included.  
  
           -v          Displays verbose data error information, printing out a complete list of all data errors  since  
                       the last complete pool scrub.  
  
           -D          Display  a  histogram of deduplication statistics, showing the allocated (physically present on  
                       disk) and referenced (logically referenced in the pool) block counts  and  sizes  by  reference  
                       count.  
  
           -T d | u    Display a time stamp.  
  
                       Specify  u  for  a  printed representation of the internal representation of time. See time(2).  
                       Specify d for standard date format. See date(1).  

Flag Counter

digoal’s 大量PostgreSQL文章入口