ZFS deduplicate

3 minute read

背景

前一篇BLOG介绍了ZFS的压缩特性, 本文将介绍一下ZFS的另一个特性deduplicate, 同compress的目标差不多, 都是节约存储空间的.  
但是deduplicate带来的副作用会比较明显, 同时deduplicate的数据不是atomic事务写入的, 可能导致数据损坏. 一般不建议开启dedup.  
Further, deduplicated data is not flushed to disk as an atomic transaction. Instead, the blocks are written to disk serially, one block at a time. Thus, this does open you up for corruption in the event of a power failure before the blocks have been written.  
  
deduplicate分3种粒度: 文件, 数据块, 字节.  
文件的粒度最粗, 只有当一个文件的所有字节都完全一致时, 只需要存储1个文件的数据,  因为文件中任何一个字节改变, 都会导致无法利用deduplicate.  
数据块级别的deduplicate显然比文件级别的好用, 但是数据块级别的dedup, 需要一个内存区域来跟踪共享数据块(即唯一的块, 假设10个数据块同样, 那么只需要1个共享数据块). 每个共享数据块需要耗费320字节来跟踪, 具体有多少个数据块可以通过zdb来查看.  
If my total storage was 1 TB in size, then 1 TB divided by 100 KB per block is about 10737418 blocks. Multiplied by 320 bytes per block, leaves us with 3.2 GB of RAM, which is close to the previous number we got.  
  
具体需要多少内存的话可以计算(和实际存储的唯一数据块的个数有关, 但是最多需要多少内存则直接使用zpool占用的块数计算), 因为ZFS还需要耗费大量的内存用作ARC, 所以能给ZFS dedup跟踪的内存必须减去一些必要的内存 :   
ZFS stores more than just the deduplication table in RAM. It also stores the ARC as well as other ZFS metadata. And, guess what? The deduplication table is capped at 25% the size of the ARC. This means, you don't need 60 GB of RAM for a 12 TB storage array. You need 240 GB of RAM to ensure that your deduplication table fits. In other words, if you plan on doing deduplication, make sure you quadruple your RAM footprint, or you'll be hurting.  
  
在有二级缓存的情况下, dedup block级别可以有更好的发挥. 因为如果内存不够的话, dedup带来的性能下降会非常明显.  
以下是dedup的测试 :   
[root@spark01 digoal]# df -h  
Filesystem      Size  Used Avail Use% Mounted on  
/dev/sda1        31G  1.2G   29G   5% /  
tmpfs            12G     0   12G   0% /dev/shm  
/dev/sda3        89G   11G   74G  13% /home  
zp              5.3G     0  5.3G   0% /zp  
zp/test         5.9G  615M  5.3G  11% /zp/test  
[root@spark01 ~]# cd /home/digoal  
[root@spark01 digoal]# zfs set dedup=on zp/test  
[root@spark01 digoal]# rm -rf /zp/test/*  
[root@spark01 digoal]# date +%F%T; cp -r hadoop-2.4.0* spl-0.6.2* zfs-0.6.2* /zp/test/ ; date +%F%T;  
2014-05-1917:48:06  
2014-05-1917:48:21  15秒  
[root@spark01 digoal]# df -h  
Filesystem      Size  Used Avail Use% Mounted on  
/dev/sda1        31G  1.2G   29G   5% /  
tmpfs            12G     0   12G   0% /dev/shm  
/dev/sda3        89G   11G   74G  13% /home  
zp              5.4G     0  5.4G   0% /zp  
zp/test         6.0G  615M  5.4G  11% /zp/test  
  
[root@spark01 digoal]# zpool get dedupratio zp  
NAME  PROPERTY    VALUE  SOURCE  
zp    dedupratio  1.24x  -  
  
同样的文件再生成一份, dedup比例上升为2.49  
[root@spark01 digoal]# cd /zp/test  
[root@spark01 test]# mkdir new  
[root@spark01 test]# cp -r hadoop-2.4.0* spl-0.6.2* zfs-0.6.2* new/  
[root@spark01 test]# zpool get dedupratio zp  
NAME  PROPERTY    VALUE  SOURCE  
zp    dedupratio  2.49x  -  
注意avail没有变化, 说明deduplicate起到作用了, used显示的是使用的空间, 实际上是假的. 因为存储池也"放大"了.  
[root@spark01 test]# zfs list  
NAME      USED  AVAIL  REFER  MOUNTPOINT  
zp       1.20G  5.34G  43.4K  /zp  
zp/test  1.20G  5.34G  1.20G  /zp/test  
可能是我这里的配置问题, zdb无法使用 :   
[root@spark01 test]# zdb  
cannot open '/etc/zfs/zpool.cache': No such file or directory  
[root@spark01 test]# zdb -b zp  
zdb: can't open 'zp': No such file or directory  
生成这个默认的配置, 当然也可以写在其他位置. 然后就可以正常使用zdb了.  
[root@spark01 test]# mkdir /etc/zfs  
[root@spark01 test]# zpool set cachefile=/etc/zfs/zpool.cache zp  
[root@spark01 test]# zpool get cachefile zp  
NAME  PROPERTY   VALUE      SOURCE  
zp    cachefile  -          default  
  
[root@spark01 test]# zdb -b zp  
  
Traversing all blocks to verify nothing leaked ...  
  
        No leaks (block sum matches space maps exactly)  
  
        bp count:           26452  
        bp logical:    1306227200      avg:  49381  
        bp physical:   1280552960      avg:  48410     compression:   1.02  
        bp allocated:  1727905792      avg:  65322     compression:   0.76  
        bp deduped:    1022530560    ref>1:   8794   deduplication:   1.59  
        SPA allocated:  705375232     used:  8.28%  
  
[root@spark01 test]# cp -r new new1  
[root@spark01 test]# df -h  
Filesystem      Size  Used Avail Use% Mounted on  
/dev/sda1        31G  1.2G   29G   5% /  
tmpfs            12G     0   12G   0% /dev/shm  
/dev/sda3        89G   11G   74G  13% /home  
zp              5.4G     0  5.4G   0% /zp  
zp/test         7.0G  1.7G  5.4G  24% /zp/test  
注意到zp/test的空间又放大了, 现在是7.0GB. 实际的pool只有5.4GB.  
[root@spark01 test]# zpool get cachefile zp  
NAME  PROPERTY   VALUE      SOURCE  
zp    cachefile  -          default  
[root@spark01 test]# zdb -b zp  
  
Traversing all blocks to verify nothing leaked ...  
  
        No leaks (block sum matches space maps exactly)  
  
        bp count:           39633  
        bp logical:    1958756352      avg:  49422  
        bp physical:   1921044480      avg:  48470     compression:   1.02  
        bp allocated:  2592529408      avg:  65413     compression:   0.76  
        bp deduped:    1876494336    ref>1:   8794   deduplication:   1.72  
        SPA allocated:  716035072     used:  8.40%  

参考

1. https://pthree.org/2012/12/18/zfs-administration-part-xi-compression-and-deduplication/

2. http://blog.163.com/digoal@126/blog/static/16387704020144197501438/

Flag Counter

digoal’s 大量PostgreSQL文章入口