ZFS deduplicate
背景
前一篇BLOG介绍了ZFS的压缩特性, 本文将介绍一下ZFS的另一个特性deduplicate, 同compress的目标差不多, 都是节约存储空间的.
但是deduplicate带来的副作用会比较明显, 同时deduplicate的数据不是atomic事务写入的, 可能导致数据损坏. 一般不建议开启dedup.
Further, deduplicated data is not flushed to disk as an atomic transaction. Instead, the blocks are written to disk serially, one block at a time. Thus, this does open you up for corruption in the event of a power failure before the blocks have been written.
deduplicate分3种粒度: 文件, 数据块, 字节.
文件的粒度最粗, 只有当一个文件的所有字节都完全一致时, 只需要存储1个文件的数据, 因为文件中任何一个字节改变, 都会导致无法利用deduplicate.
数据块级别的deduplicate显然比文件级别的好用, 但是数据块级别的dedup, 需要一个内存区域来跟踪共享数据块(即唯一的块, 假设10个数据块同样, 那么只需要1个共享数据块). 每个共享数据块需要耗费320字节来跟踪, 具体有多少个数据块可以通过zdb来查看.
If my total storage was 1 TB in size, then 1 TB divided by 100 KB per block is about 10737418 blocks. Multiplied by 320 bytes per block, leaves us with 3.2 GB of RAM, which is close to the previous number we got.
具体需要多少内存的话可以计算(和实际存储的唯一数据块的个数有关, 但是最多需要多少内存则直接使用zpool占用的块数计算), 因为ZFS还需要耗费大量的内存用作ARC, 所以能给ZFS dedup跟踪的内存必须减去一些必要的内存 :
ZFS stores more than just the deduplication table in RAM. It also stores the ARC as well as other ZFS metadata. And, guess what? The deduplication table is capped at 25% the size of the ARC. This means, you don't need 60 GB of RAM for a 12 TB storage array. You need 240 GB of RAM to ensure that your deduplication table fits. In other words, if you plan on doing deduplication, make sure you quadruple your RAM footprint, or you'll be hurting.
在有二级缓存的情况下, dedup block级别可以有更好的发挥. 因为如果内存不够的话, dedup带来的性能下降会非常明显.
以下是dedup的测试 :
[root@spark01 digoal]# df -h
Filesystem Size Used Avail Use% Mounted on
/dev/sda1 31G 1.2G 29G 5% /
tmpfs 12G 0 12G 0% /dev/shm
/dev/sda3 89G 11G 74G 13% /home
zp 5.3G 0 5.3G 0% /zp
zp/test 5.9G 615M 5.3G 11% /zp/test
[root@spark01 ~]# cd /home/digoal
[root@spark01 digoal]# zfs set dedup=on zp/test
[root@spark01 digoal]# rm -rf /zp/test/*
[root@spark01 digoal]# date +%F%T; cp -r hadoop-2.4.0* spl-0.6.2* zfs-0.6.2* /zp/test/ ; date +%F%T;
2014-05-1917:48:06
2014-05-1917:48:21 15秒
[root@spark01 digoal]# df -h
Filesystem Size Used Avail Use% Mounted on
/dev/sda1 31G 1.2G 29G 5% /
tmpfs 12G 0 12G 0% /dev/shm
/dev/sda3 89G 11G 74G 13% /home
zp 5.4G 0 5.4G 0% /zp
zp/test 6.0G 615M 5.4G 11% /zp/test
[root@spark01 digoal]# zpool get dedupratio zp
NAME PROPERTY VALUE SOURCE
zp dedupratio 1.24x -
同样的文件再生成一份, dedup比例上升为2.49
[root@spark01 digoal]# cd /zp/test
[root@spark01 test]# mkdir new
[root@spark01 test]# cp -r hadoop-2.4.0* spl-0.6.2* zfs-0.6.2* new/
[root@spark01 test]# zpool get dedupratio zp
NAME PROPERTY VALUE SOURCE
zp dedupratio 2.49x -
注意avail没有变化, 说明deduplicate起到作用了, used显示的是使用的空间, 实际上是假的. 因为存储池也"放大"了.
[root@spark01 test]# zfs list
NAME USED AVAIL REFER MOUNTPOINT
zp 1.20G 5.34G 43.4K /zp
zp/test 1.20G 5.34G 1.20G /zp/test
可能是我这里的配置问题, zdb无法使用 :
[root@spark01 test]# zdb
cannot open '/etc/zfs/zpool.cache': No such file or directory
[root@spark01 test]# zdb -b zp
zdb: can't open 'zp': No such file or directory
生成这个默认的配置, 当然也可以写在其他位置. 然后就可以正常使用zdb了.
[root@spark01 test]# mkdir /etc/zfs
[root@spark01 test]# zpool set cachefile=/etc/zfs/zpool.cache zp
[root@spark01 test]# zpool get cachefile zp
NAME PROPERTY VALUE SOURCE
zp cachefile - default
[root@spark01 test]# zdb -b zp
Traversing all blocks to verify nothing leaked ...
No leaks (block sum matches space maps exactly)
bp count: 26452
bp logical: 1306227200 avg: 49381
bp physical: 1280552960 avg: 48410 compression: 1.02
bp allocated: 1727905792 avg: 65322 compression: 0.76
bp deduped: 1022530560 ref>1: 8794 deduplication: 1.59
SPA allocated: 705375232 used: 8.28%
[root@spark01 test]# cp -r new new1
[root@spark01 test]# df -h
Filesystem Size Used Avail Use% Mounted on
/dev/sda1 31G 1.2G 29G 5% /
tmpfs 12G 0 12G 0% /dev/shm
/dev/sda3 89G 11G 74G 13% /home
zp 5.4G 0 5.4G 0% /zp
zp/test 7.0G 1.7G 5.4G 24% /zp/test
注意到zp/test的空间又放大了, 现在是7.0GB. 实际的pool只有5.4GB.
[root@spark01 test]# zpool get cachefile zp
NAME PROPERTY VALUE SOURCE
zp cachefile - default
[root@spark01 test]# zdb -b zp
Traversing all blocks to verify nothing leaked ...
No leaks (block sum matches space maps exactly)
bp count: 39633
bp logical: 1958756352 avg: 49422
bp physical: 1921044480 avg: 48470 compression: 1.02
bp allocated: 2592529408 avg: 65413 compression: 0.76
bp deduped: 1876494336 ref>1: 8794 deduplication: 1.72
SPA allocated: 716035072 used: 8.40%
参考
1. https://pthree.org/2012/12/18/zfs-administration-part-xi-compression-and-deduplication/
2. http://blog.163.com/digoal@126/blog/static/16387704020144197501438/