ZFS (sync, async) R/W IOPS / throughput performance tuning
背景
本文讨论一下zfs读写IOPS或吞吐量的优化技巧, (读写操作分同步和异步两种情况).
影响性能的因素
1. 底层设备的性能直接影响同步读写 iops, throughput. 异步读写和cache(arc, l2arc) 设备或配置有关.
2. vdev 的冗余选择影响iops, through.
因为ZPOOL的IO是均分到各vdevs的, 所以vdev越多, IO和吞吐能力越好.
vdev本身的话, 写性能 mirror > raidz1 > raidz2 > raidz3 ,
读性能看实际存储的盘数量决定. (raidz1(3) = raidz2(4) = raidz3(5) > mirror(n))
3. 底层设备的IO对齐影响IOPS.
在创建zpool 时需指定ashift, 而且以后都无法更改.
建议同一个vdev底层设备的sector一致, 如果不一致的话, 建议取最大的扇区作为ashift. 或者将不一致的块设备分到不同的vdev里面.
例如sda sdb的sector=512, sdc sdd的sector=4K
zpool create -o ashift=9 zp1 mirror sda sdb
zpool add -o ashift=12 zp1 mirror sdc sdd
ashift
Pool sector size exponent, to the power of 2 (internally referred to as "ashift"). I/O operations will be
aligned to the specified size boundaries. Additionally, the minimum (disk) write size will be set to the
specified size, so this represents a space vs. performance trade-off. The typical case for setting this
property is when performance is important and the underlying disks use 4KiB sectors but report 512B sectors
to the OS (for compatibility reasons); in that case, set ashift=12 (which is 1<<12 = 4096).
For optimal performance, the pool sector size should be greater than or equal to the sector size of the
underlying disks. Since the property cannot be changed after pool creation, if in a given pool, you ever
want to use drives that report 4KiB sectors, you must set ashift=12 at pool creation time.
Keep in mind is that the ashift is vdev specific and is not a pool global. This means that when adding new
vdevs to an existing pool you may need to specify the ashift.
这里有一个工具收录了一些常见设备的扇区大小.
https://github.com/zfsonlinux/zfs/blob/master/cmd/zpool/zpool_vdev.c#L108
如果不清楚底层设备的扇区大小, 为了对齐可以设置为13(8K).
例如
# zpool create -o ashift=13 zp1 scsi-36c81f660eb17fb001b2c5fec6553ff5e
# zpool create -o ashift=9 zp2 scsi-36c81f660eb17fb001b2c5ff465cff3ed
# zfs create -o mountpoint=/data01 zp1/data01
# zfs create -o mountpoint=/data02 zp2/data02
# date +%F%T; dd if=/dev/zero of=/data01/test.img bs=1024K count=8192 oflag=sync,noatime,nonblock; date +%F%T;
2014-06-2609:57:35
8192+0 records in
8192+0 records out
8589934592 bytes (8.6 GB) copied, 46.4277 s, 185 MB/s
2014-06-2609:58:22
# date +%F%T; dd if=/dev/zero of=/data02/test.img bs=1024K count=8192 oflag=sync,noatime,nonblock; date +%F%T;
2014-06-2609:58:32
8192+0 records in
8192+0 records out
8589934592 bytes (8.6 GB) copied, 43.9984 s, 195 MB/s
2014-06-2609:59:16
# zpool list
NAME SIZE ALLOC FREE CAP DEDUP HEALTH ALTROOT
zp1 3.62T 8.01G 3.62T 0% 1.00x ONLINE -
zp2 3.62T 8.00G 3.62T 0% 1.00x ONLINE -
大文件看不出区别, 小文件的话, 如果文件小于ashift设置的大小, 那么就等于浪费空间, 同时降低了小文件的写效率. 增加cache占用等.
4. 底层设备的模式, 建议JBOD或passthrough, 绕过RAID卡的控制.
5. zfs 参数直接影响iops和吞吐量.
5.1
对于数据库类型的应用, 大文件, 离散的小数据集访问, 选择recordsize 大于或等于数据库的块大小比较好. 例如PostgreSQL 8K的block_size, 建议zfs recordsize大于等于8KB. 一般不建议调整recordsize, 使用默认的128K就能满足大多数需求.
recordsize=size
Specifies a suggested block size for files in the file system. This property is designed solely for use
with database workloads that access files in fixed-size records. ZFS automatically tunes block sizes
according to internal algorithms optimized for typical access patterns.
For databases that create very large files but access them in small random chunks, these algorithms may be
suboptimal. Specifying a recordsize greater than or equal to the record size of the database can result in
significant performance gains. Use of this property for general purpose file systems is strongly discour-
aged, and may adversely affect performance.
The size specified must be a power of two greater than or equal to 512 and less than or equal to 128
Kbytes.
Changing the file system’s recordsize affects only files created afterward; existing files are unaffected.
This property can also be referred to by its shortened column name, recsize.
测试 :
# zpool create -o ashift=12 zp1 scsi-36c81f660eb17fb001b2c5fec6553ff5e
# zfs create -o mountpoint=/data01 -o recordsize=8K -o atime=off zp1/data01
# zfs create -o mountpoint=/data02 -o recordsize=128K -o atime=off zp1/data02
# zfs create -o mountpoint=/data03 -o recordsize=512 -o atime=off zp1/data03
关闭数据缓存, 不影响结果.
# zfs set primarycache=metadata zp1/data01
# zfs set primarycache=metadata zp1/data02
# zfs set primarycache=metadata zp1/data03
# mkdir -p /data01/pgdata
# mkdir -p /data02/pgdata
# mkdir -p /data03/pgdata
# chown postgres:postgres /data0*/pgdata
pg_test_fsync 测试结果, 512最差, 8K和128K差不多.
512
Compare file sync methods using two 8kB writes:
(in wal_sync_method preference order, except fdatasync
is Linux's default)
fdatasync 252.052 ops/sec 3967 usecs/op
fsync 248.701 ops/sec 4021 usecs/op
Non-Sync'ed 8kB writes:
write 7615.510 ops/sec 131 usecs/op
8K
fdatasync 329.874 ops/sec 3031 usecs/op
fsync 329.008 ops/sec 3039 usecs/op
Non-Sync'ed 8kB writes:
write 83849.214 ops/sec 12 usecs/op
128K
fdatasync 329.207 ops/sec 3038 usecs/op
fsync 328.739 ops/sec 3042 usecs/op
Non-Sync'ed 8kB writes:
write 76100.311 ops/sec 13 usecs/op
5.2
压缩效率和压缩比不能兼得, 一般推荐LZ4, 压缩效率和压缩比折中.
compression=on | off | lzjb | gzip | gzip-N | zle | lz4
Controls the compression algorithm used for this dataset. The lzjb compression algorithm is optimized for
performance while providing decent data compression. Setting compression to on uses the lzjb compression
algorithm.
The gzip compression algorithm uses the same compression as the gzip(1) command. You can specify the gzip
level by using the value gzip-N where N is an integer from 1 (fastest) to 9 (best compression ratio). Cur-
rently, gzip is equivalent to gzip-6 (which is also the default for gzip(1)).
The zle (zero-length encoding) compression algorithm is a fast and simple algorithm to eliminate runs of
zeroes.
The lz4 compression algorithm is a high-performance replacement for the lzjb algorithm. It features signif-
icantly faster compression and decompression, as well as a moderately higher compression ratio than lzjb,
but can only be used on pools with the lz4_compress feature set to enabled. See zpool-features(5) for
details on ZFS feature flags and the lz4_compress feature.
This property can also be referred to by its shortened column name compress. Changing this property affects
only newly-written data.
测试, 开启压缩和不开启压缩, 效率差不多.
# zfs set compression=lz4 zp1/data02
# date +%F%T; dd if=/dev/zero of=/data02/test.img ibs=1024K obs=8K count=100 oflag=nonblock,sync,noatime; date +%F%T
2014-06-2610:59:16
100+0 records in
12800+0 records out
104857600 bytes (105 MB) copied, 38.9054 s, 2.7 MB/s
2014-06-2610:59:55
# zfs set compression=off zp1/data02
# date +%F%T; dd if=/dev/zero of=/data02/test.img ibs=1024K obs=8K count=100 oflag=nonblock,sync,noatime; date +%F%T
2014-06-2611:00:08
100+0 records in
12800+0 records out
104857600 bytes (105 MB) copied, 38.8295 s, 2.7 MB/s
2014-06-2611:00:46
开启压缩后, 需要注意一些ZFS的内核参数, l2arc可能会不能缓存压缩后的buffer. 看设置.
# modinfo zfs|grep compre
parm: zfs_sync_pass_dont_compress:Don't compress starting in this pass (int)
parm: zfs_mdcomp_disable:Disable meta data compression (int)
parm: l2arc_nocompress:Skip compressing L2ARC buffers (int)
zio.c:int zfs_sync_pass_dont_compress = 5; /* don't compress starting in this pass */
zio.c: if (pass >= zfs_sync_pass_dont_compress)
zio.c:module_param(zfs_sync_pass_dont_compress, int, 0644);
zio.c:MODULE_PARM_DESC(zfs_sync_pass_dont_compress,
static int
zio_write_bp_init(zio_t *zio)
{
if (pass >= zfs_sync_pass_dont_compress)
compress = ZIO_COMPRESS_OFF;
arc.c
int l2arc_nocompress = B_FALSE; /* don't compress bufs */
5.3
文件的拷贝份数, 一般不建议设置, 除非你的vdev以及底层块设备都没有使用任何冗余措施. 同样影响文件写的IOPS.
copies=1 | 2 | 3
Controls the number of copies of data stored for this dataset. These copies are in addition to any redun-
dancy provided by the pool, for example, mirroring or RAID-Z. The copies are stored on different disks, if
possible. The space used by multiple copies is charged to the associated file and dataset, changing the
used property and counting against quotas and reservations.
Changing this property only affects newly-written data. Therefore, set this property at file system cre-
ation time by using the -o copies=N option.
5.4
数据块校验, 对IOPS有一定的影响, 但是非常不建议关闭.
checksum=on | off | fletcher2,| fletcher4 | sha256
Controls the checksum used to verify data integrity. The default value is on, which automatically selects
an appropriate algorithm (currently, fletcher4, but this may change in future releases). The value off dis-
ables integrity checking on user data. Disabling checksums is NOT a recommended practice.
Changing this property affects only newly-written data.
5.5 是否更新文件的访问时间戳, 一般建议关闭. 除非应用程序需要用到文件的访问时间戳.
atime=on | off
Controls whether the access time for files is updated when they are read. Turning this property off avoids
producing write traffic when reading files and can result in significant performance gains, though it might
confuse mailers and other similar utilities. The default value is on. See also relatime below.
5.6
主缓存(ARC)配置,
all表示所有数据均使用ARC, none表示不使用ARC, 相当于没有缓存. metadata表示只有元数据使用缓存.
开启缓存可以极大的提高读性能, 写性能则会有一定下降(差异并不大).
主要影响的还是读性能, 如果关闭arc, 读的性能会非常的差.
primarycache=all | none | metadata
Controls what is cached in the primary cache (ARC). If this property is set to all, then both user data and
metadata is cached. If this property is set to none, then neither user data nor metadata is cached. If this
property is set to metadata, then only metadata is cached. The default value is all.
缓存的使用限制可以通过zfs内核参数来调整.
/sys/module/zfs/parameters/zfs_arc_grow_retry:5
/sys/module/zfs/parameters/zfs_arc_max:0
/sys/module/zfs/parameters/zfs_arc_memory_throttle_disable:1
/sys/module/zfs/parameters/zfs_arc_meta_limit:0
/sys/module/zfs/parameters/zfs_arc_meta_prune:1048576
/sys/module/zfs/parameters/zfs_arc_min:0
/sys/module/zfs/parameters/zfs_arc_min_prefetch_lifespan:1000
/sys/module/zfs/parameters/zfs_arc_p_aggressive_disable:1
/sys/module/zfs/parameters/zfs_arc_p_dampener_disable:1
/sys/module/zfs/parameters/zfs_arc_shrink_shift:5
parm: zfs_arc_min:Min arc size (ulong)
parm: zfs_arc_max:Max arc size (ulong)
parm: zfs_arc_meta_limit:Meta limit for arc size (ulong)
parm: zfs_arc_meta_prune:Bytes of meta data to prune (int)
parm: zfs_arc_grow_retry:Seconds before growing arc size (int)
parm: zfs_arc_p_aggressive_disable:disable aggressive arc_p grow (int)
parm: zfs_arc_p_dampener_disable:disable arc_p adapt dampener (int)
parm: zfs_arc_shrink_shift:log2(fraction of arc to reclaim) (int)
parm: zfs_arc_memory_throttle_disable:disable memory throttle (int)
parm: zfs_arc_min_prefetch_lifespan:Min life of prefetch block (int)
脏数据的内存使用限制内核参数
# modinfo zfs|grep dirty
parm: zfs_vdev_async_write_active_max_dirty_percent:Async write concurrency max threshold (int)
parm: zfs_vdev_async_write_active_min_dirty_percent:Async write concurrency min threshold (int)
parm: zfs_dirty_data_max_percent:percent of ram can be dirty (int)
parm: zfs_dirty_data_max_max_percent:zfs_dirty_data_max upper bound as % of RAM (int)
parm: zfs_delay_min_dirty_percent:transaction delay threshold (int)
parm: zfs_dirty_data_max:determines the dirty space limit (ulong)
parm: zfs_dirty_data_max_max:zfs_dirty_data_max upper bound in bytes (ulong)
parm: zfs_dirty_data_sync:sync txg when this much dirty data (ulong)
# grep ".*" /sys/module/zfs/parameters/*|grep dirty
/sys/module/zfs/parameters/zfs_delay_min_dirty_percent:60
/sys/module/zfs/parameters/zfs_dirty_data_max:3361508147
/sys/module/zfs/parameters/zfs_dirty_data_max_max:8403770368
/sys/module/zfs/parameters/zfs_dirty_data_max_max_percent:25
/sys/module/zfs/parameters/zfs_dirty_data_max_percent:10
/sys/module/zfs/parameters/zfs_dirty_data_sync:67108864
/sys/module/zfs/parameters/zfs_vdev_async_write_active_max_dirty_percent:60
/sys/module/zfs/parameters/zfs_vdev_async_write_active_min_dirty_percent:30
测试, 异步读写, primarycache=metadata的写入速度要快一点, 一般在cache填满后cache=metadata和cache=all速度达到一致.
zpool块设备越多, 差别越明显. 通过zpool iostat -v 1来查看.
# zpool create -o ashift=12 -o autoreplace=on zp1 scsi-36c81f660eb17fb001b2c5fec6553ff5e scsi-36c81f660eb17fb001b2c5ff465cff3ed scsi-36c81f660eb17fb001b2c5ffa662f3df2 scsi-36c81f660eb17fb001b2c5fff66848a6c scsi-36c81f660eb17fb001b2c600466cb5810 scsi-36c81f660eb17fb001b2c60096714bcf2 scsi-36c81f660eb17fb001b2c600e6761a9bd scsi-36c81f660eb17fb001b2c601267a63fcc scsi-36c81f660eb17fb001b2c601867f2c341 scsi-36c81f660eb17fb001b2c601e685414b5 scsi-36c81f660eb17fb001b2c602368a21621 scsi-36c81f660eb17fb001b2c602a690a4ed8
# zfs create -o mountpoint=/data01 -o atime=off -o primarycache=metadata zp1/data01
# dd if=/dev/zero of=/data01/test.img bs=1024K count=819200
^C185116+0 records in
185116+0 records out
194108194816 bytes (194 GB) copied, 113.589 s, 1.7 GB/s
# zfs destroy zp1/data01
# zfs create -o mountpoint=/data01 -o atime=off -o primarycache=all zp1/data01
# dd if=/dev/zero of=/data01/test.img bs=1024K count=819200
^C147262+0 records in
147262+0 records out
154415398912 bytes (154 GB) copied, 90.1703 s, 1.7 GB/s
读性能测试, 关闭arc后, 性能非常差, 目前还不清楚是否可以通过调整zfs内核参数来提高直接的块设备的读性能.
# zfs set primarycache=metadata zp1/data01
# cp /data01/test.img /data01/test.img1
# zpool iostat -v 1
capacity operations bandwidth
pool alloc free read write read write
---------------------------------------- ----- ----- ----- ----- ----- -----
zp1 80.5G 43.4T 289 592 35.9M 64.5M
scsi-36c81f660eb17fb001b2c5fec6553ff5e 6.72G 3.62T 23 44 3.00M 5.49M
scsi-36c81f660eb17fb001b2c5ff465cff3ed 6.69G 3.62T 24 44 3.12M 5.49M
scsi-36c81f660eb17fb001b2c5ffa662f3df2 6.71G 3.62T 24 49 3.00M 5.76M
scsi-36c81f660eb17fb001b2c5fff66848a6c 6.72G 3.62T 23 44 3.00M 5.01M
scsi-36c81f660eb17fb001b2c600466cb5810 6.70G 3.62T 24 62 3.12M 5.54M
scsi-36c81f660eb17fb001b2c60096714bcf2 6.69G 3.62T 21 54 2.75M 5.15M
scsi-36c81f660eb17fb001b2c600e6761a9bd 6.71G 3.62T 27 53 3.37M 5.35M
scsi-36c81f660eb17fb001b2c601267a63fcc 6.71G 3.62T 21 46 2.75M 4.90M
scsi-36c81f660eb17fb001b2c601867f2c341 6.68G 3.62T 22 46 2.87M 5.02M
scsi-36c81f660eb17fb001b2c601e685414b5 6.74G 3.62T 25 54 3.24M 5.90M
scsi-36c81f660eb17fb001b2c602368a21621 6.71G 3.62T 23 43 3.00M 5.49M
scsi-36c81f660eb17fb001b2c602a690a4ed8 6.69G 3.62T 21 42 2.75M 5.37M
cache - - - - - -
pcie-shannon-6819246149b014-part1 5.14M 800G 0 1 0 68.9K
---------------------------------------- ----- ----- ----- ----- ----- -----
开启arc后, 读性能提升, 注意看读的iops增加到300+, 开启arc前只有20+
# zfs set primarycache=all zp1/data01
# cp /data01/test.img /data01/test.img1
cp: overwrite `/data01/test.img1'? y
capacity operations bandwidth
pool alloc free read write read write
---------------------------------------- ----- ----- ----- ----- ----- -----
zp1 82.8G 43.4T 3.54K 4.01K 449M 476M
scsi-36c81f660eb17fb001b2c5fec6553ff5e 6.91G 3.62T 318 318 39.6M 39.6M
scsi-36c81f660eb17fb001b2c5ff465cff3ed 6.89G 3.62T 286 328 35.6M 40.2M
scsi-36c81f660eb17fb001b2c5ffa662f3df2 6.91G 3.62T 304 335 37.9M 39.6M
scsi-36c81f660eb17fb001b2c5fff66848a6c 6.92G 3.62T 299 335 37.3M 40.3M
scsi-36c81f660eb17fb001b2c600466cb5810 6.89G 3.62T 288 322 35.5M 37.1M
scsi-36c81f660eb17fb001b2c60096714bcf2 6.89G 3.62T 300 337 37.3M 39.4M
scsi-36c81f660eb17fb001b2c600e6761a9bd 6.90G 3.62T 305 330 37.9M 39.0M
scsi-36c81f660eb17fb001b2c601267a63fcc 6.90G 3.62T 294 343 36.8M 40.1M
scsi-36c81f660eb17fb001b2c601867f2c341 6.88G 3.62T 300 373 36.8M 39.5M
scsi-36c81f660eb17fb001b2c601e685414b5 6.94G 3.62T 321 374 39.7M 40.4M
scsi-36c81f660eb17fb001b2c602368a21621 6.90G 3.62T 292 365 36.4M 39.6M
scsi-36c81f660eb17fb001b2c602a690a4ed8 6.89G 3.62T 308 339 38.2M 41.2M
cache - - - - - -
pcie-shannon-6819246149b014-part1 454M 800G 0 649 0 79.5M
---------------------------------------- ----- ----- ----- ----- ----- -----
5.7
二级缓存(L2ARC)配置, 即zpool 中的cache设备.
如果要使用L2ARC的话, 建议使用SSD作为L2ARC.
secondarycache=all | none | metadata
Controls what is cached in the secondary cache (L2ARC). If this property is set to all, then both user data
and metadata is cached. If this property is set to none, then neither user data nor metadata is cached. If
this property is set to metadata, then only metadata is cached. The default value is all.
l2arc的数据从arc的mru, mfu表取到, 所以arc如果关闭的话, l2arc也不会有缓存数据.
所以如果要使用l2arc的话, 务必同时打开arc和l2arc.
l2arc里面不存储脏数据, 所以对于活跃数据频繁变更的业务, L2ARC几乎没什么用处.
5.8
数据块去重配置, 对于大多数场景没有什么效果, 而且如果数据集很大的话需要耗费大量的内存. 同时影响IOPS和吞吐量.
一般不建议开启.
dedup=on | off | verify | sha256[,verify]
Controls whether deduplication is in effect for a dataset. The default value is off. The default checksum
used for deduplication is sha256 (subject to change). When dedup is enabled, the dedup checksum algorithm
overrides the checksum property. Setting the value to verify is equivalent to specifying sha256,verify.
If the property is set to verify, then, whenever two blocks have the same signature, ZFS will do a byte-
for-byte comparison with the existing block to ensure that the contents are identical.
5.9
ZIL的使用配置, 对同步写请求来说, latency表示使用ZIL设备, throughput表示不使用zil设备(非常不推荐).
如果使用PostgreSQL数据库, 并且使用异步事务提交的话, 是否使用zil关系都不大.
zil要求IOPS能力很好的设备, 才能达到好的同步写请求iops.
logbias = latency | throughput
Provide a hint to ZFS about handling of synchronous requests in this dataset. If logbias is set to latency
(the default), ZFS will use pool log devices (if configured) to handle the requests at low latency. If log-
bias is set to throughput, ZFS will not use configured pool log devices. ZFS will instead optimize syn-
chronous operations for global pool throughput and efficient use of resources.
首先我们测试一个有SSD zil设备的, 普通机械硬盘12块组成的一个ZPOOL的fsync场景性能.
# zfs get all|grep logbias
zp1 logbias latency default
zp1/data01 logbias latency default
> pg_test_fsync -f /data01/pgdata/1
5 seconds per test
O_DIRECT supported on this platform for open_datasync and open_sync.
Compare file sync methods using one 8kB write:
(in wal_sync_method preference order, except fdatasync
is Linux's default)
open_datasync n/a*
fdatasync 7285.416 ops/sec 137 usecs/op
fsync 7359.841 ops/sec 136 usecs/op
fsync_writethrough n/a
open_sync n/a*
* This file system and its mount options do not support direct
I/O, e.g. ext4 in journaled mode.
Compare file sync methods using two 8kB writes:
(in wal_sync_method preference order, except fdatasync
is Linux's default)
open_datasync n/a*
fdatasync 5396.851 ops/sec 185 usecs/op
fsync 4323.672 ops/sec 231 usecs/op
fsync_writethrough n/a
open_sync n/a*
* This file system and its mount options do not support direct
I/O, e.g. ext4 in journaled mode.
Compare open_sync with different write sizes:
(This is designed to compare the cost of writing 16kB
in different write open_sync sizes.)
1 * 16kB open_sync write n/a*
2 * 8kB open_sync writes n/a*
4 * 4kB open_sync writes n/a*
8 * 2kB open_sync writes n/a*
16 * 1kB open_sync writes n/a*
Test if fsync on non-write file descriptor is honored:
(If the times are similar, fsync() can sync data written
on a different descriptor.)
write, fsync, close 5859.650 ops/sec 171 usecs/op
write, close, fsync 6626.115 ops/sec 151 usecs/op
Non-Sync'ed 8kB writes:
write 82388.939 ops/sec 12 usecs/op
注意此时ZIL所在的SSD硬盘的利用率没有到100%, 处于一个比较低的水平.
avg-cpu: %user %nice %system %iowait %steal %idle
0.00 0.00 12.39 5.75 0.00 81.86
dfa 0.00 0.00 0.00 7401.00 0.00 177624.00 24.00 0.24 0.03 0.03 24.10
使用zpool iostat看到fsync调用使用了zil设备.
# zpool iostat -v 1
capacity operations bandwidth
pool alloc free read write read write
---------------------------------------- ----- ----- ----- ----- ----- -----
zp1 160G 43.3T 0 7.23K 0 86.7M
scsi-36c81f660eb17fb001b2c5fec6553ff5e 13.4G 3.61T 0 0 0 0
scsi-36c81f660eb17fb001b2c5ff465cff3ed 13.4G 3.61T 0 0 0 0
scsi-36c81f660eb17fb001b2c5ffa662f3df2 13.3G 3.61T 0 0 0 0
scsi-36c81f660eb17fb001b2c5fff66848a6c 13.4G 3.61T 0 0 0 0
scsi-36c81f660eb17fb001b2c600466cb5810 13.3G 3.61T 0 0 0 0
scsi-36c81f660eb17fb001b2c60096714bcf2 13.3G 3.61T 0 0 0 0
scsi-36c81f660eb17fb001b2c600e6761a9bd 13.3G 3.61T 0 0 0 0
scsi-36c81f660eb17fb001b2c601267a63fcc 13.3G 3.61T 0 0 0 0
scsi-36c81f660eb17fb001b2c601867f2c341 13.3G 3.61T 0 0 0 0
scsi-36c81f660eb17fb001b2c601e685414b5 13.4G 3.61T 0 0 0 0
scsi-36c81f660eb17fb001b2c602368a21621 13.3G 3.61T 0 0 0 0
scsi-36c81f660eb17fb001b2c602a690a4ed8 13.3G 3.61T 0 0 0 0
logs - - - - - -
pcie-shannon-6819246149b014-part2 976M 1.03G 0 7.23K 0 86.7M
cache - - - - - -
pcie-shannon-6819246149b014-part1 2.03M 800G 0 0 0 0
---------------------------------------- ----- ----- ----- ----- ----- -----
接下来把这个zfs的logbias改成throughput, 也就是不使用zil设备, fsync的性能马上下降了.
这里实际上VDEV块设备的iops利用率还不到20%, FreeBSD下面没有问题, 这是ZFSonLinux的一个问题, 已提交brian, 得到的回复如下.
Thanks,
I've opened a new issue so we can track this.
https://github.com/zfsonlinux/zfs/issues/2431
The next step is somebody is going to have to profile the Linux case to
see what's going on. It seems like we're blocking somewhere in the
stack unnecessarily. Unfortunately, all the developers are swamped so
I'm not sure when someone will get a chance to look at this. If your
interested in getting some additional profiling data I'd suggest
starting with getting a call graph of fsync() using ftrace. That should
show us where the time is going.
http://lwn.net/Articles/370423/
Thanks,
Brian
# zfs set logbias=throughput zp1/data01
> pg_test_fsync -f /data01/pgdata/1
5 seconds per test
O_DIRECT supported on this platform for open_datasync and open_sync.
Compare file sync methods using one 8kB write:
(in wal_sync_method preference order, except fdatasync
is Linux's default)
open_datasync n/a*
fdatasync 330.846 ops/sec 3023 usecs/op
fsync 329.942 ops/sec 3031 usecs/op
fsync_writethrough n/a
open_sync n/a*
* This file system and its mount options do not support direct
I/O, e.g. ext4 in journaled mode.
Compare file sync methods using two 8kB writes:
(in wal_sync_method preference order, except fdatasync
is Linux's default)
open_datasync n/a*
fdatasync 329.407 ops/sec 3036 usecs/op
fsync 329.606 ops/sec 3034 usecs/op
fsync_writethrough n/a
open_sync n/a*
* This file system and its mount options do not support direct
I/O, e.g. ext4 in journaled mode.
Compare open_sync with different write sizes:
(This is designed to compare the cost of writing 16kB
in different write open_sync sizes.)
1 * 16kB open_sync write n/a*
2 * 8kB open_sync writes n/a*
4 * 4kB open_sync writes n/a*
8 * 2kB open_sync writes n/a*
16 * 1kB open_sync writes n/a*
Test if fsync on non-write file descriptor is honored:
(If the times are similar, fsync() can sync data written
on a different descriptor.)
write, fsync, close 324.344 ops/sec 3083 usecs/op
write, close, fsync 329.272 ops/sec 3037 usecs/op
Non-Sync'ed 8kB writes:
write 84914.324 ops/sec 12 usecs/op
如果直接用SSD建立ZPOOL, 它的fsync性能如何呢? 和前面一个VDEVS使用机械硬盘+ZIL SSD性能基本一致.
# zpool destroy zp1
# zpool create -o ashift=12 zp1 pcie-shannon-6819246149b014-part1
# zfs create -o mountpoint=/data01 zp1/data01
# mkdir /data01/pgdata
# chown postgres:postgres /data01/pgdata
> pg_test_fsync -f /data01/pgdata/1
5 seconds per test
O_DIRECT supported on this platform for open_datasync and open_sync.
Compare file sync methods using one 8kB write:
(in wal_sync_method preference order, except fdatasync
is Linux's default)
open_datasync n/a*
fdatasync 6604.779 ops/sec 151 usecs/op
fsync 7086.614 ops/sec 141 usecs/op
fsync_writethrough n/a
open_sync n/a*
* This file system and its mount options do not support direct
I/O, e.g. ext4 in journaled mode.
Compare file sync methods using two 8kB writes:
(in wal_sync_method preference order, except fdatasync
is Linux's default)
open_datasync n/a*
fdatasync 5760.927 ops/sec 174 usecs/op
fsync 5677.560 ops/sec 176 usecs/op
fsync_writethrough n/a
open_sync n/a*
* This file system and its mount options do not support direct
I/O, e.g. ext4 in journaled mode.
Compare open_sync with different write sizes:
(This is designed to compare the cost of writing 16kB
in different write open_sync sizes.)
1 * 16kB open_sync write n/a*
2 * 8kB open_sync writes n/a*
4 * 4kB open_sync writes n/a*
8 * 2kB open_sync writes n/a*
16 * 1kB open_sync writes n/a*
Test if fsync on non-write file descriptor is honored:
(If the times are similar, fsync() can sync data written
on a different descriptor.)
write, fsync, close 6561.159 ops/sec 152 usecs/op
write, close, fsync 6530.990 ops/sec 153 usecs/op
Non-Sync'ed 8kB writes:
write 81261.194 ops/sec 12 usecs/op
如果不使用ZFS, 直接使用EXT4的话, 性能如何呢?
此时底层块设备的利用率明显提升.
# mkfs.ext4 /dev/disk/by-id/pcie-shannon-6819246149b014-part2
# mount /dev/disk/by-id/pcie-shannon-6819246149b014-part2 /mnt
# chmod 777 /mnt
ing one 8kB write:
(in wal_sync_method preference order, except fdatasync
is Linux's default)
open_datasync 38533.583 ops/sec 26 usecs/op
fdatasync 29027.342 ops/sec 34 usecs/op
fsync 26695.490 ops/sec 37 usecs/op
fsync_writethrough n/a
open_sync 43047.350 ops/sec 23 usecs/op
Compare file sync methods using two 8kB writes:
(in wal_sync_method preference order, except fdatasync
is Linux's default)
open_datasync 23826.738 ops/sec 42 usecs/op
fdatasync 31193.925 ops/sec 32 usecs/op
fsync 29445.494 ops/sec 34 usecs/op
fsync_writethrough n/a
open_sync 22241.529 ops/sec 45 usecs/op
Compare open_sync with different write sizes:
(This is designed to compare the cost of writing 16kB
in different write open_sync sizes.)
1 * 16kB open_sync write 34597.675 ops/sec 29 usecs/op
2 * 8kB open_sync writes 22051.151 ops/sec 45 usecs/op
4 * 4kB open_sync writes 11751.948 ops/sec 85 usecs/op
8 * 2kB open_sync writes 804.951 ops/sec 1242 usecs/op
16 * 1kB open_sync writes 403.788 ops/sec 2477 usecs/op
Test if fsync on non-write file descriptor is honored:
(If the times are similar, fsync() can sync data written
on a different descriptor.)
write, fsync, close 18227.669 ops/sec 55 usecs/op
write, close, fsync 18158.735 ops/sec 55 usecs/op
Non-Sync'ed 8kB writes:
write 288696.375 ops/sec 3 usecs/op
iostat看到此时的SSD设备利用率提高.
dfa 0.00 0.00 0.00 55244.00 0.00 441952.00 8.00 1.30 0.02 0.01 78.10
ZVOL+EXT4的性能
# zfs create -V 10G zp1/data02
# mkfs.ext4 /dev/zd0
# mount /dev/zd0 /tmp
# chmod 777 /tmp
结果也不理想
> pg_test_fsync -f /tmp/1
5 seconds per test
O_DIRECT supported on this platform for open_datasync and open_sync.
Compare file sync methods using one 8kB write:
(in wal_sync_method preference order, except fdatasync
is Linux's default)
open_datasync 5221.004 ops/sec 192 usecs/op
fdatasync 4770.779 ops/sec 210 usecs/op
fsync 2523.113 ops/sec 396 usecs/op
fsync_writethrough n/a
open_sync 5527.120 ops/sec 181 usecs/op
Compare file sync methods using two 8kB writes:
(in wal_sync_method preference order, except fdatasync
is Linux's default)
open_datasync 2740.871 ops/sec 365 usecs/op
fdatasync 3774.486 ops/sec 265 usecs/op
fsync 1927.523 ops/sec 519 usecs/op
fsync_writethrough n/a
open_sync 2747.225 ops/sec 364 usecs/op
Compare open_sync with different write sizes:
(This is designed to compare the cost of writing 16kB
in different write open_sync sizes.)
1 * 16kB open_sync write 4751.333 ops/sec 210 usecs/op
2 * 8kB open_sync writes 2729.912 ops/sec 366 usecs/op
4 * 4kB open_sync writes 1387.512 ops/sec 721 usecs/op
8 * 2kB open_sync writes 734.417 ops/sec 1362 usecs/op
16 * 1kB open_sync writes 364.665 ops/sec 2742 usecs/op
Test if fsync on non-write file descriptor is honored:
(If the times are similar, fsync() can sync data written
on a different descriptor.)
write, fsync, close 3134.067 ops/sec 319 usecs/op
write, close, fsync 3486.530 ops/sec 287 usecs/op
Non-Sync'ed 8kB writes:
write 293944.412 ops/sec 3 usecs/op
对比以上几种情况, ZFS没有发挥出底层设备的FSYNC能力, 而直接使用块设备+ext4有明显改善, 不知道是不是zfs在Linux下的效率问题, 还是需要调整某些ZFS内核参数? 后面我使用FreeBSD进行一下测试看看是不是有同样的情况.
FreeBSD的性能很好, 基本达到块设备的瓶颈. 如下 :
http://blog.163.com/digoal@126/blog/static/16387704020145264116819/
5.10 同步接口调用的操作. 不推荐关闭, 关闭可能导致异常后数据丢失. 因为某些应用程序如数据库的一些操作, 希望调用fsync后数据确实写入了非易失存储. 而关闭sync的话, 显然和应用程序的期望实际不符.
sync=standard | always | disabled
Controls the behavior of synchronous requests (e.g. fsync, O_DSYNC). standard is the POSIX specified
behavior of ensuring all synchronous requests are written to stable storage and all devices are flushed to
ensure data is not cached by device controllers (this is the default). always causes every file system
transaction to be written and flushed before its system call returns. This has a large performance penalty.
disabled disables synchronous requests. File system transactions are only committed to stable storage peri-
odically. This option will give the highest performance. However, it is very dangerous as ZFS would be
ignoring the synchronous transaction demands of applications such as databases or NFS. Administrators
should only use this option when the risks are understood.
下面测试一下关闭sync后的性能, 虽然我们非常不建议这么做, 但是提供一下测试结果.
# zfs set sync=disabled zp1/data01
# zfs get all|grep cache
zp1 primarycache all default
zp1 secondarycache all default
zp1/data01 primarycache all default
zp1/data01 secondarycache all default
> pg_test_fsync -f /data01/pgdata/1
5 seconds per test
O_DIRECT supported on this platform for open_datasync and open_sync.
Compare file sync methods using one 8kB write:
(in wal_sync_method preference order, except fdatasync
is Linux's default)
open_datasync n/a*
fdatasync 109380.512 ops/sec 9 usecs/op
fsync 115186.570 ops/sec 9 usecs/op
fsync_writethrough n/a
open_sync n/a*
* This file system and its mount options do not support direct
I/O, e.g. ext4 in journaled mode.
Compare file sync methods using two 8kB writes:
(in wal_sync_method preference order, except fdatasync
is Linux's default)
open_datasync n/a*
fdatasync 60158.540 ops/sec 17 usecs/op
fsync 60352.231 ops/sec 17 usecs/op
fsync_writethrough n/a
open_sync n/a*
* This file system and its mount options do not support direct
I/O, e.g. ext4 in journaled mode.
Compare open_sync with different write sizes:
(This is designed to compare the cost of writing 16kB
in different write open_sync sizes.)
1 * 16kB open_sync write n/a*
2 * 8kB open_sync writes n/a*
4 * 4kB open_sync writes n/a*
8 * 2kB open_sync writes n/a*
16 * 1kB open_sync writes n/a*
Test if fsync on non-write file descriptor is honored:
(If the times are similar, fsync() can sync data written
on a different descriptor.)
write, fsync, close 75829.757 ops/sec 13 usecs/op
write, close, fsync 75501.094 ops/sec 13 usecs/op
Non-Sync'ed 8kB writes:
write 94328.592 ops/sec 11 usecs/op
关闭sync后, 其实和cache没有什么关系, 即使同时关闭cache性能依旧彪悍.
# zfs set primarycache=none zp1/data01
> pg_test_fsync -f /data01/pgdata/1
5 seconds per test
O_DIRECT supported on this platform for open_datasync and open_sync.
Compare file sync methods using one 8kB write:
(in wal_sync_method preference order, except fdatasync
is Linux's default)
open_datasync n/a*
fdatasync 115321.769 ops/sec 9 usecs/op
fsync 115119.262 ops/sec 9 usecs/op
fsync_writethrough n/a
open_sync n/a*
* This file system and its mount options do not support direct
I/O, e.g. ext4 in journaled mode.
Compare file sync methods using two 8kB writes:
(in wal_sync_method preference order, except fdatasync
is Linux's default)
open_datasync n/a*
fdatasync 60296.171 ops/sec 17 usecs/op
fsync 60201.468 ops/sec 17 usecs/op
fsync_writethrough n/a
open_sync n/a*
* This file system and its mount options do not support direct
I/O, e.g. ext4 in journaled mode.
Compare open_sync with different write sizes:
(This is designed to compare the cost of writing 16kB
in different write open_sync sizes.)
1 * 16kB open_sync write n/a*
2 * 8kB open_sync writes n/a*
4 * 4kB open_sync writes n/a*
8 * 2kB open_sync writes n/a*
16 * 1kB open_sync writes n/a*
Test if fsync on non-write file descriptor is honored:
(If the times are similar, fsync() can sync data written
on a different descriptor.)
write, fsync, close 75542.879 ops/sec 13 usecs/op
write, close, fsync 75654.249 ops/sec 13 usecs/op
Non-Sync'ed 8kB writes:
write 95557.532 ops/sec 10 usecs/op
6. zfs模块内核参数也会极大的影响性能.
参见
http://blog.163.com/digoal@126/blog/static/16387704020145253599111/
参考
1. zfs source
2. man zpool
3. man zfs
4. man zdb
5. http://blog.163.com/digoal@126/blog/static/1638770402014525103556357/
6. http://blog.163.com/digoal@126/blog/static/1638770402014525111238683/
7. http://blog.163.com/digoal@126/blog/static/16387704020145253599111/
8. http://fixunix.com/solaris-rss/579853-choosing-stripsize-lun-recordsize-zfs-postgresql.html
9. https://github.com/zfsonlinux/zfs/blob/master/cmd/zpool/zpool_vdev.c#L108
10. http://blog.delphix.com/matt/2014/06/06/zfs-stripe-width/
11. http://open-zfs.org/wiki/Performance_tuning
12. https://pthree.org/2013/01/03/zfs-administration-part-xvii-best-practices-and-caveats/
13. http://www.solarisinternals.com/wiki/index.php/ZFS_Best_Practices_Guide#Memory_and_Dynamic_Reconfiguration_Recommendations
14. http://www.solarisinternals.com/wiki/index.php/ZFS_Evil_Tuning_Guide#Tuning_ZFS_for_Database_Performance
15. http://www.solarisinternals.com/wiki/index.php/ZFS_for_Databases
16. https://blogs.oracle.com/roch/entry/dedup_performance_considerations1
17. https://wiki.freebsd.org/ZFSTuningGuide
18. https://blogs.oracle.com/roch/entry/proper_alignment_for_extra_performance
19. http://blog.163.com/digoal@126/blog/static/16387704020145264116819/