flashcache usage guide
背景
前几天写过一篇关于使用flashcache提升PostgreSQL IOPS性能的文章
http://blog.163.com/digoal@126/blog/static/1638770402014528115551323/
本文将要介绍一下flashcache的使用注意事项, 更好的使用flashcache.
1. 内核的适配, 目前flashcache在2.6.18到2.6.38之间的Linux内核做过测试, 可以使用. 其他内核的话, 不建议使用.
2. 缓存模式的选择, flashcache目前支持3种模式
Writethrough - safest, all writes are cached to ssd but also written to disk
immediately. If your ssd has slower write performance than your disk (likely
for early generation SSDs purchased in 2008-2010), this may limit your system
write performance. All disk reads are cached (tunable).
写操作, 会写SSD(flashcache盘), 同时写磁盘(数据盘).
读操作, 所有读操作的数据都会读入SSD(flashcache盘), 但可以通过sysctl调整. dev.flashcache.
Writearound - again, very safe, writes are not written to ssd but directly to
disk. Disk blocks will only be cached after they are read. All disk reads
are cached (tunable).
写操作, 直接写磁盘(数据盘), 不写SSD(flashcache盘).
读操作, 所有读操作的数据都会读入SSD(flashcache盘), 但可以通过sysctl调整. dev.flashcache.
Writeback - fastest but less safe. Writes only go to the ssd initially, and
based on various policies are written to disk later. All disk reads are
cached (tunable).
写操作, 写SSD(flashcache盘), 然后异步的写入磁盘(数据盘).
读操作, 所有读操作的数据都会读入SSD(flashcache盘), 但可以通过sysctl调整. dev.flashcache.
对于顺序写入, 一般的SSD和普通15K转速的磁盘性能差别不是特别大. 如果普通盘的性能更好的话, writearound更合算. 一般的场景的话三种模式差不多.
对于离散写入, SSD性能要比普通磁盘好很多. writeback很适合.
后面会提到如何优化顺序写入.
3. 缓存持久化.
只有writeback会持久化到ssd(flashcache盘), 因为它是异步写入到磁盘的. 所以必须持久化不能丢.
而对于writethrough 和 writearound 重启或设备remove后, 数据就丢了, 也不影响数据一致性.
4. 已知的BUG
https://github.com/facebook/flashcache/issues
5. cachedev块设备的管理, dmsetup命令, 或者使用flashcache封装好的3个命令.
5.1 创建cache dev设备
flashcache_create, flashcache_load and flashcache_destroy.
These utilities use dmsetup internally, presenting a simpler interface to create,
load and destroy flashcache volumes.
It is expected that the majority of users can use these utilities instead of using dmsetup.
flashcache_create : Create a new flashcache volume.
# flashcache_create
Usage: flashcache_create [-v] [-p back|thru|around] [-b block size] [-m md block size] [-s cache size] [-a associativity] cachedev ssd_devname disk_devname
Usage : flashcache_create Cache Mode back|thru|around is required argument
Usage : flashcache_create Default units for -b, -m, -s are sectors, or specify in k/M/G. Default associativity is 512.
-v : verbose.
-p : cache mode (writeback/writethrough/writearound).
-s : cache size. Optional. If this is not specified, the entire ssd device
is used as cache. The default units is sectors. But you can specify
k/m/g as units as well.
-b : block size. Optional. Defaults to 4KB. Must be a power of 2. 建议和SSD设备(flashcache设备) 的扇区大小一致.
The default units is sectors. But you can specify k as units as well.
(A 4KB blocksize is the correct choice for the vast majority of
applications. But see the section "Cache Blocksize selection" below).
-f : force create. by pass checks (eg for ssd sectorsize).
Examples :
flashcache_create -p back -s 1g -b 4k cachedev /dev/sdc /dev/sdb
Creates a 1GB writeback cache volume with a 4KB block size on ssd
device /dev/sdc to cache the disk volume /dev/sdb. The name of the device
created is "cachedev".
flashcache_create -p thru -s 2097152 -b 8 cachedev /dev/sdc /dev/sdb
Same as above but creates a write through cache with units specified in
sectors instead. The name of the device created is "cachedev".
注意指定-s cache size, 否则整个ssd或ssd分区全部使用.
-b cache dev blocksize 和 -m cache dev metadata blocksize
cache数据块和metadata 数据块大小的选择原则 :
Cache Blocksize selection : 推荐和底层SSD设备一致.
=========================
Cache blocksize selection is critical for good cache utilization and performance.
A 4KB cache blocksize for the vast majority of workloads (and filesystems).
Cache Metadata Blocksize selection : 推荐和底层SSD设备一致.
==================================
This section only applies to the writeback cache mode. 只有writeback需要存储metadata块.
Writethrough and writearound modes store no cache metadata at all.
In Flashcache version 1, the metadata blocksize was fixed at 1 (512b) sector.
Flashcache version 2 removes this limitation. In version 2, we can configure
a larger flashcache metadata blocksize.
Version 2 maintains backwards compatibility for caches created with Version 1.
For these cases, a metadata blocksize of 512 will continue to be used.
flashcache_create -m can be used to optionally configure the metadata blocksize.
Defaults to 4KB.
Ideal choices for the metadata blocksize are 4KB (default) or 8KB. There is
little benefit to choosing a metadata blocksize greater than 8KB. The choice
of metadata blocksize is subject to the following rules :
metadata blocksize的选择原则 :
1) Metadata blocksize must be a power of 2.
2) Metadata blocksize cannot be smaller than sector size configured on the
ssd device. metadata blocksize不能小于SSD(flashcache设备)的扇区大小.
3) A single metadata block cannot contain metadata for 2 cache sets.
In other words,
with the default associativity of 512 (with each cache metadata slot sizing at 16 bytes),
the entire metadata for a given set fits in 8KB (512*16b).
For an associativity of 512, we cannot configure a metadata blocksize greater than 8KB.
选择大metadata blocksize的好处
Advantages of choosing a larger (than 512b) metadata blocksize :
- Allows the ssd to be configured to larger sectors. For example, some ssds
allow choosing a 4KB sector, often a more performant choice. 允许配置大的SSD扇区.
- Allows flashache to do better batching of metadata updates, potentially
reducing metadata updates, small ssd writes, reducing write amplification
and higher ssd lifetimes. 减少SSD些操作, 提高SSD使用寿命.
Thanks due to Earle Philhower of Virident for this feature !
5.2 加载已经存在的write back cache dev设备.
使用flashcache_load加载已经存在的writeback flashcache设备.
因为重启需要重新加载, 或者使用chkconfig来管理自动加载.
writearound和writethrough不需要加载.(前面已经说过了, 这两只缓存不持久化到ssd, 重启即删了).
flashcache_load : Load an existing writeback cache volume.
flashcache_load ssd_devname [cachedev_name]
Example :
flashcache_load /dev/sd
Load the existing writeback cache on /dev/sdc, using the virtual cachedev_name from when the device was created.
If you're upgrading from an older flashcache device format that didn't store the cachedev name internally, or you want to change the cachedev name use, you can specify it as an optional second argument to flashcache_load.
For writethrough and writearound caches flashcache_load is not needed; flashcache_create
should be used each time.
5.3 删除flashcache dev.
删除write backup设备的flashcache设备, 比较危险, 所有flashcache中的数据将被删除(未说明是否写脏数据).
writeback的flashcache设备不推荐这么做. 如果要删除的话, 建议使用dmsetup删除cache dev, 因为dmsetup会自动将脏数据写入磁盘.
flashcache_destroy : Destroy an existing writeback flashcache. All data will be lost !!!
flashcache_destroy ssd_devname
Example :
flashcache_destroy /dev/sdc
Destroy the existing cache on /dev/sdc. All data is lost !!!
For writethrough and writearound caches this is not necessary.
6. 移除flashcache cache dev设备(即device mapper设备).
对于writeback的cache dev, 先把脏数据自动写入磁盘再移除.
Removing a flashcache volume :
============================
Use dmsetup remove to remove a flashcache volume. For writeback
cache mode, the default behavior on a remove is to clean all dirty
cache blocks to disk. The remove will not return until all blocks
are cleaned. Progress on disk cleaning is reported on the console
(also see the "fast_remove" flashcache sysctl).
A reboot of the node will also result in all dirty cache blocks being
cleaned synchronously
(again see the note about "fast_remove" in the sysctls section).
For writethrough and writearound caches, the device removal or reboot
results in the cache being destroyed. However, there is no harm is
doing a 'dmsetup remove' to tidy up before boot, and indeed
this will be needed if you ever need to unload the flashcache kernel
module (for example to load an new version into a running system).
Example:
dmsetup remove cachedev
This removes the flashcache volume name cachedev. Cleaning
all blocks prior to removal.
快速移除选项如果配置为1的话, 不会同步脏数据到磁盘. 非常危险, 不推荐这么做.
dev.flashcache.<cachedev>.fast_remove = 0
Don't sync dirty blocks when removing cache. On a reload
both DIRTY and CLEAN blocks persist in the cache. This
option can be used to do a quick cache remove.
CAUTION: The cache still has uncommitted (to disk) dirty
blocks after a fast_remove.
7. flashcache cache dev设备统计信息的查看, 通过dmsetup status或dmsetup table来查看.
Cache Stats :
===========
Use 'dmsetup status' for cache statistics.
'dmsetup table' also dumps a number of cache related statistics.
Examples :
dmsetup status cachedev
dmsetup table cachedev
或者直接查看设备的状态文件
Flashcache errors are reported in
/proc/flashcache/<cache name>/flashcache_errors
Flashcache stats are also reported in
/proc/flashcache/<cache name>/flashcache_stats
for easier parseability.
例如
[root@db-172-16-3-150 sda1+sdc3]# dmsetup table cachedev1
0 207254565 flashcache conf:
ssd dev (/dev/sda1), disk dev (/dev/sdc3) cache mode(WRITE_BACK)
capacity(10216M), associativity(512), data block size(8K) metadata block size(4096b)
disk assoc(256K)
skip sequential thresh(0K)
total blocks(1307648), cached blocks(0), cache percent(0)
dirty blocks(0), dirty percent(0)
nr_queued(0)
Size Hist: 512:2660 1024:851 2048:832 4096:8159776 8192:317
[root@db-172-16-3-150 sda1+sdc3]# dmsetup status cachedev1
0 207254565 flashcache stats:
reads(2477), writes(6)
read hits(0), read hit percent(0)
write hits(0) write hit percent(0)
dirty write hits(0) dirty write hit percent(0)
replacement(0), write replacement(0)
write invalidates(0), read invalidates(0)
pending enqueues(0), pending inval(0)
metadata dirties(0), metadata cleans(0)
metadata batch(0) metadata ssd writes(0)
cleanings(0) fallow cleanings(0)
no room(0) front merge(0) back merge(0)
force_clean_block(0)
disk reads(2477), disk writes(6) ssd reads(0) ssd writes(0)
uncached reads(2477), uncached writes(6), uncached IO requeue(0)
disk read errors(0), disk write errors(0) ssd read errors(0) ssd write errors(0)
uncached sequential reads(0), uncached sequential writes(0)
pid_adds(0), pid_dels(0), pid_drops(0) pid_expiry(0)
lru hot blocks(653824), lru warm blocks(653824)
lru promotions(0), lru demotions(0)
或者直接查看设备状态文件
[root@db-172-16-3-150 sda1+sdc3]# cat /proc/flashcache/sda1+sdc3/flashcache_
flashcache_errors flashcache_iosize_hist flashcache_pidlists flashcache_stats
错误统计
[root@db-172-16-3-150 sda1+sdc3]# cat /proc/flashcache/sda1+sdc3/flashcache_errors
disk_read_errors=0 disk_write_errors=0 ssd_read_errors=0 ssd_write_errors=0 memory_alloc_errors=0
进程白名单和黑名单, 通过sysctl设置使用flashcache设备的PID白名单和黑名单列表 .
[root@db-172-16-3-150 sda1+sdc3]# cat /proc/flashcache/sda1+sdc3/flashcache_pidlists
Blacklist:
Whitelist:
IOSIZE历史
[root@db-172-16-3-150 sda1+sdc3]# cat /proc/flashcache/sda1+sdc3/flashcache_iosize_hist
512:2660 1024:851 1536:0 2048:832 2560:0 3072:0 3584:0 4096:8159776 4608:0 5120:0 5632:0 6144:0 6656:0 7168:0 7680:0 8192:317 8704:0 9216:0 9728:0 10240:0 10752:0 11264:0 11776:0 12288:0 12800:0 13312:0 13824:0 14336:0 14848:0 15360:0 15872:0 16384:0
状态信息
[root@db-172-16-3-150 sda1+sdc3]# cat /proc/flashcache/sda1+sdc3/flashcache_stats
reads=2477 writes=6
read_hits=0 read_hit_percent=0 write_hits=0 write_hit_percent=0 dirty_write_hits=0 dirty_write_hit_percent=0 replacement=0 write_replacement=0 write_invalidates=0 read_invalidates=0 pending_enqueues=0 pending_inval=0 metadata_dirties=0 metadata_cleans=0 metadata_batch=0 metadata_ssd_writes=0 cleanings=0 fallow_cleanings=0 no_room=0 front_merge=0 back_merge=0 disk_reads=2477 disk_writes=6 ssd_reads=0 ssd_writes=0 uncached_reads=2477 uncached_writes=6 uncached_IO_requeue=0 uncached_sequential_reads=0 uncached_sequential_writes=0 pid_adds=0 pid_dels=0 pid_drops=0 pid_expiry=0
8. 红帽或centos的自启动脚本, 脚本内容见 https://github.com/facebook/flashcache/blob/master/utils/flashcache
用于开机时自动加载flashcache模块, 自动创建cache dev, 自动挂载.
关机时自动remove device mapper block dev. (注意关机时如果没有remove cache dev, 可能导致关机失败.)
需要在脚本中配置几个变量: SSD_DISK, BACKEND_DISK, CACHEDEV_NAME, MOUNTPOINT, FLASHCACHE_NAME
但是这个目前仅支持1个cachedev的自动加载和自动卸载.
Using Flashcache sysVinit script (Redhat based systems):
=======================================================
Kindly note that, this sections only applies to the Redhat based systems. Use
'utils/flashcache' from the repository as the sysvinit script.
This script is to load, unload and get statistics of an existing flashcache
writeback cache volume. It helps in loading the already created cachedev during
system boot and removes the flashcache volume before system halt happens.
This script is necessary, because, when a flashcache volume is not removed
before the system halt, kernel panic occurs.
注意关机时如果没有remove cache dev, 可能导致关机失败.
Configuring the script using chkconfig:
1. Copy 'utils/flashcache' from the repo to '/etc/init.d/flashcache'
2. Make sure this file has execute permissions,
'sudo chmod +x /etc/init.d/flashcache'.
3. Edit this file and specify the values for the following variables
SSD_DISK, BACKEND_DISK, CACHEDEV_NAME, MOUNTPOINT, FLASHCACHE_NAME
4. Modify the headers in the file if necessary.
By default, it starts in runlevel 3, with start-stop priority 90-10
5. Register this file using chkconfig
'chkconfig --add /etc/init.d/flashcache'
例如 :
[root@db-172-16-3-150 ~]# cp /opt/soft_bak/flashcache/flashcache-master/utils/flashcache /etc/init.d/
[root@db-172-16-3-150 ~]# chmod 755 /etc/init.d/flashcache
[root@db-172-16-3-150 ~]# vi /etc/init.d/flashcache
SSD_DISK=/dev/sda1
BACKEND_DISK=/dev/sdc3
CACHEDEV_NAME=cachedev1
MOUNTPOINT=/opt
FLASHCACHE_NAME=sda1+sdc3
[root@db-172-16-3-150 ~]# service flashcache start
Starting Flashcache...
[root@db-172-16-3-150 ~]# df -h
/dev/mapper/cachedev1
98G 51G 42G 55% /opt
[root@db-172-16-3-150 ~]# service flashcache status
Flashcache status: loaded
0 207254565 flashcache stats:
reads(1598), writes(1)
read hits(0), read hit percent(0)
write hits(0) write hit percent(0)
dirty write hits(0) dirty write hit percent(0)
replacement(0), write replacement(0)
write invalidates(0), read invalidates(0)
pending enqueues(0), pending inval(0)
metadata dirties(0), metadata cleans(0)
metadata batch(0) metadata ssd writes(0)
cleanings(0) fallow cleanings(0)
no room(0) front merge(0) back merge(0)
force_clean_block(0)
disk reads(1598), disk writes(1) ssd reads(0) ssd writes(0)
uncached reads(1598), uncached writes(1), uncached IO requeue(0)
disk read errors(0), disk write errors(0) ssd read errors(0) ssd write errors(0)
uncached sequential reads(0), uncached sequential writes(0)
pid_adds(0), pid_dels(0), pid_drops(0) pid_expiry(0)
lru hot blocks(6144), lru warm blocks(6144)
lru promotions(0), lru demotions(0)
[root@db-172-16-3-150 ~]# service flashcache stop
dev.flashcache.sda1+sdc3.fast_remove = 0
Flushing flashcache: Flushes to /dev/sdc3
9. flashcache 模块参数设置, 需要针对ssd devname+disk devname配置 :
FlashCache Sysctls :
==================
Flashcache sysctls operate on a per-cache device basis. A couple of examples
first.
Sysctls for a writearound or writethrough mode cache :
cache device /dev/ram3, disk device /dev/ram4
dev.flashcache.ram3+ram4.cache_all = 1
dev.flashcache.ram3+ram4.zero_stats = 0
dev.flashcache.ram3+ram4.reclaim_policy = 0
dev.flashcache.ram3+ram4.pid_expiry_secs = 60
dev.flashcache.ram3+ram4.max_pids = 100
dev.flashcache.ram3+ram4.do_pid_expiry = 0
dev.flashcache.ram3+ram4.io_latency_hist = 0
dev.flashcache.ram3+ram4.skip_seq_thresh_kb = 0
Sysctls for a writeback mode cache :
cache device /dev/sdb, disk device /dev/cciss/c0d2
dev.flashcache.sdb+c0d2.fallow_delay = 900
dev.flashcache.sdb+c0d2.fallow_clean_speed = 2
dev.flashcache.sdb+c0d2.cache_all = 1
dev.flashcache.sdb+c0d2.fast_remove = 0
dev.flashcache.sdb+c0d2.zero_stats = 0
dev.flashcache.sdb+c0d2.reclaim_policy = 0
dev.flashcache.sdb+c0d2.pid_expiry_secs = 60
dev.flashcache.sdb+c0d2.max_pids = 100
dev.flashcache.sdb+c0d2.do_pid_expiry = 0
dev.flashcache.sdb+c0d2.max_clean_ios_set = 2
dev.flashcache.sdb+c0d2.max_clean_ios_total = 4
dev.flashcache.sdb+c0d2.dirty_thresh_pct = 20
dev.flashcache.sdb+c0d2.stop_sync = 0
dev.flashcache.sdb+c0d2.do_sync = 0
dev.flashcache.sdb+c0d2.io_latency_hist = 0
dev.flashcache.sdb+c0d2.skip_seq_thresh_kb = 0
Sysctls common to all cache modes :
dev.flashcache.<cachedev>.cache_all:
Global caching mode to cache everything or cache nothing.
See section on Caching Controls. Defaults to "cache everything". 时候缓存所有或啥都不缓存( 另外可以通过进程ID白名单和黑名单控制) , 如果要用白名单, cache_all=0, 如果要用黑名单, 那就设置为cache_all=1;
dev.flashcache.<cachedev>.zero_stats:
Zero stats (once).
dev.flashcache.<cachedev>.reclaim_policy: 缓存回收策略, 可以动态调整.
FIFO (0) vs LRU (1). Defaults to FIFO. Can be switched at
runtime.
dev.flashcache.<cachedev>.io_latency_hist: 是否统计IO延迟柱状图, 对clocksource慢的机器有比较大的性能影响.
Compute IO latencies and plot these out on a histogram.
The scale is 250 usecs. This is disabled by default since
internally flashcache uses gettimeofday() to compute latency
and this can get expensive depending on the clocksource used.
Setting this to 1 enables computation of IO latencies.
The IO latency histogram is appended to 'dmsetup status'.
以下不建议调整
(There is little reason to tune these)
dev.flashcache.<cachedev>.max_pids:
Maximum number of pids in the white/black lists.
dev.flashcache.<cachedev>.do_pid_expiry:
Enable expiry on the list of pids in the white/black lists.
dev.flashcache.<cachedev>.pid_expiry_secs:
Set the expiry on the pid white/black lists.
dev.flashcache.<cachedev>.skip_seq_thresh_kb: 有点类似ZFS在ARC的设计, 跳过连续IO扫描的CACHE, 例如数据库大表的全表扫描, 可能不推荐加载到CACHE中. 但是因为是后触发的, 所以必须先达到这么大的IO量才会关闭后续的写入CACHE, 也就是说连续IO的开始部分(触发skip前)的数据还是写入SSD了. 结合cache dev所对应的机械盘的连续IO能力来判断, 例如100MB.
Skip (don't cache) sequential IO larger than this number (in kb).
0 (default) means cache all IO, both sequential and random.
Sequential IO can only be determined 'after the fact', so
this much of each sequential I/O will be cached before we skip
the rest. Does not affect searching for IO in an existing cache.
以下只有writeback模式才允许的设置 :
Sysctls for writeback mode only :
dev.flashcache.<cachedev>.fallow_delay = 900 多少秒之后, 未有读写的缓存脏数据会写入磁盘.
In seconds. Clean dirty blocks that have been "idle" (not
read or written) for fallow_delay seconds. Default is 15
minutes.
Setting this to 0 disables idle cleaning completely.
dev.flashcache.<cachedev>.fallow_clean_speed = 2
The maximum number of "fallow clean" disk writes per set
per second. Defaults to 2.
dev.flashcache.<cachedev>.fast_remove = 0 是否在remove device mapper设备前将脏数据写入对应的磁盘.
Don't sync dirty blocks when removing cache. On a reload
both DIRTY and CLEAN blocks persist in the cache. This
option can be used to do a quick cache remove.
CAUTION: The cache still has uncommitted (to disk) dirty
blocks after a fast_remove.
dev.flashcache.<cachedev>.dirty_thresh_pct = 20 允许的脏数据的比例.
Flashcache will attempt to keep the dirty blocks in each set
under this %. A lower dirty threshold increases disk writes,
and reduces block overwrites, but increases the blocks
available for read caching.
dev.flashcache.<cachedev>.stop_sync = 0 停止sync.
Stop the sync in progress.
dev.flashcache.<cachedev>.do_sync = 0 执行sync, 将脏数据写入磁盘.
Schedule cleaning of all dirty blocks in the cache.
以下不建议调整 :
(There is little reason to tune these)
dev.flashcache.<cachedev>.max_clean_ios_set = 2
Maximum writes that can be issues per set when cleaning
blocks.
dev.flashcache.<cachedev>.max_clean_ios_total = 4
Maximum writes that can be issued when syncing all blocks.
10. 直接使用dmsetup管理cache device. 可以直接使用flashcache_xxx来封装管理, 所以dmsetup可以不必使用.
Using dmsetup to create and load flashcache volumes :
===================================================
Few users will need to use dmsetup natively to create and load
flashcache volumes. This section covers that.
dmsetup create device_name table_file
where
device_name: name of the flashcache device being created or loaded.
table_file : other cache args (format below). If this is omitted, dmsetup
attempts to read this from stdin.
table_file format :
0 <disk dev sz in sectors> flashcache <disk dev> <ssd dev> <dm virtual name> <cache mode> <flashcache cmd> <blksize in sectors> [size of cache in sectors] [cache set size]
cache mode:
1: Write Back
2: Write Through
3: Write Around
flashcache cmd:
1: load existing cache
2: create cache
3: force create cache (overwriting existing cache). USE WITH CAUTION
blksize in sectors:
4KB (8 sectors, PAGE_SIZE) is the right choice for most applications.
See note on block size selection below.
Unused (can be omitted) for cache loads.
size of cache in sectors:
Optional. if size is not specified, the entire ssd device is used as
cache. Needs to be a power of 2.
Unused (can be omitted) for cache loads.
cache set size:
Optional. The default set size is 512, which works well for most
applications. Little reason to change this. Needs to be a
power of 2.
Unused (can be omitted) for cache loads.
Example :
echo 0 `blockdev --getsize /dev/cciss/c0d1p2` flashcache /dev/cciss/c0d1p2 /dev/fioa2 cachedev 1 2 8 522000000 | dmsetup create cachedev
This creates a writeback cache device called "cachedev" (/dev/mapper/cachedev)
with a 4KB blocksize to cache /dev/cciss/c0d1p2 on /dev/fioa2.
The size of the cache is 522000000 sectors.
(TODO : Change loading of the cache happen via "dmsetup load" instead
of "dmsetup create").
11. 缓存的控制 , 白名单和黑名单.
Caching Controls
================
Flashcache can be put in one of 2 modes - Cache Everything or
Cache Nothing (dev.flashcache.cache_all). The defaults is to "cache
everything".
These 2 modes have a blacklist and a whitelist.
The tgid (thread group id) for a group of pthreads can be used as a
shorthand to tag all threads in an application. The tgid for a pthread
is returned by getpid() and the pid of the individual thread is
returned by gettid().
pid和tgid分别使用getpid()和gettid()获取, 可以用systemtap试一试. 参见
https://sourceware.org/systemtap/documentation.html
http://blog.163.com/digoal@126/blog/#m=0&t=1&c=fks_084068084086080075085082085095085080082075083081086071084
The algorithm works as follows :
In "cache everything" mode, 缓存所有, 先查黑名单(不缓存), 再查白名单(缓存), 最后达到连续IO限制的话跳过缓存.
1) If the pid of the process issuing the IO is in the blacklist, do
not cache the IO. ELSE,
2) If the tgid is in the blacklist, don't cache this IO. UNLESS
3) The particular pid is marked as an exception (and entered in the
whitelist, which makes the IO cacheable).
4) Finally, even if IO is cacheable up to this point, skip sequential IO
if configured by the sysctl.
Conversely, in "cache nothing" mode, 不缓存任何, 先查白名单(缓存), 再查黑名单(不换成), 最后达到连续IO限制的话跳过缓存.
1) If the pid of the process issuing the IO is in the whitelist,
cache the IO. ELSE,
2) If the tgid is in the whitelist, cache this IO. UNLESS
3) The particular pid is marked as an exception (and entered in the
blacklist, which makes the IO non-cacheable).
4) Anything whitelisted is cached, regardless of sequential or random
IO.
Examples :
--------
1) You can make the global cache setting "cache nothing", and add the
tgid of your pthreaded application to the whitelist. Which makes only
IOs issued by your application cacheable by Flashcache.
2) You can make the global cache setting "cache everything" and add
tgids (or pids) of other applications that may issue IOs on this
volume to the blacklist, which will make those un-interesting IOs not
cacheable.
Note that this only works for O_DIRECT IOs. For buffered IOs, pdflush,
kswapd would also do the writes, with flashcache caching those. 只对O_DIRECT IO请求有效控制.
The following cacheability ioctls are supported on /dev/mapper/<cachedev>
FLASHCACHEADDBLACKLIST: add the pid (or tgid) to the blacklist.
FLASHCACHEDELBLACKLIST: Remove the pid (or tgid) from the blacklist.
FLASHCACHEDELALLBLACKLIST: Clear the blacklist. This can be used to
cleanup if a process dies.
FLASHCACHEADDWHITELIST: add the pid (or tgid) to the whitelist.
FLASHCACHEDELWHITELIST: Remove the pid (or tgid) from the whitelist.
FLASHCACHEDELALLWHITELIST: Clear the whitelist. This can be used to
cleanup if a process dies.
/proc/flashcache_pidlists shows the list of pids on the whitelist and the blacklist.
12. 缓存安全, 用户进程可能在只有只读权限的情况下, 破坏缓存盘的数据.
现在的解决办法是, 收紧权限, 哪怕只读权限也不给其他用户.
Security Note :
=============
With Flashcache, it is possible for a malicious user process to
corrupt data in files with only read access. In a future revision
of flashcache, this will be addressed (with an extra data copy).
Not documenting the mechanics of how a malicious process could
corrupt data here.
You can work around this by setting file permissions on files in
the flashcache volume appropriately.
13. SSD使用率过低的问题.
因为SSD sets和HDD sets是一对多的关系, 也就是说多个HDD数据块可能竞争一个SSD cache区域.
如果竞争同一个SSD CACHE区域的块都是需要缓存的块, 而不发生竞争的块都不需要缓存的话, 最糟糕的的情况就发生了, 利用率会极低. 看个例子 :
Why is my cache only (<< 100%) utilized ?
=======================================
(Answer contributed by Will Smith)
- There is essentially a 1:many mapping between SSD blocks and HDD blocks. (ssd blocks和hdd blocks是一对多的映射关系.)
- In more detail, a HDD block gets hashed to a set on SSD which contains by
default 512 blocks. It can only be stored in that set on SSD, nowhere else.
So with a simplified SSD containing only 3 sets:
SSD = 1 2 3 , and a HDD with 9 sets worth of data, the HDD sets would map to the SSD
sets like this:
HDD: 1 2 3 4 5 6 7 8 9
SSD: 1 2 3 1 2 3 1 2 3
So if your data only happens to live in HDD sets 1 and 4, they will compete for
SSD set 1 and your SSD will at most become 33% utilized.
HDD 数据集1和4都存储在SSD的1号集, 如果HDD1,4都是需要缓存的, 其他HDD集(2,3,5,6,7,8,9)都不是活跃数据不需要缓存, 那么最糟的情况就是SSD只有33%在使用, 为了提高使用率, XFS文件系统支持调整agsize和agcount来实现目的.
If you use XFS you can tune the XFS agsize/agcount to try and mitigate this
(described next section).
14. XFS文件系统优化, 应对CACHE 使用率过低的问题.
通过调整xfs的allocation group参数agsize, agcount来优化SSD的使用.
Tuning XFS for better flashcache performance :
============================================
If you run XFS/Flashcache, it is worth tuning XFS' allocation group
parameters (agsize/agcount) to achieve better flashcache performance.
XFS allocates blocks for files in a given directory in a new (利用XFS可以将一个目录中的多个文件分散到多个agroup来分散数据块存储)
allocation group. By tuning agsize and agcount (mkfs.xfs parameters),
we can achieve much better distribution of blocks across
flashcache. Better distribution of blocks across flashcache will
decrease collisions on flashcache sets considerably, increase cache
hit rates significantly and result in lower IO latencies. (分散的数据块可以优化FLASHCACHE的冲突, 13章节已经提到了这个原因)
We can achieve this by computing agsize (and implicitly agcount) using
these equations,
计算公式 :
C = Cache size,
V = Size of filesystem Volume.
agsize % C = (1/agcount)*C
agsize * agcount ~= V
where agsize <= 1000g (XFS limits on agsize).
A couple of examples that illustrate the formula,
For agcount = 4, let's divide up the cache into 4 equal parts (each
part is size C/agcount). Let's call the parts C1, C2, C3, C4. One
ideal way to map the allocation groups onto the cache is as follows.
理想的HDD和CACHE对应的条带组合, 每个条带错位, 得到好CACHE的分布, 减少CACHE SET征用的冲突.
Ag1 Ag2 Ag3 Ag4
-- -- -- --
C1 C2 C3 C4 (stripe 1)
C2 C3 C4 C1 (stripe 2)
C3 C4 C1 C2 (stripe 3)
C4 C1 C2 C3 (stripe 4)
C1 C2 C3 C4 (stripe 5)
In this simple example, note that each "stripe" has 2 properties
1) Each element of the stripe is a unique part of the cache.
2) The union of all the parts for a stripe gives us the entire cache.
Clearly, this is an ideal mapping, from a distribution across the
cache point of view.
Another example, this time with agcount = 5, the cache is divided into
5 equal parts C1, .. C5.
Ag1 Ag2 Ag3 Ag4 Ag5
-- -- -- -- --
C1 C2 C3 C4 C5 (stripe 1)
C2 C3 C4 C5 C1 (stripe 2)
C3 C4 C5 C1 C2 (stripe 3)
C4 C5 C1 C2 C3 (stripe 4)
C5 C1 C2 C3 C4 (stripe 5)
C1 C2 C3 C4 C5 (stripe 6)
A couple of examples that compute the optimal agsize for a given
Cachesize and Filesystem volume size.
a) C = 600g, V = 3,5TB
Consider agcount = 5
agsize % 600 = (1/5)*600
agsize % 600 = 120
So an agsize of 720g would work well, and 720*5 = 3.6TB (~ 3.5TB)
b) C = 150g, V = 3.5TB
Consider agcount=4
agsize % 150 = (1/4)*150
agsize % 150 = 37.5
So an agsize of 937g would work well, and 937*4 = 3.7TB (~ 3.5TB)
As an alternative,
agsize % C = (1 - (1/agcount))*C
agsize * agcount ~= V
Works just as well as the formula above.
不想自己计算的话, 可以尝试一下直接使用flashcache提供的get_agsize工具.
This computation has been implemented in the utils/get_agsize utility.
使用mkfs.xfs创建文件系统时指定agsize, agcount.
man mkfs.xfs
-d data_section_options
These options specify the location, size, and other parameters of the data section of the filesystem.
The valid data_section_options are:
agcount=value
This is used to specify the number of allocation groups. The data section of the filesystem
is divided into allocation groups to improve the performance of XFS. More allocation groups
imply that more parallelism can be achieved when allocating blocks and inodes. The minimum
allocation group size is 16 MiB; the maximum size is just under 1 TiB. The data section of
the filesystem is divided into value allocation groups (default value is scaled automati-
cally based on the underlying device size).
agsize=value
This is an alternative to using the agcount suboption. The value is the desired size of the
allocation group expressed in bytes (usually using the m or g suffixes). This value must be
a multiple of the filesystem block size, and must be at least 16MiB, and no more than 1TiB,
and may be automatically adjusted to properly align with the stripe geometry. The agcount
and agsize suboptions are mutually exclusive.
15. 连续IO载入SSD是否影响性能, 如果影响, 如何通过跳过连续IO载入缓存来优化性能.
如果开启了cache all io, 可能存在连续IO载入SSD后带来的问题.
Tuning Sequential IO Skipping for better flashcache performance
===============================================================
Skipping sequential IO makes sense in two cases:
1) your sequential write speed of your SSD is slower than
the sequential write speed or read speed of your disk. In
particular, for implementations with RAID disks (especially
modes 0, 10 or 5) sequential reads may be very fast. If
'cache_all' mode is used, every disk read miss must also be
written to SSD. If you notice slower sequential reads and writes
after enabling flashcache, this is likely your problem.
如果数据设备是RAID磁盘, 并且RAID组较大或有RAID缓存的情况下, 连续IO的读写性能可能很好, 甚至超越SSD的性能(当然PCI-E的SSD几乎很难超越).
这种情况下, 连续IO加载到SSD就带来负面影响了, 一个是占据了大量的SSD空间, 另一方面还得不到好的性能提升(仅仅当SSD连续IO的性能低于RAID组的情况).
如果你遇到以上情况, 那么说明要调整一下flashcache的模块参数, 跳过连续IO载入SSD缓存.
2) Your 'resident set' of disk blocks that you want cached, i.e.
those that you would hope to keep in cache, is smaller
than the size of your SSD. You can check this by monitoring
how quick your cache fills up ('dmsetup table'). If this
is the case, it makes sense to prioritize caching of random IO,
since SSD performance vastly exceeds disk performance for
random IO, but is typically not much better for sequential IO.
如果SSD很快被填满, 可能出现了连续IO读载入SSD的情况, 如果带来了负面影响, 那么也说明要调整一下flashcache的模块参数, 跳过连续IO载入SSD缓存.
如果已经出现负面影响(例如加SSD后性能反而下降), 并且通过以上观察, 已经发现确实是连续IO载入缓存引起的, 那么可以通过以下方法来调整.
通过sysctl 设置 dev.flashcache.
In the above cases, start with a high value (say 1024k) for
sysctl dev.flashcache.<device>.skip_seq_thresh_kb, so only the
largest sequential IOs are skipped, and gradually reduce
if benchmarks show it's helping. Don't leave it set to a very
high value, return it to 0 (the default), since there is some
overhead in categorizing IO as random or sequential.
如果没有遇到问题, 那么继续使用cache all io即可.
If neither of the above hold, continue to cache all IO,
(the default) you will likely benefit from it.
参考
1. https://raw.githubusercontent.com/facebook/flashcache/master/doc/flashcache-sa-guide.txt
2. https://github.com/facebook/flashcache/issues
3. http://blog.163.com/digoal@126/blog/static/1638770402014528115551323/
4. https://github.com/facebook/flashcache/blob/master/utils/flashcache