PostgreSQL IOPS performance tuning by flashcache
背景
flashcache缺点之一, 一个SSD区域只能绑定到一个块设备或逻辑卷,PV. 不能像ZPOOL那样共享一个SSD区域.
其他可选cache软件, bcache, dmcache.
注意, 建议EXT4挂载项:
nobarrier,discard
https://github.com/facebook/flashcache/blob/master/doc/flashcache-doc.txt
https://github.com/facebook/flashcache/issues/163
discard/nodiscard
Controls whether ext4 should issue discard/TRIM commands to the underlying block device when blocks are
freed. This is useful for SSD devices and sparse/thinly-provisioned LUNs, but it is off by default
until sufficient testing has been done.
barrier=none / barrier=flush
This enables/disables the use of write barriers in the journaling code. barrier=none disables it, bar-
rier=flush enables it. Write barriers enforce proper on-disk ordering of journal commits, making
volatile disk write caches safe to use, at some performance penalty. The reiserfs filesystem does not
enable write barriers by default. Be sure to enable barriers unless your disks are battery-backed one
way or another. Otherwise you risk filesystem corruption in case of power failure.
wget https://github.com/facebook/flashcache/archive/master.zip
# uname -r
2.6.32-358.el6.x86_64
参照README-CentOS6安装
yum localinstall --nogpgcheck http://download.fedoraproject.org/pub/epel/6/x86_64/epel-release-6-8.noarch.rpm
vi /etc/yum.repos.d/epel.repo
enabled=1
yum install -y dkms gcc make yum-utils kernel-devel-`uname -r`
yumdownloader --source kernel-`uname -r`
如果没有的话, 需要添加到yum仓库 CentOS-Vault.repo.
或者直接去网站下载对应的版本
http://vault.centos.org/6.4/os/Source/SPackages/
例如 CentOS6.4
# uname -r
2.6.32-358.el6.x86_64
下载并安装
wget http://vault.centos.org/6.4/os/Source/SPackages/kernel-2.6.32-358.el6.src.rpm
kernel-2.6.32-358.el6.src.rpm
unzip master.zip
cd flashcache-master
安装dracut-flashcache, centos 6 boot支持, 参照doc/dracut-flashcache.txt
# cd utils/
# rpm -ivh dracut-flashcache-0.3-1.el6.noarch.rpm
# rpm -ql dracut-flashcache
/lib/udev/rules.d/10-flashcache.rules
/sbin/fc_scan
/usr/share/doc/dracut-flashcache-0.3
/usr/share/doc/dracut-flashcache-0.3/COPYING
/usr/share/doc/dracut-flashcache-0.3/README
/usr/share/dracut/modules.d
/usr/share/dracut/modules.d/90flashcache
/usr/share/dracut/modules.d/90flashcache/63-flashcache.rules
/usr/share/dracut/modules.d/90flashcache/fc_scan
/usr/share/dracut/modules.d/90flashcache/install
/usr/share/dracut/modules.d/90flashcache/installkernel
/usr/share/dracut/modules.d/90flashcache/parse-flashcache.sh
cd flashcache-master
make
make install
可选, 配置dracut-flashcache, centos 6 boot支持, 参照doc/dracut-flashcache.txt
flashcache配置, 参照flashcache-sa-guide.txt
1. 选择SSD, 做好整块盘, 当然分区也行
2. 分区的话, 先按4K/8K对齐分好(视盘的情况而定). +(n*2048-1)
# fdisk -c -u /dev/sda
Device contains neither a valid DOS partition table, nor Sun, SGI or OSF disklabel
Building a new DOS disklabel with disk identifier 0x0fab9b9b.
Changes will remain in memory only, until you decide to write them.
After that, of course, the previous content won't be recoverable.
Warning: invalid flag 0x0000 of partition table 4 will be corrected by w(rite)
Command (m for help): p
Disk /dev/sda: 240.1 GB, 240068197888 bytes
255 heads, 63 sectors/track, 29186 cylinders, total 468883199 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x0fab9b9b
Device Boot Start End Blocks Id System
Command (m for help): n
Command action
e extended
p primary partition (1-4)
p
Partition number (1-4): 1
First sector (2048-468883198, default 2048):
Using default value 2048
Last sector, +sectors or +size{K,M,G} (2048-468883198, default 468883198): +(204800000-1)
Command (m for help): p
Disk /dev/sda: 240.1 GB, 240068197888 bytes
255 heads, 63 sectors/track, 29186 cylinders, total 468883199 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x0fab9b9b
Device Boot Start End Blocks Id System
/dev/sda1 2048 204802047 102400000 83 Linux
# fdisk -l -c -u /dev/sda
Disk /dev/sda: 240.1 GB, 240068197888 bytes
87 heads, 11 sectors/track, 489951 cylinders, total 468883199 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x0fab9b9b
Device Boot Start End Blocks Id System
/dev/sda1 2048 204802047 102400000 83 Linux
Examples :
# df -h
Filesystem Size Used Avail Use% Mounted on
/dev/sdc1 29G 14G 14G 51% /
tmpfs 48G 8.0K 48G 1% /dev/shm
/dev/sdc3 98G 40G 53G 43% /opt
[root@db-172-16-3-150 flashcache-master]# flashcache_create --help
flashcache_create: invalid option -- '-'
Usage: flashcache_create [-v] [-p back|thru|around] [-b block size] [-m md block size] [-s cache size] [-a associativity] cachedev ssd_devname disk_devname
Usage : flashcache_create Cache Mode back|thru|around is required argument
Usage : flashcache_create Default units for -b, -m, -s are sectors, or specify in k/M/G. Default associativity is 512.
flashcache_create -v -p back -s 20G -b 4k cachedev1 /dev/sda1 /dev/sdc
Creates a 20GB writeback cache volume with a 4KB block size on ssd
device /dev/sdc to cache the disk volume /dev/sdb. The name of the device
created is "cachedev".
如果块设备挂载了文件系统或在使用的话, 不能创建cachedev
# flashcache_create -v -p back -s 20G -b 4k cachedev1 /dev/sda1 /dev/sdc
cachedev cachedev1, ssd_devname /dev/sda1, disk_devname /dev/sdc cache mode WRITE_BACK
block_size 8, md_block_size 8, cache_size 41943040
Flashcache metadata will use 110MB of your 96733MB main memory
Loading Flashcache Module
version string "git commit:
"
Creating FlashCache Volume : "echo 0 285474816 flashcache /dev/sdc /dev/sda1 cachedev1 1 2 8 41943040 512 140733193388544 8 | dmsetup create cachedev1"
device-mapper: reload ioctl on cachedev1 failed: Device or resource busy
Command failed
echo 0 285474816 flashcache /dev/sdc /dev/sda1 cachedev1 1 2 8 41943040 512 140733193388544 8 | dmsetup create cachedev1 failed
卸载后就可以使用了
[root@db-172-16-3-150 ~]# flashcache_create -v -p back -s 20G -b 4k cachedev1 /dev/sda1 /dev/sdd1
cachedev cachedev1, ssd_devname /dev/sda1, disk_devname /dev/sdd1 cache mode WRITE_BACK
block_size 8, md_block_size 8, cache_size 41943040
Flashcache metadata will use 110MB of your 96733MB main memory
Flashcache Module already loaded
version string "git commit:
"
Creating FlashCache Volume : "echo 0 389543936 flashcache /dev/sdd1 /dev/sda1 cachedev1 1 2 8 41943040 512 140733193388544 8 | dmsetup create cachedev1"
查看刚刚创建的DM设备
[root@db-172-16-3-150 ~]# dmsetup status
cachedev1: 0 389543936 flashcache stats:
reads(84), writes(0)
read hits(1), read hit percent(1)
write hits(0) write hit percent(0)
dirty write hits(0) dirty write hit percent(0)
replacement(0), write replacement(0)
write invalidates(0), read invalidates(0)
pending enqueues(0), pending inval(0)
metadata dirties(0), metadata cleans(0)
metadata batch(0) metadata ssd writes(0)
cleanings(0) fallow cleanings(0)
no room(0) front merge(0) back merge(0)
force_clean_block(0)
disk reads(83), disk writes(0) ssd reads(1) ssd writes(83)
uncached reads(0), uncached writes(0), uncached IO requeue(0)
disk read errors(0), disk write errors(0) ssd read errors(0) ssd write errors(0)
uncached sequential reads(0), uncached sequential writes(0)
pid_adds(0), pid_dels(0), pid_drops(0) pid_expiry(0)
lru hot blocks(2610944), lru warm blocks(2610944)
lru promotions(0), lru demotions(0)
挂载它
[root@db-172-16-3-150 ~]# mount /dev/mapper/cachedev1 /ssd1
[root@db-172-16-3-150 ~]# df -h
Filesystem Size Used Avail Use% Mounted on
/dev/sdc1 29G 14G 14G 51% /
tmpfs 48G 8.0K 48G 1% /dev/shm
/dev/sdc3 98G 40G 53G 43% /opt
/dev/sdb1 221G 72G 138G 35% /ssd4
/dev/mapper/cachedev1
183G 49G 126G 28% /ssd1
建议EXT4挂载项:
nobarrier,discard
删除DM设备,
[root@db-172-16-3-150 ~]# umount /ssd1
[root@db-172-16-3-150 ~]# dmsetup remove cachedev1
删除flashcache设备
[root@db-172-16-3-150 ~]# flashcache_destroy /dev/sda1
flashcache_destroy: Destroying Flashcache found on /dev/sda1. Any data will be lost !!
使用一个机械硬盘, 加上flashcache后看看性能如何.
[root@db-172-16-3-150 ~]# df -h
Filesystem Size Used Avail Use% Mounted on
/dev/sdc1 29G 14G 14G 51% /
tmpfs 48G 8.0K 48G 1% /dev/shm
/dev/sdc3 98G 40G 53G 43% /opt
/dev/sdb1 221G 72G 138G 35% /ssd4
[root@db-172-16-3-150 ~]# umount /opt
[root@db-172-16-3-150 ~]# flashcache_create -v -p back -s 40G -b 4k cachedev1 /dev/sda1 /dev/sdc3
cachedev cachedev1, ssd_devname /dev/sda1, disk_devname /dev/sdc3 cache mode WRITE_BACK
block_size 8, md_block_size 8, cache_size 83886080
Flashcache metadata will use 220MB of your 96733MB main memory
Flashcache Module already loaded
version string "git commit:
"
Creating FlashCache Volume : "echo 0 207254565 flashcache /dev/sdc3 /dev/sda1 cachedev1 1 2 8 83886080 512 140733193388544 8 | dmsetup create cachedev1"
[root@db-172-16-3-150 ~]# mount /dev/mapper/cachedev1 /opt
测试fsync性能
[root@db-172-16-3-150 ~]# /home/bdr/pgsql/bin/pg_test_fsync -f /opt/1
5 seconds per test
O_DIRECT supported on this platform for open_datasync and open_sync.
Compare file sync methods using one 8kB write:
(in wal_sync_method preference order, except fdatasync
is Linux's default)
open_datasync 12963.806 ops/sec 77 usecs/op
fdatasync 11115.933 ops/sec 90 usecs/op
fsync 412.602 ops/sec 2424 usecs/op
fsync_writethrough n/a
open_sync 12989.584 ops/sec 77 usecs/op
Compare file sync methods using two 8kB writes:
(in wal_sync_method preference order, except fdatasync
is Linux's default)
open_datasync 6513.141 ops/sec 154 usecs/op
fdatasync 8324.517 ops/sec 120 usecs/op
fsync 405.985 ops/sec 2463 usecs/op
fsync_writethrough n/a
open_sync 6530.344 ops/sec 153 usecs/op
Compare open_sync with different write sizes:
(This is designed to compare the cost of writing 16kB
in different write open_sync sizes.)
1 * 16kB open_sync write 9822.855 ops/sec 102 usecs/op
2 * 8kB open_sync writes 6519.366 ops/sec 153 usecs/op
4 * 4kB open_sync writes 3918.786 ops/sec 255 usecs/op
8 * 2kB open_sync writes 20.625 ops/sec 48486 usecs/op
16 * 1kB open_sync writes 10.415 ops/sec 96012 usecs/op
Test if fsync on non-write file descriptor is honored:
(If the times are similar, fsync() can sync data written
on a different descriptor.)
write, fsync, close 678.063 ops/sec 1475 usecs/op
write, close, fsync 2085.175 ops/sec 480 usecs/op
Non-Sync'ed 8kB writes:
write 188286.273 ops/sec 5 usecs/op
对比原机械硬盘的性能
[root@db-172-16-3-150 ~]# /home/bdr/pgsql/bin/pg_test_fsync -f /1
5 seconds per test
O_DIRECT supported on this platform for open_datasync and open_sync.
Compare file sync methods using one 8kB write:
(in wal_sync_method preference order, except fdatasync
is Linux's default)
open_datasync 163.266 ops/sec 6125 usecs/op
fdatasync 165.646 ops/sec 6037 usecs/op
fsync 53.012 ops/sec 18864 usecs/op
fsync_writethrough n/a
open_sync 164.367 ops/sec 6084 usecs/op
Compare file sync methods using two 8kB writes:
(in wal_sync_method preference order, except fdatasync
is Linux's default)
open_datasync 83.180 ops/sec 12022 usecs/op
fdatasync 166.243 ops/sec 6015 usecs/op
fsync 53.661 ops/sec 18636 usecs/op
fsync_writethrough n/a
open_sync 82.807 ops/sec 12076 usecs/op
Compare open_sync with different write sizes:
(This is designed to compare the cost of writing 16kB
in different write open_sync sizes.)
1 * 16kB open_sync write 165.158 ops/sec 6055 usecs/op
2 * 8kB open_sync writes 82.624 ops/sec 12103 usecs/op
4 * 4kB open_sync writes 41.285 ops/sec 24222 usecs/op
8 * 2kB open_sync writes 20.781 ops/sec 48122 usecs/op
16 * 1kB open_sync writes 10.390 ops/sec 96242 usecs/op
Test if fsync on non-write file descriptor is honored:
(If the times are similar, fsync() can sync data written
on a different descriptor.)
write, fsync, close 52.233 ops/sec 19145 usecs/op
write, close, fsync 54.324 ops/sec 18408 usecs/op
Non-Sync'ed 8kB writes:
write 203661.070 ops/sec 5 usecs/op
postgresql update if exists else insert 测试模型结果
flashcache device ssd+普通机械盘
pg93@db-172-16-3-150-> pgbench -M prepared -n -r -f ./test.sql -c 16 -j 4 -T 60
transaction type: Custom query
scaling factor: 1
query mode: prepared
number of clients: 16
number of threads: 4
duration: 60 s
number of transactions actually processed: 465274
tps = 7754.017036 (including connections establishing)
tps = 7756.166925 (excluding connections establishing)
statement latencies in milliseconds:
0.003762 \setrandom id 1 50000000
2.056537 select f(:id);
普通机械盘+raid卡rw cache.
pg93@db-172-16-3-150-> pgbench -M prepared -n -r -f ./test.sql -c 16 -j 4 -T 60
transaction type: Custom query
scaling factor: 1
query mode: prepared
number of clients: 16
number of threads: 4
duration: 60 s
number of transactions actually processed: 71206
tps = 1186.007977 (including connections establishing)
tps = 1186.820771 (excluding connections establishing)
statement latencies in milliseconds:
0.004485 \setrandom id 1 50000000
13.459944 select f(:id);
参考
1. http://ftp.sjtu.edu.cn/fedora/epel/6/x86_64/
2. https://github.com/facebook/flashcache/
3. http://blog.163.com/digoal@126/blog/static/163877040201463101652528/