PostgreSQL IOPS performance tuning by flashcache

8 minute read

背景

flashcache缺点之一, 一个SSD区域只能绑定到一个块设备或逻辑卷,PV. 不能像ZPOOL那样共享一个SSD区域.

其他可选cache软件, bcache, dmcache.

注意, 建议EXT4挂载项:

nobarrier,discard  
https://github.com/facebook/flashcache/blob/master/doc/flashcache-doc.txt  
https://github.com/facebook/flashcache/issues/163  
       discard/nodiscard  
              Controls whether ext4 should issue discard/TRIM commands to the underlying block device when blocks  are  
              freed.   This  is  useful  for  SSD devices and sparse/thinly-provisioned LUNs, but it is off by default  
              until sufficient testing has been done.  
       barrier=none / barrier=flush  
              This enables/disables the use of write barriers in the journaling code.  barrier=none disables it,  bar-  
              rier=flush  enables  it.  Write  barriers  enforce  proper  on-disk  ordering of journal commits, making  
              volatile disk write caches safe to use, at some performance penalty. The reiserfs  filesystem  does  not  
              enable  write  barriers  by default. Be sure to enable barriers unless your disks are battery-backed one  
              way or another. Otherwise you risk filesystem corruption in case of power failure.  
  
  
wget https://github.com/facebook/flashcache/archive/master.zip  
  
# uname -r  
2.6.32-358.el6.x86_64  

参照README-CentOS6安装

yum localinstall --nogpgcheck http://download.fedoraproject.org/pub/epel/6/x86_64/epel-release-6-8.noarch.rpm  
  
vi /etc/yum.repos.d/epel.repo  
enabled=1  
  
yum install -y dkms gcc make yum-utils kernel-devel-`uname -r`  
  
yumdownloader --source kernel-`uname -r`  

如果没有的话, 需要添加到yum仓库 CentOS-Vault.repo.

或者直接去网站下载对应的版本

http://vault.centos.org/6.4/os/Source/SPackages/

例如 CentOS6.4

# uname -r  
2.6.32-358.el6.x86_64  

下载并安装

wget http://vault.centos.org/6.4/os/Source/SPackages/kernel-2.6.32-358.el6.src.rpm  
kernel-2.6.32-358.el6.src.rpm  
  
unzip master.zip  
cd flashcache-master  

安装dracut-flashcache, centos 6 boot支持, 参照doc/dracut-flashcache.txt

# cd utils/  
# rpm -ivh dracut-flashcache-0.3-1.el6.noarch.rpm   
# rpm -ql dracut-flashcache  
/lib/udev/rules.d/10-flashcache.rules  
/sbin/fc_scan  
/usr/share/doc/dracut-flashcache-0.3  
/usr/share/doc/dracut-flashcache-0.3/COPYING  
/usr/share/doc/dracut-flashcache-0.3/README  
/usr/share/dracut/modules.d  
/usr/share/dracut/modules.d/90flashcache  
/usr/share/dracut/modules.d/90flashcache/63-flashcache.rules  
/usr/share/dracut/modules.d/90flashcache/fc_scan  
/usr/share/dracut/modules.d/90flashcache/install  
/usr/share/dracut/modules.d/90flashcache/installkernel  
/usr/share/dracut/modules.d/90flashcache/parse-flashcache.sh  
  
cd flashcache-master  
make  
make install  

可选, 配置dracut-flashcache, centos 6 boot支持, 参照doc/dracut-flashcache.txt

flashcache配置, 参照flashcache-sa-guide.txt

1. 选择SSD, 做好整块盘, 当然分区也行

2. 分区的话, 先按4K/8K对齐分好(视盘的情况而定). +(n*2048-1)

# fdisk -c -u /dev/sda  
Device contains neither a valid DOS partition table, nor Sun, SGI or OSF disklabel  
Building a new DOS disklabel with disk identifier 0x0fab9b9b.  
Changes will remain in memory only, until you decide to write them.  
After that, of course, the previous content won't be recoverable.  
  
Warning: invalid flag 0x0000 of partition table 4 will be corrected by w(rite)  
  
Command (m for help): p  
  
Disk /dev/sda: 240.1 GB, 240068197888 bytes  
255 heads, 63 sectors/track, 29186 cylinders, total 468883199 sectors  
Units = sectors of 1 * 512 = 512 bytes  
Sector size (logical/physical): 512 bytes / 512 bytes  
I/O size (minimum/optimal): 512 bytes / 512 bytes  
Disk identifier: 0x0fab9b9b  
  
   Device Boot      Start         End      Blocks   Id  System  
  
Command (m for help): n  
Command action  
   e   extended  
   p   primary partition (1-4)  
p  
Partition number (1-4): 1  
First sector (2048-468883198, default 2048):   
Using default value 2048  
Last sector, +sectors or +size{K,M,G} (2048-468883198, default 468883198): +(204800000-1)  
  
Command (m for help): p  
  
Disk /dev/sda: 240.1 GB, 240068197888 bytes  
255 heads, 63 sectors/track, 29186 cylinders, total 468883199 sectors  
Units = sectors of 1 * 512 = 512 bytes  
Sector size (logical/physical): 512 bytes / 512 bytes  
I/O size (minimum/optimal): 512 bytes / 512 bytes  
Disk identifier: 0x0fab9b9b  
  
   Device Boot      Start         End      Blocks   Id  System  
/dev/sda1            2048   204802047   102400000  83  Linux  
  
# fdisk -l -c -u /dev/sda  
  
Disk /dev/sda: 240.1 GB, 240068197888 bytes  
87 heads, 11 sectors/track, 489951 cylinders, total 468883199 sectors  
Units = sectors of 1 * 512 = 512 bytes  
Sector size (logical/physical): 512 bytes / 512 bytes  
I/O size (minimum/optimal): 512 bytes / 512 bytes  
Disk identifier: 0x0fab9b9b  
  
   Device Boot      Start         End      Blocks   Id  System  
/dev/sda1            2048   204802047   102400000  83  Linux  
  
Examples :  
# df -h  
Filesystem            Size  Used Avail Use% Mounted on  
/dev/sdc1              29G   14G   14G  51% /  
tmpfs                  48G  8.0K   48G   1% /dev/shm  
/dev/sdc3              98G   40G   53G  43% /opt  
  
[root@db-172-16-3-150 flashcache-master]# flashcache_create --help  
flashcache_create: invalid option -- '-'  
Usage: flashcache_create [-v] [-p back|thru|around] [-b block size] [-m md block size] [-s cache size] [-a associativity] cachedev ssd_devname disk_devname  
Usage : flashcache_create Cache Mode back|thru|around is required argument  
Usage : flashcache_create Default units for -b, -m, -s are sectors, or specify in k/M/G. Default associativity is 512.  
  
flashcache_create -v -p back -s 20G -b 4k cachedev1 /dev/sda1 /dev/sdc  
Creates a 20GB writeback cache volume with a 4KB block size on ssd   
device /dev/sdc to cache the disk volume /dev/sdb. The name of the device   
created is "cachedev".  

如果块设备挂载了文件系统或在使用的话, 不能创建cachedev

# flashcache_create -v -p back -s 20G -b 4k cachedev1 /dev/sda1 /dev/sdc  
cachedev cachedev1, ssd_devname /dev/sda1, disk_devname /dev/sdc cache mode WRITE_BACK  
block_size 8, md_block_size 8, cache_size 41943040  
Flashcache metadata will use 110MB of your 96733MB main memory  
Loading Flashcache Module  
version string "git commit:   
"  
Creating FlashCache Volume : "echo 0 285474816 flashcache /dev/sdc /dev/sda1 cachedev1 1 2 8 41943040 512 140733193388544 8 | dmsetup create cachedev1"  
device-mapper: reload ioctl on cachedev1 failed: Device or resource busy  
Command failed  
echo 0 285474816 flashcache /dev/sdc /dev/sda1 cachedev1 1 2 8 41943040 512 140733193388544 8 | dmsetup create cachedev1 failed  

卸载后就可以使用了

[root@db-172-16-3-150 ~]# flashcache_create -v -p back -s 20G -b 4k cachedev1 /dev/sda1 /dev/sdd1  
cachedev cachedev1, ssd_devname /dev/sda1, disk_devname /dev/sdd1 cache mode WRITE_BACK  
block_size 8, md_block_size 8, cache_size 41943040  
Flashcache metadata will use 110MB of your 96733MB main memory  
Flashcache Module already loaded  
version string "git commit:   
"  
Creating FlashCache Volume : "echo 0 389543936 flashcache /dev/sdd1 /dev/sda1 cachedev1 1 2 8 41943040 512 140733193388544 8 | dmsetup create cachedev1"  

查看刚刚创建的DM设备

[root@db-172-16-3-150 ~]# dmsetup status  
cachedev1: 0 389543936 flashcache stats:   
        reads(84), writes(0)  
        read hits(1), read hit percent(1)  
        write hits(0) write hit percent(0)  
        dirty write hits(0) dirty write hit percent(0)  
        replacement(0), write replacement(0)  
        write invalidates(0), read invalidates(0)  
        pending enqueues(0), pending inval(0)  
        metadata dirties(0), metadata cleans(0)  
        metadata batch(0) metadata ssd writes(0)  
        cleanings(0) fallow cleanings(0)  
        no room(0) front merge(0) back merge(0)  
        force_clean_block(0)  
        disk reads(83), disk writes(0) ssd reads(1) ssd writes(83)  
        uncached reads(0), uncached writes(0), uncached IO requeue(0)  
        disk read errors(0), disk write errors(0) ssd read errors(0) ssd write errors(0)  
        uncached sequential reads(0), uncached sequential writes(0)  
        pid_adds(0), pid_dels(0), pid_drops(0) pid_expiry(0)  
        lru hot blocks(2610944), lru warm blocks(2610944)  
        lru promotions(0), lru demotions(0)  

挂载它

[root@db-172-16-3-150 ~]# mount /dev/mapper/cachedev1 /ssd1  
[root@db-172-16-3-150 ~]# df -h  
Filesystem            Size  Used Avail Use% Mounted on  
/dev/sdc1              29G   14G   14G  51% /  
tmpfs                  48G  8.0K   48G   1% /dev/shm  
/dev/sdc3              98G   40G   53G  43% /opt  
/dev/sdb1             221G   72G  138G  35% /ssd4  
/dev/mapper/cachedev1  
                      183G   49G  126G  28% /ssd1  

建议EXT4挂载项:

nobarrier,discard  

删除DM设备,

[root@db-172-16-3-150 ~]# umount /ssd1  
[root@db-172-16-3-150 ~]# dmsetup remove cachedev1  

删除flashcache设备

[root@db-172-16-3-150 ~]# flashcache_destroy /dev/sda1  
flashcache_destroy: Destroying Flashcache found on /dev/sda1. Any data will be lost !!  

使用一个机械硬盘, 加上flashcache后看看性能如何.

[root@db-172-16-3-150 ~]# df -h  
Filesystem            Size  Used Avail Use% Mounted on  
/dev/sdc1              29G   14G   14G  51% /  
tmpfs                  48G  8.0K   48G   1% /dev/shm  
/dev/sdc3              98G   40G   53G  43% /opt  
/dev/sdb1             221G   72G  138G  35% /ssd4  
[root@db-172-16-3-150 ~]# umount /opt  
  
[root@db-172-16-3-150 ~]# flashcache_create -v -p back -s 40G -b 4k cachedev1 /dev/sda1 /dev/sdc3  
cachedev cachedev1, ssd_devname /dev/sda1, disk_devname /dev/sdc3 cache mode WRITE_BACK  
block_size 8, md_block_size 8, cache_size 83886080  
Flashcache metadata will use 220MB of your 96733MB main memory  
Flashcache Module already loaded  
version string "git commit:   
"  
Creating FlashCache Volume : "echo 0 207254565 flashcache /dev/sdc3 /dev/sda1 cachedev1 1 2 8 83886080 512 140733193388544 8 | dmsetup create cachedev1"  
  
[root@db-172-16-3-150 ~]# mount /dev/mapper/cachedev1 /opt  

测试fsync性能

[root@db-172-16-3-150 ~]# /home/bdr/pgsql/bin/pg_test_fsync -f /opt/1  
5 seconds per test  
O_DIRECT supported on this platform for open_datasync and open_sync.  
  
Compare file sync methods using one 8kB write:  
(in wal_sync_method preference order, except fdatasync  
is Linux's default)  
        open_datasync                     12963.806 ops/sec      77 usecs/op  
        fdatasync                         11115.933 ops/sec      90 usecs/op  
        fsync                               412.602 ops/sec    2424 usecs/op  
        fsync_writethrough                              n/a  
        open_sync                         12989.584 ops/sec      77 usecs/op  
  
Compare file sync methods using two 8kB writes:  
(in wal_sync_method preference order, except fdatasync  
is Linux's default)  
        open_datasync                      6513.141 ops/sec     154 usecs/op  
        fdatasync                          8324.517 ops/sec     120 usecs/op  
        fsync                               405.985 ops/sec    2463 usecs/op  
        fsync_writethrough                              n/a  
        open_sync                          6530.344 ops/sec     153 usecs/op  
  
Compare open_sync with different write sizes:  
(This is designed to compare the cost of writing 16kB  
in different write open_sync sizes.)  
         1 * 16kB open_sync write          9822.855 ops/sec     102 usecs/op  
         2 *  8kB open_sync writes         6519.366 ops/sec     153 usecs/op  
         4 *  4kB open_sync writes         3918.786 ops/sec     255 usecs/op  
         8 *  2kB open_sync writes           20.625 ops/sec   48486 usecs/op  
        16 *  1kB open_sync writes           10.415 ops/sec   96012 usecs/op  
  
Test if fsync on non-write file descriptor is honored:  
(If the times are similar, fsync() can sync data written  
on a different descriptor.)  
        write, fsync, close                 678.063 ops/sec    1475 usecs/op  
        write, close, fsync                2085.175 ops/sec     480 usecs/op  
  
Non-Sync'ed 8kB writes:  
        write                            188286.273 ops/sec       5 usecs/op  

对比原机械硬盘的性能

[root@db-172-16-3-150 ~]# /home/bdr/pgsql/bin/pg_test_fsync -f /1  
5 seconds per test  
O_DIRECT supported on this platform for open_datasync and open_sync.  
  
Compare file sync methods using one 8kB write:  
(in wal_sync_method preference order, except fdatasync  
is Linux's default)  
        open_datasync                       163.266 ops/sec    6125 usecs/op  
        fdatasync                           165.646 ops/sec    6037 usecs/op  
        fsync                                53.012 ops/sec   18864 usecs/op  
        fsync_writethrough                              n/a  
        open_sync                           164.367 ops/sec    6084 usecs/op  
  
Compare file sync methods using two 8kB writes:  
(in wal_sync_method preference order, except fdatasync  
is Linux's default)  
        open_datasync                        83.180 ops/sec   12022 usecs/op  
        fdatasync                           166.243 ops/sec    6015 usecs/op  
        fsync                                53.661 ops/sec   18636 usecs/op  
        fsync_writethrough                              n/a  
        open_sync                            82.807 ops/sec   12076 usecs/op  
  
Compare open_sync with different write sizes:  
(This is designed to compare the cost of writing 16kB  
in different write open_sync sizes.)  
         1 * 16kB open_sync write           165.158 ops/sec    6055 usecs/op  
         2 *  8kB open_sync writes           82.624 ops/sec   12103 usecs/op  
         4 *  4kB open_sync writes           41.285 ops/sec   24222 usecs/op  
         8 *  2kB open_sync writes           20.781 ops/sec   48122 usecs/op  
        16 *  1kB open_sync writes           10.390 ops/sec   96242 usecs/op  
  
Test if fsync on non-write file descriptor is honored:  
(If the times are similar, fsync() can sync data written  
on a different descriptor.)  
        write, fsync, close                  52.233 ops/sec   19145 usecs/op  
        write, close, fsync                  54.324 ops/sec   18408 usecs/op  
  
Non-Sync'ed 8kB writes:  
        write                            203661.070 ops/sec       5 usecs/op  

postgresql update if exists else insert 测试模型结果

flashcache device ssd+普通机械盘

pg93@db-172-16-3-150-> pgbench -M prepared -n -r -f ./test.sql -c 16 -j 4 -T 60  
transaction type: Custom query  
scaling factor: 1  
query mode: prepared  
number of clients: 16  
number of threads: 4  
duration: 60 s  
number of transactions actually processed: 465274  
tps = 7754.017036 (including connections establishing)  
tps = 7756.166925 (excluding connections establishing)  
statement latencies in milliseconds:  
        0.003762        \setrandom id 1 50000000  
        2.056537        select f(:id);  

普通机械盘+raid卡rw cache.

pg93@db-172-16-3-150-> pgbench -M prepared -n -r -f ./test.sql -c 16 -j 4 -T 60  
transaction type: Custom query  
scaling factor: 1  
query mode: prepared  
number of clients: 16  
number of threads: 4  
duration: 60 s  
number of transactions actually processed: 71206  
tps = 1186.007977 (including connections establishing)  
tps = 1186.820771 (excluding connections establishing)  
statement latencies in milliseconds:  
        0.004485        \setrandom id 1 50000000  
        13.459944       select f(:id);  

参考

1. http://ftp.sjtu.edu.cn/fedora/epel/6/x86_64/

2. https://github.com/facebook/flashcache/

3. http://blog.163.com/digoal@126/blog/static/163877040201463101652528/

Flag Counter

digoal’s 大量PostgreSQL文章入口