zfs on CentOS 6.5 x64 compare performance with ext4 use postgresql pgbench

13 minute read

背景

zfs可以认为是raid和文件系统的结合体.

解决了动态条带和数据与校验位的一致性问题, 所以具有极高的可靠性和性能.

https://pthree.org/2012/12/05/zfs-administration-part-ii-raidz/

因为zfs需要接管硬盘, 所以在有RAID卡的环境中, 需要设置为 JBOD模式.

http://en.wikipedia.org/wiki/Non-RAID_drive_architectures

同时ZFS利用SLOG(Separate Intent Log)设备存储ZIL(ZFS Intent Log)来提升写性能.  
(使用slog后, 即使断电, 也不会导致数据不一致, 只是会导致ZFS中新的数据没有FLUSH到磁盘, 呈现老的状态, 类似PostgreSQL 的async commit.)  
(It's important to identify that all three devices listed above can maintain data persistence during a power outage. The SLOG and the ZIL are critical in getting your data to spinning platter. If a power outage occurs, and you have a volatile SLOG, the worst thing that will happen is the new data is not flushed, and you are left with old data. However, it's important to note, that in the case of a power outage, you won't have corrupted data, just lost data. Your data will still be consistent on disk.)  
  
建议将SLOG放在ssd或者nvram设备中, 提升写性能.  
甚至可以拿slog来做增量文件系统恢复, 有点和PostgreSQL的xlog功能类似呢.  
因为slog的重要性, 一般加slog设备使用mirror.  
例如 :   
# zpool create tank mirror /tmp/file1 /tmp/file2 mirror /tmp/file3 /tmp/file4 log mirror sde sdf cache sdg sdh  
# zpool status tank  
  pool: tank  
 state: ONLINE  
 scan: none requested  
config:  
  
	NAME            STATE     READ WRITE CKSUM  
	tank            ONLINE       0     0     0  
	  mirror-0      ONLINE       0     0     0  
	    /tmp/file1  ONLINE       0     0     0  
	    /tmp/file2  ONLINE       0     0     0  
	  mirror-1      ONLINE       0     0     0  
	    /tmp/file3  ONLINE       0     0     0  
	    /tmp/file4  ONLINE       0     0     0  
	logs  
	  mirror-2      ONLINE       0     0     0  
	    sde         ONLINE       0     0     0  
	    sdf         ONLINE       0     0     0  
	cache  
	  sdg           ONLINE       0     0     0  
	  sdh           ONLINE       0     0     0  
  
errors: No known data errors  
另外需要注意: SLOG如果使用SSD来做的话, 因为SSD的使用寿命是和擦写次数相关的, 所以我们可以根据SLOG的写入速度来评估SLOG能用多久.  
SLOG Life Expectancy  
Because you will likely be using a consumer-grade SSD for your SLOG in your GNU/Linux server, we need to make some mention of the wear and tear of SSDs for write-intensive scenarios. Of course, this will largely vary based on manufacturer, but we can setup some generalities.  
  
First and foremost, ZFS has advanced wear-leveling algorithms that will evenly wear each chip on the SSD. There is no need for TRIM support, which in all reality, is really just a garbage collection support more than anything. The wear-leveling of ZFS in inherent due to the copy-on-write nature of the filesystem.  
  
Second, various drives will be implemented with different nanometer processes. The smaller the nanometer process, the shorter the life of your SSD. As an example, the Intel 320 is a 25 nanometer MLC 300 GB SSD, and is rated at roughly 5000 P/E cycles. This means you can write to your entire SSD 5000 times if using wear leveling algorithms. This produces 1500000 GB of total written data, or 1500 TB. My ZIL maintains about 3 MB of data per second. As a result, I can maintain about 95 TB of written data per year. This gives me a life of about 15 years for this Intel SSD.  
  
However, the Intel 335 is a 20 nanometer MLC 240 GB SSD, and is rated at roughly 3000 P/E cycles. With wear leveling, this means you can write you entire SSD 3000 times, which produces 720 TB of total written data. This is only 7 years for my 3 MBps ZIL, which is less than 1/2 the life expectancy the Intel 320. Point is, you need to keep an eye on these things when planning out your pool.  
  
Now, if you are using a battery-backed DRAM drive, then wear leveling is not a problem, and the DIMMs will likely last the duration of your server. Same might be said for 10k+ SAS or FC drives.  
ZIL的空间一般不需要太大, 作者用了4GB的分区作为SLOG设备.  
  
ZFS另一个强大之处是CACHE算法, 结合了LRU,LFU, 保留最近使用的和使用最频繁的块在缓存中.  
同时ZFS还支持二级缓存, 可以使用IOPS能力强大的块设备作为二级缓存设备.  

https://pthree.org/2012/12/07/zfs-administration-part-iv-the-adjustable-replacement-cache/

创建zpool时, 尽量使用设备ID (/dev/disk/by-id/*) , 不要使用别名如sda, 因为别名可能重启后会发生变化.  
  
以CentOS 6.4 x64为例, 介绍一下zfs的安装.  
下载  
[root@db-172-16-3-150 soft_bak]# wget http://archive.zfsonlinux.org/downloads/zfsonlinux/spl/spl-0.6.2.tar.gz  
[root@db-172-16-3-150 soft_bak]# wget http://archive.zfsonlinux.org/downloads/zfsonlinux/zfs/zfs-0.6.2.tar.gz  
安装spl  
[root@db-172-16-3-150 spl-0.6.2]# tar -zxvf spl-0.6.2.tar.gz  
[root@db-172-16-3-150 soft_bak]# cd spl-0.6.2  
[root@db-172-16-3-150 spl-0.6.2]# ./autogen.sh   
[root@db-172-16-3-150 spl-0.6.2]# ./configure --prefix=/opt/spl0.6.2  
[root@db-172-16-3-150 spl-0.6.2]# make && make install  
安装zfs  
[root@db-172-16-3-150 soft_bak]# cd /opt/soft_bak/  
[root@db-172-16-3-150 soft_bak]# tar -zxvf zfs-0.6.2.tar.gz   
[root@db-172-16-3-150 soft_bak]# cd zfs-0.6.2  
[root@db-172-16-3-150 zfs-0.6.2]# yum install -y libuuid-devel  
[root@db-172-16-3-150 zfs-0.6.2]# ./configure --prefix=/opt/zfs0.6.2  
[root@db-172-16-3-150 zfs-0.6.2]# make && make install  
  
测试所在系统的Solaris的移植性, 使用splat前需要加载splat模块, 否则会报错.  
如果遇到错误, 使用-vv输出详细信息, 找到错误原因.  
splat - Solaris Porting LAyer Tests  
[root@db-172-16-3-150 soft_bak]# cd /opt/spl0.6.2/sbin/  
[root@db-172-16-3-150 sbin]# ./splat -a  
Unable to open /dev/splatctl: 2  
Is the splat module loaded?  
[root@db-172-16-3-150 spl0.6.2]# modprobe splat  
[root@db-172-16-3-150 spl0.6.2]# /opt/spl0.6.2/sbin/splat -a  
------------------------------ Running SPLAT Tests ------------------------------  
                kmem:kmem_alloc           Pass    
                kmem:kmem_zalloc          Pass    
                kmem:vmem_alloc           Pass    
                kmem:vmem_zalloc          Pass    
                kmem:slab_small           Pass    
                kmem:slab_large           Pass    
                kmem:slab_align           Pass    
                kmem:slab_reap            Pass    
                kmem:slab_age             Pass    
                kmem:slab_lock            Pass    
                kmem:vmem_size            Pass    
                kmem:slab_reclaim         Fail  Timer expired  
               taskq:single               Pass    
               taskq:multiple             Pass    
               taskq:system               Pass    
               taskq:wait                 Pass    
               taskq:order                Pass    
               taskq:front                Pass    
               taskq:recurse              Pass    
               taskq:contention           Pass    
               taskq:delay                Pass    
               taskq:cancel               Pass    
                krng:freq                 Pass    
               mutex:tryenter             Pass    
               mutex:race                 Pass    
               mutex:owned                Pass    
               mutex:owner                Pass    
             condvar:signal1              Pass    
             condvar:broadcast1           Pass    
             condvar:signal2              Pass    
             condvar:broadcast2           Pass    
             condvar:timeout              Pass    
              thread:create               Pass    
              thread:exit                 Pass    
              thread:tsd                  Pass    
              rwlock:N-rd/1-wr            Pass    
              rwlock:0-rd/N-wr            Pass    
              rwlock:held                 Pass    
              rwlock:tryenter             Pass    
              rwlock:rw_downgrade         Pass    
              rwlock:rw_tryupgrade        Pass    
                time:time1                Pass    
                time:time2                Pass    
               vnode:vn_open              Pass    
               vnode:vn_openat            Pass    
               vnode:vn_rdwr              Pass    
               vnode:vn_rename            Pass    
               vnode:vn_getattr           Pass    
               vnode:vn_sync              Pass    
                kobj:open                 Pass    
                kobj:size/read            Pass    
              atomic:64-bit               Pass    
                list:create/destroy       Pass    
                list:ins/rm head          Pass    
                list:ins/rm tail          Pass    
                list:insert_after         Pass    
                list:insert_before        Pass    
                list:remove               Pass    
                list:active               Pass    
             generic:ddi_strtoul          Pass    
             generic:ddi_strtol           Pass    
             generic:ddi_strtoull         Pass    
             generic:ddi_strtoll          Pass    
             generic:udivdi3              Pass    
             generic:divdi3               Pass    
                cred:cred                 Pass    
                cred:kcred                Pass    
                cred:groupmember          Pass    
                zlib:compress/uncompress  Pass    
               linux:shrink_dcache        Pass    
               linux:shrink_icache        Pass    
               linux:shrinker             Pass  
  
接下来创建几个测试文件, 分别作为磁盘, 二级缓存, LOG.  
其中二级缓存使用SSD硬盘分区, LOG使用SSD上的两个文件.  
[root@db-172-16-3-150 ~]# cd /ssd4  
[root@db-172-16-3-150 ssd4]# dd if=/dev/zero of=./zfs.log1 bs=1k count=1024000  
[root@db-172-16-3-150 ssd4]# dd if=/dev/zero of=./zfs.log2 bs=1k count=1024000  
[root@db-172-16-3-150 ~]# cd /opt  
[root@db-172-16-3-150 opt]# dd if=/dev/zero of=./zfs.disk1 bs=1k count=1024000  
[root@db-172-16-3-150 opt]# dd if=/dev/zero of=./zfs.disk2 bs=1k count=1024000  
[root@db-172-16-3-150 opt]# dd if=/dev/zero of=./zfs.disk3 bs=1k count=1024000  
[root@db-172-16-3-150 opt]# dd if=/dev/zero of=./zfs.disk4 bs=1k count=1024000  
创建zpool, 测试性能(cache必需使用块设备, 不能直接用文件).  
[root@db-172-16-3-150 ssd4]# ll /dev/disk/by-id/  
total 0  
lrwxrwxrwx 1 root root  9 Apr 23 08:59 ata-OCZ-REVODRIVE3_OCZ-886PWVEQ351TAPNH -> ../../sdb  
lrwxrwxrwx 1 root root 10 Apr 23 08:59 ata-OCZ-REVODRIVE3_OCZ-886PWVEQ351TAPNH-part1 -> ../../sdb1  
lrwxrwxrwx 1 root root  9 Apr 23 08:59 ata-OCZ-REVODRIVE3_OCZ-Z2134R0TLQBNE659 -> ../../sda  
lrwxrwxrwx 1 root root 10 Apr 23 08:59 ata-OCZ-REVODRIVE3_OCZ-Z2134R0TLQBNE659-part1 -> ../../sda1  
lrwxrwxrwx 1 root root  9 Apr 23 08:59 scsi-360026b902fe2ce001261fa4506592f80 -> ../../sdc  
lrwxrwxrwx 1 root root 10 Apr 23 08:59 scsi-360026b902fe2ce001261fa4506592f80-part1 -> ../../sdc1  
lrwxrwxrwx 1 root root 10 Apr 23 08:59 scsi-360026b902fe2ce001261fa4506592f80-part2 -> ../../sdc2  
lrwxrwxrwx 1 root root 10 Apr 23 08:59 scsi-360026b902fe2ce001261fa4506592f80-part3 -> ../../sdc3  
lrwxrwxrwx 1 root root  9 Apr 23 08:59 scsi-360026b902fe2ce0018993f2f0c5734b3 -> ../../sdd  
lrwxrwxrwx 1 root root 10 Apr 23 08:59 scsi-360026b902fe2ce0018993f2f0c5734b3-part1 -> ../../sdd1  
lrwxrwxrwx 1 root root  9 Apr 23 08:59 scsi-SATA_OCZ-REVODRIVE3_OCZ-886PWVEQ351TAPNH -> ../../sdb  
lrwxrwxrwx 1 root root 10 Apr 23 08:59 scsi-SATA_OCZ-REVODRIVE3_OCZ-886PWVEQ351TAPNH-part1 -> ../../sdb1  
lrwxrwxrwx 1 root root  9 Apr 23 08:59 scsi-SATA_OCZ-REVODRIVE3_OCZ-Z2134R0TLQBNE659 -> ../../sda  
lrwxrwxrwx 1 root root 10 Apr 23 08:59 scsi-SATA_OCZ-REVODRIVE3_OCZ-Z2134R0TLQBNE659-part1 -> ../../sda1  
lrwxrwxrwx 1 root root  9 Apr 23 08:59 wwn-0x5e83a97e5dbf17f7 -> ../../sdb  
lrwxrwxrwx 1 root root 10 Apr 23 08:59 wwn-0x5e83a97e5dbf17f7-part1 -> ../../sdb1  
lrwxrwxrwx 1 root root  9 Apr 23 08:59 wwn-0x5e83a97e827c316e -> ../../sda  
lrwxrwxrwx 1 root root 10 Apr 23 08:59 wwn-0x5e83a97e827c316e-part1 -> ../../sda1  
lrwxrwxrwx 1 root root  9 Apr 23 08:59 wwn-0x60026b902fe2ce001261fa4506592f80 -> ../../sdc  
lrwxrwxrwx 1 root root 10 Apr 23 08:59 wwn-0x60026b902fe2ce001261fa4506592f80-part1 -> ../../sdc1  
lrwxrwxrwx 1 root root 10 Apr 23 08:59 wwn-0x60026b902fe2ce001261fa4506592f80-part2 -> ../../sdc2  
lrwxrwxrwx 1 root root 10 Apr 23 08:59 wwn-0x60026b902fe2ce001261fa4506592f80-part3 -> ../../sdc3  
lrwxrwxrwx 1 root root  9 Apr 23 08:59 wwn-0x60026b902fe2ce0018993f2f0c5734b3 -> ../../sdd  
lrwxrwxrwx 1 root root 10 Apr 23 08:59 wwn-0x60026b902fe2ce0018993f2f0c5734b3-part1 -> ../../sdd1  
  
[root@db-172-16-3-150 ssd4]# /opt/zfs0.6.2/sbin/zpool create zptest /opt/zfs.disk1 /opt/zfs.disk2 /opt/zfs.disk3 /opt/zfs.disk4 log mirror /ssd4/zfs.log1 /ssd4/zfs.log2 cache /dev/disk/by-id/scsi-SATA_OCZ-REVODRIVE3_OCZ-Z2134R0TLQBNE659-part1  
  
[root@db-172-16-3-150 ssd4]# /opt/zfs0.6.2/sbin/zpool status zptest  
  pool: zptest  
 state: ONLINE  
  scan: none requested  
config:  
  
        NAME                STATE     READ WRITE CKSUM  
        zptest              ONLINE       0     0     0  
          /opt/zfs.disk1    ONLINE       0     0     0  
          /opt/zfs.disk2    ONLINE       0     0     0  
          /opt/zfs.disk3    ONLINE       0     0     0  
          /opt/zfs.disk4    ONLINE       0     0     0  
        logs  
          mirror-4          ONLINE       0     0     0  
            /ssd4/zfs.log1  ONLINE       0     0     0  
            /ssd4/zfs.log2  ONLINE       0     0     0  
        cache  
          sda1              ONLINE       0     0     0  
  
errors: No known data errors  
  
[root@db-172-16-3-150 ssd4]# df -h  
Filesystem            Size  Used Avail Use% Mounted on  
/dev/sdc1              29G  9.3G   19G  34% /  
tmpfs                  48G     0   48G   0% /dev/shm  
/dev/sdc3              98G   34G   59G  37% /opt  
/dev/sdd1             183G   33G  142G  19% /ssd1  
/dev/sdb1             221G   42G  168G  20% /ssd4  
zptest                3.9G     0  3.9G   0% /zptest  
重启后需要使用zfs来mount.  
# /opt/zfs0.6.2/sbin/zfs mount zptest  
  
使用pg_test_fsync测试fsync接口的性能 :   
[root@db-172-16-3-150 ssd4]# cd /zptest  
[root@db-172-16-3-150 zptest]# mkdir pg93  
[root@db-172-16-3-150 zptest]# chown pg93:pg93 pg93  
[root@db-172-16-3-150 zptest]# su - pg93  
cpg93@db-172-16-3-150-> cd /zptest/pg93/  
pg93@db-172-16-3-150-> pg_test_fsync   
5 seconds per test  
O_DIRECT supported on this platform for open_datasync and open_sync.  
  
Compare file sync methods using one 64kB write:  
(in wal_sync_method preference order, except fdatasync  
is Linux's default)  
        open_datasync                                n/a*  
        fdatasync                         778.117 ops/sec    1285 usecs/op  
        fsync                             756.724 ops/sec    1321 usecs/op  
        fsync_writethrough                            n/a  
        open_sync                                    n/a*  
* This file system and its mount options do not support direct  
I/O, e.g. ext4 in journaled mode.  
  
Compare file sync methods using two 64kB writes:  
(in wal_sync_method preference order, except fdatasync  
is Linux's default)  
        open_datasync                                n/a*  
        fdatasync                          96.185 ops/sec   10397 usecs/op  
        fsync                             369.918 ops/sec    2703 usecs/op  
        fsync_writethrough                            n/a  
        open_sync                                    n/a*  
* This file system and its mount options do not support direct  
I/O, e.g. ext4 in journaled mode.  
  
Compare open_sync with different write sizes:  
(This is designed to compare the cost of writing 16kB  
in different write open_sync sizes.)  
         1 * 16kB open_sync write                    n/a*  
         2 *  8kB open_sync writes                   n/a*  
         4 *  4kB open_sync writes                   n/a*  
         8 *  2kB open_sync writes                   n/a*  
        16 *  1kB open_sync writes                   n/a*  
  
Test if fsync on non-write file descriptor is honored:  
(If the times are similar, fsync() can sync data written  
on a different descriptor.)  
        write, fsync, close               746.761 ops/sec    1339 usecs/op  
        write, close, fsync               217.356 ops/sec    4601 usecs/op  
  
Non-Sync'ed 64kB writes:  
        write                           37511.610 ops/sec      27 usecs/op  
测试过程中, 使用iostat发现, 大量的写实际上集中在zptest的LOG设备上. 符合zfs的log的原理.  
avg-cpu:  %user   %nice %system %iowait  %steal   %idle  
           0.00    0.00    4.05    7.59    0.00   88.35  
Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s avgrq-sz avgqu-sz   await  svctm  %util  
sdc               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00  
sdd               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00  
sdb               0.00 26688.00    0.00 2502.00     0.00 226848.00    90.67     1.29    0.52   0.31  78.10  
sda               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00  
  
使用postgresql测试读写, 并使用同步提交, 减少shared buffer更能体现ZFS的性能.  
pg93@db-172-16-3-150-> initdb -E UTF8 --locale=C -D /zptest/pg93/pg_root -U postgres -W  
pg93@db-172-16-3-150-> cd /zptest/pg93/pg_root  
vi postgresql.conf  
listen_addresses = '0.0.0.0'            # what IP address(es) to listen on;  
port = 1922                             # (change requires restart)  
max_connections = 100                   # (change requires restart)  
unix_socket_directories = '.'   # comma-separated list of directories  
tcp_keepalives_idle = 60                # TCP_KEEPIDLE, in seconds;  
tcp_keepalives_interval = 10            # TCP_KEEPINTVL, in seconds;  
tcp_keepalives_count = 10               # TCP_KEEPCNT;  
shared_buffers = 32MB                   # min 128kB  
maintenance_work_mem = 512MB            # min 1MB  
vacuum_cost_delay = 10                  # 0-100 milliseconds  
vacuum_cost_limit = 10000               # 1-10000 credits  
bgwriter_delay = 10ms                   # 10-10000ms between rounds  
checkpoint_segments = 3         # in logfile segments, min 1, 16MB each  
effective_cache_size = 96000MB  
log_destination = 'csvlog'              # Valid values are combinations of  
logging_collector = on          # Enable capturing of stderr and csvlog  
log_truncate_on_rotation = off          # If on, an existing log file with the  
log_checkpoints = on  
log_connections = on  
log_disconnections = on  
log_error_verbosity = verbose           # terse, default, or verbose messages  
log_timezone = 'PRC'  
autovacuum = on                 # Enable autovacuum subprocess?  'on'  
log_autovacuum_min_duration = 0 # -1 disables, 0 logs all actions and  
datestyle = 'iso, mdy'  
timezone = 'PRC'  
lc_messages = 'C'                       # locale for system error message  
lc_monetary = 'C'                       # locale for monetary formatting  
lc_numeric = 'C'                        # locale for number formatting  
lc_time = 'C'                           # locale for time formatting  
default_text_search_config = 'pg_catalog.english'  
  
pg93@db-172-16-3-150-> pg_ctl start -D /zptest/pg93/pg_root  
  
pg93@db-172-16-3-150-> psql -h /zptest/pg93/pg_root -p 1922 -U postgres postgres  
psql (9.3.3)  
Type "help" for help.  
postgres=# create table test(id int primary key, info text, crt_time timestamp);  
CREATE TABLE  
postgres=# create or replace function f_test(v_id int) returns void as $$  
declare  
begin  
  update test set info=md5(random()::text),crt_time=now() where id=v_id;  
  if not found then   
    insert into test values (v_id, md5(random()::text), now());  
  end if;  
  exception when SQLSTATE '23505' then  
    return;  
end;  
$$ language plpgsql strict;  
CREATE FUNCTION  
  
postgres=# select f_test(1);  
 f_test   
--------  
   
(1 row)  
  
postgres=# select * from test;  
 id |               info               |          crt_time            
----+----------------------------------+----------------------------  
  1 | 80fe8163f44df605621a557624740681 | 2014-05-16 22:01:38.677335  
(1 row)  
  
postgres=# select f_test(1);  
 f_test   
--------  
   
(1 row)  
  
postgres=# select * from test;  
 id |               info               |          crt_time            
----+----------------------------------+----------------------------  
  1 | 5b17eb0ba878e15f40f213716c05a3c5 | 2014-05-16 22:01:42.130284  
(1 row)  
  
pg93@db-172-16-3-150-> vi test.sql  
\setrandom id 1 500000  
select f_test(:id);  
  
pg93@db-172-16-3-150-> pgbench -M prepared -n -r -f ./test.sql -h /zptest/pg93/pg_root -p 1922 -U postgres -c 16 -j 8 -T 30 postgres   
transaction type: Custom query  
scaling factor: 1  
query mode: prepared  
number of clients: 16  
number of threads: 8  
duration: 30 s  
number of transactions actually processed: 121879  
tps = 4060.366837 (including connections establishing)  
tps = 4062.166607 (excluding connections establishing)  
statement latencies in milliseconds:  
        0.004635        \setrandom id 1 500000  
        3.930118        select f_test(:id);  
  
接下来是直接对ZFS4个文件所在机械硬盘做测试, 测试EXT4的性能.  
/dev/sdc3 on /opt type ext4 (rw,noatime,nodiratime)  
  
pg93@db-172-16-3-150-> pg_ctl stop -m fast -D /zptest/pg93/pg_root  
waiting for server to shut down.... done  
server stopped  
pg93@db-172-16-3-150-> exit  
logout  
[root@db-172-16-3-150 ~]# cp -r /zptest/pg93/pg_root /opt/pg_root  
[root@db-172-16-3-150 ~]# chown -R pg93:pg93 /opt/pg_root  
[root@db-172-16-3-150 ~]# su - pg93  
pg93@db-172-16-3-150-> pg_ctl start -D /opt/pg_root  
server starting  
pg93@db-172-16-3-150-> LOG:  00000: redirecting log output to logging collector process  
HINT:  Future log output will appear in directory "pg_log".  
LOCATION:  SysLogger_Start, syslogger.c:649  
pg93@db-172-16-3-150-> psql -h 127.0.0.1 -p 1922 -U postgres postgres  
psql (9.3.3)  
Type "help" for help.  
postgres=# truncate test;  
TRUNCATE TABLE  
postgres=# \q  
  
pg93@db-172-16-3-150-> cd  
pg93@db-172-16-3-150-> pgbench -M prepared -n -r -f ./test.sql -h /opt/pg_root -p 1922 -U postgres -c 16 -j 8 -T 30 postgres  
transaction type: Custom query  
scaling factor: 1  
query mode: prepared  
number of clients: 16  
number of threads: 8  
duration: 30 s  
number of transactions actually processed: 39062  
tps = 1301.492448 (including connections establishing)  
tps = 1302.085710 (excluding connections establishing)  
statement latencies in milliseconds:  
        0.004463        \setrandom id 1 500000  
        12.278317       select f_test(:id);  
显然, ZFS在使用了LOG和二级缓存的情况下性能完胜.  
  
最后, 直接使用SSD, 对比一下性能, 因为本例LOG是使用两个文件来代替块设备的, 而不是文件系统, 所以和直接在SSD上的EXT4有性能差距 :   
pg93@db-172-16-3-150-> pg_ctl stop -m fast -D /opt/pg_root  
waiting for server to shut down.... done  
server stopped  
pg93@db-172-16-3-150-> exit  
logout  
You have new mail in /var/spool/mail/root  
[root@db-172-16-3-150 ~]# cp -r /zptest/pg93/pg_root /ssd4/  
database/   pg92/       pg931/      pgxl/       test.pl     zfs.log2      
lost+found/ pg93/       pg94/       ssd3/       zfs.log1      
[root@db-172-16-3-150 ~]# cp -r /zptest/pg93/pg_root /ssd4/pg_root  
[root@db-172-16-3-150 ~]# chown -R pg93:pg93 /ssd4/pg_root  
[root@db-172-16-3-150 ~]# su - pg93  
pg93@db-172-16-3-150-> pg_ctl start -D /ssd4/pg_root  
server starting  
pg93@db-172-16-3-150-> LOG:  00000: redirecting log output to logging collector process  
HINT:  Future log output will appear in directory "pg_log".  
LOCATION:  SysLogger_Start, syslogger.c:649  
  
pg93@db-172-16-3-150-> psql -h 127.0.0.1 -p 1922 -U postgres postgres  
psql (9.3.3)  
Type "help" for help.  
  
postgres=# truncate test;  
TRUNCATE TABLE  
postgres=# \q  
  
pg93@db-172-16-3-150-> pgbench -M prepared -n -r -f ./test.sql -h /ssd4/pg_root -p 1922 -U postgres -c 16 -j 8 -T 30 postgres  
transaction type: Custom query  
scaling factor: 1  
query mode: prepared  
number of clients: 16  
number of threads: 8  
duration: 30 s  
number of transactions actually processed: 266565  
tps = 8876.805761 (including connections establishing)  
tps = 8881.884999 (excluding connections establishing)  
statement latencies in milliseconds:  
        0.003877        \setrandom id 1 500000  
        1.793971        select f_test(:id);  
  
接下来把EXT4放到文件中对比.  
[root@db-172-16-3-150 ~]# cd /ssd4  
[root@db-172-16-3-150 ssd4]# dd if=/dev/zero of=./test.img bs=1024k count=1024  
[root@db-172-16-3-150 ssd4]# mkfs.ext4 ./test.img   
[root@db-172-16-3-150 ssd4]# mount -o loop ./test.img /mnt  
[root@db-172-16-3-150 ssd4]# su - pg93  
pg93@db-172-16-3-150-> pg_ctl stop -m fast -D /ssd4/pg_root  
waiting for server to shut down.... done  
server stopped  
pg93@db-172-16-3-150-> exit  
logout  
[root@db-172-16-3-150 ssd4]# cp -r /ssd4/pg_root /mnt/  
[root@db-172-16-3-150 ssd4]# chown -R pg93:pg93 /mnt/pg_root  
[root@db-172-16-3-150 ssd4]# su - pg93  
pg93@db-172-16-3-150-> pg_ctl start -D /mnt/pg_root  
server starting  
pg93@db-172-16-3-150-> LOG:  00000: redirecting log output to logging collector process  
HINT:  Future log output will appear in directory "pg_log".  
LOCATION:  SysLogger_Start, syslogger.c:649  
  
pg93@db-172-16-3-150-> psql -h 127.0.0.1 -p 1922 -U postgres postgres  
psql (9.3.3)  
Type "help" for help.  
  
postgres=# truncate test;  
TRUNCATE TABLE  
postgres=# \q  
pg93@db-172-16-3-150-> pgbench -M prepared -n -r -f ./test.sql -h /mnt/pg_root -p 1922 -U postgres -c 16 -j 8 -T 30 postgres  
transaction type: Custom query  
scaling factor: 1  
query mode: prepared  
number of clients: 16  
number of threads: 8  
duration: 30 s  
number of transactions actually processed: 192445  
tps = 6410.026454 (including connections establishing)  
tps = 6412.254485 (excluding connections establishing)  
statement latencies in milliseconds:  
        0.004118        \setrandom id 1 500000  
        2.486914        select f_test(:id);  

参考

1. http://zfsonlinux.org/docs.html

2. http://zfsonlinux.org/

3. https://pthree.org/2012/12/05/zfs-administration-part-ii-raidz/

4. http://en.wikipedia.org/wiki/Non-RAID_drive_architectures

5. https://github.com/zfsonlinux

6. http://open-zfs.org/wiki/Main_Page

7. http://rudd-o.com/linux-and-free-software/ways-in-which-zfs-is-better-than-btrfs

8. https://java.net/projects/solaris-zfs/pages/Home

9. http://www.oracle.com/technetwork/server-storage/solaris11/documentation/index.html

Flag Counter

digoal’s 大量PostgreSQL文章入口