PostgreSQL OLTP on ZFS 性能优化

10 minute read

背景

环境

  PostgreSQL 9.5 rc1  
  数据块大小为8KB  
  CentOS 6.x x64  
  zfsonlinux  
  3*aliflash  
  256G内存  
  32核 Intel(R) Xeon(R) CPU E5-2650 v2 @ 2.60GHz  
  
  
pg_xlog on ext4  
  
  ext4 mount option ( defaults,noatime,nodiratime,discard,nodelalloc,data=writeback,nobarrier )  
  
$PGDATA on zfs  

zfs优化

1. 块设备对齐

fdisk -c -u /dev/dfa  
start  2048  
end  +(n*2048-1)  
  
或者使用parted分配,GPT+对齐  

2. 模块参数

cd /sys/module/zfs/parameters/  

1.1 关闭zfs prefetch,因为是OLTP系统,不需要prefetch。

echo 1 > zfs_prefetch_disable  

1.2 修改ARC脏页一次被刷出的单位(太大会导致evict arc脏页时中断响应,建议16MB)

算法 : 系统内存 除以 2^zfs_arc_shrink_shift

所以当内存为256GB是,要配置为16MB,zfs_arc_shrink_shift必须设置为14。

echo 14 > zfs_arc_shrink_shift  

1.3 对于未使用ZLOG设备的zpool,可以将sync改为always

zfs set sync=always zp1/data01  
zfs set sync=always zp1  

3. zpool 参数

2.1 ashift和数据库的块大小对齐。这里PostgreSQL使用了默认的8KB。

所以ashift选择13。

2^13=8192  
  
zpool create -o ashift=13 zp1 dfa1 dfb1 dfc1  

4. zfs 参数

recordsize 对齐数据库块大小 = 8K  
primarycache = metadata  
secondarycache = none  
atime = off  
logbias = throughput    (直接写数据,因为没有使用ZLOG,不需要用标准的)  

5. postgresql 参数

listen_addresses = '0.0.0.0'            # what IP address(es) to listen on;  
port = 1921                             # (change requires restart)  
max_connections = 1000                  # (change requires restart)  
unix_socket_directories = '.'   # comma-separated list of directories  
shared_buffers = 32GB                   # min 128kB  
maintenance_work_mem = 512MB            # min 1MB  
autovacuum_work_mem = 512MB             # min 1MB, or -1 to use maintenance_work_mem  
dynamic_shared_memory_type = posix      # the default is the first option  
bgwriter_delay = 10ms                   # 10-10000ms between rounds  
wal_level = hot_standby  # minimal, archive, hot_standby, or logical  
synchronous_commit = off                # synchronization level;  
max_wal_size = 32GB  
max_wal_senders = 10            # max number of walsender processes  
max_replication_slots = 10      # max number of replication slots  
hot_standby = on                        # "on" allows queries during recovery  
wal_receiver_status_interval = 1s       # send replies at least this often  
hot_standby_feedback = off               # send info from standby to prevent  
random_page_cost = 1.0                  # same scale as above  
effective_cache_size = 256GB  
log_destination = 'csvlog'              # Valid values are combinations of  
logging_collector = on          # Enable capturing of stderr and csvlog  
log_checkpoints = on  
log_connections = on  
log_disconnections = on  
log_error_verbosity = verbose           # terse, default, or verbose messages  
log_timezone = 'PRC'  
datestyle = 'iso, mdy'  
timezone = 'PRC'  
lc_messages = 'C'                       # locale for system error message  
lc_monetary = 'C'                       # locale for monetary formatting  
lc_numeric = 'C'                        # locale for number formatting  
lc_time = 'C'                           # locale for time formatting  
default_text_search_config = 'pg_catalog.english'  

6. 测试PostgreSQL TPC-B

5亿 tpc-b测试数据

pgbench -i -s 5000  
pgbench -M prepared -n -r -P 5 -c 48 -j 48 -T 7200  

测试结果:

pic

zfs下的性能约为XFS的75%.

ZFS

transaction type: TPC-B (sort of)  
scaling factor: 5000  
query mode: prepared  
number of clients: 48  
number of threads: 48  
duration: 7200 s  
number of transactions actually processed: 54221472  
latency average: 6.370 ms  
latency stddev: 13.424 ms  
tps = 7530.645849 (including connections establishing)  
tps = 7530.676229 (excluding connections establishing)  
statement latencies in milliseconds:  
        0.006580        \set nbranches 1 * :scale  
        0.001856        \set ntellers 10 * :scale  
        0.001427        \set naccounts 100000 * :scale  
        0.002671        \setrandom aid 1 :naccounts  
        0.001598        \setrandom bid 1 :nbranches  
        0.001533        \setrandom tid 1 :ntellers  
        0.001618        \setrandom delta -5000 5000  
        0.146576        BEGIN;  
        3.357134        UPDATE pgbench_accounts SET abalance = abalance + :delta WHERE aid = :aid;  
        0.199865        SELECT abalance FROM pgbench_accounts WHERE aid = :aid;  
        1.036640        UPDATE pgbench_tellers SET tbalance = tbalance + :delta WHERE tid = :tid;  
        0.636415        UPDATE pgbench_branches SET bbalance = bbalance + :delta WHERE bid = :bid;  
        0.523942        INSERT INTO pgbench_history (tid, bid, aid, delta, mtime) VALUES (:tid, :bid, :aid, :delta, CURRENT_TIMESTAMP);  
        0.434377        END;  

XFS

transaction type: TPC-B (sort of)  
scaling factor: 5000  
query mode: prepared  
number of clients: 48  
number of threads: 48  
duration: 7200 s  
number of transactions actually processed: 78512059  
latency average: 4.400 ms  
latency stddev: 10.051 ms  
tps = 10904.276312 (including connections establishing)  
tps = 10904.307274 (excluding connections establishing)  
statement latencies in milliseconds:  
        0.003500        \set nbranches 1 * :scale  
        0.000971        \set ntellers 10 * :scale  
        0.000787        \set naccounts 100000 * :scale  
        0.001327        \setrandom aid 1 :naccounts  
        0.001081        \setrandom bid 1 :nbranches  
        0.000894        \setrandom tid 1 :ntellers  
        0.000924        \setrandom delta -5000 5000  
        0.096122        BEGIN;  
        1.521620        UPDATE pgbench_accounts SET abalance = abalance + :delta WHERE aid = :aid;  
        0.121572        SELECT abalance FROM pgbench_accounts WHERE aid = :aid;  
        1.035498        UPDATE pgbench_tellers SET tbalance = tbalance + :delta WHERE tid = :tid;  
        0.631052        UPDATE pgbench_branches SET bbalance = bbalance + :delta WHERE bid = :bid;  
        0.524147        INSERT INTO pgbench_history (tid, bid, aid, delta, mtime) VALUES (:tid, :bid, :aid, :delta, CURRENT_TIMESTAMP);  
        0.450451        END;  

附件

1.

#modinfo zfs  
filename:       /lib/modules/3.18.24/extra/zfs/zfs.ko  
version:        0.6.5.3-1  
license:        CDDL  
author:         OpenZFS on Linux  
description:    ZFS  
srcversion:     CEB8F91B3D53F4A2844D531  
depends:        spl,zcommon,znvpair,zavl,zunicode  
vermagic:       3.18.24 SMP mod_unload modversions   
parm:           zvol_inhibit_dev:Do not create zvol device nodes (uint)  
parm:           zvol_major:Major number for zvol device (uint)  
parm:           zvol_max_discard_blocks:Max number of blocks to discard (ulong)  
parm:           zvol_prefetch_bytes:Prefetch N bytes at zvol start+end (uint)  
parm:           zio_delay_max:Max zio millisec delay before posting event (int)  
parm:           zio_requeue_io_start_cut_in_line:Prioritize requeued I/O (int)  
parm:           zfs_sync_pass_deferred_free:Defer frees starting in this pass (int)  
parm:           zfs_sync_pass_dont_compress:Don't compress starting in this pass (int)  
parm:           zfs_sync_pass_rewrite:Rewrite new bps starting in this pass (int)  
parm:           zil_replay_disable:Disable intent logging replay (int)  
parm:           zfs_nocacheflush:Disable cache flushes (int)  
parm:           zil_slog_limit:Max commit bytes to separate log device (ulong)  
parm:           zfs_read_chunk_size:Bytes to read per chunk (long)  
parm:           zfs_immediate_write_sz:Largest data block to write to zil (long)  
parm:           zfs_dbgmsg_enable:Enable ZFS debug message log (int)  
parm:           zfs_dbgmsg_maxsize:Maximum ZFS debug log size (int)  
parm:           zfs_admin_snapshot:Enable mkdir/rmdir/mv in .zfs/snapshot (int)  
parm:           zfs_expire_snapshot:Seconds to expire .zfs/snapshot (int)  
parm:           zfs_vdev_aggregation_limit:Max vdev I/O aggregation size (int)  
parm:           zfs_vdev_read_gap_limit:Aggregate read I/O over gap (int)  
parm:           zfs_vdev_write_gap_limit:Aggregate write I/O over gap (int)  
parm:           zfs_vdev_max_active:Maximum number of active I/Os per vdev (int)  
parm:           zfs_vdev_async_write_active_max_dirty_percent:Async write concurrency max threshold (int)  
parm:           zfs_vdev_async_write_active_min_dirty_percent:Async write concurrency min threshold (int)  
parm:           zfs_vdev_async_read_max_active:Max active async read I/Os per vdev (int)  
parm:           zfs_vdev_async_read_min_active:Min active async read I/Os per vdev (int)  
parm:           zfs_vdev_async_write_max_active:Max active async write I/Os per vdev (int)  
parm:           zfs_vdev_async_write_min_active:Min active async write I/Os per vdev (int)  
parm:           zfs_vdev_scrub_max_active:Max active scrub I/Os per vdev (int)  
parm:           zfs_vdev_scrub_min_active:Min active scrub I/Os per vdev (int)  
parm:           zfs_vdev_sync_read_max_active:Max active sync read I/Os per vdev (int)  
parm:           zfs_vdev_sync_read_min_active:Min active sync read I/Os per vdev (int)  
parm:           zfs_vdev_sync_write_max_active:Max active sync write I/Os per vdev (int)  
parm:           zfs_vdev_sync_write_min_active:Min active sync write I/Os per vdev (int)  
parm:           zfs_vdev_mirror_switch_us:Switch mirrors every N usecs (int)  
parm:           zfs_vdev_scheduler:I/O scheduler (charp)  
parm:           zfs_vdev_cache_max:Inflate reads small than max (int)  
parm:           zfs_vdev_cache_size:Total size of the per-disk cache (int)  
parm:           zfs_vdev_cache_bshift:Shift size to inflate reads too (int)  
parm:           metaslabs_per_vdev:Divide added vdev into approximately (but no more than) this number of metaslabs (int)  
parm:           zfs_txg_timeout:Max seconds worth of delta per txg (int)  
parm:           zfs_read_history:Historic statistics for the last N reads (int)  
parm:           zfs_read_history_hits:Include cache hits in read history (int)  
parm:           zfs_txg_history:Historic statistics for the last N txgs (int)  
parm:           zfs_flags:Set additional debugging flags (uint)  
parm:           zfs_recover:Set to attempt to recover from fatal errors (int)  
parm:           zfs_free_leak_on_eio:Set to ignore IO errors during free and permanently leak the space (int)  
parm:           zfs_deadman_synctime_ms:Expiration time in milliseconds (ulong)  
parm:           zfs_deadman_enabled:Enable deadman timer (int)  
parm:           spa_asize_inflation:SPA size estimate multiplication factor (int)  
parm:           spa_slop_shift:Reserved free space in pool (int)  
parm:           spa_config_path:SPA config file (/etc/zfs/zpool.cache) (charp)  
parm:           zfs_autoimport_disable:Disable pool import at module load (int)  
parm:           spa_load_verify_maxinflight:Max concurrent traversal I/Os while verifying pool during import -X (int)  
parm:           spa_load_verify_metadata:Set to traverse metadata on pool import (int)  
parm:           spa_load_verify_data:Set to traverse data on pool import (int)  
parm:           metaslab_aliquot:allocation granularity (a.k.a. stripe size) (ulong)  
parm:           metaslab_debug_load:load all metaslabs when pool is first opened (int)  
parm:           metaslab_debug_unload:prevent metaslabs from being unloaded (int)  
parm:           metaslab_preload_enabled:preload potential metaslabs during reassessment (int)  
parm:           zfs_mg_noalloc_threshold:percentage of free space for metaslab group to allow allocation (int)  
parm:           zfs_mg_fragmentation_threshold:fragmentation for metaslab group to allow allocation (int)  
parm:           zfs_metaslab_fragmentation_threshold:fragmentation for metaslab to allow allocation (int)  
parm:           metaslab_fragmentation_factor_enabled:use the fragmentation metric to prefer less fragmented metaslabs (int)  
parm:           metaslab_lba_weighting_enabled:prefer metaslabs with lower LBAs (int)  
parm:           metaslab_bias_enabled:enable metaslab group biasing (int)  
parm:           zfs_zevent_len_max:Max event queue length (int)  
parm:           zfs_zevent_cols:Max event column width (int)  
parm:           zfs_zevent_console:Log events to the console (int)  
parm:           zfs_top_maxinflight:Max I/Os per top-level (int)  
parm:           zfs_resilver_delay:Number of ticks to delay resilver (int)  
parm:           zfs_scrub_delay:Number of ticks to delay scrub (int)  
parm:           zfs_scan_idle:Idle window in clock ticks (int)  
parm:           zfs_scan_min_time_ms:Min millisecs to scrub per txg (int)  
parm:           zfs_free_min_time_ms:Min millisecs to free per txg (int)  
parm:           zfs_resilver_min_time_ms:Min millisecs to resilver per txg (int)  
parm:           zfs_no_scrub_io:Set to disable scrub I/O (int)  
parm:           zfs_no_scrub_prefetch:Set to disable scrub prefetching (int)  
parm:           zfs_free_max_blocks:Max number of blocks freed in one txg (ulong)  
parm:           zfs_dirty_data_max_percent:percent of ram can be dirty (int)  
parm:           zfs_dirty_data_max_max_percent:zfs_dirty_data_max upper bound as % of RAM (int)  
parm:           zfs_delay_min_dirty_percent:transaction delay threshold (int)  
parm:           zfs_dirty_data_max:determines the dirty space limit (ulong)  
parm:           zfs_dirty_data_max_max:zfs_dirty_data_max upper bound in bytes (ulong)  
parm:           zfs_dirty_data_sync:sync txg when this much dirty data (ulong)  
parm:           zfs_delay_scale:how quickly delay approaches infinity (ulong)  
parm:           zfs_max_recordsize:Max allowed record size (int)  
parm:           zfs_prefetch_disable:Disable all ZFS prefetching (int)  
parm:           zfetch_max_streams:Max number of streams per zfetch (uint)  
parm:           zfetch_min_sec_reap:Min time before stream reclaim (uint)  
parm:           zfetch_block_cap:Max number of blocks to fetch at a time (uint)  
parm:           zfetch_array_rd_sz:Number of bytes in a array_read (ulong)  
parm:           zfs_pd_bytes_max:Max number of bytes to prefetch (int)  
parm:           zfs_send_corrupt_data:Allow sending corrupt data (int)  
parm:           zfs_mdcomp_disable:Disable meta data compression (int)  
parm:           zfs_nopwrite_enabled:Enable NOP writes (int)  
parm:           zfs_dedup_prefetch:Enable prefetching dedup-ed blks (int)  
parm:           zfs_dbuf_state_index:Calculate arc header index (int)  
parm:           zfs_arc_min:Min arc size (ulong)  
parm:           zfs_arc_max:Max arc size (ulong)  
parm:           zfs_arc_meta_limit:Meta limit for arc size (ulong)  
parm:           zfs_arc_meta_min:Min arc metadata (ulong)  
parm:           zfs_arc_meta_prune:Meta objects to scan for prune (int)  
parm:           zfs_arc_meta_adjust_restarts:Limit number of restarts in arc_adjust_meta (int)  
parm:           zfs_arc_meta_strategy:Meta reclaim strategy (int)  
parm:           zfs_arc_grow_retry:Seconds before growing arc size (int)  
parm:           zfs_arc_p_aggressive_disable:disable aggressive arc_p grow (int)  
parm:           zfs_arc_p_dampener_disable:disable arc_p adapt dampener (int)  
parm:           zfs_arc_shrink_shift:log2(fraction of arc to reclaim) (int)  
parm:           zfs_arc_p_min_shift:arc_c shift to calc min/max arc_p (int)  
parm:           zfs_disable_dup_eviction:disable duplicate buffer eviction (int)  
parm:           zfs_arc_average_blocksize:Target average block size (int)  
parm:           zfs_arc_min_prefetch_lifespan:Min life of prefetch block (int)  
parm:           zfs_arc_num_sublists_per_state:Number of sublists used in each of the ARC state lists (int)  
parm:           l2arc_write_max:Max write bytes per interval (ulong)  
parm:           l2arc_write_boost:Extra write bytes during device warmup (ulong)  
parm:           l2arc_headroom:Number of max device writes to precache (ulong)  
parm:           l2arc_headroom_boost:Compressed l2arc_headroom multiplier (ulong)  
parm:           l2arc_feed_secs:Seconds between L2ARC writing (ulong)  
parm:           l2arc_feed_min_ms:Min feed interval in milliseconds (ulong)  
parm:           l2arc_noprefetch:Skip caching prefetched buffers (int)  
parm:           l2arc_nocompress:Skip compressing L2ARC buffers (int)  
parm:           l2arc_feed_again:Turbo L2ARC warmup (int)  
parm:           l2arc_norw:No reads during writes (int)  
parm:           zfs_arc_lotsfree_percent:System free memory I/O throttle in bytes (int)  
parm:           zfs_arc_sys_free:System free memory target size in bytes (ulong)  
  
  
#zfs get all zp1/data01  
NAME        PROPERTY              VALUE                  SOURCE  
zp1/data01  type                  filesystem             -  
zp1/data01  creation              Tue Dec 29 11:11 2015  -  
zp1/data01  used                  73.3G                  -  
zp1/data01  available             5.01T                  -  
zp1/data01  referenced            73.3G                  -  
zp1/data01  compressratio         1.00x                  -  
zp1/data01  mounted               yes                    -  
zp1/data01  quota                 none                   default  
zp1/data01  reservation           none                   default  
zp1/data01  recordsize            8K                     inherited from zp1  
zp1/data01  mountpoint            /data01                local  
zp1/data01  sharenfs              off                    default  
zp1/data01  checksum              on                     default  
zp1/data01  compression           off                    default  
zp1/data01  atime                 off                    inherited from zp1  
zp1/data01  devices               on                     default  
zp1/data01  exec                  on                     default  
zp1/data01  setuid                on                     default  
zp1/data01  readonly              off                    default  
zp1/data01  zoned                 off                    default  
zp1/data01  snapdir               hidden                 default  
zp1/data01  aclinherit            restricted             default  
zp1/data01  canmount              on                     default  
zp1/data01  xattr                 on                     default  
zp1/data01  copies                1                      default  
zp1/data01  version               5                      -  
zp1/data01  utf8only              off                    -  
zp1/data01  normalization         none                   -  
zp1/data01  casesensitivity       sensitive              -  
zp1/data01  vscan                 off                    default  
zp1/data01  nbmand                off                    default  
zp1/data01  sharesmb              off                    default  
zp1/data01  refquota              none                   default  
zp1/data01  refreservation        none                   default  
zp1/data01  primarycache          metadata               local  
zp1/data01  secondarycache        none                   local  
zp1/data01  usedbysnapshots       0                      -  
zp1/data01  usedbydataset         73.3G                  -  
zp1/data01  usedbychildren        0                      -  
zp1/data01  usedbyrefreservation  0                      -  
zp1/data01  logbias               throughput             local  
zp1/data01  dedup                 off                    default  
zp1/data01  mlslabel              none                   default  
zp1/data01  sync                  standard               default  
zp1/data01  refcompressratio      1.00x                  -  
zp1/data01  written               73.3G                  -  
zp1/data01  logicalused           72.8G                  -  
zp1/data01  logicalreferenced     72.8G                  -  
zp1/data01  filesystem_limit      none                   default  
zp1/data01  snapshot_limit        none                   default  
zp1/data01  filesystem_count      none                   default  
zp1/data01  snapshot_count        none                   default  
zp1/data01  snapdev               hidden                 default  
zp1/data01  acltype               off                    default  
zp1/data01  context               none                   default  
zp1/data01  fscontext             none                   default  
zp1/data01  defcontext            none                   default  
zp1/data01  rootcontext           none                   default  
zp1/data01  relatime              off                    default  
zp1/data01  redundant_metadata    all                    default  
zp1/data01  overlay               off                    default  
  
#zpool get all zp1  
NAME  PROPERTY                    VALUE                       SOURCE  
zp1   size                        5.25T                       -  
zp1   capacity                    1%                          -  
zp1   altroot                     -                           default  
zp1   health                      ONLINE                      -  
zp1   guid                        5337829197153510332         default  
zp1   version                     -                           default  
zp1   bootfs                      -                           default  
zp1   delegation                  on                          default  
zp1   autoreplace                 off                         default  
zp1   cachefile                   -                           default  
zp1   failmode                    wait                        default  
zp1   listsnapshots               off                         default  
zp1   autoexpand                  off                         default  
zp1   dedupditto                  0                           default  
zp1   dedupratio                  1.00x                       -  
zp1   free                        5.18T                       -  
zp1   allocated                   74.1G                       -  
zp1   readonly                    off                         -  
zp1   ashift                      13                          local  
zp1   comment                     -                           default  
zp1   expandsize                  -                           -  
zp1   freeing                     0                           default  
zp1   fragmentation               5%                          -  
zp1   leaked                      0                           default  
zp1   feature@async_destroy       enabled                     local  
zp1   feature@empty_bpobj         active                      local  
zp1   feature@lz4_compress        active                      local  
zp1   feature@spacemap_histogram  active                      local  
zp1   feature@enabled_txg         active                      local  
zp1   feature@hole_birth          active                      local  
zp1   feature@extensible_dataset  enabled                     local  
zp1   feature@embedded_data       active                      local  
zp1   feature@bookmarks           enabled                     local  
zp1   feature@filesystem_limits   enabled                     local  
zp1   feature@large_blocks        enabled                     local  

参考

https://www.cupfighter.net/2012/10/default-nexenta-zfs-settings-you-want-to-change

https://www.cupfighter.net/2013/03/default-nexenta-zfs-settings-you-want-to-change-part-2

http://open-zfs.org/wiki/Performance_tuning

https://docs.oracle.com/cd/E26502_01/html/E29022/chapterzfs-db1.html#chapterzfs-db2

http://docs.oracle.com/cd/E19253-01/819-5461/givdo/index.html

https://icesquare.com/wordpress/how-to-improve-zfs-performance/

http://zfsonlinux.org/faq.html#HowDoesZFSonLinuxHandlesAdvacedFormatDrives

http://serverfault.com/users/283629/brian-thomas

http://dtrace.org/blogs/ahl/2014/08/31/openzfs-tuning/

<PostgreSQL + ZFS Best Practices and Standard Procedures>

Flag Counter

digoal’s 大量PostgreSQL文章入口