ZFS case : top CPU 100%sy, when no free memory trigger it.

12 minute read

背景

最近在一个系统频频遇到负载突然飙升到几百, 然后又下去的情况.

根据负载升高的时间点对应的数据库日志分析, 对应的时间点, 有大量的类似如下的日志 :

"UPDATE waiting",2015-01-09 01:38:47 CST,979/7,2927976054,LOG,00000,"process 26366 still waiting for ExclusiveLock on extension of relation 686062002 of database 35078604 after 1117.676 ms",,,,,,"  
"INSERT waiting",2015-01-09 01:38:36 CST,541/8,2927976307,LOG,00000,"process 25936 still waiting for ExclusiveLock on extension of relation 686062002 of database 35078604 after 1219.762 ms",,,,,,"  
"INSERT waiting",2015-01-09 01:38:48 CST,1018/64892,2929458056,LOG,00000,"process 26439 still waiting for ExclusiveLock on extension of relation 686061993 of database 35078604 after 1000.105 ms",  
.........  

对应几个对象的块扩展等待

select 686062002::regclass;  
          regclass             
-----------------------------  
 pg_toast.pg_toast_686061993  
(1 row)  
select relname from pg_class where reltoastrelid=686062002;  
               relname                 
-------------------------------------  
 tbl_xxx_20150109  
(1 row)  
Time: 4.643 ms  

同时系统的dmesg还伴随 :

postgres: page allocation failure. order:1, mode:0x20  
Pid: 20427, comm: postgres Tainted: P           ---------------    2.6.32-504.el6.x86_64 #1  
Call Trace:  
 <IRQ>  [<ffffffff8113438a>] ? __alloc_pages_nodemask+0x74a/0x8d0  
 [<ffffffff810eaa90>] ? handle_IRQ_event+0x60/0x170  
 [<ffffffff81173332>] ? kmem_getpages+0x62/0x170  
 [<ffffffff81173f4a>] ? fallback_alloc+0x1ba/0x270  
 [<ffffffff8117399f>] ? cache_grow+0x2cf/0x320  
 [<ffffffff81173cc9>] ? ____cache_alloc_node+0x99/0x160  
 [<ffffffff81174c4b>] ? kmem_cache_alloc+0x11b/0x190  
 [<ffffffff8144c768>] ? sk_prot_alloc+0x48/0x1c0  
 [<ffffffff8144d992>] ? sk_clone+0x22/0x2e0  
 [<ffffffff814a1b76>] ? inet_csk_clone+0x16/0xd0  
 [<ffffffff814bb713>] ? tcp_create_openreq_child+0x23/0x470  
 [<ffffffff814b8ecd>] ? tcp_v4_syn_recv_sock+0x4d/0x310  
 [<ffffffff814bb4b6>] ? tcp_check_req+0x226/0x460  
 [<ffffffff814b890b>] ? tcp_v4_do_rcv+0x35b/0x490  
 [<ffffffffa0207557>] ? ipv4_confirm+0x87/0x1d0 [nf_conntrack_ipv4]  
 [<ffffffff814ba1a2>] ? tcp_v4_rcv+0x522/0x900  
 [<ffffffff81496d10>] ? ip_local_deliver_finish+0x0/0x2d0  
 [<ffffffff81496ded>] ? ip_local_deliver_finish+0xdd/0x2d0  
 [<ffffffff81497078>] ? ip_local_deliver+0x98/0xa0  
 [<ffffffff8149653d>] ? ip_rcv_finish+0x12d/0x440  
 [<ffffffff81496ac5>] ? ip_rcv+0x275/0x350  
 [<ffffffff8145c88b>] ? __netif_receive_skb+0x4ab/0x750  
 [<ffffffff81460588>] ? netif_receive_skb+0x58/0x60  
 [<ffffffff81460690>] ? napi_skb_finish+0x50/0x70  
 [<ffffffff81461f69>] ? napi_gro_receive+0x39/0x50  
 [<ffffffffa01a7d91>] ? igb_poll+0x981/0x1010 [igb]  
 [<ffffffff814b59c0>] ? tcp_delack_timer+0x0/0x270  
 [<ffffffff814b3af9>] ? tcp_send_ack+0xd9/0x120  
 [<ffffffff81462083>] ? net_rx_action+0x103/0x2f0  
 [<ffffffff8107d8b1>] ? __do_softirq+0xc1/0x1e0  
 [<ffffffff810eaa90>] ? handle_IRQ_event+0x60/0x170  
 [<ffffffff8107d90f>] ? __do_softirq+0x11f/0x1e0  
 [<ffffffff8100c30c>] ? call_softirq+0x1c/0x30  
 [<ffffffff8100fc15>] ? do_softirq+0x65/0xa0  
 [<ffffffff8107d765>] ? irq_exit+0x85/0x90  
 [<ffffffff81533b45>] ? do_IRQ+0x75/0xf0  
 [<ffffffff8100b9d3>] ? ret_from_intr+0x0/0x11  
 <EOI>  [<ffffffff8116f5f9>] ? compaction_alloc+0x269/0x4b0  
 [<ffffffff8116f552>] ? compaction_alloc+0x1c2/0x4b0  
 [<ffffffff811799fa>] ? migrate_pages+0xaa/0x480  
 [<ffffffff8100b9ce>] ? common_interrupt+0xe/0x13  
 [<ffffffff8116f390>] ? compaction_alloc+0x0/0x4b0  
 [<ffffffff8116e9ea>] ? compact_zone+0x61a/0xba0  
 [<ffffffff8116f01c>] ? compact_zone_order+0xac/0x100  
 [<ffffffff8116f151>] ? try_to_compact_pages+0xe1/0x120  
 [<ffffffff81133b6a>] ? __alloc_pages_direct_compact+0xda/0x1b0  
 [<ffffffff81134055>] ? __alloc_pages_nodemask+0x415/0x8d0  
 [<ffffffff8116c79a>] ? alloc_pages_vma+0x9a/0x150  
 [<ffffffff8118845d>] ? do_huge_pmd_anonymous_page+0x14d/0x3b0  
 [<ffffffff8114fdb0>] ? handle_mm_fault+0x2f0/0x300  
 [<ffffffff8104d0d8>] ? __do_page_fault+0x138/0x480  
 [<ffffffff8152ae5e>] ? mutex_lock+0x1e/0x50  
 [<ffffffff8152ffbe>] ? do_page_fault+0x3e/0xa0  
 [<ffffffff8152d375>] ? page_fault+0x25/0x30  

这是个日志表, 有4个索引, 其中一个变长字段存储的值较长(因此有用到TOAST存储), 例如

DxxxxxxxxxxxxzwLlyyDd7xGd7^7xxwLDxyD@5xHB7^if5^vv4&DJCEL7xxxCFyhsxxxd4x~j2%$BB%ChkzHlzzvxBwqn5^DDCFexzwC@zyLDz  
zC~zyDDzyCbAyyh3M~v5^DDCHvBBy%j0%iL4^fJB%K1xxxB%G1wz~h2M%B4%qn5&7xxwyPs!$xJ!Dd7xCb3^DFCGLnzyzlzyP7zyCJ7x)Lx^xxxxxxxy73$rLB&DND  
zL5zy~xxyCt4xPj4%DJCE~DzyP#zyLPzyypxxx3&~DB^P1zzC5zye5wzz10MCb3^Gp4^DLCEiNywi$yzvxBwL73&$F7%7xzwG5zyy5wyah4MbzB%C1DzL9zyf5yyG9z  
y!1zyLJxyCt4xPP5%nLB&xxxx  
&7xxwjHzzi#yyi$yzi$yzmHIPm^K@CbAzzh5MDBCGLxxxxwz~h5M$JB%DxDzGlzyH5zyL7zzylzyC9AyLxxx7xxwC5AyzNxxxxxx5%Pp0  
^~d0&6NzwK1wyzN1M%xxxx^P74^DJCGD7yyvBByiF4&Pt0%~d0&6hwwvDByiF4&et5xxxxxx17%$$1^DBCFGfwxzh2M!j1%qv5^DLCF!NzyH5zzvJBy%h4  
%aD4&%v4^61zwDnyyK5wyzN0Mxxxx5$71zwCd7xCv0&fj5&(h1%yNc%mf7%71zwxxx%a@4%rpd^a5d$71zwCt5x!l1$~^l^LDx&K1wzmh5MxxxxxxxxxxGr4&Dvd&$jl  
%LBx%K1wzPh5M$Ll^yn1&Ht6+fxxxxxxwC@5xqnxxx&~90^fj5+Oh5%71zwDJ7xH~j&yDh$G@5^7xxxx^Gp4@DLCEf1yw7b0~bn0^%^i&HDzyebz  
y6)zyLBzyfdyyvxBwH#5%GP5^nvd&$LB^DN5zqHA^%P1^nLlEDL5%$@n^i#4^$J7%nPn+bzF@Ct4xD~1&GB4^add%7xxw!7zybBxyC@5xqn5xxxxx!DLCEvBxxxd6&!vL@7xx  
w)dyECn5&)B1zxxxxxOz)D&HND&C9DOOl1yzpD&xxxxxx1^6hzwLrzyf3zxxx&L73+Gp4@DLCEiNywi$yzvxBw$h5%Hdf&(l9%zh0%nHB^D1zz~7zyvzB  
yHl6%jl4!!Fg^rLB%DhDzC5xxxxb7j&aH4^)txxxDzvxBw$v0%DJCEzNxxxxzyzr1y~Lzx(vA&(@zyrtCyy5DxxxHl5OHbDO$3DxxxyyN0MD~1%afd&71z  
wDd7xj17xxxxx$)7^7N2wq8*=  

每天会新建一个表, 因此不停的在做数据块的扩展, 但是理论上扩展是比较快的, 不会导致以上情况的发生, 而且发生问题的时间点, 数据量, 并发量也正常.

关于这个等待的情况, 可以参考之前写过一篇文章, 关于批量导入遇到的extend lock等待的性能问题.

http://blog.163.com/digoal@126/blog/static/163877040201392641033482

和本文 性能的 case 无关.

看样子是ZFS的问题, 最后排查发现.

free的内存在不停的减少, 当减少到0的时候, 负载就会马上飙升.

环境 :

CentOS 6.x x64  
2.6.32-504.el6.x86_64  

zfs 版本

zfs-0.6.3-1.1.el6.x86_64  
libzfs2-0.6.3-1.1.el6.x86_64  
zfs-dkms-0.6.3-1.1.el6.noarch  

服务器内存 384G

数据库shared buffer 20GB, maintenance_work_mem=2G, autovacuum_max_workers=6

不算work_MEM的话, 数据库最多可能占用32G内存.

还有300多G可以给系统和ZFS使用.

zfs 参数如下

cd /sys/module/zfs/parameters  
# grep '' *|sort   
l2arc_feed_again:1  
l2arc_feed_min_ms:200  
l2arc_feed_secs:1  
l2arc_headroom:2  
l2arc_headroom_boost:200  
l2arc_nocompress:0  
l2arc_noprefetch:1  
l2arc_norw:0  
l2arc_write_boost:8388608  
l2arc_write_max:8388608  
metaslab_debug_load:0  
metaslab_debug_unload:0  
spa_asize_inflation:24  
spa_config_path:/etc/zfs/zpool.cache  
zfetch_array_rd_sz:1048576  
zfetch_block_cap:256  
zfetch_max_streams:8  
zfetch_min_sec_reap:2  
zfs_arc_grow_retry:5  
zfs_arc_max:10240000000  
zfs_arc_memory_throttle_disable:1  
zfs_arc_meta_limit:0  
zfs_arc_meta_prune:1048576  
zfs_arc_min:0  
zfs_arc_min_prefetch_lifespan:1000  
zfs_arc_p_aggressive_disable:1  
zfs_arc_p_dampener_disable:1  
zfs_arc_shrink_shift:5  
zfs_autoimport_disable:0  
zfs_dbuf_state_index:0  
zfs_deadman_enabled:1  
zfs_deadman_synctime_ms:1000000  
zfs_dedup_prefetch:1  
zfs_delay_min_dirty_percent:60  
zfs_delay_scale:500000  
zfs_dirty_data_max:10240000000  
zfs_dirty_data_max_max:101595342848  
zfs_dirty_data_max_max_percent:25  
zfs_dirty_data_max_percent:10  
zfs_dirty_data_sync:67108864  
zfs_disable_dup_eviction:0  
zfs_expire_snapshot:300  
zfs_flags:1  
zfs_free_min_time_ms:1000  
zfs_immediate_write_sz:32768  
zfs_mdcomp_disable:0  
zfs_nocacheflush:0  
zfs_nopwrite_enabled:1  
zfs_no_scrub_io:0  
zfs_no_scrub_prefetch:0  
zfs_pd_blks_max:100  
zfs_prefetch_disable:0  
zfs_read_chunk_size:1048576  
zfs_read_history:0  
zfs_read_history_hits:0  
zfs_recover:0  
zfs_resilver_delay:2  
zfs_resilver_min_time_ms:3000  
zfs_scan_idle:50  
zfs_scan_min_time_ms:1000  
zfs_scrub_delay:4  
zfs_send_corrupt_data:0  
zfs_sync_pass_deferred_free:2  
zfs_sync_pass_dont_compress:5  
zfs_sync_pass_rewrite:2  
zfs_top_maxinflight:32  
zfs_txg_history:0  
zfs_txg_timeout:5  
zfs_vdev_aggregation_limit:131072  
zfs_vdev_async_read_max_active:3  
zfs_vdev_async_read_min_active:1  
zfs_vdev_async_write_active_max_dirty_percent:60  
zfs_vdev_async_write_active_min_dirty_percent:30  
zfs_vdev_async_write_max_active:10  
zfs_vdev_async_write_min_active:1  
zfs_vdev_cache_bshift:16  
zfs_vdev_cache_max:16384  
zfs_vdev_cache_size:0  
zfs_vdev_max_active:1000  
zfs_vdev_mirror_switch_us:10000  
zfs_vdev_read_gap_limit:32768  
zfs_vdev_scheduler:noop  
zfs_vdev_scrub_max_active:2  
zfs_vdev_scrub_min_active:1  
zfs_vdev_sync_read_max_active:10  
zfs_vdev_sync_read_min_active:10  
zfs_vdev_sync_write_max_active:10  
zfs_vdev_sync_write_min_active:10  
zfs_vdev_write_gap_limit:4096  
zfs_zevent_cols:80  
zfs_zevent_console:0  
zfs_zevent_len_max:768  
zil_replay_disable:0  
zil_slog_limit:1048576  
zio_bulk_flags:0  
zio_delay_max:30000  
zio_injection_enabled:0  
zio_requeue_io_start_cut_in_line:1  
zvol_inhibit_dev:0  
zvol_major:230  
zvol_max_discard_blocks:16384  
zvol_threads:32  

这些参数的介绍可参考 :

man /usr/share/man/man5/zfs-module-parameters.5.gz  

zpool参数

# zpool get all zp1  
NAME  PROPERTY               VALUE                  SOURCE  
zp1   size                   40T                    -  
zp1   capacity               2%                     -  
zp1   altroot                -                      default  
zp1   health                 ONLINE                 -  
zp1   guid                   15254203672861282738   default  
zp1   version                -                      default  
zp1   bootfs                 -                      default  
zp1   delegation             on                     default  
zp1   autoreplace            off                    default  
zp1   cachefile              -                      default  
zp1   failmode               wait                   default  
zp1   listsnapshots          off                    default  
zp1   autoexpand             off                    default  
zp1   dedupditto             0                      default  
zp1   dedupratio             1.00x                  -  
zp1   free                   39.0T                  -  
zp1   allocated              995G                   -  
zp1   readonly               off                    -  
zp1   ashift                 12                     local  
zp1   comment                -                      default  
zp1   expandsize             0                      -  
zp1   freeing                0                      default  
zp1   feature@async_destroy  enabled                local  
zp1   feature@empty_bpobj    active                 local  
zp1   feature@lz4_compress   active                 local  

zfs参数

# zfs get all zp1/data_a0  
NAME         PROPERTY              VALUE                  SOURCE  
zp1/data_a0  type                  filesystem             -  
zp1/data_a0  creation              Thu Dec 18 10:30 2014  -  
zp1/data_a0  used                  98.8G                  -  
zp1/data_a0  available             34.1T                  -  
zp1/data_a0  referenced            98.8G                  -  
zp1/data_a0  compressratio         1.00x                  -  
zp1/data_a0  mounted               yes                    -  
zp1/data_a0  quota                 none                   default  
zp1/data_a0  reservation           none                   default  
zp1/data_a0  recordsize            128K                   default  
zp1/data_a0  mountpoint            /data_a0               local  
zp1/data_a0  sharenfs              off                    default  
zp1/data_a0  checksum              on                     default  
zp1/data_a0  compression           off                    local  
zp1/data_a0  atime                 off                    inherited from zp1  
zp1/data_a0  devices               on                     default  
zp1/data_a0  exec                  on                     default  
zp1/data_a0  setuid                on                     default  
zp1/data_a0  readonly              off                    default  
zp1/data_a0  zoned                 off                    default  
zp1/data_a0  snapdir               hidden                 default  
zp1/data_a0  aclinherit            restricted             default  
zp1/data_a0  canmount              on                     default  
zp1/data_a0  xattr                 sa                     local  
zp1/data_a0  copies                1                      default  
zp1/data_a0  version               5                      -  
zp1/data_a0  utf8only              off                    -  
zp1/data_a0  normalization         none                   -  
zp1/data_a0  casesensitivity       sensitive              -  
zp1/data_a0  vscan                 off                    default  
zp1/data_a0  nbmand                off                    default  
zp1/data_a0  sharesmb              off                    default  
zp1/data_a0  refquota              none                   default  
zp1/data_a0  refreservation        none                   default  
zp1/data_a0  primarycache          metadata               local  
zp1/data_a0  secondarycache        all                    local  
zp1/data_a0  usedbysnapshots       0                      -  
zp1/data_a0  usedbydataset         98.8G                  -  
zp1/data_a0  usedbychildren        0                      -  
zp1/data_a0  usedbyrefreservation  0                      -  
zp1/data_a0  logbias               latency                default  
zp1/data_a0  dedup                 off                    default  
zp1/data_a0  mlslabel              none                   default  
zp1/data_a0  sync                  standard               default  
zp1/data_a0  refcompressratio      1.00x                  -  
zp1/data_a0  written               98.8G                  -  
zp1/data_a0  logicalused           98.7G                  -  
zp1/data_a0  logicalreferenced     98.7G                  -  
zp1/data_a0  snapdev               hidden                 default  
zp1/data_a0  acltype               off                    default  
zp1/data_a0  context               none                   default  
zp1/data_a0  fscontext             none                   default  
zp1/data_a0  defcontext            none                   default  
zp1/data_a0  rootcontext           none                   default  
zp1/data_a0  relatime              off                    default  

解决问题可能要从 arc入手 :

ARC原理参考

https://pthree.org/2012/12/07/zfs-administration-part-iv-the-adjustable-replacement-cache/

man zfs-module-parameters  

arc 优化案例

http://dtrace.org/blogs/brendan/2014/02/11/another-10-performance-wins/

https://www.cupfighter.net/2013/03/default-nexenta-zfs-settings-you-want-to-change-part-2

在大内存下建议调整ARC shrink shift (降到每次100M左右)

       zfs_arc_shrink_shift (int)  
                   log2(fraction of arc to reclaim)  
                   Default value: 5.  

默认是5, 也就是1/32 , 如果内存有384G, 将达到12GB, 一次shrink 12GB arc的话, 要hang很久的.

建议降低到100MB左右, 那么可以设置zfs_arc_shrink_shift =11, 也就是1/2048, 相当于187.5MB

Description: Semi-regular spikes in I/O latency on an SSD postgres server.  
Analysis: The customer reported multi-second I/O latency for a server with flash memory-based solid state disks (SSDs). Since this SSD type was new in production, it was feared that there may be a new drive or firmware problem causing high latency. ZFS latency counters, measured at the VFS interface, confirmed that I/O latency was dismal, sometimes reaching 10 seconds for I/O. The DTrace-based iosnoop tool (DTraceToolkit) was used to trace at the block device level, however, no seriously slow I/O was observed from the SSDs. I plotted the iosnoop traces using R for evidence of queueing behind TXG flushes, but they didn’t support that theory either.  
This was difficult to investigate since the slow I/O was intermittent, sometimes only occurring once per hour. Instead of a typical interactive investigation, I developed various ways to log activity from DTrace and kstats, so that clues for the issue could be examined afterwards from the logs. This included capturing which processes were executed using execsnoop, and dumping ZFS metrics from kstat, including arcstats. This showed that various maintenance processes were executing during the hour, and, the ZFS ARC, which was around 210 Gbytes, would sometimes drop by around 6 Gbytes. Having worked performance issues with shrinking ARCs before, I developed a DTrace script to trace ARC reaping along with process execution, and found that it was a match with a cp(1) command. This was part of the maintenance task, which was copying a 30 Gbyte file, hitting the ARC limit and triggering an ARC shrink. Shrinking involves holding ARC hash locks, which can cause latency, especially when shrinking 6 Gbytes worth of buffers. The zfs:zfs_arc_shrink_shift tunable was adjusted to reduce the shrink size, which also made them more frequent. The worst-case I/O improved from 10s to 100ms.  
  
ARC shrink shift  
Every second a process runs which checks if data can be removed from the ARC and evicts it. Default max 1/32nd of the ARC can be evicted at a time. This is limited because evicting large amounts of data from ARC stalls all other processes. Back when 8GB was a lot of memory 1/32nd meant 256MB max at a time. When you have 196GB of memory 1/32nd is 6.3GB, which can cause up to 20-30 seconds of unresponsiveness (depending on the record size).  
This 1/32nd needs to be changed to make sure the max is set to ~100-200MB again, by adding the following to /etc/system:  
set zfs:zfs_arc_shrink_shift=11  
(where 11 is 1/2 11 or 1/2048th, 10 is  1/2 10 or 1/1024th etc. Change depending on amount of RAM in your system).  

结合ARC原理还有异步dirty write delay的情况, 优化如下 :

       zfs_vdev_async_write_active_min_dirty_percent (int)  
                   When  the  pool  has  less  than  zfs_vdev_async_write_active_min_dirty_percent  dirty  data,   use  
                   zfs_vdev_async_write_min_active to limit active async writes.  If the dirty data is between min and  
                   max, the active I/O limit is linearly interpolated. See the section "ZFS I/O SCHEDULER".  
                   Default value: 30.  
       zfs_vdev_async_write_active_max_dirty_percent (int)  
                   When  the  pool  has  more  than  zfs_vdev_async_write_active_max_dirty_percent  dirty  data,   use  
                   zfs_vdev_async_write_max_active to limit active async writes.  If the dirty data is between min and  
                   max, the active I/O limit is linearly interpolated. See the section "ZFS I/O SCHEDULER".  
                   Default value: 60.  
       zfs_vdev_async_write_max_active (int)  
                   Maxium asynchronous write I/Os active to each device.  See the section "ZFS I/O SCHEDULER".  
                   Default value: 10.  
       zfs_vdev_async_write_min_active (int)  
                   Minimum asynchronous write I/Os active to each device.  See the section "ZFS I/O SCHEDULER".  
                   Default value: 1.  

这幅图表示异步dirty write的提速和限速情况, 降低zfs_vdev_async_write_active_min_dirty_percent可以使最小限速区间变小,

降低zfs_vdev_async_write_active_max_dirty_percent可以使最大限速提早, 从而提高脏数据的flush速度.

但是可能影响同步写的IO争抢.

              |              o---------| <-- zfs_vdev_async_write_max_active  
         ^    |             /^         |  
         |    |            / |         |  
       active |           /  |         |  
        I/O   |          /   |         |  
       count  |         /    |         |  
              |        /     |         |  
              |-------o      |         | <-- zfs_vdev_async_write_min_active  
             0|_______^______|_________|  
              0%      |      |       100% of zfs_dirty_data_max  
                      |      |  
                      |      ‘-- zfs_vdev_async_write_active_max_dirty_percent  
                      ‘--------- zfs_vdev_async_write_active_min_dirty_percent  

另一方面, 我们需要设置arc max, 注意不是dirty arc max

因为数据库也占用了大部分内存, ZFS ARC不限制的话就无节制了.

有文章指出将ARC限制到总内存的40% . (总内存有384GB, PostgreSQL shared buffer用掉 20GB)

http://blog.163.com/digoal@126/blog/static/163877040201462204333503

到底设置为多少呢 ?

查看当前情况, 数据库已经开启,

# free  
             total       used       free     shared    buffers     cached  
Mem:     396856808  228812456  168044352   21633868      58744   45380060  

系统有168GB空闲内存

ARC已使用约20GB内存

# cat /proc/spl/kstat/zfs/arcstats |grep size  
size                            4    19751851104  

那么, 在当前空闲内存的情形下我再留48GB给系统和数据库的话, ZFS还有120GB可用.

加上已用的20G, ZFS可以用140G.

把arc max设置到140GB (差不多是总内存的40%)

# echo 140000000000 > /sys/module/zfs/parameters/zfs_arc_max  

接下来设置一下dirty相关的参数

zfs_dirty_data_max 降到 arc max 的 1/5 = 28000000000 (可动态调整)

异步写的加速参数调整

zfs_vdev_async_write_active_min_dirty_percent=10  
zfs_vdev_async_write_active_max_dirty_percent=30  (务必小于zfs_delay_min_dirty_percent)  
zfs_delay_min_dirty_percent=60  

动态调整后, 建议设置启动模块参数 :

# cd /sys/module/zfs/parameters/  
# echo 140000000000 >zfs_arc_max  
# echo 28000000000 >zfs_dirty_data_max  
# echo 10 > zfs_vdev_async_write_active_min_dirty_percent  
# echo 30 > zfs_vdev_async_write_active_max_dirty_percent  
# echo 60 > zfs_delay_min_dirty_percent  
# echo 11 > zfs_arc_shrink_shift  

zfs模块启动参数

# vi /etc/modprobe.d/zfs.conf  
options zfs zfs_arc_max=140000000000  
options zfs zfs_dirty_data_max=28000000000  
options zfs zfs_vdev_async_write_active_min_dirty_percent=10  
options zfs zfs_vdev_async_write_active_max_dirty_percent=30  
options zfs zfs_delay_min_dirty_percent=60  
options zfs zfs_arc_shrink_shift=11  

观察期…..

还是一个样子, 内存会用光, 然后一样CPU暴增.

但是进程的内存消耗是正常的,

# ps -e --width=1024 -o pid,%mem,rss,size,sz,vsz,cmd --sort rss  
rss        RSS      resident set size, the non-swapped physical memory that a task has used (in kiloBytes).  
                    (alias rssize, rsz).  
size       SZ       approximate amount of swap space that would be required if the process were to dirty all writable  
                    pages and then be swapped out. This number is very rough!  
sz         SZ       size in physical pages of the core image of the process. This includes text, data, and stack  
                    space. Device mappings are currently excluded; this is subject to change. See vsz and rss.  
vsz        VSZ      virtual memory size of the process in KiB (1024-byte units). Device mappings are currently  
                    excluded; this is subject to change. (alias vsize).  
  
06:10:01 AM kbmemfree kbmemused  %memused kbbuffers  kbcached  kbcommit   %commit  
06:20:01 AM 219447748 177409060     44.70     23196  22965260  27351828      6.75  
06:30:01 AM 219304016 177552792     44.74     24628  23080756  27348820      6.75  
06:40:01 AM 218698000 178158808     44.89     26276  23638736  27365764      6.75  
06:50:01 AM 218454732 178402076     44.95     27588  23852552  27365664      6.75  
07:00:01 AM 218211060 178645748     45.02     28840  24066384  27365736      6.75  
07:10:01 AM 218006588 178850220     45.07     30144  24231036  27366528      6.75  
07:20:01 AM 217784072 179072736     45.12     31424  24412084  27365496      6.75  
07:30:01 AM 217128620 179728188     45.29     32752  24970064  27370048      6.75  
07:40:01 AM 216704964 180151844     45.39     34372  25331396  27369700      6.75  
07:50:01 AM 216372456 180484352     45.48     35740  25610760  27371348      6.75  
08:00:01 AM 216028392 180828416     45.57     37060  25890136  27393748      6.76  
08:10:01 AM 214706196 182150612     45.90     38808  27120088  27400288      6.76  
08:20:01 AM 213981920 182874888     46.08     42712  27798924  27413000      6.76  
08:30:01 AM 213551104 183305704     46.19     44268  28193028  27411516      6.76  

设置cache的使用趋势

vfs_cache_pressure  
------------------  
  
This percentage value controls the tendency of the kernel to reclaim  
the memory which is used for caching of directory and inode objects.  
  
At the default value of vfs_cache_pressure=100 the kernel will attempt to  
reclaim dentries and inodes at a "fair" rate with respect to pagecache and  
swapcache reclaim.  Decreasing vfs_cache_pressure causes the kernel to prefer  
to retain dentry and inode caches. When vfs_cache_pressure=0, the kernel will  
never reclaim dentries and inodes due to memory pressure and this can easily  
lead to out-of-memory conditions. Increasing vfs_cache_pressure beyond 100  
causes the kernel to prefer to reclaim dentries and inodes.  
  
Increasing vfs_cache_pressure significantly beyond 100 may have negative  
performance impact. Reclaim code needs to take various locks to find freeable  
directory and inode objects. With vfs_cache_pressure=1000, it will look for  
ten times more freeable objects than there are.  

即使设置为1, 貌似还是不断的使用cache.

因为和脏数据无关, 所以也不需要调整脏数据的内核参数 :

# cat /proc/meminfo |grep -i -E "dirt|back"  
Dirty:                 0 kB  
Writeback:             0 kB  
WritebackTmp:          0 kB  
  
==============================================================  
  
dirty_background_bytes  
  
Contains the amount of dirty memory at which the background kernel  
flusher threads will start writeback.  
  
If dirty_background_bytes is written, dirty_background_ratio becomes a function  
of its value (dirty_background_bytes / the amount of dirtyable system memory).  
  
==============================================================  
  
dirty_background_ratio  
  
Contains, as a percentage of total system memory, the number of pages at which  
the background kernel flusher threads will start writing out dirty data.  
  
==============================================================  
  
dirty_bytes  
  
Contains the amount of dirty memory at which a process generating disk writes  
will itself start writeback.  
  
If dirty_bytes is written, dirty_ratio becomes a function of its value  
(dirty_bytes / the amount of dirtyable system memory).  
  
Note: the minimum value allowed for dirty_bytes is two pages (in bytes); any  
value lower than this limit will be ignored and the old configuration will be  
retained.  
  
==============================================================  
  
dirty_expire_centisecs  
  
This tunable is used to define when dirty data is old enough to be eligible  
for writeout by the kernel flusher threads.  It is expressed in 100'ths  
of a second.  Data which has been dirty in-memory for longer than this  
interval will be written out next time a flusher thread wakes up.  
  
==============================================================  
  
dirty_ratio  
  
Contains, as a percentage of total system memory, the number of pages at which  
a process which is generating disk writes will itself start writing out dirty  
data.  
  
==============================================================  
  
dirty_writeback_centisecs  
  
The kernel flusher threads will periodically wake up and write `old' data  
out to disk.  This tunable expresses the interval between those wakeups, in  
100'ths of a second.  
  
Setting this to zero disables periodic writeback altogether.  

现在暂且增加一个空闲时间自动FREE的脚本.

/usr/share/doc/kernel-doc-2.6.32/Documentation/sysctl/vm.txt  
drop_caches  
  
Writing to this will cause the kernel to drop clean caches, dentries and  
inodes from memory, causing that memory to become free.  
  
To free pagecache:  
        echo 1 > /proc/sys/vm/drop_caches  
To free dentries and inodes:  
        echo 2 > /proc/sys/vm/drop_caches  
To free pagecache, dentries and inodes:  
        echo 3 > /proc/sys/vm/drop_caches  
  
As this is a non-destructive operation and dirty objects are not freeable, the  
user should run `sync' first.  
  
crontab -e  
30 4 * * * /usr/local/bin/free.sh >>/tmp/free.log 2>&1  
  
# cat /usr/local/bin/free.sh  
#!/bin/bash  
  
. /root/.bash_profile  
. /etc/profile  
  
echo "`date +%F%T` start drop cache."  
free  
sync  
echo 3 > /proc/sys/vm/drop_caches  
echo "`date +%F%T` end drop cache."  
free  

最终调整的参数如下

负载恢复正常.

减少脏数据比例, 提高脏数据刷新频率

将ARC改成只存储metadata, 不存储page.

sysctl -w vm.zone_reclaim_mode=1  
sysctl -w vm.dirty_background_bytes=102400000  
sysctl -w vm.dirty_bytes=102400000  
sysctl -w vm.dirty_expire_centisecs=10  
sysctl -w vm.dirty_writeback_centisecs=10  
sysctl -w vm.swappiness=0  
sysctl -w vm.vfs_cache_pressure=80  
  
# vi /etc/sysctl.conf  
vm.zone_reclaim_mode=1  
vm.dirty_background_bytes=102400000  
vm.dirty_bytes=102400000  
vm.dirty_expire_centisecs=10  
vm.dirty_writeback_centisecs=10  
vm.swappiness=0  
vm.vfs_cache_pressure=80  
  
  
# cd /sys/module/zfs/parameters/  
# cat zfs_arc_max   
10240000000  

查看arc统计信息/proc/spl/kstat/zfs/arcstats, 可以看到metadata使用了不到2G, 所以给10G差不多了.

不够的话, 以后可以再调整.

meta_size                       4    1952531968  
  
# cat /etc/modprobe.d/zfs.conf   
options zfs zfs_arc_max=10240000000  
options zfs zfs_dirty_data_max=800000000  
options zfs zfs_vdev_async_write_active_min_dirty_percent=10  
options zfs zfs_vdev_async_write_active_max_dirty_percent=30  
options zfs zfs_delay_min_dirty_percent=60  
options zfs zfs_arc_shrink_shift=11  

设置为metadata, 因为LINUX本身也带cache, 没有必要多重cache.

zfs 和 PostgreSQL 一样有这个多重cache问题, 除非使用directIO.

# zfs set primarycache=metadata zp1  
# zfs set primarycache=metadata zp1/data_a0  
# zfs set primarycache=metadata zp1/data_a1  
# zfs set primarycache=metadata zp1/data_b0  
# zfs set primarycache=metadata zp1/data_b1  
# zfs set primarycache=metadata zp1/data_c0  
# zfs set primarycache=metadata zp1/data_c1  
# zfs set primarycache=metadata zp1/data_ssd0  
# zfs set primarycache=metadata zp1/data_ssd1  

设置为与数据库块大小一致.

# zfs set recordsize=16k zp1/data_a0  wal_block_size=16k  
# zfs set recordsize=8k zp1/data_a0  block_size=8k  

参考

1. http://blog.163.com/digoal@126/blog/static/163877040201392641033482

2. http://constantin.glez.de/blog/2010/04/ten-ways-easily-improve-oracle-solaris-zfs-filesystem-performance

3. https://github.com/zfsonlinux/zfs/issues/258

4. http://blog.163.com/digoal@126/blog/static/163877040201462204333503

5. https://github.com/spacelama

6. https://github.com/mharsch

7. https://pthree.org/2012/12/07/zfs-administration-part-iv-the-adjustable-replacement-cache/

8. man zfs-module-parameters

rpm -ql zfs  

9. http://dtrace.org/blogs/brendan/2014/02/11/another-10-performance-wins/

10. https://www.cupfighter.net/2013/03/default-nexenta-zfs-settings-you-want-to-change-part-2

11. /proc/spl/*

Flag Counter

digoal’s 大量PostgreSQL文章入口