ZFS case : top CPU 100%sy, when no free memory trigger it.
背景
最近在一个系统频频遇到负载突然飙升到几百, 然后又下去的情况.
根据负载升高的时间点对应的数据库日志分析, 对应的时间点, 有大量的类似如下的日志 :
"UPDATE waiting",2015-01-09 01:38:47 CST,979/7,2927976054,LOG,00000,"process 26366 still waiting for ExclusiveLock on extension of relation 686062002 of database 35078604 after 1117.676 ms",,,,,,"
"INSERT waiting",2015-01-09 01:38:36 CST,541/8,2927976307,LOG,00000,"process 25936 still waiting for ExclusiveLock on extension of relation 686062002 of database 35078604 after 1219.762 ms",,,,,,"
"INSERT waiting",2015-01-09 01:38:48 CST,1018/64892,2929458056,LOG,00000,"process 26439 still waiting for ExclusiveLock on extension of relation 686061993 of database 35078604 after 1000.105 ms",
.........
对应几个对象的块扩展等待
select 686062002::regclass;
regclass
-----------------------------
pg_toast.pg_toast_686061993
(1 row)
select relname from pg_class where reltoastrelid=686062002;
relname
-------------------------------------
tbl_xxx_20150109
(1 row)
Time: 4.643 ms
同时系统的dmesg还伴随 :
postgres: page allocation failure. order:1, mode:0x20
Pid: 20427, comm: postgres Tainted: P --------------- 2.6.32-504.el6.x86_64 #1
Call Trace:
<IRQ> [<ffffffff8113438a>] ? __alloc_pages_nodemask+0x74a/0x8d0
[<ffffffff810eaa90>] ? handle_IRQ_event+0x60/0x170
[<ffffffff81173332>] ? kmem_getpages+0x62/0x170
[<ffffffff81173f4a>] ? fallback_alloc+0x1ba/0x270
[<ffffffff8117399f>] ? cache_grow+0x2cf/0x320
[<ffffffff81173cc9>] ? ____cache_alloc_node+0x99/0x160
[<ffffffff81174c4b>] ? kmem_cache_alloc+0x11b/0x190
[<ffffffff8144c768>] ? sk_prot_alloc+0x48/0x1c0
[<ffffffff8144d992>] ? sk_clone+0x22/0x2e0
[<ffffffff814a1b76>] ? inet_csk_clone+0x16/0xd0
[<ffffffff814bb713>] ? tcp_create_openreq_child+0x23/0x470
[<ffffffff814b8ecd>] ? tcp_v4_syn_recv_sock+0x4d/0x310
[<ffffffff814bb4b6>] ? tcp_check_req+0x226/0x460
[<ffffffff814b890b>] ? tcp_v4_do_rcv+0x35b/0x490
[<ffffffffa0207557>] ? ipv4_confirm+0x87/0x1d0 [nf_conntrack_ipv4]
[<ffffffff814ba1a2>] ? tcp_v4_rcv+0x522/0x900
[<ffffffff81496d10>] ? ip_local_deliver_finish+0x0/0x2d0
[<ffffffff81496ded>] ? ip_local_deliver_finish+0xdd/0x2d0
[<ffffffff81497078>] ? ip_local_deliver+0x98/0xa0
[<ffffffff8149653d>] ? ip_rcv_finish+0x12d/0x440
[<ffffffff81496ac5>] ? ip_rcv+0x275/0x350
[<ffffffff8145c88b>] ? __netif_receive_skb+0x4ab/0x750
[<ffffffff81460588>] ? netif_receive_skb+0x58/0x60
[<ffffffff81460690>] ? napi_skb_finish+0x50/0x70
[<ffffffff81461f69>] ? napi_gro_receive+0x39/0x50
[<ffffffffa01a7d91>] ? igb_poll+0x981/0x1010 [igb]
[<ffffffff814b59c0>] ? tcp_delack_timer+0x0/0x270
[<ffffffff814b3af9>] ? tcp_send_ack+0xd9/0x120
[<ffffffff81462083>] ? net_rx_action+0x103/0x2f0
[<ffffffff8107d8b1>] ? __do_softirq+0xc1/0x1e0
[<ffffffff810eaa90>] ? handle_IRQ_event+0x60/0x170
[<ffffffff8107d90f>] ? __do_softirq+0x11f/0x1e0
[<ffffffff8100c30c>] ? call_softirq+0x1c/0x30
[<ffffffff8100fc15>] ? do_softirq+0x65/0xa0
[<ffffffff8107d765>] ? irq_exit+0x85/0x90
[<ffffffff81533b45>] ? do_IRQ+0x75/0xf0
[<ffffffff8100b9d3>] ? ret_from_intr+0x0/0x11
<EOI> [<ffffffff8116f5f9>] ? compaction_alloc+0x269/0x4b0
[<ffffffff8116f552>] ? compaction_alloc+0x1c2/0x4b0
[<ffffffff811799fa>] ? migrate_pages+0xaa/0x480
[<ffffffff8100b9ce>] ? common_interrupt+0xe/0x13
[<ffffffff8116f390>] ? compaction_alloc+0x0/0x4b0
[<ffffffff8116e9ea>] ? compact_zone+0x61a/0xba0
[<ffffffff8116f01c>] ? compact_zone_order+0xac/0x100
[<ffffffff8116f151>] ? try_to_compact_pages+0xe1/0x120
[<ffffffff81133b6a>] ? __alloc_pages_direct_compact+0xda/0x1b0
[<ffffffff81134055>] ? __alloc_pages_nodemask+0x415/0x8d0
[<ffffffff8116c79a>] ? alloc_pages_vma+0x9a/0x150
[<ffffffff8118845d>] ? do_huge_pmd_anonymous_page+0x14d/0x3b0
[<ffffffff8114fdb0>] ? handle_mm_fault+0x2f0/0x300
[<ffffffff8104d0d8>] ? __do_page_fault+0x138/0x480
[<ffffffff8152ae5e>] ? mutex_lock+0x1e/0x50
[<ffffffff8152ffbe>] ? do_page_fault+0x3e/0xa0
[<ffffffff8152d375>] ? page_fault+0x25/0x30
这是个日志表, 有4个索引, 其中一个变长字段存储的值较长(因此有用到TOAST存储), 例如
DxxxxxxxxxxxxzwLlyyDd7xGd7^7xxwLDxyD@5xHB7^if5^vv4&DJCEL7xxxCFyhsxxxd4x~j2%$BB%ChkzHlzzvxBwqn5^DDCFexzwC@zyLDz
zC~zyDDzyCbAyyh3M~v5^DDCHvBBy%j0%iL4^fJB%K1xxxB%G1wz~h2M%B4%qn5&7xxwyPs!$xJ!Dd7xCb3^DFCGLnzyzlzyP7zyCJ7x)Lx^xxxxxxxy73$rLB&DND
zL5zy~xxyCt4xPj4%DJCE~DzyP#zyLPzyypxxx3&~DB^P1zzC5zye5wzz10MCb3^Gp4^DLCEiNywi$yzvxBwL73&$F7%7xzwG5zyy5wyah4MbzB%C1DzL9zyf5yyG9z
y!1zyLJxyCt4xPP5%nLB&xxxx
&7xxwjHzzi#yyi$yzi$yzmHIPm^K@CbAzzh5MDBCGLxxxxwz~h5M$JB%DxDzGlzyH5zyL7zzylzyC9AyLxxx7xxwC5AyzNxxxxxx5%Pp0
^~d0&6NzwK1wyzN1M%xxxx^P74^DJCGD7yyvBByiF4&Pt0%~d0&6hwwvDByiF4&et5xxxxxx17%$$1^DBCFGfwxzh2M!j1%qv5^DLCF!NzyH5zzvJBy%h4
%aD4&%v4^61zwDnyyK5wyzN0Mxxxx5$71zwCd7xCv0&fj5&(h1%yNc%mf7%71zwxxx%a@4%rpd^a5d$71zwCt5x!l1$~^l^LDx&K1wzmh5MxxxxxxxxxxGr4&Dvd&$jl
%LBx%K1wzPh5M$Ll^yn1&Ht6+fxxxxxxwC@5xqnxxx&~90^fj5+Oh5%71zwDJ7xH~j&yDh$G@5^7xxxx^Gp4@DLCEf1yw7b0~bn0^%^i&HDzyebz
y6)zyLBzyfdyyvxBwH#5%GP5^nvd&$LB^DN5zqHA^%P1^nLlEDL5%$@n^i#4^$J7%nPn+bzF@Ct4xD~1&GB4^add%7xxw!7zybBxyC@5xqn5xxxxx!DLCEvBxxxd6&!vL@7xx
w)dyECn5&)B1zxxxxxOz)D&HND&C9DOOl1yzpD&xxxxxx1^6hzwLrzyf3zxxx&L73+Gp4@DLCEiNywi$yzvxBw$h5%Hdf&(l9%zh0%nHB^D1zz~7zyvzB
yHl6%jl4!!Fg^rLB%DhDzC5xxxxb7j&aH4^)txxxDzvxBw$v0%DJCEzNxxxxzyzr1y~Lzx(vA&(@zyrtCyy5DxxxHl5OHbDO$3DxxxyyN0MD~1%afd&71z
wDd7xj17xxxxx$)7^7N2wq8*=
每天会新建一个表, 因此不停的在做数据块的扩展, 但是理论上扩展是比较快的, 不会导致以上情况的发生, 而且发生问题的时间点, 数据量, 并发量也正常.
关于这个等待的情况, 可以参考之前写过一篇文章, 关于批量导入遇到的extend lock等待的性能问题.
http://blog.163.com/digoal@126/blog/static/163877040201392641033482
和本文 性能的 case 无关.
看样子是ZFS的问题, 最后排查发现.
free的内存在不停的减少, 当减少到0的时候, 负载就会马上飙升.
环境 :
CentOS 6.x x64
2.6.32-504.el6.x86_64
zfs 版本
zfs-0.6.3-1.1.el6.x86_64
libzfs2-0.6.3-1.1.el6.x86_64
zfs-dkms-0.6.3-1.1.el6.noarch
服务器内存 384G
数据库shared buffer 20GB, maintenance_work_mem=2G, autovacuum_max_workers=6
不算work_MEM的话, 数据库最多可能占用32G内存.
还有300多G可以给系统和ZFS使用.
zfs 参数如下
cd /sys/module/zfs/parameters
# grep '' *|sort
l2arc_feed_again:1
l2arc_feed_min_ms:200
l2arc_feed_secs:1
l2arc_headroom:2
l2arc_headroom_boost:200
l2arc_nocompress:0
l2arc_noprefetch:1
l2arc_norw:0
l2arc_write_boost:8388608
l2arc_write_max:8388608
metaslab_debug_load:0
metaslab_debug_unload:0
spa_asize_inflation:24
spa_config_path:/etc/zfs/zpool.cache
zfetch_array_rd_sz:1048576
zfetch_block_cap:256
zfetch_max_streams:8
zfetch_min_sec_reap:2
zfs_arc_grow_retry:5
zfs_arc_max:10240000000
zfs_arc_memory_throttle_disable:1
zfs_arc_meta_limit:0
zfs_arc_meta_prune:1048576
zfs_arc_min:0
zfs_arc_min_prefetch_lifespan:1000
zfs_arc_p_aggressive_disable:1
zfs_arc_p_dampener_disable:1
zfs_arc_shrink_shift:5
zfs_autoimport_disable:0
zfs_dbuf_state_index:0
zfs_deadman_enabled:1
zfs_deadman_synctime_ms:1000000
zfs_dedup_prefetch:1
zfs_delay_min_dirty_percent:60
zfs_delay_scale:500000
zfs_dirty_data_max:10240000000
zfs_dirty_data_max_max:101595342848
zfs_dirty_data_max_max_percent:25
zfs_dirty_data_max_percent:10
zfs_dirty_data_sync:67108864
zfs_disable_dup_eviction:0
zfs_expire_snapshot:300
zfs_flags:1
zfs_free_min_time_ms:1000
zfs_immediate_write_sz:32768
zfs_mdcomp_disable:0
zfs_nocacheflush:0
zfs_nopwrite_enabled:1
zfs_no_scrub_io:0
zfs_no_scrub_prefetch:0
zfs_pd_blks_max:100
zfs_prefetch_disable:0
zfs_read_chunk_size:1048576
zfs_read_history:0
zfs_read_history_hits:0
zfs_recover:0
zfs_resilver_delay:2
zfs_resilver_min_time_ms:3000
zfs_scan_idle:50
zfs_scan_min_time_ms:1000
zfs_scrub_delay:4
zfs_send_corrupt_data:0
zfs_sync_pass_deferred_free:2
zfs_sync_pass_dont_compress:5
zfs_sync_pass_rewrite:2
zfs_top_maxinflight:32
zfs_txg_history:0
zfs_txg_timeout:5
zfs_vdev_aggregation_limit:131072
zfs_vdev_async_read_max_active:3
zfs_vdev_async_read_min_active:1
zfs_vdev_async_write_active_max_dirty_percent:60
zfs_vdev_async_write_active_min_dirty_percent:30
zfs_vdev_async_write_max_active:10
zfs_vdev_async_write_min_active:1
zfs_vdev_cache_bshift:16
zfs_vdev_cache_max:16384
zfs_vdev_cache_size:0
zfs_vdev_max_active:1000
zfs_vdev_mirror_switch_us:10000
zfs_vdev_read_gap_limit:32768
zfs_vdev_scheduler:noop
zfs_vdev_scrub_max_active:2
zfs_vdev_scrub_min_active:1
zfs_vdev_sync_read_max_active:10
zfs_vdev_sync_read_min_active:10
zfs_vdev_sync_write_max_active:10
zfs_vdev_sync_write_min_active:10
zfs_vdev_write_gap_limit:4096
zfs_zevent_cols:80
zfs_zevent_console:0
zfs_zevent_len_max:768
zil_replay_disable:0
zil_slog_limit:1048576
zio_bulk_flags:0
zio_delay_max:30000
zio_injection_enabled:0
zio_requeue_io_start_cut_in_line:1
zvol_inhibit_dev:0
zvol_major:230
zvol_max_discard_blocks:16384
zvol_threads:32
这些参数的介绍可参考 :
man /usr/share/man/man5/zfs-module-parameters.5.gz
zpool参数
# zpool get all zp1
NAME PROPERTY VALUE SOURCE
zp1 size 40T -
zp1 capacity 2% -
zp1 altroot - default
zp1 health ONLINE -
zp1 guid 15254203672861282738 default
zp1 version - default
zp1 bootfs - default
zp1 delegation on default
zp1 autoreplace off default
zp1 cachefile - default
zp1 failmode wait default
zp1 listsnapshots off default
zp1 autoexpand off default
zp1 dedupditto 0 default
zp1 dedupratio 1.00x -
zp1 free 39.0T -
zp1 allocated 995G -
zp1 readonly off -
zp1 ashift 12 local
zp1 comment - default
zp1 expandsize 0 -
zp1 freeing 0 default
zp1 feature@async_destroy enabled local
zp1 feature@empty_bpobj active local
zp1 feature@lz4_compress active local
zfs参数
# zfs get all zp1/data_a0
NAME PROPERTY VALUE SOURCE
zp1/data_a0 type filesystem -
zp1/data_a0 creation Thu Dec 18 10:30 2014 -
zp1/data_a0 used 98.8G -
zp1/data_a0 available 34.1T -
zp1/data_a0 referenced 98.8G -
zp1/data_a0 compressratio 1.00x -
zp1/data_a0 mounted yes -
zp1/data_a0 quota none default
zp1/data_a0 reservation none default
zp1/data_a0 recordsize 128K default
zp1/data_a0 mountpoint /data_a0 local
zp1/data_a0 sharenfs off default
zp1/data_a0 checksum on default
zp1/data_a0 compression off local
zp1/data_a0 atime off inherited from zp1
zp1/data_a0 devices on default
zp1/data_a0 exec on default
zp1/data_a0 setuid on default
zp1/data_a0 readonly off default
zp1/data_a0 zoned off default
zp1/data_a0 snapdir hidden default
zp1/data_a0 aclinherit restricted default
zp1/data_a0 canmount on default
zp1/data_a0 xattr sa local
zp1/data_a0 copies 1 default
zp1/data_a0 version 5 -
zp1/data_a0 utf8only off -
zp1/data_a0 normalization none -
zp1/data_a0 casesensitivity sensitive -
zp1/data_a0 vscan off default
zp1/data_a0 nbmand off default
zp1/data_a0 sharesmb off default
zp1/data_a0 refquota none default
zp1/data_a0 refreservation none default
zp1/data_a0 primarycache metadata local
zp1/data_a0 secondarycache all local
zp1/data_a0 usedbysnapshots 0 -
zp1/data_a0 usedbydataset 98.8G -
zp1/data_a0 usedbychildren 0 -
zp1/data_a0 usedbyrefreservation 0 -
zp1/data_a0 logbias latency default
zp1/data_a0 dedup off default
zp1/data_a0 mlslabel none default
zp1/data_a0 sync standard default
zp1/data_a0 refcompressratio 1.00x -
zp1/data_a0 written 98.8G -
zp1/data_a0 logicalused 98.7G -
zp1/data_a0 logicalreferenced 98.7G -
zp1/data_a0 snapdev hidden default
zp1/data_a0 acltype off default
zp1/data_a0 context none default
zp1/data_a0 fscontext none default
zp1/data_a0 defcontext none default
zp1/data_a0 rootcontext none default
zp1/data_a0 relatime off default
解决问题可能要从 arc入手 :
ARC原理参考
https://pthree.org/2012/12/07/zfs-administration-part-iv-the-adjustable-replacement-cache/
man zfs-module-parameters
arc 优化案例
http://dtrace.org/blogs/brendan/2014/02/11/another-10-performance-wins/
https://www.cupfighter.net/2013/03/default-nexenta-zfs-settings-you-want-to-change-part-2
在大内存下建议调整ARC shrink shift (降到每次100M左右)
zfs_arc_shrink_shift (int)
log2(fraction of arc to reclaim)
Default value: 5.
默认是5, 也就是1/32 , 如果内存有384G, 将达到12GB, 一次shrink 12GB arc的话, 要hang很久的.
建议降低到100MB左右, 那么可以设置zfs_arc_shrink_shift =11, 也就是1/2048, 相当于187.5MB
Description: Semi-regular spikes in I/O latency on an SSD postgres server.
Analysis: The customer reported multi-second I/O latency for a server with flash memory-based solid state disks (SSDs). Since this SSD type was new in production, it was feared that there may be a new drive or firmware problem causing high latency. ZFS latency counters, measured at the VFS interface, confirmed that I/O latency was dismal, sometimes reaching 10 seconds for I/O. The DTrace-based iosnoop tool (DTraceToolkit) was used to trace at the block device level, however, no seriously slow I/O was observed from the SSDs. I plotted the iosnoop traces using R for evidence of queueing behind TXG flushes, but they didn’t support that theory either.
This was difficult to investigate since the slow I/O was intermittent, sometimes only occurring once per hour. Instead of a typical interactive investigation, I developed various ways to log activity from DTrace and kstats, so that clues for the issue could be examined afterwards from the logs. This included capturing which processes were executed using execsnoop, and dumping ZFS metrics from kstat, including arcstats. This showed that various maintenance processes were executing during the hour, and, the ZFS ARC, which was around 210 Gbytes, would sometimes drop by around 6 Gbytes. Having worked performance issues with shrinking ARCs before, I developed a DTrace script to trace ARC reaping along with process execution, and found that it was a match with a cp(1) command. This was part of the maintenance task, which was copying a 30 Gbyte file, hitting the ARC limit and triggering an ARC shrink. Shrinking involves holding ARC hash locks, which can cause latency, especially when shrinking 6 Gbytes worth of buffers. The zfs:zfs_arc_shrink_shift tunable was adjusted to reduce the shrink size, which also made them more frequent. The worst-case I/O improved from 10s to 100ms.
ARC shrink shift
Every second a process runs which checks if data can be removed from the ARC and evicts it. Default max 1/32nd of the ARC can be evicted at a time. This is limited because evicting large amounts of data from ARC stalls all other processes. Back when 8GB was a lot of memory 1/32nd meant 256MB max at a time. When you have 196GB of memory 1/32nd is 6.3GB, which can cause up to 20-30 seconds of unresponsiveness (depending on the record size).
This 1/32nd needs to be changed to make sure the max is set to ~100-200MB again, by adding the following to /etc/system:
set zfs:zfs_arc_shrink_shift=11
(where 11 is 1/2 11 or 1/2048th, 10 is 1/2 10 or 1/1024th etc. Change depending on amount of RAM in your system).
结合ARC原理还有异步dirty write delay的情况, 优化如下 :
zfs_vdev_async_write_active_min_dirty_percent (int)
When the pool has less than zfs_vdev_async_write_active_min_dirty_percent dirty data, use
zfs_vdev_async_write_min_active to limit active async writes. If the dirty data is between min and
max, the active I/O limit is linearly interpolated. See the section "ZFS I/O SCHEDULER".
Default value: 30.
zfs_vdev_async_write_active_max_dirty_percent (int)
When the pool has more than zfs_vdev_async_write_active_max_dirty_percent dirty data, use
zfs_vdev_async_write_max_active to limit active async writes. If the dirty data is between min and
max, the active I/O limit is linearly interpolated. See the section "ZFS I/O SCHEDULER".
Default value: 60.
zfs_vdev_async_write_max_active (int)
Maxium asynchronous write I/Os active to each device. See the section "ZFS I/O SCHEDULER".
Default value: 10.
zfs_vdev_async_write_min_active (int)
Minimum asynchronous write I/Os active to each device. See the section "ZFS I/O SCHEDULER".
Default value: 1.
这幅图表示异步dirty write的提速和限速情况, 降低zfs_vdev_async_write_active_min_dirty_percent可以使最小限速区间变小,
降低zfs_vdev_async_write_active_max_dirty_percent可以使最大限速提早, 从而提高脏数据的flush速度.
但是可能影响同步写的IO争抢.
| o---------| <-- zfs_vdev_async_write_max_active
^ | /^ |
| | / | |
active | / | |
I/O | / | |
count | / | |
| / | |
|-------o | | <-- zfs_vdev_async_write_min_active
0|_______^______|_________|
0% | | 100% of zfs_dirty_data_max
| |
| ‘-- zfs_vdev_async_write_active_max_dirty_percent
‘--------- zfs_vdev_async_write_active_min_dirty_percent
另一方面, 我们需要设置arc max, 注意不是dirty arc max
因为数据库也占用了大部分内存, ZFS ARC不限制的话就无节制了.
有文章指出将ARC限制到总内存的40% . (总内存有384GB, PostgreSQL shared buffer用掉 20GB)
http://blog.163.com/digoal@126/blog/static/163877040201462204333503
到底设置为多少呢 ?
查看当前情况, 数据库已经开启,
# free
total used free shared buffers cached
Mem: 396856808 228812456 168044352 21633868 58744 45380060
系统有168GB空闲内存
ARC已使用约20GB内存
# cat /proc/spl/kstat/zfs/arcstats |grep size
size 4 19751851104
那么, 在当前空闲内存的情形下我再留48GB给系统和数据库的话, ZFS还有120GB可用.
加上已用的20G, ZFS可以用140G.
把arc max设置到140GB (差不多是总内存的40%)
# echo 140000000000 > /sys/module/zfs/parameters/zfs_arc_max
接下来设置一下dirty相关的参数
zfs_dirty_data_max 降到 arc max 的 1/5 = 28000000000 (可动态调整)
异步写的加速参数调整
zfs_vdev_async_write_active_min_dirty_percent=10
zfs_vdev_async_write_active_max_dirty_percent=30 (务必小于zfs_delay_min_dirty_percent)
zfs_delay_min_dirty_percent=60
动态调整后, 建议设置启动模块参数 :
# cd /sys/module/zfs/parameters/
# echo 140000000000 >zfs_arc_max
# echo 28000000000 >zfs_dirty_data_max
# echo 10 > zfs_vdev_async_write_active_min_dirty_percent
# echo 30 > zfs_vdev_async_write_active_max_dirty_percent
# echo 60 > zfs_delay_min_dirty_percent
# echo 11 > zfs_arc_shrink_shift
zfs模块启动参数
# vi /etc/modprobe.d/zfs.conf
options zfs zfs_arc_max=140000000000
options zfs zfs_dirty_data_max=28000000000
options zfs zfs_vdev_async_write_active_min_dirty_percent=10
options zfs zfs_vdev_async_write_active_max_dirty_percent=30
options zfs zfs_delay_min_dirty_percent=60
options zfs zfs_arc_shrink_shift=11
观察期…..
还是一个样子, 内存会用光, 然后一样CPU暴增.
但是进程的内存消耗是正常的,
# ps -e --width=1024 -o pid,%mem,rss,size,sz,vsz,cmd --sort rss
rss RSS resident set size, the non-swapped physical memory that a task has used (in kiloBytes).
(alias rssize, rsz).
size SZ approximate amount of swap space that would be required if the process were to dirty all writable
pages and then be swapped out. This number is very rough!
sz SZ size in physical pages of the core image of the process. This includes text, data, and stack
space. Device mappings are currently excluded; this is subject to change. See vsz and rss.
vsz VSZ virtual memory size of the process in KiB (1024-byte units). Device mappings are currently
excluded; this is subject to change. (alias vsize).
06:10:01 AM kbmemfree kbmemused %memused kbbuffers kbcached kbcommit %commit
06:20:01 AM 219447748 177409060 44.70 23196 22965260 27351828 6.75
06:30:01 AM 219304016 177552792 44.74 24628 23080756 27348820 6.75
06:40:01 AM 218698000 178158808 44.89 26276 23638736 27365764 6.75
06:50:01 AM 218454732 178402076 44.95 27588 23852552 27365664 6.75
07:00:01 AM 218211060 178645748 45.02 28840 24066384 27365736 6.75
07:10:01 AM 218006588 178850220 45.07 30144 24231036 27366528 6.75
07:20:01 AM 217784072 179072736 45.12 31424 24412084 27365496 6.75
07:30:01 AM 217128620 179728188 45.29 32752 24970064 27370048 6.75
07:40:01 AM 216704964 180151844 45.39 34372 25331396 27369700 6.75
07:50:01 AM 216372456 180484352 45.48 35740 25610760 27371348 6.75
08:00:01 AM 216028392 180828416 45.57 37060 25890136 27393748 6.76
08:10:01 AM 214706196 182150612 45.90 38808 27120088 27400288 6.76
08:20:01 AM 213981920 182874888 46.08 42712 27798924 27413000 6.76
08:30:01 AM 213551104 183305704 46.19 44268 28193028 27411516 6.76
设置cache的使用趋势
vfs_cache_pressure
------------------
This percentage value controls the tendency of the kernel to reclaim
the memory which is used for caching of directory and inode objects.
At the default value of vfs_cache_pressure=100 the kernel will attempt to
reclaim dentries and inodes at a "fair" rate with respect to pagecache and
swapcache reclaim. Decreasing vfs_cache_pressure causes the kernel to prefer
to retain dentry and inode caches. When vfs_cache_pressure=0, the kernel will
never reclaim dentries and inodes due to memory pressure and this can easily
lead to out-of-memory conditions. Increasing vfs_cache_pressure beyond 100
causes the kernel to prefer to reclaim dentries and inodes.
Increasing vfs_cache_pressure significantly beyond 100 may have negative
performance impact. Reclaim code needs to take various locks to find freeable
directory and inode objects. With vfs_cache_pressure=1000, it will look for
ten times more freeable objects than there are.
即使设置为1, 貌似还是不断的使用cache.
因为和脏数据无关, 所以也不需要调整脏数据的内核参数 :
# cat /proc/meminfo |grep -i -E "dirt|back"
Dirty: 0 kB
Writeback: 0 kB
WritebackTmp: 0 kB
==============================================================
dirty_background_bytes
Contains the amount of dirty memory at which the background kernel
flusher threads will start writeback.
If dirty_background_bytes is written, dirty_background_ratio becomes a function
of its value (dirty_background_bytes / the amount of dirtyable system memory).
==============================================================
dirty_background_ratio
Contains, as a percentage of total system memory, the number of pages at which
the background kernel flusher threads will start writing out dirty data.
==============================================================
dirty_bytes
Contains the amount of dirty memory at which a process generating disk writes
will itself start writeback.
If dirty_bytes is written, dirty_ratio becomes a function of its value
(dirty_bytes / the amount of dirtyable system memory).
Note: the minimum value allowed for dirty_bytes is two pages (in bytes); any
value lower than this limit will be ignored and the old configuration will be
retained.
==============================================================
dirty_expire_centisecs
This tunable is used to define when dirty data is old enough to be eligible
for writeout by the kernel flusher threads. It is expressed in 100'ths
of a second. Data which has been dirty in-memory for longer than this
interval will be written out next time a flusher thread wakes up.
==============================================================
dirty_ratio
Contains, as a percentage of total system memory, the number of pages at which
a process which is generating disk writes will itself start writing out dirty
data.
==============================================================
dirty_writeback_centisecs
The kernel flusher threads will periodically wake up and write `old' data
out to disk. This tunable expresses the interval between those wakeups, in
100'ths of a second.
Setting this to zero disables periodic writeback altogether.
现在暂且增加一个空闲时间自动FREE的脚本.
/usr/share/doc/kernel-doc-2.6.32/Documentation/sysctl/vm.txt
drop_caches
Writing to this will cause the kernel to drop clean caches, dentries and
inodes from memory, causing that memory to become free.
To free pagecache:
echo 1 > /proc/sys/vm/drop_caches
To free dentries and inodes:
echo 2 > /proc/sys/vm/drop_caches
To free pagecache, dentries and inodes:
echo 3 > /proc/sys/vm/drop_caches
As this is a non-destructive operation and dirty objects are not freeable, the
user should run `sync' first.
crontab -e
30 4 * * * /usr/local/bin/free.sh >>/tmp/free.log 2>&1
# cat /usr/local/bin/free.sh
#!/bin/bash
. /root/.bash_profile
. /etc/profile
echo "`date +%F%T` start drop cache."
free
sync
echo 3 > /proc/sys/vm/drop_caches
echo "`date +%F%T` end drop cache."
free
最终调整的参数如下
负载恢复正常.
减少脏数据比例, 提高脏数据刷新频率
将ARC改成只存储metadata, 不存储page.
sysctl -w vm.zone_reclaim_mode=1
sysctl -w vm.dirty_background_bytes=102400000
sysctl -w vm.dirty_bytes=102400000
sysctl -w vm.dirty_expire_centisecs=10
sysctl -w vm.dirty_writeback_centisecs=10
sysctl -w vm.swappiness=0
sysctl -w vm.vfs_cache_pressure=80
# vi /etc/sysctl.conf
vm.zone_reclaim_mode=1
vm.dirty_background_bytes=102400000
vm.dirty_bytes=102400000
vm.dirty_expire_centisecs=10
vm.dirty_writeback_centisecs=10
vm.swappiness=0
vm.vfs_cache_pressure=80
# cd /sys/module/zfs/parameters/
# cat zfs_arc_max
10240000000
查看arc统计信息/proc/spl/kstat/zfs/arcstats, 可以看到metadata使用了不到2G, 所以给10G差不多了.
不够的话, 以后可以再调整.
meta_size 4 1952531968
# cat /etc/modprobe.d/zfs.conf
options zfs zfs_arc_max=10240000000
options zfs zfs_dirty_data_max=800000000
options zfs zfs_vdev_async_write_active_min_dirty_percent=10
options zfs zfs_vdev_async_write_active_max_dirty_percent=30
options zfs zfs_delay_min_dirty_percent=60
options zfs zfs_arc_shrink_shift=11
设置为metadata, 因为LINUX本身也带cache, 没有必要多重cache.
zfs 和 PostgreSQL 一样有这个多重cache问题, 除非使用directIO.
# zfs set primarycache=metadata zp1
# zfs set primarycache=metadata zp1/data_a0
# zfs set primarycache=metadata zp1/data_a1
# zfs set primarycache=metadata zp1/data_b0
# zfs set primarycache=metadata zp1/data_b1
# zfs set primarycache=metadata zp1/data_c0
# zfs set primarycache=metadata zp1/data_c1
# zfs set primarycache=metadata zp1/data_ssd0
# zfs set primarycache=metadata zp1/data_ssd1
设置为与数据库块大小一致.
# zfs set recordsize=16k zp1/data_a0 wal_block_size=16k
# zfs set recordsize=8k zp1/data_a0 block_size=8k
参考
1. http://blog.163.com/digoal@126/blog/static/163877040201392641033482
2. http://constantin.glez.de/blog/2010/04/ten-ways-easily-improve-oracle-solaris-zfs-filesystem-performance
3. https://github.com/zfsonlinux/zfs/issues/258
4. http://blog.163.com/digoal@126/blog/static/163877040201462204333503
5. https://github.com/spacelama
6. https://github.com/mharsch
7. https://pthree.org/2012/12/07/zfs-administration-part-iv-the-adjustable-replacement-cache/
8. man zfs-module-parameters
rpm -ql zfs
9. http://dtrace.org/blogs/brendan/2014/02/11/another-10-performance-wins/
10. https://www.cupfighter.net/2013/03/default-nexenta-zfs-settings-you-want-to-change-part-2
11. /proc/spl/*