Linux page allocation failure 的问题处理 - zone_reclaim_mode
背景
Linux内核分配失败,现象:
内存使用一定量后,HANG。
dmesg中可能会有类似这样的错误,系统HANG住,无法连接,需要重启解决。
page allocation failure
Oct 24 11:27:42 kernel: : [21289.479063] python2.6: page allocation failure. order:1, mode:0x20
kernel: swapper: page allocation failure. order:1, mode:0x20
kernel: Pid: 0, comm: swapper Not tainted 2.6.32-358.2.1.el6.x86_64 #1
kernel: Call Trace:
kernel: <IRQ> [<ffffffff8112c207>] ? __alloc_pages_nodemask+0x757/0x8d0
kernel: [<ffffffff81166ab2>] ? kmem_getpages+0x62/0x170
kernel: [<ffffffff811676ca>] ? fallback_alloc+0x1ba/0x270
kernel: [<ffffffff8116711f>] ? cache_grow+0x2cf/0x320
kernel: [<ffffffff81167449>] ? ____cache_alloc_node+0x99/0x160
kernel: [<ffffffff811683cb>] ? kmem_cache_alloc+0x11b/0x190
kernel: [<ffffffff81439d58>] ? sk_prot_alloc+0x48/0x1c0
kernel: [<ffffffff8143ae32>] ? sk_clone+0x22/0x2e0
kernel: [<ffffffff81489d66>] ? inet_csk_clone+0x16/0xd0
kernel: [<ffffffff814a2c73>] ? tcp_create_openreq_child+0x23/0x450
kernel: [<ffffffff814a046d>] ? tcp_v4_syn_recv_sock+0x4d/0x310
kernel: [<ffffffff814a2a16>] ? tcp_check_req+0x226/0x460
kernel: [<ffffffff8149ff0b>] ? tcp_v4_do_rcv+0x35b/0x430
kernel: [<ffffffff81082034>] ? mod_timer+0x144/0x220
kernel: [<ffffffff814a171e>] ? tcp_v4_rcv+0x4fe/0x8d0
kernel: [<ffffffff814a171e>] ? tcp_v4_rcv+0x4fe/0x8d0
kernel: [<ffffffff8147f50d>] ? ip_local_deliver_finish+0xdd/0x2d0
kernel: [<ffffffff8147f798>] ? ip_local_deliver+0x98/0xa0
kernel: [<ffffffff8147ec5d>] ? ip_rcv_finish+0x12d/0x440
kernel: [<ffffffff8147f1e5>] ? ip_rcv+0x275/0x350
kernel: [<ffffffff814483bb>] ? __netif_receive_skb+0x4ab/0x750
kernel: [<ffffffff8144a798>] ? netif_receive_skb+0x58/0x60
kernel: [<ffffffffa008b975>] ? vmxnet3_rq_rx_complete+0x365/0x890 [vmxnet3]
kernel: [<ffffffff8128d2b0>] ? swiotlb_map_page+0x0/0x100
kernel: [<ffffffffa008c0f3>] ? vmxnet3_poll_rx_only+0x43/0xc0 [vmxnet3]
kernel: [<ffffffff8144cf63>] ? net_rx_action+0x103/0x2f0
kernel: [<ffffffff81076fb1>] ? __do_softirq+0xc1/0x1e0
kernel: [<ffffffff810e1720>] ? handle_IRQ_event+0x60/0x170
kernel: [<ffffffff8100c1cc>] ? call_softirq+0x1c/0x30
kernel: [<ffffffff8100de05>] ? do_softirq+0x65/0xa0
kernel: [<ffffffff81076d95>] ? irq_exit+0x85/0x90
kernel: [<ffffffff81516f15>] ? do_IRQ+0x75/0xf0
kernel: [<ffffffff8100b9d3>] ? ret_from_intr+0x0/0x11
kernel: <EOI> [<ffffffff8103b90b>] ? native_safe_halt+0xb/0x10
kernel: [<ffffffff8101495d>] ? default_idle+0x4d/0xb0
kernel: [<ffffffff81009fc6>] ? cpu_idle+0xb6/0x110
kernel: [<ffffffff81506d9c>] ? start_secondary+0x2ac/0x2ef
解决方案 - 升级内核版本
1、升级到kernel-2.6.32-358.el6或更高内核。(但是不能彻底解决,只是减轻问题)
Update to kernel-2.6.32-358.el6 or higher, which contains the enhancement described in the Root Cause section below.
Please note, this update (or newer) does not completely eliminate the possibility of the occurrence of the page allocation failure.
The below mentioned workaround also works in 2.6.32-358.el6 and newer if the issue still persists even after the update.
解决方案 - 修改内核参数
vi /etc/sysctl.conf or vi /etc/sysctl.d/xxx.conf
vm.zone_reclaim_mode = 1
vm.min_free_kbytes = 512000
sysctl -w vm.zone_reclaim_mode=1
sysctl -w vm.min_free_kbytes=512000
The following tunables can be used in an attempt to alleviate or prevent the reported condition:
Increase vm.min_free_kbytes value, for example to a higher value than a single allocation request.
Change vm.zone_reclaim_mode to 1 if it's set to zero, so the system can reclaim back memory from cached memory.
Both settings can be set in /etc/sysctl.conf, and loaded using sysctl -p /etc/sysctl.conf.
For more information on these tunables, install the kernel-doc package and refer to file
/usr/share/doc/kernel-doc-2.6.32/Documentation/sysctl/vm.txt.
根本原因
6.4以前的版本,kswapd 不会处理
Before RHEL 6.4, kswapd does not try to free contiguous pages.
This can cause GFP_ATOMIC allocations requests to fail repeatedly,
when nothing else in the system defragments memory.
With RHEL 6.4 and newer, kswapd will compact (defragment) free memory, when required.
Please note that allocation failures can still happen.
For example, when a larger burst of GFP_ATOMIC allocations occur which kswapd may struggle to keep up with.
However, these allocations should eventually succeed.
There are also other more specific cases that can result in page allocation failures and cause additional issues.
Please refer to the following articles for more information
Zone_reclaim_mode 解释
Zone_reclaim_mode allows someone to set more or less aggressive approaches to
reclaim memory when a zone runs out of memory. If it is set to zero then no
zone reclaim occurs. Allocations will be satisfied from other zones / nodes
in the system.
This is value ORed together of
1 = Zone reclaim on
2 = Zone reclaim writes dirty pages out
4 = Zone reclaim swaps pages
zone_reclaim_mode is set during bootup to 1 if it is determined that pages
from remote zones will cause a measurable performance reduction. The
page allocator will then reclaim easily reusable pages (those page
cache pages that are currently not used) before allocating off node pages.
0: It may be beneficial to switch off zone reclaim if the system is
used for a file server and all of memory should be used for caching files
from disk. In that case the caching effect is more important than
data locality.
1: Allowing zone reclaim to write out pages stops processes that are
writing large amounts of data from dirtying pages on other nodes. Zone
reclaim will write out dirty pages if a zone fills up and so effectively
throttle the process. This may decrease the performance of a single process
2: since it cannot use all of system memory to buffer the outgoing writes
anymore but it preserve the memory on other nodes so that the performance
of other processes running on other nodes will not be affected.
4: Allowing regular swap effectively restricts allocations to the local
node unless explicitly overridden by memory policies or cpuset
configurations.
参考
http://www.zbuse.com/2014/07/837.html
https://serverfault.com/questions/236170/page-allocation-failure-am-i-running-out-of-memory
https://access.redhat.com/solutions/90883
《Linux page allocation failure 的问题处理 - lowmem_reserve_ratio》