如何度量Kernel Resources for PostgreSQL
背景
对于操作系统来说,数据库算是比较大型的应用,往往需要耗费大量的系统资源,特别是在内部进程间通信这块的资源。
操作系统默认的配置可能无法满足数据库对资源使用的需求。
那么应该如何根据数据库的需要,设置操作系统相关资源参数呢?
PostgreSQL 对系统资源的需求计算
在讲资源分配前,大家可以参考阅读一下
https://www.postgresql.org/docs/9.5/static/kernel-resources.html#SYSVIPC
https://www.postgresql.org/docs/9.5/static/runtime-config-resource.html#RUNTIME-CONFIG-RESOURCE-MEMORY
kernel-doc-xxx/Documentation/sysctl/kernel.txt
https://en.wikipedia.org/wiki/UNIX_System_V
这一篇主要讲的是进程间通信
https://docs.oracle.com/cd/E19455-01/806-4750/6jdqdflta/index.html
共享内存和信号量
关于PostgreSQL的共享内存管理,早期的 PostgreSQL 版本共享内存分配只支持sysv的方式,数据库启动时,需要分配一块共享内存段,这个需求主要与配置多大的shared_buffers 有关。
社区在9.3的版本做了一个变动,使用mmap分配共享内存,大幅降低了系统对System V共享内存的需求量。
https://www.postgresql.org/docs/9.3/static/release-9-3.html
但是mmap并不好,会带来额外的IO开销,所以PostgreSQL 9.4开始,又做了一个变动,支持动态分配共享内存,主要是为多核并行计算做的铺垫,而且默认的共享内存分配的方法变成了posix(如果环境支持),同样不需要启动时分配大的共享内存段 。
https://www.postgresql.org/docs/9.4/static/release-9-4.html
从9.4开始,共享内存分配方法通过参数 dynamic_shared_memory_type 控制。
不同的值,创建共享内存段的方法也不一样,例如posix使用shm_open,sysv使用shmget,mmap使用mmap。
正是由于创建共享内存段的方法不一样,所以需要配置的操作系统内核参数也不一样。
dynamic_shared_memory_type (enum)
Specifies the dynamic shared memory implementation that the server should use.
Possible values are : (支持如下)
posix (for POSIX shared memory allocated using shm_open),
sysv (for System V shared memory allocated via shmget),
windows (for Windows shared memory),
mmap (to simulate shared memory using memory-mapped files stored in the data directory),
none (to disable this feature).
Not all values are supported on all platforms;
the first supported option is the default for that platform.
如果不强制指定,默认使用第一种支持的方式。
The use of the mmap option, which is not the default on any platform, is generally discouraged because the operating system may write modified pages back to disk repeatedly, increasing system I/O load;
however, it may be useful for debugging, when the pg_dynshmem directory is stored on a RAM disk, or when other shared memory facilities are not available.
建议不要使用mmap,除非你想调试PG的共享内存。
不同的共享内存分配方法,对操作系统的内核参数配置要求也不一样。
涉及的资源以及计算方法如下
Name | Description | Reasonable values |
---|---|---|
SHMMAX | 单个共享内存段最大允许多大 (bytes) | 见另一张表,或者直接设置为内存的80% |
SHMMIN | 单个共享内存段最小允许多小 (bytes) | 1 |
SHMALL | 整个系统允许分配多少共享内存,(所有共享内存段相加) (bytes or pages) | 需考虑其他需要分配共享内存的应用,确保大于所有应用的需求量,通常可以设置为实际内存大小 |
SHMSEG | 每个进程允许分配多少个共享内存段 | only 1 segment is needed, but the default is much higher, 所以不需要设置 |
SHMMNI | 整个系统允许分配多少个共享内存段 | 需要分配共享内存的进程数 * SHMSEG |
SEMMNI | 允许分配多少组信号量ID (i.e., sets) | at least ceil((max_connections + autovacuum_max_workers + 5) / 16) ,PostgreSQL每16个进程一组 |
SEMMNS | 允许分配多少个信号量 | ceil((max_connections + autovacuum_max_workers + 5) / 16) * 17 plus room for other applications,每组信号量需要17字节,加上其他软件的需求。实际设置时设置为SEMMNI*SEMMSL |
SEMMSL | 每组允许开多少信号量 | at least 17 |
SEMMAP | Number of entries in semaphore map | see text |
SEMVMX | Maximum value of semaphore | at least 1000 (The default is often 32767; do not change unless necessary) |
共享内存段 shmmax 计算方法
-
共享内存 SysV 管理 (适用于 < 9.3 的版本)
https://www.postgresql.org/docs/9.2/static/kernel-resources.html
因此对于9.2以及更低版本的共享内存SysV管理的情况,shmmax的需求计算方法如下,将所有项相加。
Usage | Approximate shared memory bytes required (as of 8.3) |
---|---|
Connections | (1800 + 270 * max_locks_per_transaction) * max_connections |
Autovacuum workers | (1800 + 270 * max_locks_per_transaction) * autovacuum_max_workers |
Prepared transactions | (770 + 270 * max_locks_per_transaction) * max_prepared_transactions |
Shared disk buffers | (block_size + 208) * shared_buffers |
WAL buffers | (wal_block_size + 8) * wal_buffers |
Fixed space requirements | 770 kB |
-
共享内存 SysV 管理 (适用于 >= 9.3 的版本)
对于9.3以及更高版本的PostgreSQL, 即使使用SysV,也不需要这么多共享内存。后面会有实测。
通常需要4KB左右。
-
共享内存 posix, mmap, none 管理
一个PostgreSQL集群只需要56字节(实测)的共享内存段大小
PostgreSQL requires a few bytes of System V shared memory (typically 48 bytes, on 64-bit platforms) for each copy of the server.
On most modern operating systems, this amount can easily be allocated.
However, if you are running many copies of the server, or if other applications are also using System V shared memory :
it may be necessary to increase SHMMAX, the maximum size in bytes of a shared memory segment,
SHMALL, the total amount of System V shared memory system-wide.
Note that SHMALL is measured in pages rather than bytes on many systems.
小结
9.3 以下版本,设置这3个内核参数
(9.3 以及以上版本,需要的shmmax没那么大,所以也可以使用以上设置。 )
kernel.shmall = 实际内存大小 (如果单位为page, bytes/PAGE_SIZE)
kernel.shmmax >= shared_buffer (bytes)
kernel.shmmni >= 实际数据库集群数*2(>=9.4版本,使用SysV时每个PostgreSQL数据库集群需要2个共享内存段)
如果一台服务器中要启动多个PostgreSQL集群,则每个集群都需要
shmmin和shmseg不需要设置,从shmget的开发者手册也可以得到证实
shmmin = 1 # 1字节,但实际最小是1 PAGE(4KB)
shmseg = unlimited
系统页大小(未使用huge page时)
# getconf PAGE_SIZE
4096
man shmget
The following limits on shared memory segment resources affect the shmget() call:
SHMALL System wide maximum of shared memory pages (on Linux, this limit can be read and modified via /proc/sys/kernel/shmall).
SHMMAX Maximum size in bytes for a shared memory segment: policy dependent (on Linux, this limit can be read and modified via /proc/sys/kernel/shmmax).
SHMMIN Minimum size in bytes for a shared memory segment: implementation dependent (currently 1 byte, though PAGE_SIZE is the effective minimum size).
SHMMNI System wide maximum number of shared memory segments: implementation dependent (currently 4096, was 128 before Linux 2.3.99; on Linux, this limit can be read and modified via /proc/sys/kernel/shmmni).
The implementation has no specific limits for the per-process maximum number of shared memory segments (SHMSEG).
信号量 计算方法
信号量的需求,和数据库版本无关,计算方法如下。
-
需要多少组
SEMMNI >= (max_connections + max_worker_processes + autovacuum_max_workers + 5) / 16
-
需要多少信号量
SEMMNS >= ((max_connections + max_worker_processes + autovacuum_max_workers + 5) / 16) * 17 + 其他程序的需求
-
每组需要多少信号量
SEMMSL >= 17
对应系统内核配置举例
# sysctl -w kernel.sem="1234 150994944 512000 7890"
含义分别为
max number of arrays = 7890 对应 semmni
max semaphores per array = 1234 对应 semmsl
max semaphores system wide = 150994944 对应 semmns = semmni*semmsl
max ops per semop call = 512000
如何查看当前系统设置的SysV资源限制
# ipcs -l
------ Messages Limits --------
max queues system wide = 32768
max size of message (bytes) = 8192
default max size of queue (bytes) = 16384
------ Shared Memory Limits --------
max number of segments = 8192
max seg size (kbytes) = 19531250
max total shared memory (kbytes) = 4096000000
min seg size (bytes) = 1
------ Semaphore Limits --------
max number of arrays = 7890
max semaphores per array = 1234
max semaphores system wide = 150994944
max ops per semop call = 512000
semaphore max value = 32767
如何查看已使用的SysV资源
# ipcs -u
------ Messages Status --------
allocated queues = 0
used headers = 0
used space = 0 bytes
------ Shared Memory Status --------
segments allocated 2
pages allocated 2
pages resident 2
pages swapped 0
Swap performance: 0 attempts 0 successes
------ Semaphore Status --------
used arrays = 8
allocated semaphores = 136
实测
-
shmmax与信号量实测
-
共享内存管理方法 posix, mmap, none 实测 shmmax 需求 如下
sysctl -w kernel.shmmax=1024
#
postgresql.conf
shared_buffer=16GB
启动数据库, 查看IPC
$ ipcs
------ Shared Memory Segments --------
key shmid owner perms bytes nattch status
0x001d4fe9 294912 digoal 600 56 5
#
如果sysctl -w kernel.shmmax=48,启动报错,需要56字节
FATAL: could not create shared memory segment: Invalid argument
DETAIL: Failed system call was shmget(key=1921001, size=56, 03600).
HINT: This error usually means that PostgreSQL's request for a shared memory segment exceeded your kernel's SHMMAX parameter, or possibly that it is less than your kernel's SHMMIN parameter.
The PostgreSQL documentation contains more information about shared memory configuration.
- 共享内存管理方法 sysv 实测 shmmax 需求 如下
postgresql.conf
dynamic_shared_memory_type = sysv
如果设置低于数据库的需求,会报错
sysctl -w kernel.shmmax=1024
报错
pg_ctl start
server starting
FATAL: could not get shared memory segment: Invalid argument
把shm加到到20GB,
9.5的版本,启动时实际需要的内存并不多,如果你在9.2或者更低版本测试,那会需要很多
sysctl -w kernel.shmmax=2000000000 # 单位byte
sysctl -w kernel.shmall=2000000000 # 单位page
#
------ Shared Memory Segments --------
key shmid owner perms bytes nattch status
0x001d4fe9 360448 digoal 600 56 5
0x6b8b4567 393217 digoal 600 2396 5
PostgreSQL 9.2 shared_buffer=16GB , 启动时需要申请大量的内存.
ipcs
#
------ Shared Memory Segments --------
key shmid owner perms bytes nattch status
0x001d53d1 1114112 digoal 600 17605672960 5
ulimit 设置
ulimit 设置主要是限制进程级别对系统资源的使用。
ulimit [-HSTabcdefilmnpqrstuvx [limit]]
Provides control over the resources available to the shell and to processes started by it, on systems that allow such control. The -H and -S options specify that the hard or soft limit is set for the given resource.
A hard limit cannot be increased by a non-root user once it is set; a soft limit may be increased up to the value of the hard limit. If neither -H nor -S is specified, both the soft and hard limits are set. The value of limit can be a number in the unit specified for the resource or one of the special values hard, soft, or unlimited, which stand for the current hard limit, the current soft limit, and no limit, respectively.
If limit is omitted, the current value of the soft limit of the resource is printed, unless the -H option is given. When more than one resource is specified, the limit name and unit are printed before the value.
Other options are interpreted as follows:
-a All current limits are reported
-b The maximum socket buffer size
-c The maximum size of core files created
-d The maximum size of a process's data segment
-e The maximum scheduling priority ("nice")
-f The maximum size of files written by the shell and its children
-i The maximum number of pending signals
-l The maximum size that may be locked into memory
-m The maximum resident set size (many systems do not honor this limit)
-n The maximum number of open file descriptors (most systems do not allow this value to be set)
-p The pipe size in 512-byte blocks (this may not be set)
-q The maximum number of bytes in POSIX message queues
-r The maximum real-time scheduling priority
-s The maximum stack size
-t The maximum amount of cpu time in seconds
-u The maximum number of processes available to a single user
-v The maximum amount of virtual memory available to the shell and, on some systems, to its children
-x The maximum number of file locks
-T The maximum number of threads
If limit is given, it is the new value of the specified resource (the -a option is display only). If no option is given, then -f is assumed. Values are in 1024-byte increments, except for -t, which is in seconds,
-p, which is in units of 512-byte blocks, and -T, -b, -n, and -u, which are unscaled values. The return status is 0 unless an invalid option or argument is supplied, or an error occurs while setting a new limit.
配置文件举例
/etc/security/limits.conf
#<domain> <type> <item> <value>
#
#Where:
#<domain> can be:
# - a user name
# - a group name, with @group syntax
# - the wildcard *, for default entry
# - the wildcard %, can be also used with %group syntax,
# for maxlogin limit
#
#<type> can have the two values:
# - "soft" for enforcing the soft limits
# - "hard" for enforcing hard limits
#
#<item> can be one of the following:
# - core - limits the core file size (KB)
# - data - max data size (KB)
# - fsize - maximum filesize (KB)
# - memlock - max locked-in-memory address space (KB)
# - nofile - max number of open files
# - rss - max resident set size (KB)
# - stack - max stack size (KB)
# - cpu - max CPU time (MIN)
# - nproc - max number of processes
# - as - address space limit (KB)
# - maxlogins - max number of logins for this user
# - maxsyslogins - max number of logins on the system
# - priority - the priority to run user process with
# - locks - max number of file locks the user can hold
# - sigpending - max number of pending signals
# - msgqueue - max memory used by POSIX message queues (bytes)
# - nice - max nice priority allowed to raise to values: [-20, 19]
# - rtprio - max realtime priority
#
#<domain> <type> <item> <value>
#
#* soft core 0
#* hard rss 10000
#@student hard nproc 20
#@faculty soft nproc 20
#@faculty hard nproc 50
#ftp hard nproc 0
#@student - maxlogins 4
# End of file
* soft nofile 655360
* hard nofile 655360
* soft nproc 655360
* hard nproc 655360
* soft memlock unlimited
* hard memlock unlimited
* soft core unlimited
* hard core unlimited
查看进程设置
# cat /proc/$PID/limits
Limit Soft Limit Hard Limit Units
Max cpu time unlimited unlimited seconds
Max file size unlimited unlimited bytes
Max data size unlimited unlimited bytes
Max stack size 8388608 unlimited bytes
Max core file size unlimited unlimited bytes
Max resident set unlimited unlimited bytes
Max processes 655360 655360 processes
Max open files 655360 655360 files
Max locked memory unlimited unlimited bytes
Max address space unlimited unlimited bytes
Max file locks unlimited unlimited locks
Max pending signals 513997 513997 signals
Max msgqueue size 819200 819200 bytes
Max nice priority 0 0
Max realtime priority 0 0
Max realtime timeout unlimited unlimited us
查看当前用户的limit配置
# ulimit -a
core file size (blocks, -c) unlimited
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 513997
max locked memory (kbytes, -l) unlimited
max memory size (kbytes, -m) unlimited
open files (-n) 655360
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) 8192
cpu time (seconds, -t) unlimited
max user processes (-u) 655360
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited
PostgreSQL 推荐设置
* soft nofile 655360 # The maximum number of open file descriptors
* hard nofile 655360
* soft nproc 655360 # The maximum number of processes available to a single user
* hard nproc 655360
* soft memlock unlimited # The maximum size that may be locked into memory
* hard memlock unlimited
* soft core unlimited # The maximum size of core files created
* hard core unlimited
core dump 相关内核设置
kernel.core_pattern = /xxx/xxx/core_%e_%u_%t_%s.%p
kernel.core_uses_pid = 1
OOM score adj 设置
PostgreSQL 的守护进程是postgres,如果它挂了,数据库就挂了,其他进程挂了它会负责crash recovery,自动重启数据库(默认设置了 restart_after_crash = on )
所以如果要防止系统OOM时杀掉postgres主进程,需要在启动数据库前,使用root用户设置self脚本进程的oom_score_adj,然后启动数据库。
echo -1000 > /proc/self/oom_score_adj
or
echo -17 > /proc/self/oom_score_adj
例子
# echo -1000 > /proc/self/oom_score_adj
启动是需要在启动环境中设置这两个环境变量
# export PG_OOM_ADJUST_FILE=/proc/self/oom_score_adj # 设置postgres主进程oom_score_adj
# export PG_OOM_ADJUST_VALUE=0 # 设置子进程oom_score_adj
# su - digoal -c "export PG_OOM_ADJUST_FILE=/proc/self/oom_score_adj;export PG_OOM_ADJUST_VALUE=0;. ~/env.sh; pg_ctl start"
# ps -efw|grep digoal
digoal 2492 1 9 23:22 ? 00:00:00 /home/digoal/pgsql9.5/bin/postgres
digoal 2493 2492 0 23:22 ? 00:00:00 postgres: logger process
digoal 2495 2492 0 23:22 ? 00:00:00 postgres: checkpointer process
digoal 2496 2492 0 23:22 ? 00:00:00 postgres: writer process
digoal 2497 2492 0 23:22 ? 00:00:00 postgres: wal writer process
digoal 2498 2492 0 23:22 ? 00:00:00 postgres: autovacuum launcher process
digoal 2499 2492 0 23:22 ? 00:00:00 postgres: stats collector process
# cat /proc/2492/oom_score_adj
-1000
# cat /proc/2493/oom_score_adj
0
参考
-
https://www.postgresql.org/docs/9.5/static/kernel-resources.html#SYSVIPC
-
https://www.postgresql.org/docs/9.5/static/runtime-config-resource.html#RUNTIME-CONFIG-RESOURCE-MEMORY
-
man shm_open, shmget, mmap, semctl, sem_overview
小结
本文主要帮助大家理解PostgreSQL数据库对操作系统资源的需求,以及计算方法。
如果用户需要在一个系统中运行多个数据库集群,则需要将所有集群的需求加起来。
PostgreSQL 9.2以及以前的版本,在数据库启动时对SysV共享内存段的需求很大,所以要设得比较大,需要用户注意。
祝大家玩得开心,欢迎随时来 阿里云促膝长谈业务需求 ,恭候光临。
阿里云的小伙伴们加油,努力 做好内核与服务,打造最贴地气的云数据库 。