xfs logdev 完美解决cgroup iops限制时ext4 data=writeback才能解决的问题
背景
Linux下面ext4和xfs都是带journal的文件系统,在写metadata前,必须先写metadata的journal。
(journal类似数据库的REDO LOG,可用于灾难恢复)
metadata则包含了文件系统的inodes, directories, indirect blocks信息。创建文件(含目录),改变文件大小,改变文件修改时间都涉及metadata的写操作。
在ext4,xfs文件系统中,metadata的journal操作是串行的,这点和redo log也类似。
cgroup的blkio模块,可以控制进程对指定块设备的读,写IOPS,吞吐率等。
当我们对iops进行限制时,由于”metadata的journal操作是串行的”,可能遇到相互干扰的问题。
例如:
有1个块设备,找到它的major,minor号。
#ll /dev/mapper/aliflash-lv0*
lrwxrwxrwx 1 root root 7 Jan 7 11:12 /dev/mapper/aliflash-lv01 -> ../dm-0
#ll /dev/dm-0
brw-rw---- 1 root disk 253, 0 Jan 7 11:22 /dev/dm-0
在这个块设备上创建xfs或ext4文件系统,并挂载到/data01。
初始化两个PostgreSQL数据库实例,分别放在/data01不同目录中。
限制其中一个PostgreSQL集群对(253:0)这个块设备的写IOPS到100。
ps -ewf|grep postgres
digoal 24259 1 0 12:58 pts/4 00:00:00 /home/digoal/pgsql9.5/bin/postgres -- 监听1921
digoal 24260 24259 0 12:58 ? 00:00:00 postgres: logger process
digoal 24262 24259 0 12:58 ? 00:00:00 postgres: checkpointer process
digoal 24263 24259 0 12:58 ? 00:00:00 postgres: writer process
digoal 24264 24259 0 12:58 ? 00:00:00 postgres: wal writer process
digoal 24265 24259 0 12:58 ? 00:00:00 postgres: autovacuum launcher process
digoal 24266 24259 0 12:58 ? 00:00:00 postgres: stats collector process
digoal 24293 1 0 12:58 pts/4 00:00:00 /home/digoal/pgsql9.5/bin/postgres -D /data01/digoal/pg_root -- 监听1922
digoal 24294 24293 0 12:58 ? 00:00:00 postgres: logger process
digoal 24296 24293 0 12:58 ? 00:00:20 postgres: checkpointer process
digoal 24297 24293 0 12:58 ? 00:00:00 postgres: writer process
digoal 24298 24293 0 12:58 ? 00:00:00 postgres: wal writer process
digoal 24299 24293 0 12:58 ? 00:00:00 postgres: autovacuum launcher process
digoal 24300 24293 0 12:58 ? 00:00:00 postgres: stats collector process
限制1921实例的IOPS
cd /sys/fs/cgroup/blkio/
mkdir cg1
cd cg1
echo "253:0 100" > blkio.throttle.write_iops_device
echo 24259 > tasks
echo 24260 > tasks
echo 24262 > tasks
echo 24263 > tasks
echo 24264 > tasks
echo 24265 > tasks
echo 24266 > tasks
开启一个会大量修改metadata的压测。使用create database即可。
(create database会大量的COPY模板库的数据文件,调用fsync。从而产生大量的metadata修改的动作,触发metadata journal的修改。)
vi test.sh
#!/bin/bash
for ((i=0;i<100;i++))
do
psql -h 127.0.0.1 -p 1921 -c "create database $i"
done
. ./test.sh
观察块设备IOPS,被写IOPS限制在100了。
iostat -x 1
avg-cpu: %user %nice %system %iowait %steal %idle
0.00 0.00 0.03 3.12 0.00 96.84
Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util
dm-0 0.00 0.00 0.00 100.00 0.00 1600.00 16.00 0.00 0.00 0.00 0.00
现在连到1922的实例,去压测性能:
pgbench -i -s 100 -h 127.0.0.1 -p 1922
pgbench -M prepared -n -r -P 1 -c 96 -j 96 -T 100 -h 127.0.0.1 -p 1922
progress: 1.0 s, 33.0 tps, lat 2.841 ms stddev 1.746
progress: 2.0 s, 0.0 tps, lat -nan ms stddev -nan
progress: 3.0 s, 0.0 tps, lat -nan ms stddev -nan
progress: 4.0 s, 0.0 tps, lat -nan ms stddev -nan
progress: 5.0 s, 0.0 tps, lat -nan ms stddev -nan
progress: 6.0 s, 197.2 tps, lat 2884.437 ms stddev 2944.982
progress: 7.0 s, 556.6 tps, lat 33.527 ms stddev 34.798
progress: 8.0 s, 0.0 tps, lat -nan ms stddev -nan
progress: 9.0 s, 0.0 tps, lat -nan ms stddev -nan
progress: 10.0 s, 0.0 tps, lat -nan ms stddev -nan
progress: 11.0 s, 0.0 tps, lat -nan ms stddev -nan
progress: 12.0 s, 0.0 tps, lat -nan ms stddev -nan
progress: 13.0 s, 0.0 tps, lat -nan ms stddev -nan
progress: 14.0 s, 0.0 tps, lat -nan ms stddev -nan
progress: 15.0 s, 0.0 tps, lat -nan ms stddev -nan
可以看到,1922的性能受到了1921的影响,实际上块设备的IO能力有几十万。
为什么?
由于metadata journal是串行操作的,当1921实例操作metadata journal变慢之后,影响了1922实例对该文件系统的metadata journal的操作。
甚至select 1;这种操作都会受到影响,原因是每次front process与PostgreSQL在建立连接时,需要创建一个临时catalog文件global/pg_internal.init.pid。
跟踪第二个数据库实例的postmaster进程
[root@digoal ~]# strace -T -f -p 24293 >./conn 2>&1
连接第二个数据库实例
postgres@digoal-> strace -T psql -h 127.0.0.1 -p 1922
execve("/opt/pgsql/bin/psql", ["psql", "-h", "127.0.0.1", "-p", "1922"], [/* 34 vars */]) = 0 <0.009976>
brk(0) = 0x1747000 <0.000007>
...
poll([{fd=3, events=POLLIN|POLLERR}], 1, -1) // 会卡在这里
此时在系统中可以看到startup进程,是postmaster fork出来的,注意这个进程号,和后面的conn文件对应起来。
[root@digoal postgresql-9.4.4]# ps -efw|grep start
postgres 46147 24293 0 19:43 ? 00:00:00 postgres: postgres postgres 127.0.0.1(17947) startup
strace -T psql -h 127.0.0.1 -p 1922的输出截取:
setsockopt(3, SOL_SOCKET, SO_KEEPALIVE, [1], 4) = 0 <0.000008>
connect(3, {sa_family=AF_INET, sin_port=htons(1922), sin_addr=inet_addr("127.0.0.1")}, 16) = -1 EINPROGRESS (Operation now in progress) <0.000943>
poll([{fd=3, events=POLLOUT|POLLERR}], 1, -1) = 1 ([{fd=3, revents=POLLOUT}]) <0.000011>
getsockopt(3, SOL_SOCKET, SO_ERROR, [0], [4]) = 0 <0.000124>
getsockname(3, {sa_family=AF_INET, sin_port=htons(17947), sin_addr=inet_addr("127.0.0.1")}, [16]) = 0 <0.000015>
poll([{fd=3, events=POLLOUT|POLLERR}], 1, -1) = 1 ([{fd=3, revents=POLLOUT}]) <0.000008>
sendto(3, "\0\0\0\10\4\322\26/", 8, MSG_NOSIGNAL, NULL, 0) = 8 <0.000050>
poll([{fd=3, events=POLLIN|POLLERR}], 1, -1) = 1 ([{fd=3, revents=POLLIN}]) <0.000600>
recvfrom(3, "N", 16384, 0, NULL, NULL) = 1 <0.000010>
poll([{fd=3, events=POLLOUT|POLLERR}], 1, -1) = 1 ([{fd=3, revents=POLLOUT}]) <0.000007>
sendto(3, "\0\0\0T\0\3\0\0user\0postgres\0database\0p"..., 84, MSG_NOSIGNAL, NULL, 0) = 84 <0.000020>
poll响应时间达到了67秒
poll([{fd=3, events=POLLIN|POLLERR}], 1, -1) = 1 ([{fd=3, revents=POLLIN}]) <67.436925> , 响应时间达到了67秒
recvfrom(3, "R\0\0\0\10\0\0\0\0S\0\0\0\32application_name\0p"..., 16384, 0, NULL, NULL) = 322 <0.000017>
当建立连接后,查看postmaster进程的跟踪情况。可以看到startup进程46147,这个进程调用write花了66秒,因为这次调用write时触发了写修改metadata的动作。
[root@digoal ~]# grep "pid 46147" conn|less
[pid 46147] mmap(NULL, 528384, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f0f1403d000 <0.000012>
[pid 46147] unlink("global/pg_internal.init.46147") = -1 ENOENT (No such file or directory) <0.000059>
[pid 46147] open("global/pg_internal.init.46147", O_WRONLY|O_CREAT|O_TRUNC, 0666) = 14 <0.000068>
[pid 46147] fstat(14, {st_mode=S_IFREG|0600, st_size=0, ...}) = 0 <0.000013>
[pid 46147] mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f0f1403c000 <0.000020>
[pid 46147] write(14, "f2W\0008\1\0\0\0\0\0\0\200\6\0\0\0\0\0\0U2\0\0\0\0\0\0\0\0\0\0"..., 4096 <unfinished ...>
[pid 46147] <... write resumed> ) = 4096 <66.317440>
[pid 46147] --- SIGALRM (Alarm clock) @ 0 (0) ---
找到对应的代码:
write_relcache_init_file@src/backend/utils/cache/relcache.c
重新跟踪这个c文件:
[root@digoal ~]# cat trc.stp
global f_start[999999]
probe process("/opt/pgsql/bin/postgres").function("*@/opt/soft_bak/postgresql-9.4.4/src/backend/utils/cache/relcache.c").call {
f_start[execname(), pid(), tid(), cpu()] = gettimeofday_ms()
}
probe process("/opt/pgsql/bin/postgres").function("*@/opt/soft_bak/postgresql-9.4.4/src/backend/utils/cache/relcache.c").return {
t=gettimeofday_ms()
a=execname()
b=cpu()
c=pid()
d=pp()
e=tid()
if (f_start[a,c,e,b] && t-f_start[a,c,e,b]>1) {
# printf("time:%d, execname:%s, pp:%s, par:%s\n", t - f_start[a,c,e,b], a, d, $$locals$$)
printf("time:%d, execname:%s, pp:%s\n", t - f_start[a,c,e,b], a, d)
}
}
因为startup进程是动态生成的,所以只能这样跟踪:
[root@digoal ~]# cat t.sh
#!/bin/bash
for ((i=0;i<1;i=0)) do
PID=`ps -ewf|grep start|grep -v grep|awk '{print $2}'`
stap -vp 5 -DMAXSKIPPED=9999999 -DSTP_NO_OVERLOAD -DMAXTRYLOCK=100 ./trc.stp -x $PID
done
重新跟踪如下:
postgres@digoal-> strace -T psql -h 127.0.0.1 -p 1922
[root@digoal ~]# . ./t.sh
Pass 1: parsed user script and 111 library script(s) using 209296virt/36828res/3172shr/34516data kb, in 180usr/20sys/199real ms.
Pass 2: analyzed script: 102 probe(s), 7 function(s), 4 embed(s), 1 global(s) using 223800virt/51400res/4172shr/49020data kb, in 80usr/60sys/142real ms.
Pass 3: translated to C into "/tmp/stapbw7MDq/stap_b17f8a3318ccf4b972f4b84491bbdc1e_54060_src.c" using 223800virt/51744res/4504shr/49020data kb, in 10usr/40sys/57real ms.
Pass 4: compiled C into "stap_b17f8a3318ccf4b972f4b84491bbdc1e_54060.ko" in 1440usr/370sys/1640real ms.
Pass 5: starting run.
time:6134, execname:postgres, pp:process("/opt/pgsql9.4.4/bin/postgres").function("write_item@/opt/soft_bak/postgresql-9.4.4/src/backend/utils/cache/relcache.c:4979").return
time:3, execname:postgres, pp:process("/opt/pgsql9.4.4/bin/postgres").function("write_item@/opt/soft_bak/postgresql-9.4.4/src/backend/utils/cache/relcache.c:4979").return
time:6, execname:postgres, pp:process("/opt/pgsql9.4.4/bin/postgres").function("write_item@/opt/soft_bak/postgresql-9.4.4/src/backend/utils/cache/relcache.c:4979").return
......
以上问题怎么解决?如何隔离数据库实例的IOPS不会产生相互干扰呢?
解决办法1
不同的实例使用不同的文件系统。
例如
mkfs.ext4 /dev/mapper/vgdata01-lv01
mkfs.ext4 /dev/mapper/vgdata01-lv02
mount /dev/mapper/vgdata01-lv01 /data01
mount /dev/mapper/vgdata01-lv02 /data02
两个数据库实例分别放在/data01和/data02
限制/dev/mapper/vgdata01-lv01的IOPS,不会影响另一个文件系统。
这种方法的弊端:如果实例数很多,需要拆分成很多个小的文件系统,不适合空间弹性管理和共用。
解决办法2
针对ext4
正常情况下写数据的顺序如果你要修改metadata,必须确保metadata对应的块改变已经落盘,因此可能出现写metadata被迫要刷dirty data page的情况。
如果dirty data page刷盘很慢,就会导致metadata写受堵。而写metadata journal又是串行的,势必影响其他进程对metadata journal的修改。
使用 data=writeback 加载ext4文件系统。
这个方法的原理是写metadata前,不需要等待data写完。从而可能出现metadata是新的,但是数据是旧的情况。(例如inode是新的,data是旧的,可能某些inode引用的块不存在或者是旧的已删除的块)
写metadata不等待写data,好处就是串行操作不好因为data受堵塞而堵塞。
data={journal|ordered|writeback}
Specifies the journalling mode for file data. Metadata is always journaled. To use modes other than ordered on the root filesystem, pass the mode to the kernel as boot parameter, e.g. root-
flags=data=journal.
journal
All data is committed into the journal prior to being written into the main filesystem.
ordered
This is the default mode. All data is forced directly out to the main file system prior to its metadata being committed to the journal.
writeback
Data ordering is not preserved - data may be written into the main filesystem after its metadata has been committed to the journal. This is rumoured to be the highest-throughput option. It
guarantees internal filesystem integrity, however it can allow old data to appear in files after a crash and journal recovery.
弊端,文件系统或操作系统crash后,可能导致metadata和data不一致,出现脏块。
解决办法3
将用作journal块设备独立出来,在限制IOPS时,不限制journal块设备的IO(因为metadata journal的操作很少,也很快,没有必要限制),只限制data块设备的IO。
这种方法只适合xfs文件系统。
ext4文件系统使用这种方法未达到效果,ext4分开journal dev方法如下,但是没有效果,你可以尝试一下。
创建逻辑卷,一个放DATA,一个放journal
#pvcreate /dev/dfa
#pvcreate /dev/dfb
#pvcreate /dev/dfc
#vgcreate aliflash /dev/dfa /dev/dfb /dev/dfc
#lvcreate -i 3 -I 8 -L 1T -n lv01 aliflash
#lvcreate -i 3 -I 8 -L 2G -n lv02 aliflash
创建journal块设备
#mkfs.ext4 -O journal_dev -b 4096 /dev/mapper/aliflash-lv02
mke2fs 1.41.12 (17-May-2010)
Discarding device blocks: done
Filesystem label=
OS type: Linux
Block size=4096 (log=2)
Fragment size=4096 (log=2)
Stride=2 blocks, Stripe width=6 blocks
0 inodes, 525312 blocks
0 blocks (0.00%) reserved for the super user
First data block=0
0 block group
32768 blocks per group, 32768 fragments per group
0 inodes per group
Superblock backups stored on blocks:
Zeroing journal device: done
创建ext4文件系统
#mkfs.ext4 -E stride=16,stripe-width=48 -J device=/dev/mapper/aliflash-lv02 /dev/mapper/aliflash-lv01
mke2fs 1.41.12 (17-May-2010)
Using journal device's blocksize: 4096
Discarding device blocks: done
Filesystem label=
OS type: Linux
Block size=4096 (log=2)
Fragment size=4096 (log=2)
Stride=16 blocks, Stripe width=48 blocks
67117056 inodes, 268437504 blocks
13421875 blocks (5.00%) reserved for the super user
First data block=0
Maximum filesystem blocks=4294967296
8193 block groups
32768 blocks per group, 32768 fragments per group
8192 inodes per group
Superblock backups stored on blocks:
32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208,
4096000, 7962624, 11239424, 20480000, 23887872, 71663616, 78675968,
102400000, 214990848
Writing inode tables: done
Adding journal to device /dev/mapper/aliflash-lv02: done
Writing superblocks and filesystem accounting information: done
This filesystem will be automatically checked every 31 mounts or
180 days, whichever comes first. Use tune2fs -c or -i to override.
#ll /dev/mapper/aliflash-lv0*
lrwxrwxrwx 1 root root 7 Jan 7 11:12 /dev/mapper/aliflash-lv01 -> ../dm-0
lrwxrwxrwx 1 root root 7 Jan 7 11:12 /dev/mapper/aliflash-lv02 -> ../dm-1
#ll /dev/dm-0
brw-rw---- 1 root disk 253, 0 Jan 7 11:22 /dev/dm-0
#ll /dev/dm-1
brw-rw---- 1 root disk 253, 1 Jan 7 11:22 /dev/dm-1
挂载文件系统
#mount -o nobarrier,noatime,nodiratime,discard,defaults,nodelalloc /dev/mapper/aliflash-lv01 /data01
使用本文开头的方法,只限制/dev/mapper/vgdata01-lv01的IOPS,测试不能解决问题。
XFS文件系统使用journal dev的方法
# mkfs.xfs -f -b size=4096 -l logdev=/dev/mapper/vgdata01-lv02,size=2136997888,sunit=16 -d agcount=9000,sunit=16,swidth=48 /dev/mapper/vgdata01-lv01
# mount -t xfs -o nobarrier,nolargeio,logbsize=262144,noatime,nodiratime,swalloc,logdev=/dev/mapper/vgdata01-lv02 /dev/mapper/vgdata01-lv01 /data01
使用本文开头的方法,只限制/dev/mapper/vgdata01-lv01的IOPS,测试,问题得到解决。
被限制IOPS的实例,IO使用确实只能达到限制值,所以在不改变原有需求的情况下,实现了相互不干扰。