xfs logdev 完美解决cgroup iops限制时ext4 data=writeback才能解决的问题

6 minute read

背景

Linux下面ext4和xfs都是带journal的文件系统,在写metadata前,必须先写metadata的journal。

(journal类似数据库的REDO LOG,可用于灾难恢复)

metadata则包含了文件系统的inodes, directories, indirect blocks信息。创建文件(含目录),改变文件大小,改变文件修改时间都涉及metadata的写操作。

在ext4,xfs文件系统中,metadata的journal操作是串行的,这点和redo log也类似。

cgroup的blkio模块,可以控制进程对指定块设备的读,写IOPS,吞吐率等。

当我们对iops进行限制时,由于”metadata的journal操作是串行的”,可能遇到相互干扰的问题。

例如:

有1个块设备,找到它的major,minor号。

#ll /dev/mapper/aliflash-lv0*  
lrwxrwxrwx 1 root root 7 Jan  7 11:12 /dev/mapper/aliflash-lv01 -> ../dm-0  
#ll /dev/dm-0  
brw-rw---- 1 root disk 253, 0 Jan  7 11:22 /dev/dm-0  

在这个块设备上创建xfs或ext4文件系统,并挂载到/data01。

初始化两个PostgreSQL数据库实例,分别放在/data01不同目录中。

限制其中一个PostgreSQL集群对(253:0)这个块设备的写IOPS到100。

ps -ewf|grep postgres  
digoal 24259     1  0 12:58 pts/4    00:00:00 /home/digoal/pgsql9.5/bin/postgres  -- 监听1921  
digoal 24260 24259  0 12:58 ?        00:00:00 postgres: logger process              
digoal 24262 24259  0 12:58 ?        00:00:00 postgres: checkpointer process        
digoal 24263 24259  0 12:58 ?        00:00:00 postgres: writer process              
digoal 24264 24259  0 12:58 ?        00:00:00 postgres: wal writer process          
digoal 24265 24259  0 12:58 ?        00:00:00 postgres: autovacuum launcher process     
digoal 24266 24259  0 12:58 ?        00:00:00 postgres: stats collector process     
digoal 24293     1  0 12:58 pts/4    00:00:00 /home/digoal/pgsql9.5/bin/postgres -D /data01/digoal/pg_root  -- 监听1922  
digoal 24294 24293  0 12:58 ?        00:00:00 postgres: logger process                                          
digoal 24296 24293  0 12:58 ?        00:00:20 postgres: checkpointer process                                    
digoal 24297 24293  0 12:58 ?        00:00:00 postgres: writer process                                          
digoal 24298 24293  0 12:58 ?        00:00:00 postgres: wal writer process                                      
digoal 24299 24293  0 12:58 ?        00:00:00 postgres: autovacuum launcher process                             
digoal 24300 24293  0 12:58 ?        00:00:00 postgres: stats collector process   

限制1921实例的IOPS

cd /sys/fs/cgroup/blkio/  
mkdir cg1  
cd cg1  
echo "253:0 100" > blkio.throttle.write_iops_device  
echo 24259 > tasks  
echo 24260 > tasks  
echo 24262 > tasks  
echo 24263 > tasks  
echo 24264 > tasks  
echo 24265 > tasks  
echo 24266 > tasks  

开启一个会大量修改metadata的压测。使用create database即可。

(create database会大量的COPY模板库的数据文件,调用fsync。从而产生大量的metadata修改的动作,触发metadata journal的修改。)

vi test.sh  
#!/bin/bash  
  
for ((i=0;i<100;i++))  
do  
psql -h 127.0.0.1 -p 1921 -c "create database $i"  
done  
  
. ./test.sh  

观察块设备IOPS,被写IOPS限制在100了。

iostat -x 1  
avg-cpu:  %user   %nice %system %iowait  %steal   %idle  
           0.00    0.00    0.03    3.12    0.00   96.84  
Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s avgrq-sz avgqu-sz   await  svctm  %util  
dm-0              0.00     0.00    0.00  100.00     0.00  1600.00    16.00     0.00    0.00   0.00   0.00  

现在连到1922的实例,去压测性能:

pgbench -i -s 100 -h 127.0.0.1 -p 1922  
  
pgbench -M prepared -n -r -P 1 -c 96 -j 96 -T 100 -h 127.0.0.1 -p 1922  
progress: 1.0 s, 33.0 tps, lat 2.841 ms stddev 1.746  
progress: 2.0 s, 0.0 tps, lat -nan ms stddev -nan  
progress: 3.0 s, 0.0 tps, lat -nan ms stddev -nan  
progress: 4.0 s, 0.0 tps, lat -nan ms stddev -nan  
progress: 5.0 s, 0.0 tps, lat -nan ms stddev -nan  
progress: 6.0 s, 197.2 tps, lat 2884.437 ms stddev 2944.982  
progress: 7.0 s, 556.6 tps, lat 33.527 ms stddev 34.798  
progress: 8.0 s, 0.0 tps, lat -nan ms stddev -nan  
progress: 9.0 s, 0.0 tps, lat -nan ms stddev -nan  
progress: 10.0 s, 0.0 tps, lat -nan ms stddev -nan  
progress: 11.0 s, 0.0 tps, lat -nan ms stddev -nan  
progress: 12.0 s, 0.0 tps, lat -nan ms stddev -nan  
progress: 13.0 s, 0.0 tps, lat -nan ms stddev -nan  
progress: 14.0 s, 0.0 tps, lat -nan ms stddev -nan  
progress: 15.0 s, 0.0 tps, lat -nan ms stddev -nan  

可以看到,1922的性能受到了1921的影响,实际上块设备的IO能力有几十万。

为什么?

由于metadata journal是串行操作的,当1921实例操作metadata journal变慢之后,影响了1922实例对该文件系统的metadata journal的操作。

甚至select 1;这种操作都会受到影响,原因是每次front process与PostgreSQL在建立连接时,需要创建一个临时catalog文件global/pg_internal.init.pid。

跟踪第二个数据库实例的postmaster进程

[root@digoal ~]# strace -T -f -p 24293 >./conn 2>&1  

连接第二个数据库实例

postgres@digoal-> strace -T psql -h 127.0.0.1 -p 1922  
execve("/opt/pgsql/bin/psql", ["psql", "-h", "127.0.0.1", "-p", "1922"], [/* 34 vars */]) = 0 <0.009976>  
brk(0)                                  = 0x1747000 <0.000007>  
...  
poll([{fd=3, events=POLLIN|POLLERR}], 1, -1) // 会卡在这里  

此时在系统中可以看到startup进程,是postmaster fork出来的,注意这个进程号,和后面的conn文件对应起来。

[root@digoal postgresql-9.4.4]# ps -efw|grep start  
postgres 46147 24293  0 19:43 ?        00:00:00 postgres: postgres postgres 127.0.0.1(17947) startup  
  
strace -T psql -h 127.0.0.1 -p 1922的输出截取:  
setsockopt(3, SOL_SOCKET, SO_KEEPALIVE, [1], 4) = 0 <0.000008>  
connect(3, {sa_family=AF_INET, sin_port=htons(1922), sin_addr=inet_addr("127.0.0.1")}, 16) = -1 EINPROGRESS (Operation now in progress) <0.000943>  
poll([{fd=3, events=POLLOUT|POLLERR}], 1, -1) = 1 ([{fd=3, revents=POLLOUT}]) <0.000011>  
getsockopt(3, SOL_SOCKET, SO_ERROR, [0], [4]) = 0 <0.000124>  
getsockname(3, {sa_family=AF_INET, sin_port=htons(17947), sin_addr=inet_addr("127.0.0.1")}, [16]) = 0 <0.000015>  
poll([{fd=3, events=POLLOUT|POLLERR}], 1, -1) = 1 ([{fd=3, revents=POLLOUT}]) <0.000008>  
sendto(3, "\0\0\0\10\4\322\26/", 8, MSG_NOSIGNAL, NULL, 0) = 8 <0.000050>  
poll([{fd=3, events=POLLIN|POLLERR}], 1, -1) = 1 ([{fd=3, revents=POLLIN}]) <0.000600>  
recvfrom(3, "N", 16384, 0, NULL, NULL)  = 1 <0.000010>  
poll([{fd=3, events=POLLOUT|POLLERR}], 1, -1) = 1 ([{fd=3, revents=POLLOUT}]) <0.000007>  
sendto(3, "\0\0\0T\0\3\0\0user\0postgres\0database\0p"..., 84, MSG_NOSIGNAL, NULL, 0) = 84 <0.000020>  
  
poll响应时间达到了67秒  
  
poll([{fd=3, events=POLLIN|POLLERR}], 1, -1) = 1 ([{fd=3, revents=POLLIN}]) <67.436925>  , 响应时间达到了67秒  
recvfrom(3, "R\0\0\0\10\0\0\0\0S\0\0\0\32application_name\0p"..., 16384, 0, NULL, NULL) = 322 <0.000017>  

当建立连接后,查看postmaster进程的跟踪情况。可以看到startup进程46147,这个进程调用write花了66秒,因为这次调用write时触发了写修改metadata的动作。

[root@digoal ~]# grep "pid 46147" conn|less  
[pid 46147] mmap(NULL, 528384, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f0f1403d000 <0.000012>  
[pid 46147] unlink("global/pg_internal.init.46147") = -1 ENOENT (No such file or directory) <0.000059>  
[pid 46147] open("global/pg_internal.init.46147", O_WRONLY|O_CREAT|O_TRUNC, 0666) = 14 <0.000068>  
[pid 46147] fstat(14, {st_mode=S_IFREG|0600, st_size=0, ...}) = 0 <0.000013>  
[pid 46147] mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f0f1403c000 <0.000020>  
[pid 46147] write(14, "f2W\0008\1\0\0\0\0\0\0\200\6\0\0\0\0\0\0U2\0\0\0\0\0\0\0\0\0\0"..., 4096 <unfinished ...>  
[pid 46147] <... write resumed> )       = 4096 <66.317440>  
[pid 46147] --- SIGALRM (Alarm clock) @ 0 (0) ---  

找到对应的代码:

write_relcache_init_file@src/backend/utils/cache/relcache.c

重新跟踪这个c文件:

[root@digoal ~]# cat trc.stp   
global f_start[999999]  
  
probe process("/opt/pgsql/bin/postgres").function("*@/opt/soft_bak/postgresql-9.4.4/src/backend/utils/cache/relcache.c").call {   
  f_start[execname(), pid(), tid(), cpu()] = gettimeofday_ms()  
}  
  
probe process("/opt/pgsql/bin/postgres").function("*@/opt/soft_bak/postgresql-9.4.4/src/backend/utils/cache/relcache.c").return {   
  t=gettimeofday_ms()  
  a=execname()  
  b=cpu()  
  c=pid()  
  d=pp()  
  e=tid()  
  if (f_start[a,c,e,b] && t-f_start[a,c,e,b]>1) {  
#    printf("time:%d, execname:%s, pp:%s, par:%s\n", t - f_start[a,c,e,b], a, d, $$locals$$)  
    printf("time:%d, execname:%s, pp:%s\n", t - f_start[a,c,e,b], a, d)  
  }  
    
}  

因为startup进程是动态生成的,所以只能这样跟踪:

[root@digoal ~]# cat t.sh  
#!/bin/bash  
  
for ((i=0;i<1;i=0)) do  
PID=`ps -ewf|grep start|grep -v grep|awk '{print $2}'`  
stap -vp 5 -DMAXSKIPPED=9999999 -DSTP_NO_OVERLOAD -DMAXTRYLOCK=100 ./trc.stp -x $PID  
done  

重新跟踪如下:

postgres@digoal-> strace -T psql -h 127.0.0.1 -p 1922  
  
[root@digoal ~]# . ./t.sh  
Pass 1: parsed user script and 111 library script(s) using 209296virt/36828res/3172shr/34516data kb, in 180usr/20sys/199real ms.  
Pass 2: analyzed script: 102 probe(s), 7 function(s), 4 embed(s), 1 global(s) using 223800virt/51400res/4172shr/49020data kb, in 80usr/60sys/142real ms.  
Pass 3: translated to C into "/tmp/stapbw7MDq/stap_b17f8a3318ccf4b972f4b84491bbdc1e_54060_src.c" using 223800virt/51744res/4504shr/49020data kb, in 10usr/40sys/57real ms.  
Pass 4: compiled C into "stap_b17f8a3318ccf4b972f4b84491bbdc1e_54060.ko" in 1440usr/370sys/1640real ms.  
Pass 5: starting run.  
time:6134, execname:postgres, pp:process("/opt/pgsql9.4.4/bin/postgres").function("write_item@/opt/soft_bak/postgresql-9.4.4/src/backend/utils/cache/relcache.c:4979").return  
time:3, execname:postgres, pp:process("/opt/pgsql9.4.4/bin/postgres").function("write_item@/opt/soft_bak/postgresql-9.4.4/src/backend/utils/cache/relcache.c:4979").return  
time:6, execname:postgres, pp:process("/opt/pgsql9.4.4/bin/postgres").function("write_item@/opt/soft_bak/postgresql-9.4.4/src/backend/utils/cache/relcache.c:4979").return  
......  

以上问题怎么解决?如何隔离数据库实例的IOPS不会产生相互干扰呢?

解决办法1

不同的实例使用不同的文件系统。

例如

mkfs.ext4 /dev/mapper/vgdata01-lv01  
mkfs.ext4 /dev/mapper/vgdata01-lv02  
mount /dev/mapper/vgdata01-lv01 /data01  
mount /dev/mapper/vgdata01-lv02 /data02  

两个数据库实例分别放在/data01和/data02

限制/dev/mapper/vgdata01-lv01的IOPS,不会影响另一个文件系统。

这种方法的弊端:如果实例数很多,需要拆分成很多个小的文件系统,不适合空间弹性管理和共用。

解决办法2

针对ext4

正常情况下写数据的顺序如果你要修改metadata,必须确保metadata对应的块改变已经落盘,因此可能出现写metadata被迫要刷dirty data page的情况。

pic

如果dirty data page刷盘很慢,就会导致metadata写受堵。而写metadata journal又是串行的,势必影响其他进程对metadata journal的修改。

使用 data=writeback 加载ext4文件系统。

这个方法的原理是写metadata前,不需要等待data写完。从而可能出现metadata是新的,但是数据是旧的情况。(例如inode是新的,data是旧的,可能某些inode引用的块不存在或者是旧的已删除的块)

写metadata不等待写data,好处就是串行操作不好因为data受堵塞而堵塞。

       data={journal|ordered|writeback}  
              Specifies the journalling mode for file data.  Metadata is always journaled.  To use modes other than ordered on the root filesystem, pass the mode to  the  kernel  as  boot  parameter,  e.g.   root-  
              flags=data=journal.  
  
              journal  
                     All data is committed into the journal prior to being written into the main filesystem.  
  
              ordered  
                     This is the default mode.  All data is forced directly out to the main file system prior to its metadata being committed to the journal.  
  
              writeback  
                     Data  ordering  is not preserved - data may be written into the main filesystem after its metadata has been committed to the journal.  This is rumoured to be the highest-throughput option.  It  
                     guarantees internal filesystem integrity, however it can allow old data to appear in files after a crash and journal recovery.  

弊端,文件系统或操作系统crash后,可能导致metadata和data不一致,出现脏块。

解决办法3

将用作journal块设备独立出来,在限制IOPS时,不限制journal块设备的IO(因为metadata journal的操作很少,也很快,没有必要限制),只限制data块设备的IO。

这种方法只适合xfs文件系统。

ext4文件系统使用这种方法未达到效果,ext4分开journal dev方法如下,但是没有效果,你可以尝试一下。

创建逻辑卷,一个放DATA,一个放journal

#pvcreate /dev/dfa  
#pvcreate /dev/dfb  
#pvcreate /dev/dfc  
#vgcreate aliflash /dev/dfa /dev/dfb /dev/dfc  
#lvcreate -i 3 -I 8 -L 1T -n lv01 aliflash  
#lvcreate -i 3 -I 8 -L 2G -n lv02 aliflash  

创建journal块设备

#mkfs.ext4 -O journal_dev -b 4096 /dev/mapper/aliflash-lv02  
mke2fs 1.41.12 (17-May-2010)  
Discarding device blocks: done                              
Filesystem label=  
OS type: Linux  
Block size=4096 (log=2)  
Fragment size=4096 (log=2)  
Stride=2 blocks, Stripe width=6 blocks  
0 inodes, 525312 blocks  
0 blocks (0.00%) reserved for the super user  
First data block=0  
0 block group  
32768 blocks per group, 32768 fragments per group  
0 inodes per group  
Superblock backups stored on blocks:   
  
Zeroing journal device: done     

创建ext4文件系统

#mkfs.ext4 -E stride=16,stripe-width=48 -J device=/dev/mapper/aliflash-lv02 /dev/mapper/aliflash-lv01  
mke2fs 1.41.12 (17-May-2010)  
Using journal device's blocksize: 4096  
Discarding device blocks: done                              
Filesystem label=  
OS type: Linux  
Block size=4096 (log=2)  
Fragment size=4096 (log=2)  
Stride=16 blocks, Stripe width=48 blocks  
67117056 inodes, 268437504 blocks  
13421875 blocks (5.00%) reserved for the super user  
First data block=0  
Maximum filesystem blocks=4294967296  
8193 block groups  
32768 blocks per group, 32768 fragments per group  
8192 inodes per group  
Superblock backups stored on blocks:   
        32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208,   
        4096000, 7962624, 11239424, 20480000, 23887872, 71663616, 78675968,   
        102400000, 214990848  
  
Writing inode tables: done                              
Adding journal to device /dev/mapper/aliflash-lv02: done  
Writing superblocks and filesystem accounting information: done  
  
This filesystem will be automatically checked every 31 mounts or  
180 days, whichever comes first.  Use tune2fs -c or -i to override.  
  
#ll /dev/mapper/aliflash-lv0*  
lrwxrwxrwx 1 root root 7 Jan  7 11:12 /dev/mapper/aliflash-lv01 -> ../dm-0  
lrwxrwxrwx 1 root root 7 Jan  7 11:12 /dev/mapper/aliflash-lv02 -> ../dm-1  
#ll /dev/dm-0  
brw-rw---- 1 root disk 253, 0 Jan  7 11:22 /dev/dm-0  
#ll /dev/dm-1  
brw-rw---- 1 root disk 253, 1 Jan  7 11:22 /dev/dm-1  

挂载文件系统

#mount -o nobarrier,noatime,nodiratime,discard,defaults,nodelalloc /dev/mapper/aliflash-lv01 /data01  

使用本文开头的方法,只限制/dev/mapper/vgdata01-lv01的IOPS,测试不能解决问题。

XFS文件系统使用journal dev的方法

# mkfs.xfs -f -b size=4096 -l logdev=/dev/mapper/vgdata01-lv02,size=2136997888,sunit=16 -d agcount=9000,sunit=16,swidth=48 /dev/mapper/vgdata01-lv01   
# mount -t xfs -o nobarrier,nolargeio,logbsize=262144,noatime,nodiratime,swalloc,logdev=/dev/mapper/vgdata01-lv02 /dev/mapper/vgdata01-lv01 /data01  

使用本文开头的方法,只限制/dev/mapper/vgdata01-lv01的IOPS,测试,问题得到解决。

被限制IOPS的实例,IO使用确实只能达到限制值,所以在不改变原有需求的情况下,实现了相互不干扰。

Flag Counter

digoal’s 大量PostgreSQL文章入口