USE blockdev –setra 0 and systemtap test real BLOCKDEV iops

3 minute read

背景

我在上一篇文章中讨论了关于调整CPU亲和来降低systemtap带来的性能影响, 本文是一个扩展, 还是用到那个例子, 但是我增加了IO响应时间的柱状图输出, 从柱状图分析IO的一些”奇妙”的东西.

首先启动数据库, 亲和设置为1

pg93@db-172-16-3-150-> taskset -c 1 /home/pg93/pgsql9.3.1/bin/postgres >/dev/null 2>&1  

然后清除OS CACHE

[root@db-172-16-3-150 ~]# sync; echo 3 > /proc/sys/vm/drop_caches  

然后启动psql

pg93@db-172-16-3-150-> psql  
psql (9.3.1)  
Type "help" for help.  
digoal=# select pg_backend_pid();  
 pg_backend_pid   
----------------  
           5167  
(1 row)  

然后启动stap

[root@db-172-16-3-150 ~]# taskset -c 7 stap -e '  
global a  
probe process("/home/pg93/pgsql9.3.1/bin/postgres").mark("query__start") {  
  delete a  
  println("query__start ", user_string($arg1), "pid:", pid())  
}  
probe vfs.read.return {  
  t = gettimeofday_ns() - @entry(gettimeofday_ns())  
  # if (execname() == "postgres" && devname != "N/A")  
    a[pid()] <<< t  
}  
probe process("/home/pg93/pgsql9.3.1/bin/postgres").mark("query__done") {  
  if (@count(a[pid()]))   
    printdln("**", pid(), @count(a[pid()]), @avg(a[pid()]))  
  println("query__done ", user_string($arg1), "pid:", pid())  
  if (@count(a[pid()])) {  
    println(@hist_log(a[pid()]))  
    #println(@hist_linear(a[pid()],1024,4096,100))  
  }  
  delete a  
}' -x 5167  

然后执行一条查询

digoal=# explain (analyze,verbose,costs,buffers,timing) select * from tbl_cost_align;  
                                                               QUERY PLAN                                                             
      
------------------------------------------------------------------------------------------------------------------------------------  
----  
 Seq Scan on postgres.tbl_cost_align  (cost=0.00..195393.00 rows=10100000 width=45) (actual time=0.683..3252.479 rows=10100000 loops  
=1)  
   Output: id, info, crt_time  
   Buffers: shared read=94393  
 Total runtime: 4325.405 ms  
(4 rows)  

stap的输出如下

query__start explain (analyze,verbose,costs,buffers,timing) select * from tbl_cost_align;pid:5167  
5167**94418**14464  
query__done explain (analyze,verbose,costs,buffers,timing) select * from tbl_cost_align;pid:5167  
  value |-------------------------------------------------- count  
   1024 |                                                       0  
   2048 |                                                       0  
   4096 |                                                     109  
   8192 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@  86181  
  16384 |@                                                   2045  
  32768 |                                                      90  
  65536 |@@@                                                 5903  
 131072 |                                                      72  
 262144 |                                                       7  
 524288 |                                                       9  
1048576 |                                                       1  
2097152 |                                                       1  
4194304 |                                                       0  
8388608 |                                                       0  

平均的IO响应时间为14464纳秒.

但是从IO响应时间分布图, 我们看到了比较奇妙的现象.

响应时间在8192纳秒以内的操作数有86181次, 占了所有IO的绝大部分.

而另一较大部分32768 到65536纳秒之间的操作数是5903个.

这是为什么呢? 原因是块设备的预读.

例如本例的tbl_cost_align表对应的磁盘为/dev/sdb

digoal=# \db  
              List of tablespaces  
    Name    |  Owner   |       Location          
------------+----------+-----------------------  
 pg_default | postgres |   
 pg_global  | postgres |   
 tbs_digoal | postgres | /ssd3/pg93/tbs_digoal  
 tbs_idx    | postgres | /ssd4/pg93/tbs_idx  
(4 rows)  
  
digoal=# \d tbl_cost_align   
          Table "postgres.tbl_cost_align"  
  Column  |            Type             | Modifiers   
----------+-----------------------------+-----------  
 id       | integer                     |   
 info     | text                        |   
 crt_time | timestamp without time zone |   
Indexes:  
    "idx_tbl_cost_align_id" btree (id), tablespace "tbs_idx"  
Tablespace: "tbs_idx"  
  
digoal=# \q  
pg93@db-172-16-3-150-> df -h  
Filesystem            Size  Used Avail Use% Mounted on  
/dev/sdc1              29G  7.6G   20G  28% /  
tmpfs                  48G     0   48G   0% /dev/shm  
/dev/sdc3              98G   27G   67G  29% /opt  
/dev/sdd1             183G   18G  157G  10% /ssd1  
/dev/sde1             469G  1.7G  444G   1% /ssd2  
/dev/sda1             221G   12G  198G   6% /ssd3  
/dev/sdb1             221G   26G  184G  13% /ssd4  

sdb的预读是多少呢?

[root@db-172-16-3-150 ~]# blockdev --getra /dev/sdb  
256  

表的大小

digoal=# \dt+ tbl_cost_align   
                          List of relations  
  Schema  |      Name      | Type  |  Owner   |  Size  | Description   
----------+----------------+-------+----------+--------+-------------  
 postgres | tbl_cost_align | table | postgres | 738 MB |   
(1 row)  

在柱状图中的5903个操作, 如果算上预读的话刚好和表的大小差不多.

digoal=# select 5903*256*512/1024/1024.0;  
       ?column?         
----------------------  
 737.8750000000000000  
(1 row)  

柱状图反映的是一个真实的IO获取的响应时间的现象, 同时也体现了块设备readahead在这里所的作用.

接下来我把read ahead设置为0 , 那么每次响应就变得很慢了. 来看结果.

[root@db-172-16-3-150 ~]# blockdev --setra 0 /dev/sdb  
  
[root@db-172-16-3-150 ~]# sync; echo 3 > /proc/sys/vm/drop_caches  
  
pg93@db-172-16-3-150-> taskset -c 1 /home/pg93/pgsql9.3.1/bin/postgres >/dev/null 2>&1  
  
pg93@db-172-16-3-150-> psql  
psql (9.3.1)  
Type "help" for help.  
digoal=# select pg_backend_pid();  
 pg_backend_pid   
----------------  
           5330  
(1 row)  
  
[root@db-172-16-3-150 ~]# taskset -c 7 stap -e '  
global a  
probe process("/home/pg93/pgsql9.3.1/bin/postgres").mark("query__start") {  
  delete a  
  println("query__start ", user_string($arg1), "pid:", pid())  
}  
probe vfs.read.return {  
  t = gettimeofday_ns() - @entry(gettimeofday_ns())  
  # if (execname() == "postgres" && devname != "N/A")  
    a[pid()] <<< t  
}  
probe process("/home/pg93/pgsql9.3.1/bin/postgres").mark("query__done") {  
  if (@count(a[pid()]))   
    printdln("**", pid(), @count(a[pid()]), @avg(a[pid()]))  
  println("query__done ", user_string($arg1), "pid:", pid())  
  if (@count(a[pid()])) {  
    println(@hist_log(a[pid()]))  
    #println(@hist_linear(a[pid()],1024,4096,100))  
  }  
  delete a  
}' -x 5330  
  
digoal=# explain (analyze,verbose,costs,buffers,timing) select * from tbl_cost_align;  
                                                               QUERY PLAN                                                             
       
------------------------------------------------------------------------------------------------------------------------------------  
-----  
 Seq Scan on postgres.tbl_cost_align  (cost=0.00..195393.00 rows=10100000 width=45) (actual time=1.115..28424.672 rows=10100000 loop  
s=1)  
   Output: id, info, crt_time  
   Buffers: shared read=94393  
 Total runtime: 29471.988 ms  
(4 rows)  
  
query__start explain (analyze,verbose,costs,buffers,timing) select * from tbl_cost_align;pid:5330  
5330**94418**277500  
query__done explain (analyze,verbose,costs,buffers,timing) select * from tbl_cost_align;pid:5330  
  value |-------------------------------------------------- count  
   2048 |                                                       0  
   4096 |                                                       0  
   8192 |                                                       7  
  16384 |                                                       0  
  32768 |                                                       1  
  65536 |                                                       0  
 131072 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@  62765  
 262144 |@@@@@@@@@@@@@@@@@@@@@@                             28131  
 524288 |@@                                                  3509  
1048576 |                                                       2  
2097152 |                                                       3  
4194304 |                                                       0  
8388608 |                                                       0  

从结果看到, 平均IO响应时间变成277500纳秒了, 换算成IOPS只有3603.

参考

1. man blockdev

       --setra N  
              Set readahead to N 512-byte sectors.  
  
       --getra  
              Print readahead (in 512-byte sectors).  

2. 《设置进程亲和 - numactl 或 taskset - retrieve or set a process’s CPU affinity (affect SYSTEMTAP TIME)》

Flag Counter

digoal’s 大量PostgreSQL文章入口