优化器成本因子校对(disk,ssd,memory IO开销精算) - PostgreSQL real seq_page_cost & random_page_cost in disks,ssd,memory

7 minute read

背景

在PostgreSQL中全表扫描时, 计算page扫描的成本因子为seq_page_cost, 例如扫描100个数据块, 那么扫描这些数据块的成本就是seq_page_cost*100;

如果使用的是索引扫描(如index scan, index only scan, bitmap index scan, bitmap heap scan)这些都是随机页面扫描, 在计算成本时使用random_page_cost成本因子, 例如使用这些扫描方法扫描了100个数据块, 则扫描这些数据块的成本就是random_page_cost*100;

我早些时间写了一篇BLOG专门介绍成本因子的校准的方法, 包括CPU的成本因子和PAGE SCAN的成本因子.

有兴趣的朋友可参考 :

《优化器成本因子校对 - PostgreSQL explain cost constants alignment to timestamp》

本文将对比几种硬件的连续扫描和离散扫描的实际时间. 用的方法和上面这篇BLOG一致, 都是使用systemtap来跟踪数据块读取的实际时间.

在测试机得到的测试结果如下 :

机械硬盘连续块扫描的平均时间197881纳秒(0.197881毫秒)  
  
机械硬盘离散块扫描的平均时间4957678纳秒(4.957678毫秒)  
  
SSD硬盘连续块扫描的平均时间49577纳秒(0.049577毫秒)  
  
SSD硬盘离散块扫描的平均时间483504纳秒(0.483504毫秒)  
  
当数据在内存中时, 连续块扫描的平均时间11445纳秒(0.011445毫秒)  
  
当数据在内存中时, 离散块扫描的平均时间18563纳秒(0.018563毫秒)  

从结论得知, 机械硬盘离散扫描的性能下降最严重, 比连续扫描降25倍.

SSD的离散扫描性能比连续扫描性能降9.7倍.

内存的离散扫描性能比连续扫描性能降1.6倍.

相对取单条TUPLE的CPU成本, 数据块扫描的成本大概是他们的百倍到万倍的区间 :

(以下结果为1.6GHZ频率的CPU的测试结果)

cpu_tuple_cost = 0.00018884145574257426   #毫秒  
cpu_index_tuple_cost = 0.00433497085216479990   #毫秒  
cpu_operator_cost = 0.00216748542608239995   #毫秒  

详细的测试过程参考

测试环境如下 :

OS: CentOS 6.4 x64  
MEM: 96GB  
DB: PostgreSQL 9.3.3  
CPU: Intel(R) Xeon(R) CPU           E5504  @ 2.00GHz  
DISK: SAS 2.5寸 1万转, 146GB, 开启RAID R|W cache.  
SSD: OCZ RevoDrive3 240GB  

要对比的是连续块扫描和离散块扫描在机械硬盘, SSD硬盘以及内存中的实际时间.

分区如下 :

pg93@db-172-16-3-150-> df -h  
Filesystem            Size  Used Avail Use% Mounted on  
/dev/sdc1              29G  9.2G   19G  34% /  
tmpfs                  48G     0   48G   0% /dev/shm  
/dev/sdc3              98G   30G   64G  32% /opt  
/dev/sdd1             183G   34G  141G  20% /ssd1  
/dev/sda1             221G  8.0G  201G   4% /ssd3  
/dev/sdb1             221G   34G  176G  17% /ssd4  

数据库安装在SSD上面 :

pg93@db-172-16-3-150-> cd $PGDATA  
pg93@db-172-16-3-150-> pwd  
/ssd4/pg93/pg_root  
pg93@db-172-16-3-150-> psql  
psql (9.3.3)  
Type "help" for help.  
digoal=# \db  
       List of tablespaces  
    Name    |  Owner   | Location   
------------+----------+----------  
 pg_default | postgres |   
 pg_global  | postgres |   
(2 rows)  

为了测试机械硬盘, 在机械硬盘上加一个表空间

[root@db-172-16-3-150 ssd4]# mkdir -p /opt/pg93/tbs_digoal  
[root@db-172-16-3-150 ssd4]# chown -R pg93:pg93 /opt/pg93  
[root@db-172-16-3-150 ssd4]# su - pg93  
pg93@db-172-16-3-150-> psql  
psql (9.3.3)  
Type "help" for help.  
digoal=# create tablespace tbs_digoal location '/opt/pg93/tbs_digoal';  
CREATE TABLESPACE  

首先要测试的是SSD硬盘 :

创建测试数据和索引

digoal=# create table tbl_cost_align (id int, info text, crt_time timestamp);  
CREATE TABLE  
digoal=# insert into tbl_cost_align select (random()*2000000000)::int, md5(random()::text), clock_timestamp() from generate_series(1,10000000);  
INSERT 0 10000000  
digoal=# create index idx_tbl_cost_align_id on tbl_cost_align(id);  
digoal=# analyze tbl_cost_align;  
ANALYZE  
digoal=# select relpages from pg_class where relname='tbl_cost_align';  
 relpages   
----------  
    23256  
(1 row)  
digoal=# checkpoint;  

关闭数据库, 并删除系统层面的缓存, 然后重启数据库 :

pg93@db-172-16-3-150-> pg_ctl stop -m fast  
waiting for server to shut down.... done  
server stopped  
[root@db-172-16-3-150 ~]# sync; echo 3 > /proc/sys/vm/drop_caches  
pg93@db-172-16-3-150-> taskset -c 1 /home/pg93/pgsql9.3.3/bin/postgres >/dev/null 2>&1  

开启一个psql进行测试 :

pg93@db-172-16-3-150-> psql  
psql (9.3.3)  
Type "help" for help.  
digoal=# select pg_backend_pid();  
 pg_backend_pid   
----------------  
           2699  
(1 row)  

跟踪这个psql的pid

taskset -c 7 stap -e '  
global a  
probe process("/home/pg93/pgsql9.3.3/bin/postgres").mark("query__start") {  
  delete a  
  println("query__start ", user_string($arg1), "pid:", pid())  
}  
probe vfs.read.return {  
  t = gettimeofday_ns() - @entry(gettimeofday_ns())  
  # if (execname() == "postgres" && devname != "N/A")  
    a[pid()] <<< t  
}  
probe process("/home/pg93/pgsql9.3.3/bin/postgres").mark("query__done") {  
  if (@count(a[pid()]))   
    printdln("**", pid(), @count(a[pid()]), @avg(a[pid()]))  
  println("query__done ", user_string($arg1), "pid:", pid())  
  if (@count(a[pid()])) {  
    println(@hist_log(a[pid()]))  
    #println(@hist_linear(a[pid()],1024,4096,100))  
  }  
  delete a  
}' -x 2699  

确认stap模块已经加载到内核 :

[root@db-172-16-3-150 ~]# lsmod|grep stap  
stap_ba670c311fb1c365e5ccb4b6e6e9d92a__3022   149604  2   
uprobes                28877  2 stap_ba670c311fb1c365e5ccb4b6e6e9d92a__3022  

在psql中执行一个全表扫描的查询 :

digoal=# explain (analyze,verbose,costs,buffers,timing) select * from tbl_cost_align;  
                                                              QUERY PLAN                                                              
    
------------------------------------------------------------------------------------------------------------------------------------  
 Seq Scan on public.tbl_cost_align  (cost=0.00..123256.00 rows=10000000 width=45) (actual time=0.793..2986.001 rows=10000000 loops=1)  
   Output: id, info, crt_time  
   Buffers: shared read=23256  
 Total runtime: 4046.410 ms  
(4 rows)  

systemtap跟踪的结果如下 :

平均扫描时间为49577纳秒(0.049577毫秒).

query__start explain (analyze,verbose,costs,buffers,timing) select * from tbl_cost_align;pid:2699  
2699**23269**49577  
query__done explain (analyze,verbose,costs,buffers,timing) select * from tbl_cost_align;pid:2699  
  value |-------------------------------------------------- count  
   4096 |                                                      0  
   8192 |                                                      0  
  16384 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@  8888  
  32768 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@    8509  
  65536 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@                   5787  
 131072 |                                                     73  
 262144 |                                                      2  
 524288 |                                                      6  
1048576 |                                                      1  
2097152 |                                                      3  
4194304 |                                                      0  
8388608 |                                                      0  

随机块扫描测试 :

[root@db-172-16-3-150 ~]# sync; echo 3 > /proc/sys/vm/drop_caches  

重启数据库

pg93@db-172-16-3-150-> taskset -c 1 /home/pg93/pgsql9.3.3/bin/postgres >/dev/null 2>&1  

重新测试

pg93@db-172-16-3-150-> psql  
psql (9.3.3)  
Type "help" for help.  
digoal=# select pg_backend_pid();  
 pg_backend_pid   
----------------  
           3596  
(1 row)  
  
[root@db-172-16-3-150 ~]# taskset -c 7 stap -e '  
global a  
probe process("/home/pg93/pgsql9.3.3/bin/postgres").mark("query__start") {  
  delete a  
  println("query__start ", user_string($arg1), "pid:", pid())  
}  
probe vfs.read.return {  
  t = gettimeofday_ns() - @entry(gettimeofday_ns())  
  # if (execname() == "postgres" && devname != "N/A")  
    a[pid()] <<< t  
}  
probe process("/home/pg93/pgsql9.3.3/bin/postgres").mark("query__done") {  
  if (@count(a[pid()]))   
    printdln("**", pid(), @count(a[pid()]), @avg(a[pid()]))  
  println("query__done ", user_string($arg1), "pid:", pid())  
  if (@count(a[pid()])) {  
    println(@hist_log(a[pid()]))  
    #println(@hist_linear(a[pid()],1024,4096,100))  
  }  
  delete a  
}' -x 3596  
  
[root@db-172-16-3-150 ~]# lsmod|grep stap  
stap_ba670c311fb1c365e5ccb4b6e6e9d92a__3619   149604  2   
uprobes                28877  2 stap_ba670c311fb1c365e5ccb4b6e6e9d92a__3619  
  
digoal=# set enable_seqscan=off;   
digoal=# set enable_bitmapscan=off;   
digoal=# explain (analyze,verbose,costs,buffers,timing) select * from tbl_cost_align where id>1998999963;  
                                                                        QUERY PLAN                                                    
                        
------------------------------------------------------------------------------------------------------------------------------------  
 Index Scan using idx_tbl_cost_align_id on public.tbl_cost_align  (cost=0.43..4609.95 rows=5003 width=45) (actual time=0.711..2240.410 rows=5007 loops=1)  
   Output: id, info, crt_time  
   Index Cond: (tbl_cost_align.id > 1998999963)  
   Buffers: shared hit=481 read=4532  
 Total runtime: 2242.254 ms  
(5 rows)  
  
query__start explain (analyze,verbose,costs,buffers,timing) select * from tbl_cost_align where id>1998999963;pid:3596  
3596**4563**483504  
query__done explain (analyze,verbose,costs,buffers,timing) select * from tbl_cost_align where id>1998999963;pid:3596  
  value |-------------------------------------------------- count  
   4096 |                                                      0  
   8192 |                                                      0  
  16384 |                                                     21  
  32768 |@@@                                                 175  
  65536 |                                                     26  
 131072 |                                                      1  
 262144 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@  2491  
 524288 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@               1800  
1048576 |                                                     47  
2097152 |                                                      2  
4194304 |                                                      0  
8388608 |                                                      0  

随机块扫描的平均时间为483504纳秒(0.483504毫秒);

接下来测试的是机械硬盘的连续块扫描和随机块扫描的实际性能.

digoal=# alter table tbl_cost_align set tablespace tbs_digoal;  
ALTER TABLE  
digoal=# alter index idx_tbl_cost_align_id set tablespace tbs_digoal;  
ALTER INDEX  
digoal=# checkpoint;  
CHECKPOINT  
  
[root@db-172-16-3-150 ~]# sync; echo 3 > /proc/sys/vm/drop_caches  
pg93@db-172-16-3-150-> taskset -c 1 /home/pg93/pgsql9.3.3/bin/postgres >/dev/null 2>&1  
  
su - pg93  
^[[A^[[Apg93@db-172-16-3-150-> psql  
psql (9.3.3)  
Type "help" for help.  
digoal=# select pg_backend_pid();  
 pg_backend_pid   
----------------  
           3739  
(1 row)  
  
[root@db-172-16-3-150 ~]# taskset -c 7 stap -e '  
global a  
probe process("/home/pg93/pgsql9.3.3/bin/postgres").mark("query__start") {  
  delete a  
  println("query__start ", user_string($arg1), "pid:", pid())  
}  
probe vfs.read.return {  
  t = gettimeofday_ns() - @entry(gettimeofday_ns())  
  # if (execname() == "postgres" && devname != "N/A")  
    a[pid()] <<< t  
}  
probe process("/home/pg93/pgsql9.3.3/bin/postgres").mark("query__done") {  
  if (@count(a[pid()]))   
    printdln("**", pid(), @count(a[pid()]), @avg(a[pid()]))  
  println("query__done ", user_string($arg1), "pid:", pid())  
  if (@count(a[pid()])) {  
    println(@hist_log(a[pid()]))  
    #println(@hist_linear(a[pid()],1024,4096,100))  
  }  
  delete a  
}' -x 3739  
  
[root@db-172-16-3-150 ~]# lsmod|grep stap  
stap_ba670c311fb1c365e5ccb4b6e6e9d92a__3755   149604  2   
uprobes                28877  2 stap_ba670c311fb1c365e5ccb4b6e6e9d92a__3755  
  
digoal=# explain (analyze,verbose,costs,buffers,timing) select * from tbl_cost_align;  
                                                              QUERY PLAN                                                              
     
------------------------------------------------------------------------------------------------------------------------------------  
 Seq Scan on public.tbl_cost_align  (cost=0.00..123256.00 rows=10000000 width=45) (actual time=12.592..6378.106 rows=10000000 loops=1)  
   Output: id, info, crt_time  
   Buffers: shared read=23256  
 Total runtime: 7405.526 ms  
(4 rows)  
  
query__start explain (analyze,verbose,costs,buffers,timing) select * from tbl_cost_align;pid:3739  
3739**23272**197881  
query__done explain (analyze,verbose,costs,buffers,timing) select * from tbl_cost_align;pid:3739  
   value |-------------------------------------------------- count  
    4096 |                                                       0  
    8192 |                                                       0  
   16384 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@  10176  
   32768 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@                 7189  
   65536 |@@@@@@@@@                                           2036  
  131072 |                                                     104  
  262144 |@@@@@@@@@                                           1858  
  524288 |@@@@@@@                                             1472  
 1048576 |                                                      91  
 2097152 |                                                      83  
 4194304 |@                                                    261  
 8388608 |                                                       2  
16777216 |                                                       0  
33554432 |                                                       0  
  
[root@db-172-16-3-150 ~]# sync; echo 3 > /proc/sys/vm/drop_caches  
pg93@db-172-16-3-150-> taskset -c 1 /home/pg93/pgsql9.3.3/bin/postgres >/dev/null 2>&1  
  
  
[root@db-172-16-3-150 ~]# su - pg93  
psqlpg93@db-172-16-3-150-> psql  
psql (9.3.3)  
Type "help" for help.  
digoal=# select pg_backend_pid();  
 pg_backend_pid   
----------------  
           3868  
(1 row)  
  
  
[root@db-172-16-3-150 ~]# taskset -c 7 stap -e '  
global a  
probe process("/home/pg93/pgsql9.3.3/bin/postgres").mark("query__start") {  
  delete a  
  println("query__start ", user_string($arg1), "pid:", pid())  
}  
probe vfs.read.return {  
  t = gettimeofday_ns() - @entry(gettimeofday_ns())  
  # if (execname() == "postgres" && devname != "N/A")  
    a[pid()] <<< t  
}  
probe process("/home/pg93/pgsql9.3.3/bin/postgres").mark("query__done") {  
  if (@count(a[pid()]))   
    printdln("**", pid(), @count(a[pid()]), @avg(a[pid()]))  
  println("query__done ", user_string($arg1), "pid:", pid())  
  if (@count(a[pid()])) {  
    println(@hist_log(a[pid()]))  
    #println(@hist_linear(a[pid()],1024,4096,100))  
  }  
  delete a  
}' -x 3868  
  
[root@db-172-16-3-150 ~]# lsmod|grep stap  
stap_ba670c311fb1c365e5ccb4b6e6e9d92a__3881   149604  2   
uprobes                28877  2 stap_ba670c311fb1c365e5ccb4b6e6e9d92a__3881  
  
digoal=# set enable_seqscan=off;   
digoal=# set enable_bitmapscan=off;   
digoal=# explain (analyze,verbose,costs,buffers,timing) select * from tbl_cost_align where id>1998999963;  
                                                                        QUERY PLAN                                                    
------------------------------------------------------------------------------------------------------------------------------------  
 Index Scan using idx_tbl_cost_align_id on public.tbl_cost_align  (cost=0.43..4609.95 rows=5003 width=45) (actual time=5.462..22646.769 rows=5007 loops=1)  
   Output: id, info, crt_time  
   Index Cond: (tbl_cost_align.id > 1998999963)  
   Buffers: shared hit=481 read=4532  
 Total runtime: 22648.805 ms  
(5 rows)  
  
query__start explain (analyze,verbose,costs,buffers,timing) select * from tbl_cost_align where id>1998999963;pid:3868  
3868**4563**4957678  
query__done explain (analyze,verbose,costs,buffers,timing) select * from tbl_cost_align where id>1998999963;pid:3868  
    value |-------------------------------------------------- count  
     4096 |                                                      0  
     8192 |                                                      0  
    16384 |                                                     27  
    32768 |@@@                                                 169  
    65536 |                                                     26  
   131072 |                                                      2  
   262144 |                                                     17  
   524288 |                                                     40  
  1048576 |@@@                                                 200  
  2097152 |@@@@@@@@@@@@@@@@@@@@@@@@@                          1338  
  4194304 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@  2568  
  8388608 |@@@                                                 159  
 16777216 |                                                     16  
 33554432 |                                                      0  
 67108864 |                                                      1  
134217728 |                                                      0  
268435456 |                                                      0  

机械硬盘连续块扫描的平均时间197881纳秒(0.197881毫秒)

离散块扫描的平均时间4957678纳秒(4.957678毫秒)

最后要测试的是在内存中的连续块扫描和离散块扫描的性能.

要用到pgfincore这个插件, 同时需要修改数据库的shared buffer, 避免在shared buffer中命中, 无法跟踪到vfs系统调用的情况.

digoal=# create extension pgfincore;  
CREATE EXTENSION  
digoal=# select * from pgfadvise_willneed('idx_tbl_cost_align_id');  
                   relpath                    | os_page_size | rel_os_pages | os_pages_free   
----------------------------------------------+--------------+--------------+---------------  
 pg_tblspc/25324/PG_9.3_201306121/16384/33521 |         4096 |        54416 |      19013540  
(1 row)  
  
digoal=# select * from pgfadvise_willneed('tbl_cost_align');  
                   relpath                    | os_page_size | rel_os_pages | os_pages_free   
----------------------------------------------+--------------+--------------+---------------  
 pg_tblspc/25324/PG_9.3_201306121/16384/33518 |         4096 |       186048 |      18884781  
(1 row)  
  
vi postgresql.conf  
shared_buffers = 32MB  
pg93@db-172-16-3-150-> taskset -c 1 /home/pg93/pgsql9.3.3/bin/postgres >/dev/null 2>&1  
  
pg93@db-172-16-3-150-> psql  
psql (9.3.3)  
Type "help" for help.  
digoal=# select pg_backend_pid();  
 pg_backend_pid   
----------------  
           4538  
(1 row)  
  
[root@db-172-16-3-150 ~]# taskset -c 7 stap -e '  
global a  
probe process("/home/pg93/pgsql9.3.3/bin/postgres").mark("query__start") {  
  delete a  
  println("query__start ", user_string($arg1), "pid:", pid())  
}  
probe vfs.read.return {      
  t = gettimeofday_ns() - @entry(gettimeofday_ns())  
  # if (execname() == "postgres" && devname != "N/A")  
    a[pid()] <<< t  
}                                                                          
probe process("/home/pg93/pgsql9.3.3/bin/postgres").mark("query__done") {  
  if (@count(a[pid()]))                                      
    printdln("**", pid(), @count(a[pid()]), @avg(a[pid()]))   
  println("query__done ", user_string($arg1), "pid:", pid())  
  if (@count(a[pid()])) {         
    println(@hist_log(a[pid()]))  
    #println(@hist_linear(a[pid()],1024,4096,100))  
  }         
  delete a  
}' -x 4538  
  
[root@db-172-16-3-150 ~]# lsmod|grep stap  
stap_ba670c311fb1c365e5ccb4b6e6e9d92a__4556   149604  2   
uprobes                28877  2 stap_ba670c311fb1c365e5ccb4b6e6e9d92a__4556  
  
digoal=# explain (analyze,verbose,costs,buffers,timing) select * from tbl_cost_align;  
                                                              QUERY PLAN                                                              
------------------------------------------------------------------------------------------------------------------------------------  
 Seq Scan on public.tbl_cost_align  (cost=0.00..123256.00 rows=10000000 width=45) (actual time=0.066..2086.471 rows=10000000 loops=1)  
   Output: id, info, crt_time  
   Buffers: shared read=23256  
 Total runtime: 3128.600 ms  
(4 rows)  
  
query__start explain (analyze,verbose,costs,buffers,timing) select * from tbl_cost_align;pid:4538  
4538**23272**11445  
query__done explain (analyze,verbose,costs,buffers,timing) select * from tbl_cost_align;pid:4538  
  value |-------------------------------------------------- count  
   2048 |                                                       0  
   4096 |                                                       0  
   8192 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@  23018  
  16384 |                                                     224  
  32768 |                                                      26  
  65536 |                                                       3  
 131072 |                                                       0  
 262144 |                                                       0  
 524288 |                                                       1  
1048576 |                                                       0  
2097152 |                                                       0  
  
digoal=# set enable_seqscan=off;   
digoal=# set enable_bitmapscan=off;   
digoal=# explain (analyze,verbose,costs,buffers,timing) select * from tbl_cost_align where id>1998999963;  
                                                                       QUERY PLAN                                                     
------------------------------------------------------------------------------------------------------------------------------------  
 Index Scan using idx_tbl_cost_align_id on public.tbl_cost_align  (cost=0.43..4609.95 rows=5003 width=45) (actual time=0.157..110.887 rows=5007 loops=1)  
   Output: id, info, crt_time  
   Index Cond: (tbl_cost_align.id > 1998999963)  
   Buffers: shared hit=175 read=4838  
 Total runtime: 111.819 ms  
(5 rows)  
  
query__start explain (analyze,verbose,costs,buffers,timing) select * from tbl_cost_align where id>1998999963;pid:4538  
4538**4855**18563  
query__done explain (analyze,verbose,costs,buffers,timing) select * from tbl_cost_align where id>1998999963;pid:4538  
  value |-------------------------------------------------- count  
   2048 |                                                      0  
   4096 |                                                      0  
   8192 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@  3762  
  16384 |@@@@                                                367  
  32768 |@@@@@@@@@                                           717  
  65536 |                                                      2  
 131072 |                                                      6  
 262144 |                                                      1  
 524288 |                                                      0  
1048576 |                                                      0  

当数据在内存中时, 连续块扫描的平均时间11445纳秒(0.011445毫秒)

离散块扫描的平均时间18563纳秒(0.018563毫秒)

参考

1. 《优化器成本因子校对 - PostgreSQL explain cost constants alignment to timestamp》

Flag Counter

digoal’s 大量PostgreSQL文章入口