优化器成本因子校对(disk,ssd,memory IO开销精算) - PostgreSQL real seq_page_cost & random_page_cost in disks,ssd,memory
背景
在PostgreSQL中全表扫描时, 计算page扫描的成本因子为seq_page_cost, 例如扫描100个数据块, 那么扫描这些数据块的成本就是seq_page_cost*100;
如果使用的是索引扫描(如index scan, index only scan, bitmap index scan, bitmap heap scan)这些都是随机页面扫描, 在计算成本时使用random_page_cost成本因子, 例如使用这些扫描方法扫描了100个数据块, 则扫描这些数据块的成本就是random_page_cost*100;
我早些时间写了一篇BLOG专门介绍成本因子的校准的方法, 包括CPU的成本因子和PAGE SCAN的成本因子.
有兴趣的朋友可参考 :
《优化器成本因子校对 - PostgreSQL explain cost constants alignment to timestamp》
本文将对比几种硬件的连续扫描和离散扫描的实际时间. 用的方法和上面这篇BLOG一致, 都是使用systemtap来跟踪数据块读取的实际时间.
在测试机得到的测试结果如下 :
机械硬盘连续块扫描的平均时间197881纳秒(0.197881毫秒)
机械硬盘离散块扫描的平均时间4957678纳秒(4.957678毫秒)
SSD硬盘连续块扫描的平均时间49577纳秒(0.049577毫秒)
SSD硬盘离散块扫描的平均时间483504纳秒(0.483504毫秒)
当数据在内存中时, 连续块扫描的平均时间11445纳秒(0.011445毫秒)
当数据在内存中时, 离散块扫描的平均时间18563纳秒(0.018563毫秒)
从结论得知, 机械硬盘离散扫描的性能下降最严重, 比连续扫描降25倍.
SSD的离散扫描性能比连续扫描性能降9.7倍.
内存的离散扫描性能比连续扫描性能降1.6倍.
相对取单条TUPLE的CPU成本, 数据块扫描的成本大概是他们的百倍到万倍的区间 :
(以下结果为1.6GHZ频率的CPU的测试结果)
cpu_tuple_cost = 0.00018884145574257426 #毫秒
cpu_index_tuple_cost = 0.00433497085216479990 #毫秒
cpu_operator_cost = 0.00216748542608239995 #毫秒
详细的测试过程参考
测试环境如下 :
OS: CentOS 6.4 x64
MEM: 96GB
DB: PostgreSQL 9.3.3
CPU: Intel(R) Xeon(R) CPU E5504 @ 2.00GHz
DISK: SAS 2.5寸 1万转, 146GB, 开启RAID R|W cache.
SSD: OCZ RevoDrive3 240GB
要对比的是连续块扫描和离散块扫描在机械硬盘, SSD硬盘以及内存中的实际时间.
分区如下 :
pg93@db-172-16-3-150-> df -h
Filesystem Size Used Avail Use% Mounted on
/dev/sdc1 29G 9.2G 19G 34% /
tmpfs 48G 0 48G 0% /dev/shm
/dev/sdc3 98G 30G 64G 32% /opt
/dev/sdd1 183G 34G 141G 20% /ssd1
/dev/sda1 221G 8.0G 201G 4% /ssd3
/dev/sdb1 221G 34G 176G 17% /ssd4
数据库安装在SSD上面 :
pg93@db-172-16-3-150-> cd $PGDATA
pg93@db-172-16-3-150-> pwd
/ssd4/pg93/pg_root
pg93@db-172-16-3-150-> psql
psql (9.3.3)
Type "help" for help.
digoal=# \db
List of tablespaces
Name | Owner | Location
------------+----------+----------
pg_default | postgres |
pg_global | postgres |
(2 rows)
为了测试机械硬盘, 在机械硬盘上加一个表空间
[root@db-172-16-3-150 ssd4]# mkdir -p /opt/pg93/tbs_digoal
[root@db-172-16-3-150 ssd4]# chown -R pg93:pg93 /opt/pg93
[root@db-172-16-3-150 ssd4]# su - pg93
pg93@db-172-16-3-150-> psql
psql (9.3.3)
Type "help" for help.
digoal=# create tablespace tbs_digoal location '/opt/pg93/tbs_digoal';
CREATE TABLESPACE
首先要测试的是SSD硬盘 :
创建测试数据和索引
digoal=# create table tbl_cost_align (id int, info text, crt_time timestamp);
CREATE TABLE
digoal=# insert into tbl_cost_align select (random()*2000000000)::int, md5(random()::text), clock_timestamp() from generate_series(1,10000000);
INSERT 0 10000000
digoal=# create index idx_tbl_cost_align_id on tbl_cost_align(id);
digoal=# analyze tbl_cost_align;
ANALYZE
digoal=# select relpages from pg_class where relname='tbl_cost_align';
relpages
----------
23256
(1 row)
digoal=# checkpoint;
关闭数据库, 并删除系统层面的缓存, 然后重启数据库 :
pg93@db-172-16-3-150-> pg_ctl stop -m fast
waiting for server to shut down.... done
server stopped
[root@db-172-16-3-150 ~]# sync; echo 3 > /proc/sys/vm/drop_caches
pg93@db-172-16-3-150-> taskset -c 1 /home/pg93/pgsql9.3.3/bin/postgres >/dev/null 2>&1
开启一个psql进行测试 :
pg93@db-172-16-3-150-> psql
psql (9.3.3)
Type "help" for help.
digoal=# select pg_backend_pid();
pg_backend_pid
----------------
2699
(1 row)
跟踪这个psql的pid
taskset -c 7 stap -e '
global a
probe process("/home/pg93/pgsql9.3.3/bin/postgres").mark("query__start") {
delete a
println("query__start ", user_string($arg1), "pid:", pid())
}
probe vfs.read.return {
t = gettimeofday_ns() - @entry(gettimeofday_ns())
# if (execname() == "postgres" && devname != "N/A")
a[pid()] <<< t
}
probe process("/home/pg93/pgsql9.3.3/bin/postgres").mark("query__done") {
if (@count(a[pid()]))
printdln("**", pid(), @count(a[pid()]), @avg(a[pid()]))
println("query__done ", user_string($arg1), "pid:", pid())
if (@count(a[pid()])) {
println(@hist_log(a[pid()]))
#println(@hist_linear(a[pid()],1024,4096,100))
}
delete a
}' -x 2699
确认stap模块已经加载到内核 :
[root@db-172-16-3-150 ~]# lsmod|grep stap
stap_ba670c311fb1c365e5ccb4b6e6e9d92a__3022 149604 2
uprobes 28877 2 stap_ba670c311fb1c365e5ccb4b6e6e9d92a__3022
在psql中执行一个全表扫描的查询 :
digoal=# explain (analyze,verbose,costs,buffers,timing) select * from tbl_cost_align;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------
Seq Scan on public.tbl_cost_align (cost=0.00..123256.00 rows=10000000 width=45) (actual time=0.793..2986.001 rows=10000000 loops=1)
Output: id, info, crt_time
Buffers: shared read=23256
Total runtime: 4046.410 ms
(4 rows)
systemtap跟踪的结果如下 :
平均扫描时间为49577纳秒(0.049577毫秒).
query__start explain (analyze,verbose,costs,buffers,timing) select * from tbl_cost_align;pid:2699
2699**23269**49577
query__done explain (analyze,verbose,costs,buffers,timing) select * from tbl_cost_align;pid:2699
value |-------------------------------------------------- count
4096 | 0
8192 | 0
16384 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 8888
32768 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 8509
65536 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 5787
131072 | 73
262144 | 2
524288 | 6
1048576 | 1
2097152 | 3
4194304 | 0
8388608 | 0
随机块扫描测试 :
[root@db-172-16-3-150 ~]# sync; echo 3 > /proc/sys/vm/drop_caches
重启数据库
pg93@db-172-16-3-150-> taskset -c 1 /home/pg93/pgsql9.3.3/bin/postgres >/dev/null 2>&1
重新测试
pg93@db-172-16-3-150-> psql
psql (9.3.3)
Type "help" for help.
digoal=# select pg_backend_pid();
pg_backend_pid
----------------
3596
(1 row)
[root@db-172-16-3-150 ~]# taskset -c 7 stap -e '
global a
probe process("/home/pg93/pgsql9.3.3/bin/postgres").mark("query__start") {
delete a
println("query__start ", user_string($arg1), "pid:", pid())
}
probe vfs.read.return {
t = gettimeofday_ns() - @entry(gettimeofday_ns())
# if (execname() == "postgres" && devname != "N/A")
a[pid()] <<< t
}
probe process("/home/pg93/pgsql9.3.3/bin/postgres").mark("query__done") {
if (@count(a[pid()]))
printdln("**", pid(), @count(a[pid()]), @avg(a[pid()]))
println("query__done ", user_string($arg1), "pid:", pid())
if (@count(a[pid()])) {
println(@hist_log(a[pid()]))
#println(@hist_linear(a[pid()],1024,4096,100))
}
delete a
}' -x 3596
[root@db-172-16-3-150 ~]# lsmod|grep stap
stap_ba670c311fb1c365e5ccb4b6e6e9d92a__3619 149604 2
uprobes 28877 2 stap_ba670c311fb1c365e5ccb4b6e6e9d92a__3619
digoal=# set enable_seqscan=off;
digoal=# set enable_bitmapscan=off;
digoal=# explain (analyze,verbose,costs,buffers,timing) select * from tbl_cost_align where id>1998999963;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------
Index Scan using idx_tbl_cost_align_id on public.tbl_cost_align (cost=0.43..4609.95 rows=5003 width=45) (actual time=0.711..2240.410 rows=5007 loops=1)
Output: id, info, crt_time
Index Cond: (tbl_cost_align.id > 1998999963)
Buffers: shared hit=481 read=4532
Total runtime: 2242.254 ms
(5 rows)
query__start explain (analyze,verbose,costs,buffers,timing) select * from tbl_cost_align where id>1998999963;pid:3596
3596**4563**483504
query__done explain (analyze,verbose,costs,buffers,timing) select * from tbl_cost_align where id>1998999963;pid:3596
value |-------------------------------------------------- count
4096 | 0
8192 | 0
16384 | 21
32768 |@@@ 175
65536 | 26
131072 | 1
262144 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 2491
524288 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 1800
1048576 | 47
2097152 | 2
4194304 | 0
8388608 | 0
随机块扫描的平均时间为483504纳秒(0.483504毫秒);
接下来测试的是机械硬盘的连续块扫描和随机块扫描的实际性能.
digoal=# alter table tbl_cost_align set tablespace tbs_digoal;
ALTER TABLE
digoal=# alter index idx_tbl_cost_align_id set tablespace tbs_digoal;
ALTER INDEX
digoal=# checkpoint;
CHECKPOINT
[root@db-172-16-3-150 ~]# sync; echo 3 > /proc/sys/vm/drop_caches
pg93@db-172-16-3-150-> taskset -c 1 /home/pg93/pgsql9.3.3/bin/postgres >/dev/null 2>&1
su - pg93
^[[A^[[Apg93@db-172-16-3-150-> psql
psql (9.3.3)
Type "help" for help.
digoal=# select pg_backend_pid();
pg_backend_pid
----------------
3739
(1 row)
[root@db-172-16-3-150 ~]# taskset -c 7 stap -e '
global a
probe process("/home/pg93/pgsql9.3.3/bin/postgres").mark("query__start") {
delete a
println("query__start ", user_string($arg1), "pid:", pid())
}
probe vfs.read.return {
t = gettimeofday_ns() - @entry(gettimeofday_ns())
# if (execname() == "postgres" && devname != "N/A")
a[pid()] <<< t
}
probe process("/home/pg93/pgsql9.3.3/bin/postgres").mark("query__done") {
if (@count(a[pid()]))
printdln("**", pid(), @count(a[pid()]), @avg(a[pid()]))
println("query__done ", user_string($arg1), "pid:", pid())
if (@count(a[pid()])) {
println(@hist_log(a[pid()]))
#println(@hist_linear(a[pid()],1024,4096,100))
}
delete a
}' -x 3739
[root@db-172-16-3-150 ~]# lsmod|grep stap
stap_ba670c311fb1c365e5ccb4b6e6e9d92a__3755 149604 2
uprobes 28877 2 stap_ba670c311fb1c365e5ccb4b6e6e9d92a__3755
digoal=# explain (analyze,verbose,costs,buffers,timing) select * from tbl_cost_align;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------
Seq Scan on public.tbl_cost_align (cost=0.00..123256.00 rows=10000000 width=45) (actual time=12.592..6378.106 rows=10000000 loops=1)
Output: id, info, crt_time
Buffers: shared read=23256
Total runtime: 7405.526 ms
(4 rows)
query__start explain (analyze,verbose,costs,buffers,timing) select * from tbl_cost_align;pid:3739
3739**23272**197881
query__done explain (analyze,verbose,costs,buffers,timing) select * from tbl_cost_align;pid:3739
value |-------------------------------------------------- count
4096 | 0
8192 | 0
16384 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 10176
32768 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 7189
65536 |@@@@@@@@@ 2036
131072 | 104
262144 |@@@@@@@@@ 1858
524288 |@@@@@@@ 1472
1048576 | 91
2097152 | 83
4194304 |@ 261
8388608 | 2
16777216 | 0
33554432 | 0
[root@db-172-16-3-150 ~]# sync; echo 3 > /proc/sys/vm/drop_caches
pg93@db-172-16-3-150-> taskset -c 1 /home/pg93/pgsql9.3.3/bin/postgres >/dev/null 2>&1
[root@db-172-16-3-150 ~]# su - pg93
psqlpg93@db-172-16-3-150-> psql
psql (9.3.3)
Type "help" for help.
digoal=# select pg_backend_pid();
pg_backend_pid
----------------
3868
(1 row)
[root@db-172-16-3-150 ~]# taskset -c 7 stap -e '
global a
probe process("/home/pg93/pgsql9.3.3/bin/postgres").mark("query__start") {
delete a
println("query__start ", user_string($arg1), "pid:", pid())
}
probe vfs.read.return {
t = gettimeofday_ns() - @entry(gettimeofday_ns())
# if (execname() == "postgres" && devname != "N/A")
a[pid()] <<< t
}
probe process("/home/pg93/pgsql9.3.3/bin/postgres").mark("query__done") {
if (@count(a[pid()]))
printdln("**", pid(), @count(a[pid()]), @avg(a[pid()]))
println("query__done ", user_string($arg1), "pid:", pid())
if (@count(a[pid()])) {
println(@hist_log(a[pid()]))
#println(@hist_linear(a[pid()],1024,4096,100))
}
delete a
}' -x 3868
[root@db-172-16-3-150 ~]# lsmod|grep stap
stap_ba670c311fb1c365e5ccb4b6e6e9d92a__3881 149604 2
uprobes 28877 2 stap_ba670c311fb1c365e5ccb4b6e6e9d92a__3881
digoal=# set enable_seqscan=off;
digoal=# set enable_bitmapscan=off;
digoal=# explain (analyze,verbose,costs,buffers,timing) select * from tbl_cost_align where id>1998999963;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------
Index Scan using idx_tbl_cost_align_id on public.tbl_cost_align (cost=0.43..4609.95 rows=5003 width=45) (actual time=5.462..22646.769 rows=5007 loops=1)
Output: id, info, crt_time
Index Cond: (tbl_cost_align.id > 1998999963)
Buffers: shared hit=481 read=4532
Total runtime: 22648.805 ms
(5 rows)
query__start explain (analyze,verbose,costs,buffers,timing) select * from tbl_cost_align where id>1998999963;pid:3868
3868**4563**4957678
query__done explain (analyze,verbose,costs,buffers,timing) select * from tbl_cost_align where id>1998999963;pid:3868
value |-------------------------------------------------- count
4096 | 0
8192 | 0
16384 | 27
32768 |@@@ 169
65536 | 26
131072 | 2
262144 | 17
524288 | 40
1048576 |@@@ 200
2097152 |@@@@@@@@@@@@@@@@@@@@@@@@@ 1338
4194304 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 2568
8388608 |@@@ 159
16777216 | 16
33554432 | 0
67108864 | 1
134217728 | 0
268435456 | 0
机械硬盘连续块扫描的平均时间197881纳秒(0.197881毫秒)
离散块扫描的平均时间4957678纳秒(4.957678毫秒)
最后要测试的是在内存中的连续块扫描和离散块扫描的性能.
要用到pgfincore这个插件, 同时需要修改数据库的shared buffer, 避免在shared buffer中命中, 无法跟踪到vfs系统调用的情况.
digoal=# create extension pgfincore;
CREATE EXTENSION
digoal=# select * from pgfadvise_willneed('idx_tbl_cost_align_id');
relpath | os_page_size | rel_os_pages | os_pages_free
----------------------------------------------+--------------+--------------+---------------
pg_tblspc/25324/PG_9.3_201306121/16384/33521 | 4096 | 54416 | 19013540
(1 row)
digoal=# select * from pgfadvise_willneed('tbl_cost_align');
relpath | os_page_size | rel_os_pages | os_pages_free
----------------------------------------------+--------------+--------------+---------------
pg_tblspc/25324/PG_9.3_201306121/16384/33518 | 4096 | 186048 | 18884781
(1 row)
vi postgresql.conf
shared_buffers = 32MB
pg93@db-172-16-3-150-> taskset -c 1 /home/pg93/pgsql9.3.3/bin/postgres >/dev/null 2>&1
pg93@db-172-16-3-150-> psql
psql (9.3.3)
Type "help" for help.
digoal=# select pg_backend_pid();
pg_backend_pid
----------------
4538
(1 row)
[root@db-172-16-3-150 ~]# taskset -c 7 stap -e '
global a
probe process("/home/pg93/pgsql9.3.3/bin/postgres").mark("query__start") {
delete a
println("query__start ", user_string($arg1), "pid:", pid())
}
probe vfs.read.return {
t = gettimeofday_ns() - @entry(gettimeofday_ns())
# if (execname() == "postgres" && devname != "N/A")
a[pid()] <<< t
}
probe process("/home/pg93/pgsql9.3.3/bin/postgres").mark("query__done") {
if (@count(a[pid()]))
printdln("**", pid(), @count(a[pid()]), @avg(a[pid()]))
println("query__done ", user_string($arg1), "pid:", pid())
if (@count(a[pid()])) {
println(@hist_log(a[pid()]))
#println(@hist_linear(a[pid()],1024,4096,100))
}
delete a
}' -x 4538
[root@db-172-16-3-150 ~]# lsmod|grep stap
stap_ba670c311fb1c365e5ccb4b6e6e9d92a__4556 149604 2
uprobes 28877 2 stap_ba670c311fb1c365e5ccb4b6e6e9d92a__4556
digoal=# explain (analyze,verbose,costs,buffers,timing) select * from tbl_cost_align;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------
Seq Scan on public.tbl_cost_align (cost=0.00..123256.00 rows=10000000 width=45) (actual time=0.066..2086.471 rows=10000000 loops=1)
Output: id, info, crt_time
Buffers: shared read=23256
Total runtime: 3128.600 ms
(4 rows)
query__start explain (analyze,verbose,costs,buffers,timing) select * from tbl_cost_align;pid:4538
4538**23272**11445
query__done explain (analyze,verbose,costs,buffers,timing) select * from tbl_cost_align;pid:4538
value |-------------------------------------------------- count
2048 | 0
4096 | 0
8192 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 23018
16384 | 224
32768 | 26
65536 | 3
131072 | 0
262144 | 0
524288 | 1
1048576 | 0
2097152 | 0
digoal=# set enable_seqscan=off;
digoal=# set enable_bitmapscan=off;
digoal=# explain (analyze,verbose,costs,buffers,timing) select * from tbl_cost_align where id>1998999963;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------
Index Scan using idx_tbl_cost_align_id on public.tbl_cost_align (cost=0.43..4609.95 rows=5003 width=45) (actual time=0.157..110.887 rows=5007 loops=1)
Output: id, info, crt_time
Index Cond: (tbl_cost_align.id > 1998999963)
Buffers: shared hit=175 read=4838
Total runtime: 111.819 ms
(5 rows)
query__start explain (analyze,verbose,costs,buffers,timing) select * from tbl_cost_align where id>1998999963;pid:4538
4538**4855**18563
query__done explain (analyze,verbose,costs,buffers,timing) select * from tbl_cost_align where id>1998999963;pid:4538
value |-------------------------------------------------- count
2048 | 0
4096 | 0
8192 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 3762
16384 |@@@@ 367
32768 |@@@@@@@@@ 717
65536 | 2
131072 | 6
262144 | 1
524288 | 0
1048576 | 0
当数据在内存中时, 连续块扫描的平均时间11445纳秒(0.011445毫秒)
离散块扫描的平均时间18563纳秒(0.018563毫秒)
参考
1. 《优化器成本因子校对 - PostgreSQL explain cost constants alignment to timestamp》