PostgreSQL 9.2 add array elements statistics
背景
PostgreSQL 9.2 在统计信息中增加了 array类型中 element的统计信息,以提供给优化器参考。
例如@>, &&, 和<@ 操作符的使用更加优化。
以下是PostgreSQL 9.1 的pg_stats包含的列信息
digoal=# \d pg_stats
View "pg_catalog.pg_stats"
Column | Type | Modifiers
-------------------+----------+-----------
schemaname | name |
tablename | name |
attname | name |
inherited | boolean |
null_frac | real |
avg_width | integer |
n_distinct | real |
most_common_vals | anyarray |
most_common_freqs | real[] |
histogram_bounds | anyarray |
correlation | real |
以下是PostgreSQL 9.2 的pg_stats包含的列信息, 可以看到增加了最后3列, 就是存放array类型的统计信息.
digoal=# \d pg_stats
View "pg_catalog.pg_stats"
Column | Type | Modifiers
------------------------+----------+-----------
schemaname | name |
tablename | name |
attname | name |
inherited | boolean |
null_frac | real |
avg_width | integer |
n_distinct | real |
most_common_vals | anyarray |
most_common_freqs | real[] |
histogram_bounds | anyarray |
correlation | real |
most_common_elems | anyarray |
most_common_elem_freqs | real[] |
elem_count_histogram | real[] |
创建测试表
digoal=# create table array_gist_test (id int,class text[]);
CREATE TABLE
插入测试数据
digoal=# insert into array_gist_test values (1,array['digoal','francs','david','hank','dage']);
INSERT 0 1
digoal=# insert into array_gist_test values (2,array['test1']);
INSERT 0 1
digoal=# insert into array_gist_test values (3,array['test2']);
INSERT 0 1
digoal=# insert into array_gist_test select generate_series(4,100000),array['test4'];
INSERT 0 99997
创建gin索引
digoal=# create index idx_array_test on array_gist_test using gin (class);
CREATE INDEX
为了便于比较,在测试前都把数据刷出OS缓存,并重启数据库.
echo 3 > /proc/sys/vm/drop_caches
以下是PostgreSQL 9.1中的执行计划测试
查询包含’test4’的数据
digoal=# explain analyze select * from array_gist_test where class @> array['test4'];
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on array_gist_test (cost=3.77..100.03 rows=100 width=37) (actual time=31.899..91.692 rows=99997 loops=1)
Recheck Cond: (class @> '{test4}'::text[])
-> Bitmap Index Scan on idx_array_test (cost=0.00..3.75 rows=100 width=0) (actual time=31.574..31.574 rows=99997 loops=1)
Index Cond: (class @> '{test4}'::text[])
Total runtime: 101.623 ms
(5 rows)
查询包含’test3’的数据
digoal=# explain analyze select * from array_gist_test where class @> array['test3'];
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on array_gist_test (cost=3.77..100.03 rows=100 width=37) (actual time=0.014..0.014 rows=0 loops=1)
Recheck Cond: (class @> '{test3}'::text[])
-> Bitmap Index Scan on idx_array_test (cost=0.00..3.75 rows=100 width=0) (actual time=0.012..0.012 rows=0 loops=1)
Index Cond: (class @> '{test3}'::text[])
Total runtime: 0.055 ms
(5 rows)
注意看, 在PostgreSQL9.1中test3和test4查询都走了bitmap index scan计划, 同时执行计划中的cost是一样的. 因为PostgreSQL 9.1没有收集array级的统计信息, 当然也没有办法依靠这个来计算成本.
以下是PostgreSQL 9.2中的执行计划测试
查询包含’test4’的数据
digoal=# explain analyze select * from array_gist_test where class @> array['test4'];
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------
Seq Scan on array_gist_test (cost=0.00..2084.00 rows=100000 width=37) (actual time=7.175..88.462 rows=99997 loops=1)
Filter: (class @> '{test4}'::text[])
Rows Removed by Filter: 3
Total runtime: 106.384 ms
(4 rows)
查询包含’test3’的数据
digoal=# explain analyze select * from array_gist_test where class @> array['test3'];
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on array_gist_test (cost=6.88..398.12 rows=500 width=37) (actual time=0.074..0.074 rows=0 loops=1)
Recheck Cond: (class @> '{test3}'::text[])
-> Bitmap Index Scan on idx_array_test (cost=0.00..6.75 rows=500 width=0) (actual time=0.071..0.071 rows=0 loops=1)
Index Cond: (class @> '{test3}'::text[])
Total runtime: 0.134 ms
(5 rows)
注意到, 查询条件是test4时, 数据库选择了seq_scan的执行计划. 但是执行时间上超过了PostgreSQL 9.1用到的bitmap scan计划.
查询条件是test3时走的是bitmap index scan.
那么为什么9.2和9.1的查询时间不一样呢?
1. 走索引不需要从TUPLE中取出class字段的值进行过滤.需要消耗掉CPU的时间.
2. 走索引会多一些IO的操作, 但是这些IO可能落在(HDS)存储的CACHE里面了, 所以在本例中看不出来, 因此IO在整个过程中的时间占比就显得比较小, 而从TUPLE中取出class字段过滤的动作在时间上占比更突出.
3. 走索引的过滤比走全表的过滤简单,需要的CPU运算少.
接下来我强制9.2在查询条件是test4时不使用seq_scan
digoal=# set enable_seqscan=off;
SET
digoal=# explain analyze select * from array_gist_test where class @> array['test4'];
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on array_gist_test (cost=851.00..2935.00 rows=100000 width=37) (actual time=43.921..69.214 rows=99997 loops=1)
Recheck Cond: (class @> '{test4}'::text[])
-> Bitmap Index Scan on idx_array_test (cost=0.00..826.00 rows=100000 width=0) (actual time=43.719..43.719 rows=99997 loops=1)
Index Cond: (class @> '{test4}'::text[])
Total runtime: 77.961 ms
(5 rows)
注意看, 这时PostgreSQL 9.2在查询条件为test4时也走了bitmap index scan, 而且COST可以看出比seq_scan要大, 而且此时的COST和前面test3条件时的COST也不一样, 因为9.2有了array列的统计信息, 所以可以拿来做成本计算.
解读pg_stats中的array类型的统计信息
例如一个pg_stats中array_gist_test.class列的统计信息如下 :
digoal=# select * from pg_stats where tablename='array_gist_test' and attname='class';
-[ RECORD 1 ]----------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
schemaname | public
tablename | array_gist_test
attname | class
inherited | f
null_frac | 0
avg_width | 33
n_distinct | 5
most_common_vals | {"{test7}","{test8}","{test5}","{test6}","{test4}"}
most_common_freqs | {0.202267,0.2015,0.1997,0.1995,0.197033}
histogram_bounds |
correlation | 1
most_common_elems | {test4,test5,test6,test7,test8}
most_common_elem_freqs | {0.197033,0.1997,0.1995,0.202267,0.2015,0.197033,0.202267,0}
elem_count_histogram | {1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1}
most_common_elems中表示出现最频繁的几个元素.
most_common_elem_freqs的除掉最后3个剩余的则分别对应common_elems的元素的占比。
例如第一个test4元素统计出来出现的频率是0.197033
我们可以来计算一下看看和实际是否相符.
首先查看这个表有多少条记录.
digoal=# select reltuples from pg_class where relname='array_gist_test';
-[ RECORD 1 ]-----
reltuples | 499988
然后根据这个记录数乘以频率就等于这个记录有多少条.
digoal=# select 0.197033*499988;
-[ RECORD 1 ]----------
?column? | 98514.135604
看看真实的test4有多少条.
digoal=# select count(*) from array_gist_test where class @> array['test4'];
-[ RECORD 1 ]
count | 99997
与统计到的数据基本一致.
前面5个值加起来刚好等于1, 最后一位是0.
digoal=# select 0.197033+0.1997+0.1995+0.202267+0.2015;
-[ RECORD 1 ]------
?column? | 1.000000
好了,那么剩余的最后3个值0.197033,0.202267,0分别是什么意思呢?
0.197033是指前面所有频率值的最小值。
0.202267是指前面所有频率值的最大值。
0表示除去这些comm_elems之外的elems占的比例.
我们可以来看看到底是不是0.
digoal=# select class,count(*) from array_gist_test group by class order by count(*);
class | count
---------------------------------+-------
{test2} | 1
{test1} | 1
{digoal,francs,david,hank,dage} | 1
{test7} | 99997
{test8} | 99997
{test4} | 99997
{test6} | 99997
{test5} | 99997
从结果上可以看出, 其他elems为 : test1, test2, digoal, francs, david, hank, dage总共出现在3行中. 占比0.000006, 如下.
digoal=# select count(*) from array_gist_test ;
count
--------
499988
(1 row)
digoal=# select 3/499988.0;
?column?
----------------------------
0.000006000144003456082946
(1 row)
为什么most_common_elem_freqs的最后一位不是0.000006呢, 原因是在analyze时根本就没有取到这3行.
重置一个比较大的set default_statistics_target=10000. 重新分析就能看到更全面的数据了.
digoal=# set default_statistics_target=10000;
digoal=# analyze array_gist_test ;
digoal=# select * from pg_stats where tablename='array_gist_test' and attname='class';
most_common_elems | {dage,david,digoal,francs,hank,test1,test2,test4,test5,test6,test7,test8}
most_common_elem_freqs | {2.00005e-06,2.00005e-06,2.00005e-06,2.00005e-06,2.00005e-06,2.00005e-06,2.00005e-06,0.199999,0.199999,0.199999,0.199999,0.199999,2.00005e-06,0.199999,0}
最后解释一下elem_count_histogram的意思,A histogram of the counts of distinct non-null element values within the values of the column, followed by the average number of distinct non-null elements.就说和default_statistics_target有关系,比如默认是100,那elem_count_histogram就有101个值, 前100个值表示分成均等的100份后,每份中class列包含的distinct non-null element的个数。最后一位是平均的distinct non-null elements的值. 不过个人感觉这个统计没有什么实质的意义. 应该再配合其他的统计信息使用.