PostgreSQL 9.2 add array elements statistics

3 minute read

背景

PostgreSQL 9.2 在统计信息中增加了 array类型中 element的统计信息,以提供给优化器参考。

例如@>, &&, 和<@ 操作符的使用更加优化。

以下是PostgreSQL 9.1 的pg_stats包含的列信息

digoal=# \d pg_stats  
        View "pg_catalog.pg_stats"  
      Column       |   Type   | Modifiers   
-------------------+----------+-----------  
 schemaname        | name     |   
 tablename         | name     |   
 attname           | name     |   
 inherited         | boolean  |   
 null_frac         | real     |   
 avg_width         | integer  |   
 n_distinct        | real     |   
 most_common_vals  | anyarray |   
 most_common_freqs | real[]   |   
 histogram_bounds  | anyarray |   
 correlation       | real     |   

以下是PostgreSQL 9.2 的pg_stats包含的列信息, 可以看到增加了最后3列, 就是存放array类型的统计信息.

digoal=# \d pg_stats  
          View "pg_catalog.pg_stats"  
         Column         |   Type   | Modifiers   
------------------------+----------+-----------  
 schemaname             | name     |   
 tablename              | name     |   
 attname                | name     |   
 inherited              | boolean  |   
 null_frac              | real     |   
 avg_width              | integer  |   
 n_distinct             | real     |   
 most_common_vals       | anyarray |   
 most_common_freqs      | real[]   |   
 histogram_bounds       | anyarray |   
 correlation            | real     |   
 most_common_elems      | anyarray |   
 most_common_elem_freqs | real[]   |   
 elem_count_histogram   | real[]   |   

创建测试表

digoal=# create table array_gist_test (id int,class text[]);  
CREATE TABLE  

插入测试数据

digoal=# insert into array_gist_test values (1,array['digoal','francs','david','hank','dage']);  
INSERT 0 1  
digoal=# insert into array_gist_test values (2,array['test1']);  
INSERT 0 1  
digoal=# insert into array_gist_test values (3,array['test2']);  
INSERT 0 1  
digoal=# insert into array_gist_test select generate_series(4,100000),array['test4'];  
INSERT 0 99997  

创建gin索引

digoal=# create index idx_array_test on array_gist_test using gin (class);  
CREATE INDEX  

为了便于比较,在测试前都把数据刷出OS缓存,并重启数据库.

echo 3 > /proc/sys/vm/drop_caches  

以下是PostgreSQL 9.1中的执行计划测试

查询包含’test4’的数据

digoal=# explain analyze select * from array_gist_test where class @> array['test4'];  
                                                          QUERY PLAN                                                             
-------------------------------------------------------------------------------------------------------------------------------  
 Bitmap Heap Scan on array_gist_test  (cost=3.77..100.03 rows=100 width=37) (actual time=31.899..91.692 rows=99997 loops=1)  
   Recheck Cond: (class @> '{test4}'::text[])  
   ->  Bitmap Index Scan on idx_array_test  (cost=0.00..3.75 rows=100 width=0) (actual time=31.574..31.574 rows=99997 loops=1)  
         Index Cond: (class @> '{test4}'::text[])  
 Total runtime: 101.623 ms  
(5 rows)  

查询包含’test3’的数据

digoal=# explain analyze select * from array_gist_test where class @> array['test3'];  
                                                       QUERY PLAN                                                          
-------------------------------------------------------------------------------------------------------------------------  
 Bitmap Heap Scan on array_gist_test  (cost=3.77..100.03 rows=100 width=37) (actual time=0.014..0.014 rows=0 loops=1)  
   Recheck Cond: (class @> '{test3}'::text[])  
   ->  Bitmap Index Scan on idx_array_test  (cost=0.00..3.75 rows=100 width=0) (actual time=0.012..0.012 rows=0 loops=1)  
         Index Cond: (class @> '{test3}'::text[])  
 Total runtime: 0.055 ms  
(5 rows)  

注意看, 在PostgreSQL9.1中test3和test4查询都走了bitmap index scan计划, 同时执行计划中的cost是一样的. 因为PostgreSQL 9.1没有收集array级的统计信息, 当然也没有办法依靠这个来计算成本.

以下是PostgreSQL 9.2中的执行计划测试

查询包含’test4’的数据

digoal=# explain analyze select * from array_gist_test where class @> array['test4'];  
                                                      QUERY PLAN                                                         
-----------------------------------------------------------------------------------------------------------------------  
 Seq Scan on array_gist_test  (cost=0.00..2084.00 rows=100000 width=37) (actual time=7.175..88.462 rows=99997 loops=1)  
   Filter: (class @> '{test4}'::text[])  
   Rows Removed by Filter: 3  
 Total runtime: 106.384 ms  
(4 rows)  

查询包含’test3’的数据

digoal=# explain analyze select * from array_gist_test where class @> array['test3'];  
                                                       QUERY PLAN                                                          
-------------------------------------------------------------------------------------------------------------------------  
 Bitmap Heap Scan on array_gist_test  (cost=6.88..398.12 rows=500 width=37) (actual time=0.074..0.074 rows=0 loops=1)  
   Recheck Cond: (class @> '{test3}'::text[])  
   ->  Bitmap Index Scan on idx_array_test  (cost=0.00..6.75 rows=500 width=0) (actual time=0.071..0.071 rows=0 loops=1)  
         Index Cond: (class @> '{test3}'::text[])  
 Total runtime: 0.134 ms  
(5 rows)  

注意到, 查询条件是test4时, 数据库选择了seq_scan的执行计划. 但是执行时间上超过了PostgreSQL 9.1用到的bitmap scan计划.

查询条件是test3时走的是bitmap index scan.

那么为什么9.2和9.1的查询时间不一样呢?

1. 走索引不需要从TUPLE中取出class字段的值进行过滤.需要消耗掉CPU的时间.

2. 走索引会多一些IO的操作, 但是这些IO可能落在(HDS)存储的CACHE里面了, 所以在本例中看不出来, 因此IO在整个过程中的时间占比就显得比较小, 而从TUPLE中取出class字段过滤的动作在时间上占比更突出.

3. 走索引的过滤比走全表的过滤简单,需要的CPU运算少.

接下来我强制9.2在查询条件是test4时不使用seq_scan

digoal=# set enable_seqscan=off;  
SET  
digoal=# explain analyze select * from array_gist_test where class @> array['test4'];  
                                                             QUERY PLAN                                                               
------------------------------------------------------------------------------------------------------------------------------------  
 Bitmap Heap Scan on array_gist_test  (cost=851.00..2935.00 rows=100000 width=37) (actual time=43.921..69.214 rows=99997 loops=1)  
   Recheck Cond: (class @> '{test4}'::text[])  
   ->  Bitmap Index Scan on idx_array_test  (cost=0.00..826.00 rows=100000 width=0) (actual time=43.719..43.719 rows=99997 loops=1)  
         Index Cond: (class @> '{test4}'::text[])  
 Total runtime: 77.961 ms  
(5 rows)  

注意看, 这时PostgreSQL 9.2在查询条件为test4时也走了bitmap index scan, 而且COST可以看出比seq_scan要大, 而且此时的COST和前面test3条件时的COST也不一样, 因为9.2有了array列的统计信息, 所以可以拿来做成本计算.

解读pg_stats中的array类型的统计信息

例如一个pg_stats中array_gist_test.class列的统计信息如下 :

digoal=# select * from pg_stats where tablename='array_gist_test' and attname='class';  
-[ RECORD 1 ]----------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------  
schemaname             | public  
tablename              | array_gist_test  
attname                | class  
inherited              | f  
null_frac              | 0  
avg_width              | 33  
n_distinct             | 5  
most_common_vals       | {"{test7}","{test8}","{test5}","{test6}","{test4}"}  
most_common_freqs      | {0.202267,0.2015,0.1997,0.1995,0.197033}  
histogram_bounds       |   
correlation            | 1  
most_common_elems      | {test4,test5,test6,test7,test8}  
most_common_elem_freqs | {0.197033,0.1997,0.1995,0.202267,0.2015,0.197033,0.202267,0}  
elem_count_histogram   | {1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1}  

most_common_elems中表示出现最频繁的几个元素.

most_common_elem_freqs的除掉最后3个剩余的则分别对应common_elems的元素的占比。

例如第一个test4元素统计出来出现的频率是0.197033

我们可以来计算一下看看和实际是否相符.

首先查看这个表有多少条记录.

digoal=# select reltuples from pg_class where relname='array_gist_test';  
-[ RECORD 1 ]-----  
reltuples | 499988  

然后根据这个记录数乘以频率就等于这个记录有多少条.

digoal=# select 0.197033*499988;  
-[ RECORD 1 ]----------  
?column? | 98514.135604  

看看真实的test4有多少条.

digoal=# select count(*) from array_gist_test where class @> array['test4'];  
-[ RECORD 1 ]  
count | 99997  

与统计到的数据基本一致.

前面5个值加起来刚好等于1, 最后一位是0.

digoal=# select 0.197033+0.1997+0.1995+0.202267+0.2015;  
-[ RECORD 1 ]------  
?column? | 1.000000  

好了,那么剩余的最后3个值0.197033,0.202267,0分别是什么意思呢?

0.197033是指前面所有频率值的最小值。

0.202267是指前面所有频率值的最大值。

0表示除去这些comm_elems之外的elems占的比例.

我们可以来看看到底是不是0.

digoal=# select class,count(*) from array_gist_test group by class order by count(*);  
              class              | count   
---------------------------------+-------  
 {test2}                         |     1  
 {test1}                         |     1  
 {digoal,francs,david,hank,dage} |     1  
 {test7}                         | 99997  
 {test8}                         | 99997  
 {test4}                         | 99997  
 {test6}                         | 99997  
 {test5}                         | 99997  

从结果上可以看出, 其他elems为 : test1, test2, digoal, francs, david, hank, dage总共出现在3行中. 占比0.000006, 如下.

digoal=# select count(*) from array_gist_test ;  
 count    
--------  
 499988  
(1 row)  
digoal=# select 3/499988.0;  
          ?column?            
----------------------------  
 0.000006000144003456082946  
(1 row)  

为什么most_common_elem_freqs的最后一位不是0.000006呢, 原因是在analyze时根本就没有取到这3行.

重置一个比较大的set default_statistics_target=10000. 重新分析就能看到更全面的数据了.

digoal=# set default_statistics_target=10000;  
digoal=# analyze array_gist_test ;  
digoal=# select * from pg_stats where tablename='array_gist_test' and attname='class';  
most_common_elems      | {dage,david,digoal,francs,hank,test1,test2,test4,test5,test6,test7,test8}  
most_common_elem_freqs | {2.00005e-06,2.00005e-06,2.00005e-06,2.00005e-06,2.00005e-06,2.00005e-06,2.00005e-06,0.199999,0.199999,0.199999,0.199999,0.199999,2.00005e-06,0.199999,0}  

最后解释一下elem_count_histogram的意思,A histogram of the counts of distinct non-null element values within the values of the column, followed by the average number of distinct non-null elements.就说和default_statistics_target有关系,比如默认是100,那elem_count_histogram就有101个值, 前100个值表示分成均等的100份后,每份中class列包含的distinct non-null element的个数。最后一位是平均的distinct non-null elements的值. 不过个人感觉这个统计没有什么实质的意义. 应该再配合其他的统计信息使用.

Flag Counter

digoal’s 大量PostgreSQL文章入口