PostgreSQL 11 preview - 索引优化。filter智能消除、分区索引智能合并

5 minute read

背景

当数据量较大时,对数据分区是一种比较不错的数据管理手段。

每个分区都有对应的边界定义,在PostgreSQL中我们可以把它理解为分区的constraint。

目前PostgreSQL在优化器的细节方面进行打磨,constraint可以被优化器利用来做一些优化的工作。

1、例如:

每个班级一个分区,你要找在某个班级并且学号=1的同学。

假设你在学号上创建了一个索引。

那么真实的执行计划会怎么样呢?在没有优化的情况下,可能是这样的

    ->  Index Scan using pi1 on 班级101分区表  (cost=0.29..8.31 rows=1 width=8)  
          Index Cond: (学号 = 1)  
          Filter: (班级=101)  

由于索引上没有班级这个字段的内容,所以被放到了filter里面,意味着每一条符合学号=1的记录都需要再次被filter一遍。而实际上这是板上钉钉的,因为你已经在这个分区里面了。不需要再filter。

显然,优化器可以对这样的场景进行优化,去除Filter。

再看下面这个例子,分区上面都有K字段的范围约束,选择分区表时优化器根据输入条件,将数据收敛到了对应的分区,但是在索引扫描时,FILTRER没有被去除。

create table base (k integer primary key, v integer);  
create table part1 (check (k between 1 and 10000)) inherits (base);  
create table part2 (check (k between 10001 and 20000)) inherits (base);  
create index pi1 on part1(v);  
create index pi2 on part2(v);  
insert int part1 values (generate series(1,10000), random());  
insert into part2 values (generate_series(10001,20000), random());  
explain select * from base where k between 1 and 20000 and v = 100;  
                               QUERY PLAN  
-----------------------------------------------------------------------  
  Append  (cost=0.00..15.65 rows=3 width=8)  
    ->  Seq Scan on base  (cost=0.00..0.00 rows=1 width=8)  
          Filter: ((k >= 1) AND (k <= 20000) AND (v = 100))  
    ->  Index Scan using pi1 on part1  (cost=0.29..8.31 rows=1 width=8)  
          Index Cond: (v = 100)  
          Filter: ((k >= 1) AND (k <= 20000))  
    ->  Index Scan using pi2 on part2  (cost=0.29..7.34 rows=1 width=8)  
          Index Cond: (v = 100)  
          Filter: ((k >= 1) AND (k <= 20000))  

2、对于partial index也是一样的道理,当可以选择多个partial index时,PostgreSQL目前的版本,不能智能的进行最优索引的选择。

create table t (k integer primary key, v integer);  
insert into t values (generate_series(1,20000),random());  
create index i1 on t(v) where k between 1 and 10000;  
create index i2 on t(v) where k between 10001 and 20000;  
postgres=# explain select * from t where k between 1 and 10000 and v = 100;  
                          QUERY PLAN  
------------------------------------------------------------  
  Index Scan using i1 on t  (cost=0.29..7.28 rows=1 width=8)  
    Index Cond: (v = 100)  
(2 rows)  
  
  
Here we get perfect plan. Let's try to extend search interval:  
  
  
postgres=# explain select * from t where k between 1 and 20000 and v = 100;  
                             QUERY PLAN  
------------------------------------------------------------------  
  Index Scan using t_pkey on t  (cost=0.29..760.43 rows=1 width=8)  
    Index Cond: ((k >= 1) AND (k <= 20000))  
    Filter: (v = 100)  
(3 rows)  

实际上可以使用多个partial index,从而实现最优的成本。

Unfortunately in this case Postgres is not able to apply partial indexes.  
And this is what I expected to get:  
  
postgres=# explain select * from t where k between 1 and 10000 and v =   
100 union all select * from t where k between 10001 and 20000 and v = 100;  
                               QUERY PLAN  
----------------------------------------------------------------------  
  Append  (cost=0.29..14.58 rows=2 width=8)  
    ->  Index Scan using i1 on t  (cost=0.29..7.28 rows=1 width=8)  
          Index Cond: (v = 100)  
    ->  Index Scan using i2 on t t_1  (cost=0.29..7.28 rows=1 width=8)  
          Index Cond: (v = 100)  

PostgreSQL 11 在优化器方面,可能针对这个场景会进行优化。patch如下。

https://commitfest.postgresql.org/17/1264/

https://www.postgresql.org/message-id/flat/27516421-5afa-203c-e22a-8407e9187327@postgrespro.ru#27516421-5afa-203c-e22a-8407e9187327@postgrespro.ru

Hi hackers,  
  
I am trying to compare different ways of optimizing work with huge   
append-only tables in PostgreSQL where primary key is something like   
timestamp and queries are usually accessing most recent data using some   
secondary keys. Size of secondary index is one of the most critical   
factors limiting  insert/search performance. As far as data is inserted   
in timestamp ascending order, access to primary key is well localized   
and accessed tables are present in memory. But if we create secondary   
key for the whole table, then access to it will require random reads   
from the disk and significantly decrease performance.  
  
There are two well known solutions of the problem:  
1. Table partitioning  
2. Partial indexes  
  
This approaches I want to compare. First of all I want to check if   
optimizer is able to generate efficient query execution plan covering   
different time intervals.  
Unfortunately in both cases generated plan is not optimal.  
  
1. Table partitioning:  
  
create table base (k integer primary key, v integer);  
create table part1 (check (k between 1 and 10000)) inherits (base);  
create table part2 (check (k between 10001 and 20000)) inherits (base);  
create index pi1 on part1(v);  
create index pi2 on part2(v);  
insert int part1 values (generate series(1,10000), random());  
insert into part2 values (generate_series(10001,20000), random());  
explain select * from base where k between 1 and 20000 and v = 100;  
                               QUERY PLAN  
-----------------------------------------------------------------------  
  Append  (cost=0.00..15.65 rows=3 width=8)  
    ->  Seq Scan on base  (cost=0.00..0.00 rows=1 width=8)  
          Filter: ((k >= 1) AND (k <= 20000) AND (v = 100))  
    ->  Index Scan using pi1 on part1  (cost=0.29..8.31 rows=1 width=8)  
          Index Cond: (v = 100)  
          Filter: ((k >= 1) AND (k <= 20000))  
    ->  Index Scan using pi2 on part2  (cost=0.29..7.34 rows=1 width=8)  
          Index Cond: (v = 100)  
          Filter: ((k >= 1) AND (k <= 20000))  
  
Questions:  
- Is there some way to avoid sequential scan of parent table? Yes, it is   
empty and so sequential scan will not take much time, but ... it still   
requires some additional actions and so increasing query execution time.  
- Why index scan of partition indexes includes filter condition if it is   
guaranteed by check constraint that all records of this partition match   
search predicate?  
  
  
2. Partial indexes:  
  
create table t (k integer primary key, v integer);  
insert into t values (generate_series(1,20000),random());  
create index i1 on t(v) where k between 1 and 10000;  
create index i2 on t(v) where k between 10001 and 20000;  
postgres=# explain select * from t where k between 1 and 10000 and v = 100;  
                          QUERY PLAN  
------------------------------------------------------------  
  Index Scan using i1 on t  (cost=0.29..7.28 rows=1 width=8)  
    Index Cond: (v = 100)  
(2 rows)  
  
  
Here we get perfect plan. Let's try to extend search interval:  
  
  
postgres=# explain select * from t where k between 1 and 20000 and v = 100;  
                             QUERY PLAN  
------------------------------------------------------------------  
  Index Scan using t_pkey on t  (cost=0.29..760.43 rows=1 width=8)  
    Index Cond: ((k >= 1) AND (k <= 20000))  
    Filter: (v = 100)  
(3 rows)  
  
Unfortunately in this case Postgres is not able to apply partial indexes.  
And this is what I expected to get:  
  
postgres=# explain select * from t where k between 1 and 10000 and v =   
100 union all select * from t where k between 10001 and 20000 and v = 100;  
                               QUERY PLAN  
----------------------------------------------------------------------  
  Append  (cost=0.29..14.58 rows=2 width=8)  
    ->  Index Scan using i1 on t  (cost=0.29..7.28 rows=1 width=8)  
          Index Cond: (v = 100)  
    ->  Index Scan using i2 on t t_1  (cost=0.29..7.28 rows=1 width=8)  
          Index Cond: (v = 100)  
  
  
I wonder if there are some principle problems in supporting this two   
things in optimizer:  
1. Remove search condition for primary key if it is fully satisfied by   
derived table check constraint.  
2. Append index scans of several partial indexes if specified interval   
is covered by their conditions.  
  
I wonder if someone is familiar with this part of optimizer ad can   
easily fix it.  
Otherwise I am going to spend some time on solving this problems (if   
community think that such optimizations will be useful).  
  
--   
  
Konstantin Knizhnik  
Postgres Professional: http://www.postgrespro.com  
The Russian Postgres Company  

小结

PostgreSQL 11在优化器细节方面进行了很多优化,包括前面提到的分区表的智能并行JOIN,group, 聚合等。

以及本文提到的,在索引扫描时,

1、根据分区表的约束条件,如果索引字段未包含分区键,会自动消除filter。

2、在输入条件可以用到多个partial index时,自动选择union all扫描。(这项优化,将来也能用在分区索引上面,partial index是分区索引的一个雏形。)

参考

https://commitfest.postgresql.org/17/1264/

https://www.postgresql.org/message-id/flat/27516421-5afa-203c-e22a-8407e9187327@postgrespro.ru#27516421-5afa-203c-e22a-8407e9187327@postgrespro.ru

Flag Counter

digoal’s 大量PostgreSQL文章入口