PostgreSQL 空间包含 OR 组合查询、 IN 大量重复值 CASE 优化 - 消重、消bitmapOr、消filter

1 minute read

背景

简单的等值查询，空间包含查询，很容易想到使用索引加速。

但是当条件升级到IN，或者多个空间包含的OR查询时，数据库可能会选择多次索引扫描然后BitmapOr的方法，这种方法虽然用了索引，但是由于索引返回的是BLOCKID而不是CTID，所以会引入recheck。

如果IN内数值，或者空间本身存在大量的重叠区间，那么性能会下降更严重。

postgres=# create table abc(id int primary key, info text);  
  
postgres=# insert into abc select generate_series(1,10000000), 'test';  
  
-- 查询100万个重复ID  
postgres=# do language plpgsql $$                                                                       
declare  
sql text;  
begin  
  select string_agg('1',',') into sql from generate_series(1,1000000);  
  sql := format('select * from abc where id in (%s);', sql);   
  raise notice '%', clock_timestamp();  
  execute sql;   
  raise notice '%', clock_timestamp();  
end;  
$$;  
NOTICE:  2018-06-22 02:35:54.731764+08  
NOTICE:  2018-06-22 02:35:55.841688+08  
DO  
Time: 1336.696 ms (00:01.337)  

实际场景中多个空间包含组合搜索还蛮常见：

查询所有连锁店覆盖的空间区域的数据，

查询多个空间圈选后包含的数据，

存在同样的问题，例如两个st_covers的or组合查询，变成了bitmap index scan

select * from xxx where st_covers(geo1,loc) or st_covers(geo2,loc) or .... st_covers(geo?,loc);
  
 Bitmap Heap Scan on xxx  (cost=8.84..13.37 rows=1 width=1194) (actual time=0.940..2.547 rows=215 loops=1)  
   Output: xxx  
   Recheck Cond: (('0103000020E610000001000000050000004607CD68194F5E40A30954F216CA42404607CD68194F5E4017898ACF2AC742401DBB4CC022505E4017898ACF2AC742401DBB4CC022505E40A30954F216CA42404607CD68194F5E40A30954F216CA4240'::geography && xxx.location) OR ('0103000020E610000001000000050000004607CD68194F5E40250954F216CA42404607CD68194F5E4017898ACF2AC742401DBB4CC022505E4017898ACF2AC742401DBB4CC022505E40A30954F216CA42404607CD68194F5E40250954F216CA4240'::geography && xxx.location))  
   Filter: ((('0103000020E610000001000000050000004607CD68194F5E40A30954F216CA42404607CD68194F5E4017898ACF2AC742401DBB4CC022505E4017898ACF2AC742401DBB4CC022505E40A30954F216CA42404607CD68194F5E40A30954F216CA4240'::geography && xxx.location) AND _st_covers('0103000020E610000001000000050000004607CD68194F5E40A30954F216CA42404607CD68194F5E4017898ACF2AC742401DBB4CC022505E4017898ACF2AC742401DBB4CC022505E40A30954F216CA42404607CD68194F5E40A30954F216CA4240'::geography, xxx.location)) OR (('0103000020E610000001000000050000004607CD68194F5E40250954F216CA42404607CD68194F5E4017898ACF2AC742401DBB4CC022505E4017898ACF2AC742401DBB4CC022505E40A30954F216CA42404607CD68194F5E40250954F216CA4240'::geography && xxx.location) AND _st_covers('0103000020E610000001000000050000004607CD68194F5E40250954F216CA42404607CD68194F5E4017898ACF2AC742401DBB4CC022505E4017898ACF2AC742401DBB4CC022505E40A30954F216CA42404607CD68194F5E40250954F216CA4240'::geography, xxx.location)))  
   Rows Removed by Filter: 18  
   Heap Blocks: exact=241  
   Buffers: shared hit=259  
   ->  BitmapOr  (cost=8.84..8.84 rows=1 width=0) (actual time=0.862..0.862 rows=0 loops=1)  
         Buffers: shared hit=18  
         ->  Bitmap Index Scan on xxx  (cost=0.00..4.42 rows=1 width=0) (actual time=0.465..0.465 rows=242 loops=1)  
               Index Cond: ('0103000020E610000001000000050000004607CD68194F5E40A30954F216CA42404607CD68194F5E4017898ACF2AC742401DBB4CC022505E4017898ACF2AC742401DBB4CC022505E40A30954F216CA42404607CD68194F5E40A30954F216CA4240'::geography && xxx.location)  
               Buffers: shared hit=9  
         ->  Bitmap Index Scan on xxx  (cost=0.00..4.42 rows=1 width=0) (actual time=0.397..0.397 rows=242 loops=1)  
               Index Cond: ('0103000020E610000001000000050000004607CD68194F5E40250954F216CA42404607CD68194F5E4017898ACF2AC742401DBB4CC022505E4017898ACF2AC742401DBB4CC022505E40A30954F216CA42404607CD68194F5E40250954F216CA4240'::geography && xxx.location)  
               Buffers: shared hit=9  
 Planning time: 0.719 ms  
 Execution time: 2.590 ms  
(17 rows)  

优化

数值IN大量重复值

1、SQL重写请参考:

《PostgreSQL in 与 = any 的SQL语法异同与性能优化》

2、客户端对输入条件去重，然后再拼接SQL进行输入

空间包含or组合查询

1、客户端对空间进行重叠处理去重，

2、然后根据空间BOX进行split，拆成多个空间对象，切割的目的是提高有效空间的占比。减少不规则空间对象使用GiST索引扫描时，BOUND BOX引入的无效记录的占比。提高效率。

3、将原来的OR改成，多个空间对象的包含查询，

4、union all组合多个查询。

select * from xxx where st_cover(geo1, loc)   
union all  
select * from xxx where st_cover(geo2, loc)   
union all  
...  
select * from xxx where st_cover(geo?, loc) ;  

执行计划会变成类似这样(注意下面演示的执行计划没有对空间去重(只是演示问题，不是真实问题)，你需要关注的是空间去重后，recheck没有了，bitmap scan没有了，只有append， INDEX SCAN。真实情况下按bound box切割时多个index scan扫描的BLOCK是完全隔离的，不会产生冗余扫描。)

同时，原始的查询条件如果有100个OR，实际上在空间处理后换成union all的查询可能没有100个UNION ALL，根据实际的空间SPLIT情况来定。

 Append  (cost=0.41..17.40 rows=2 width=1178) (actual time=0.126..3.267 rows=430 loops=1)
   Buffers: shared hit=500
   ->  Index Scan using xxx on xxx.xxx  (cost=0.41..8.69 rows=1 width=1194) (actual time=0.126..1.821 rows=215 loops=1)
         Output: xxx
         Index Cond: ('0103000020E610000001000000050000004607CD68194F5E40A30954F216CA42404607CD68194F5E4017898ACF2AC742401DBB4CC022505E4017898ACF2AC742401DBB4CC022505E40A30954F216CA42404607CD68194F5E40A30954F216CA4240'::geography && xxx.location)
         Filter: _st_covers('0103000020E610000001000000050000004607CD68194F5E40A30954F216CA42404607CD68194F5E4017898ACF2AC742401DBB4CC022505E4017898ACF2AC742401DBB4CC022505E40A30954F216CA42404607CD68194F5E40A30954F216CA4240'::geography, xxx.location)
         Rows Removed by Filter: 18
         Buffers: shared hit=250
   ->  Index Scan using xxx on xxx.xxx xxx_1  (cost=0.41..8.69 rows=1 width=1194) (actual time=0.101..1.402 rows=215 loops=1)
         Output: xxx
         Index Cond: ('0103000020E610000001000000050000004607CD68194F5E40250954F216CA42404607CD68194F5E4017898ACF2AC742401DBB4CC022505E4017898ACF2AC742401DBB4CC022505E40A30954F216CA42404607CD68194F5E40250954F216CA4240'::geography && xxx_1.location)
         Filter: _st_covers('0103000020E610000001000000050000004607CD68194F5E40250954F216CA42404607CD68194F5E4017898ACF2AC742401DBB4CC022505E4017898ACF2AC742401DBB4CC022505E40A30954F216CA42404607CD68194F5E40250954F216CA4240'::geography, xxx_1.location)
         Rows Removed by Filter: 18
         Buffers: shared hit=250

如果客户端的计算都可以在PostGIS中实现，那么也可以交给PG来处理（引入一些空间计算的开销），但是数据库数据扫描与RECHECK,FILTER的开销则降低了。

在数据库端实现的优化例子:

《PostgreSQL 空间切割(st_split, ST_Subdivide)功能扩展 - 空间对象网格化 (多边形GiST优化)》

《PostgreSQL 空间st_contains，st_within空间包含搜索优化 - 降IO和降CPU(bound box) (多边形GiST优化)》