PostgreSQL 单列组合查询优化 - 多个多边形查询优化
背景
在某些空间数据查询需求中,有一些这样的请求,例如查找与某些多边形中的任意一个相交的空间对象。
比如在菜鸟、新零售的业务中,查询某几个商场多边形,或者某几个小区多边形内覆盖的点。
SQL写法可能是这样的
select geo_point,* from table where ST_Within(geo_point, polygon_1) or ST_Within(geo_point, polygon_2) or ... ST_Within(geo_point, polygon_n);
PostgreSQL支持空间索引,同时支持bitmapAnd, bitmapOr index scan。也就是说只要geo_point字段有索引,不管多少个查询条件,都可以走index scan。
这个查询有什么优化空间么?
在讲这个空间优化前,我们来看另一个例子。
单列组合条件查询
单列组合条件查询与前面提到的需求类似,即一个字段,多个查询条件。是不是类似于一个字段,多个多边形匹配呢?
表结构如下
postgres=# \d+ a
Table "public.a"
Column | Type | Collation | Nullable | Default | Storage | Stats target | Description
----------+-----------------------------+-----------+----------+---------+----------+--------------+-------------
id | integer | | not null | | plain | |
info | text | | | | extended | |
crt_time | timestamp without time zone | | | | plain | |
Indexes:
"a_pkey" PRIMARY KEY, btree (id)
查询需求如下,ID字段上有4个条件,任意条件满足即返回结果。
PostgreSQL根据统计信息,基于CBO成本优化,采用了bitmapOr,重复使用了多次单列索引,合并所有条件对应的数据块。最后进行一次recheck得到所要的结果。
bitmapAnd,bitmapOr是PostgreSQL数据库独有的特性,可以在多个查询条件的组合查询中使用多个索引进行数据块的合并和消除扫描,非常赞。
postgres=# explain (analyze,verbose,timing,costs,buffers) select * from a where id <100 or id<200 or id<300 or id between 10 and 100 or id between 100 and 30000;
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on public.a (cost=39.85..3118.98 rows=2610 width=44) (actual time=5.099..15.355 rows=3154 loops=1)
Output: id, info, crt_time
Recheck Cond: ((a.id < 100) OR (a.id < 200) OR (a.id < 300) OR ((a.id >= 10) AND (a.id <= 100)) OR ((a.id >= 100) AND (a.id <= 30000)))
Heap Blocks: exact=38
Buffers: shared hit=16 read=40
-> BitmapOr (cost=39.85..39.85 rows=2610 width=0) (actual time=5.089..5.089 rows=0 loops=1)
Buffers: shared hit=15 read=3
-> Bitmap Index Scan on a_pkey (cost=0.00..1.70 rows=8 width=0) (actual time=0.003..0.003 rows=1 loops=1)
Index Cond: (a.id < 100)
Buffers: shared hit=3
-> Bitmap Index Scan on a_pkey (cost=0.00..1.77 rows=17 width=0) (actual time=0.001..0.001 rows=1 loops=1)
Index Cond: (a.id < 200)
Buffers: shared hit=3
-> Bitmap Index Scan on a_pkey (cost=0.00..1.84 rows=26 width=0) (actual time=0.001..0.001 rows=1 loops=1)
Index Cond: (a.id < 300)
Buffers: shared hit=3
-> Bitmap Index Scan on a_pkey (cost=0.00..1.72 rows=8 width=0) (actual time=0.004..0.004 rows=0 loops=1)
Index Cond: ((a.id >= 10) AND (a.id <= 100))
Buffers: shared hit=3
-> Bitmap Index Scan on a_pkey (cost=0.00..29.55 rows=2551 width=0) (actual time=5.079..5.079 rows=3153 loops=1)
Index Cond: ((a.id >= 100) AND (a.id <= 30000))
Buffers: shared hit=3 read=3
Planning time: 0.251 ms
Execution time: 15.699 ms
(24 rows)
而实际上这4个条件,我们可以在逻辑上合成一个条件。
postgres=# explain (analyze,verbose,timing,costs,buffers) select * from a where id <= 30000;
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on public.a (cost=23.88..3006.82 rows=2560 width=44) (actual time=0.110..0.568 rows=3154 loops=1)
Output: id, info, crt_time
Recheck Cond: (a.id <= 30000)
Heap Blocks: exact=38
Buffers: shared hit=44
-> Bitmap Index Scan on a_pkey (cost=0.00..23.24 rows=2560 width=0) (actual time=0.102..0.102 rows=3154 loops=1)
Index Cond: (a.id <= 30000)
Buffers: shared hit=6
Planning time: 0.127 ms
Execution time: 0.835 ms
(10 rows)
合成后,走精确索引扫描,性能提升几十倍。
实际上,空间类型的数据,也是一样的,而且合成起来更方便,通过st_union即可。
空间合成优化
select geo_point,* from table where ST_Within(geo_point, polygon_1) or ST_Within(geo_point, polygon_2) or ... ST_Within(geo_point, polygon_n);
优化为
select geo_point,* from table where ST_Within(geo_point, st_union(polygon_1,polygon_2,...polygon_n) );
性能提升也是非常明显的。
参考
http://postgis.net/docs/manual-2.3/ST_Union.html
https://www.postgresql.org/docs/9.6/static/indexes-bitmap-scans.html