Greenplum 空间(GIS)数据检索 b-tree & GiST 索引实践 - 阿里云HybridDB for PostgreSQL最佳实践

4 minute read

背景

气象数据、地震数据、室内定位、室外定位、手机、车联网、还有我们最喜欢的“左划不喜欢、右划喜欢”,越来越多的位置属性的数据。将来会越来越多。

基于GIS的数据分析、OLTP业务也越来越受到决策者的青睐,例如商场的选址决策,O2O的广告营销等。有很多基于多边形、时间、用户对象属性过滤的需求。

阿里云HybridDB for PostgreSQL是一个支持GIS数据类型处理的MPP分布式数据库,支持海量的GIS数据的存储和分析处理。

支持三种索引接口:

bitmap

btree

gist

三种索引的原理请参考

《PostgreSQL 9种索引的原理和应用场景》

在Greenplum中,GiST索引比较重(是R-Tree结构的空间索引),但是它支持几乎所有的空间搜索,但是overhead相比btree也更大一些:

1、平面、三维、多维对象 几何相交、不相交、相邻。

2、平面、三维、多维对象的方位判断(相交或严格在左边、右边、上边、下边),类似数值的大于、小于、大于等于、小于等于。

3、平面、三维、多维对象 包含 另一个对象

4、平面、三维、多维对象 等于 另一个对象

5、平面、三维、多维对象 与另一个对象的(边、最近、中心点的)距离,按距离排序输出满足条件的行,输出距离在XX以内的行。

PostgreSQL 比Greenplum支持的索引接口更多,比如BRIN是一个很好的块级索引,对于基于多边形的群体分析非常有效。但是Greenplum中没有BRIN。

《PostGIS空间索引(GiST、BRIN、R-Tree)选择、优化 - 阿里云RDS PostgreSQL最佳实践》

那么Greenplum中能否有类似BRIN块级索引的功能的,在阿里云HybridDB for PostgreSQL中是有这个功能的:

《解密上帝之手 - 阿里云HDB for PostgreSQL数据库metascan特性(存储级、块级、batch级过滤与数据编排)》

但是我们先来说说社区版本的Greenplum,能否用b-tree替代overhead更大的GiST?

答案必须是肯定的。

精度不是那么高的GEOHASH

《geohash vs PostGIS》

pic

geohash的精度随着它的位数变化,例如4位时,它是一个20公里大的BOX。(也就是说一个点,被模糊化为1个方圆20公里的BOX。够粗糙的,但是我们就要用这个粗糙来过滤数据。)

GEOHASH B-Tree一重过滤

使用geohash一重过滤,得到一个大范围的BOX内的数据。

由于GEOHASH是text类型,支持b-tree索引。

当POINT在某个范围(box)内时,它的prefix会和这个box的prefix重叠。

查询某些纵横20公里内的点。  
  
select * from table where st_geohash(pos,15) ~ '^abcd';  

st_distancespheroid二重过滤

一重过滤会得到一个较大范围,例如我们要查询5公里内的人群,实际上返回的是20公里内的人群。

二重过滤,使用距离函数,计算真实的距离,过滤不符合条件的记录。

Name  
  
ST_DistanceSpheroid — Returns the minimum distance between two lon/lat geometries given a particular spheroid. PostGIS versions prior to 1.5 only support points.  
  
Synopsis  
  
float ST_DistanceSpheroid(geometry geomlonlatA, geometry geomlonlatB, spheroid measurement_spheroid);  

性能评测

1、建表,存储geometry类型的POINT

postgres=# create unlogged table test(id int, pos geometry);  
CREATE TABLE  

2、写入5000万测试数据

postgres=# insert into test select generate_series(1,50000000), st_setsrid(st_makepoint(random()*360-180, random()*180-90), 4326) ;  

3、创建geohash表达式索引

create index idx_test_pos on test using btree(st_geohash(pos, 11));  

4、查询ST_Point(100,90)方圆5公里内的点,我们需要采用4位精度(一重过滤收缩到20公里)。

postgresql 语法:

postgres=# explain (analyze,verbose,timing,costs,buffers)   
select *,  
ST_DistanceSpheroid(pos,ST_SetSRID(ST_Point(100,90),4326),'SPHEROID["WGS84",6378137,298.257223563]')   
from test  
where st_geohash(pos,11) ~ ('^'||st_geohash(st_setsrid(st_makepoint(100,90),4326), 4)) ;  
                                                                             QUERY PLAN                                                                                
---------------------------------------------------------------------------------------------------------------------------------------------------------------------  
 Index Scan using idx_test_pos on public.test  (cost=0.56..2532.74 rows=5000 width=44) (actual time=0.224..7.279 rows=46 loops=1)  
   Output: id, pos, st_distancespheroid(pos, '0101000020E610000000000000000059400000000000805640'::geometry, 'SPHEROID("WGS84",6378137,298.257223562997)'::spheroid)  
   Index Cond: ((st_geohash(test.pos, 11) >= 'ypzp'::text) AND (st_geohash(test.pos, 11) < 'ypzq'::text))  
   Filter: (st_geohash(test.pos, 11) ~ '^ypzp'::text)  
   Buffers: shared hit=50  
 Planning time: 0.194 ms  
 Execution time: 7.311 ms  
(7 rows)  
  
postgres=# explain (analyze,verbose,timing,costs,buffers)   
select *   
from test  
where st_geohash(pos,11) ~ ('^'||st_geohash(st_setsrid(st_makepoint(100,90),4326), 4)) ;  
                                                           QUERY PLAN                                                             
--------------------------------------------------------------------------------------------------------------------------------  
 Index Scan using idx_test_pos on public.test  (cost=0.56..32.74 rows=5000 width=36) (actual time=0.031..0.144 rows=46 loops=1)  
   Output: id, pos  
   Index Cond: ((st_geohash(test.pos, 11) >= 'ypzp'::text) AND (st_geohash(test.pos, 11) < 'ypzq'::text))  
   Filter: (st_geohash(test.pos, 11) ~ '^ypzp'::text)  
   Buffers: shared hit=50  
 Planning time: 0.173 ms  
 Execution time: 0.176 ms  
(7 rows)  
  
postgres=#  select id,st_astext(pos),  
ST_DistanceSpheroid(pos,ST_SetSRID(ST_Point(100,90),4326),'SPHEROID["WGS84",6378137,298.257223563]')   
from test where   
st_geohash(pos,11) ~ ('^'||st_geohash(st_setsrid(st_makepoint(100,90),4326), 4)) ;  
    id    |                st_astext                 | st_distancespheroid   
----------+------------------------------------------+---------------------  
 25173094 | POINT(99.8623717390001 89.8564164061099) |    16037.4226616061  
 15256453 | POINT(99.886408187449 89.8609424661845)  |    15531.8890312462  
 42820155 | POINT(99.900087621063 89.8426755331457)  |    17572.1953385406  
 11941582 | POINT(99.9024446122348 89.8641818203032) |    15170.0726995738  
 27302558 | POINT(99.8545220866799 89.9049358069897) |    10618.0979324084  
 12748345 | POINT(99.8604090325534 89.9055569060147) |    10548.7249125223  
 23691474 | POINT(99.9338417127728 89.8551008664072) |    16184.3605168626  
  2789094 | POINT(99.9521030299366 89.8572400957346) |    15945.4214953106  
 13014164 | POINT(99.9811399541795 89.8462957609445) |    17167.8377188967  
 30480729 | POINT(99.9596883170307 89.899117089808)  |    11268.0135911417  
 23922308 | POINT(99.9793708696961 89.8840416502208) |    12951.8493711937  
 37865268 | POINT(99.9939758330584 89.8977239336818) |    11423.6207380738  
 44349686 | POINT(99.8703192919493 89.9472489114851) |    5891.97898548146  
 24627681 | POINT(99.8947213590145 89.9413335509598) |    6552.68913675189  
 29549671 | POINT(99.8913882113993 89.9539441242814) |    5144.16402989278  
 19377493 | POINT(99.9157874286175 89.935091631487)  |    7249.87395455637  
 15823642 | POINT(99.8874770477414 89.9763751029968) |    2638.75876144951  
 35541120 | POINT(99.9283183738589 89.9659094586968) |    3807.70821893377  
  5136659 | POINT(99.9655172601342 89.9136384855956) |    9646.06115073887  
 35330398 | POINT(99.9899900704622 89.9316252116114) |     7637.0521799929  
 18139347 | POINT(99.9744066037238 89.9915046896785) |    948.875017323096  
 39258118 | POINT(99.9836320616305 89.9618403799832) |    4262.19981177446  
 30534377 | POINT(99.9944747239351 89.9620758276433) |    4235.90172575841  
  5400987 | POINT(99.9796806648374 89.9704348482192) |     3302.2494557095  
 27994107 | POINT(100.035087894648 89.835905469954)  |    18328.3705781338  
 11267788 | POINT(100.044458862394 89.8578487057239) |    15877.4434278569  
 26289436 | POINT(100.046943258494 89.8645990714431) |    15123.4682619251  
 44089622 | POINT(100.072270520031 89.8646929487586) |    15112.9827315541  
  3894728 | POINT(100.086387153715 89.8635132797062) |    15244.7446550829  
 35587898 | POINT(100.044751726091 89.8876157775521) |    12552.6408821271  
  2297611 | POINT(100.070957243443 89.8893462214619) |    12359.3607228444  
 31932031 | POINT(100.064905174077 89.8981667496264) |    11374.1608646003  
 22491539 | POINT(100.117400866002 89.8279081285)    |    19221.6253932898  
 22237545 | POINT(100.126622300595 89.8635710310191) |    15238.2941814952  
 36179231 | POINT(100.174575503916 89.8459538631141) |    17206.0256453898  
  7990607 | POINT(100.183915123343 89.8694989643991) |     14576.179748045  
 42321853 | POINT(100.080158896744 89.9232382792979) |    8573.82201116562  
 11478472 | POINT(100.07952991873 89.929187502712)   |    7909.32958390792  
 11105634 | POINT(100.022092089057 89.9607635568827) |    4382.47446858912  
  2532812 | POINT(100.037098880857 89.9638177547604) |    4041.33895476991  
 29379668 | POINT(100.023544505239 89.9718571733683) |     3143.3842999806  
 29884935 | POINT(100.043103173375 89.9965709634125) |    383.002742505678  
  1076756 | POINT(100.098143275827 89.9763003364205) |    2647.10973787711  
 22469249 | POINT(100.096434876323 89.9840644095093) |    1779.90951806841  
 14234913 | POINT(100.1863790676 89.9419186916202)   |    6487.33244847654  
 33483358 | POINT(100.13643046841 89.9627213180065)  |    4163.80433863277  
(46 rows)  

Greenplum 语法(因为Greenplum版本较老,不支持正则表达式的索引检索,我们需要人为转换一下):

select id,st_astext(pos),  
ST_DistanceSpheroid(pos,ST_SetSRID(ST_Point(100,90),4326),'SPHEROID["WGS84",6378137,298.257223563]')   
from test where   
st_geohash(pos,11) >= st_geohash(st_setsrid(st_makepoint(100,90),4326), 4)  
and   
st_geohash(pos,11) <   
substring(st_geohash(st_setsrid(st_makepoint(100,90),4326), 4),1, length(st_geohash(st_setsrid(st_makepoint(100,90),4326), 4))-1 )  
||  
chr(ascii(substring(st_geohash(st_setsrid(st_makepoint(100,90),4326), 4), '(.)$'))+1);  
  
  
    id    |                st_astext                 | st_distancespheroid   
----------+------------------------------------------+---------------------  
 25173094 | POINT(99.8623717390001 89.8564164061099) |    16037.4226616061  
 15256453 | POINT(99.886408187449 89.8609424661845)  |    15531.8890312462  
 42820155 | POINT(99.900087621063 89.8426755331457)  |    17572.1953385406  
 11941582 | POINT(99.9024446122348 89.8641818203032) |    15170.0726995738  
 27302558 | POINT(99.8545220866799 89.9049358069897) |    10618.0979324084  
 12748345 | POINT(99.8604090325534 89.9055569060147) |    10548.7249125223  
 23691474 | POINT(99.9338417127728 89.8551008664072) |    16184.3605168626  
  2789094 | POINT(99.9521030299366 89.8572400957346) |    15945.4214953106  
 13014164 | POINT(99.9811399541795 89.8462957609445) |    17167.8377188967  
 30480729 | POINT(99.9596883170307 89.899117089808)  |    11268.0135911417  
 23922308 | POINT(99.9793708696961 89.8840416502208) |    12951.8493711937  
 37865268 | POINT(99.9939758330584 89.8977239336818) |    11423.6207380738  
 44349686 | POINT(99.8703192919493 89.9472489114851) |    5891.97898548146  
 24627681 | POINT(99.8947213590145 89.9413335509598) |    6552.68913675189  
 29549671 | POINT(99.8913882113993 89.9539441242814) |    5144.16402989278  
 19377493 | POINT(99.9157874286175 89.935091631487)  |    7249.87395455637  
 15823642 | POINT(99.8874770477414 89.9763751029968) |    2638.75876144951  
 35541120 | POINT(99.9283183738589 89.9659094586968) |    3807.70821893377  
  5136659 | POINT(99.9655172601342 89.9136384855956) |    9646.06115073887  
 35330398 | POINT(99.9899900704622 89.9316252116114) |     7637.0521799929  
 18139347 | POINT(99.9744066037238 89.9915046896785) |    948.875017323096  
 39258118 | POINT(99.9836320616305 89.9618403799832) |    4262.19981177446  
 30534377 | POINT(99.9944747239351 89.9620758276433) |    4235.90172575841  
  5400987 | POINT(99.9796806648374 89.9704348482192) |     3302.2494557095  
 27994107 | POINT(100.035087894648 89.835905469954)  |    18328.3705781338  
 11267788 | POINT(100.044458862394 89.8578487057239) |    15877.4434278569  
 26289436 | POINT(100.046943258494 89.8645990714431) |    15123.4682619251  
 44089622 | POINT(100.072270520031 89.8646929487586) |    15112.9827315541  
  3894728 | POINT(100.086387153715 89.8635132797062) |    15244.7446550829  
 35587898 | POINT(100.044751726091 89.8876157775521) |    12552.6408821271  
  2297611 | POINT(100.070957243443 89.8893462214619) |    12359.3607228444  
 31932031 | POINT(100.064905174077 89.8981667496264) |    11374.1608646003  
 22491539 | POINT(100.117400866002 89.8279081285)    |    19221.6253932898  
 22237545 | POINT(100.126622300595 89.8635710310191) |    15238.2941814952  
 36179231 | POINT(100.174575503916 89.8459538631141) |    17206.0256453898  
  7990607 | POINT(100.183915123343 89.8694989643991) |     14576.179748045  
 42321853 | POINT(100.080158896744 89.9232382792979) |    8573.82201116562  
 11478472 | POINT(100.07952991873 89.929187502712)   |    7909.32958390792  
 11105634 | POINT(100.022092089057 89.9607635568827) |    4382.47446858912  
  2532812 | POINT(100.037098880857 89.9638177547604) |    4041.33895476991  
 29379668 | POINT(100.023544505239 89.9718571733683) |     3143.3842999806  
 29884935 | POINT(100.043103173375 89.9965709634125) |    383.002742505678  
  1076756 | POINT(100.098143275827 89.9763003364205) |    2647.10973787711  
 22469249 | POINT(100.096434876323 89.9840644095093) |    1779.90951806841  
 14234913 | POINT(100.1863790676 89.9419186916202)   |    6487.33244847654  
 33483358 | POINT(100.13643046841 89.9627213180065)  |    4163.80433863277  
(46 rows)  

5、CLUSTER优化,按geohash重排后,扫描的块降低,速度大幅提升。

postgres=# cluster test using idx_test_pos ;  

优化后效率如下,从扫描50个块,降低到扫描5个块。扫描成本直线下降10倍,对SHARED BUFFER的需求也下降10倍。

explain (analyze,verbose,timing,costs,buffers)   
select *,  
ST_DistanceSpheroid(pos,ST_SetSRID(ST_Point(100,90),4326),'SPHEROID["WGS84",6378137,298.257223563]')   
from test  
where st_geohash(pos,11) ~ ('^'||st_geohash(st_setsrid(st_makepoint(100,90),4326), 4)) ;  
  
                                                                             QUERY PLAN                                                                                
---------------------------------------------------------------------------------------------------------------------------------------------------------------------  
 Index Scan using idx_test_pos on public.test  (cost=0.56..2532.74 rows=5000 width=44) (actual time=0.214..7.172 rows=46 loops=1)  
   Output: id, pos, st_distancespheroid(pos, '0101000020E610000000000000000059400000000000805640'::geometry, 'SPHEROID("WGS84",6378137,298.257223562997)'::spheroid)  
   Index Cond: ((st_geohash(test.pos, 11) >= 'ypzp'::text) AND (st_geohash(test.pos, 11) < 'ypzq'::text))  
   Filter: (st_geohash(test.pos, 11) ~ '^ypzp'::text)  
   Buffers: shared hit=5  
 Planning time: 0.174 ms  
 Execution time: 7.205 ms  
(7 rows)  

云端相关产品

阿里云 RDS PostgreSQL

阿里云 HybridDB for PostgreSQL

小结

对于Greenplum来说,如果你觉得GiST索引引入的数据导入OVERHEAD不值得,那么建议改用B-Tree索引,使用geohash的btree来代替GiST索引,通过一重索引过滤,二重CPU过滤来进行空间数据的筛选。

在不太影响导入性能的同时,又能获得不错的空间数据筛选性能。何乐而不为呢?

最后,我们可能还有时间、空间、对象属性多维度查询的需求(可以使用多级分区表、多字段索引bitmap合并扫描等数据库黑科技),方法请参考:

《PostgreSQL\GPDB 毫秒级海量多维数据透视 案例分享》

《时间、空间、对象多维属性 海量数据任意多维 高效检索 - 阿里云RDS PostgreSQL最佳实践》

《(新零售)商户网格化运营 - 阿里云RDS PostgreSQL、HybridDB for PostgreSQL最佳实践》

《PostGIS空间索引(GiST、BRIN、R-Tree)选择、优化 - 阿里云RDS PostgreSQL最佳实践》

《PostgreSQL 单列组合查询优化 - 多个多边形查询优化》

二重过滤的CPU开销较大,如果这部分操作放到程序端实现,那么数据库的查询耗时将降低到0.17毫秒

-- 以下运算,建议放到程序过滤。(在一重过滤数据量可控的情况下,例如查询2.4公里内的点,放大到5公里操作,5公里内比如有100万个点,加上其他属性的过滤,可能过滤到50万个点,实际上这样返回给客户端,客户端再次过滤是可以接受的。)
ST_DistanceSpheroid(pos,ST_SetSRID(ST_Point(100,90),4326),'SPHEROID["WGS84",6378137,298.257223563]')

参考

《PostGIS 空间数据学习建议》

http://postgis.net/docs/ST_Distance_Spheroid.html

Flag Counter

digoal’s 大量PostgreSQL文章入口