Greenplum 空间(GIS)数据检索 b-tree & GiST 索引实践 - 阿里云HybridDB for PostgreSQL最佳实践
背景
气象数据、地震数据、室内定位、室外定位、手机、车联网、还有我们最喜欢的“左划不喜欢、右划喜欢”,越来越多的位置属性的数据。将来会越来越多。
基于GIS的数据分析、OLTP业务也越来越受到决策者的青睐,例如商场的选址决策,O2O的广告营销等。有很多基于多边形、时间、用户对象属性过滤的需求。
阿里云HybridDB for PostgreSQL是一个支持GIS数据类型处理的MPP分布式数据库,支持海量的GIS数据的存储和分析处理。
支持三种索引接口:
bitmap
btree
gist
三种索引的原理请参考
在Greenplum中,GiST索引比较重(是R-Tree结构的空间索引),但是它支持几乎所有的空间搜索,但是overhead相比btree也更大一些:
1、平面、三维、多维对象 几何相交、不相交、相邻。
2、平面、三维、多维对象的方位判断(相交或严格在左边、右边、上边、下边),类似数值的大于、小于、大于等于、小于等于。
3、平面、三维、多维对象 包含 另一个对象
4、平面、三维、多维对象 等于 另一个对象
5、平面、三维、多维对象 与另一个对象的(边、最近、中心点的)距离,按距离排序输出满足条件的行,输出距离在XX以内的行。
PostgreSQL 比Greenplum支持的索引接口更多,比如BRIN是一个很好的块级索引,对于基于多边形的群体分析非常有效。但是Greenplum中没有BRIN。
《PostGIS空间索引(GiST、BRIN、R-Tree)选择、优化 - 阿里云RDS PostgreSQL最佳实践》
那么Greenplum中能否有类似BRIN块级索引的功能的,在阿里云HybridDB for PostgreSQL中是有这个功能的:
《解密上帝之手 - 阿里云HDB for PostgreSQL数据库metascan特性(存储级、块级、batch级过滤与数据编排)》
但是我们先来说说社区版本的Greenplum,能否用b-tree替代overhead更大的GiST?
答案必须是肯定的。
精度不是那么高的GEOHASH
geohash的精度随着它的位数变化,例如4位时,它是一个20公里大的BOX。(也就是说一个点,被模糊化为1个方圆20公里的BOX。够粗糙的,但是我们就要用这个粗糙来过滤数据。)
GEOHASH B-Tree一重过滤
使用geohash一重过滤,得到一个大范围的BOX内的数据。
由于GEOHASH是text类型,支持b-tree索引。
当POINT在某个范围(box)内时,它的prefix会和这个box的prefix重叠。
查询某些纵横20公里内的点。
select * from table where st_geohash(pos,15) ~ '^abcd';
st_distancespheroid二重过滤
一重过滤会得到一个较大范围,例如我们要查询5公里内的人群,实际上返回的是20公里内的人群。
二重过滤,使用距离函数,计算真实的距离,过滤不符合条件的记录。
Name
ST_DistanceSpheroid — Returns the minimum distance between two lon/lat geometries given a particular spheroid. PostGIS versions prior to 1.5 only support points.
Synopsis
float ST_DistanceSpheroid(geometry geomlonlatA, geometry geomlonlatB, spheroid measurement_spheroid);
性能评测
1、建表,存储geometry类型的POINT
postgres=# create unlogged table test(id int, pos geometry);
CREATE TABLE
2、写入5000万测试数据
postgres=# insert into test select generate_series(1,50000000), st_setsrid(st_makepoint(random()*360-180, random()*180-90), 4326) ;
3、创建geohash表达式索引
create index idx_test_pos on test using btree(st_geohash(pos, 11));
4、查询ST_Point(100,90)方圆5公里内的点,我们需要采用4位精度(一重过滤收缩到20公里)。
postgresql 语法:
postgres=# explain (analyze,verbose,timing,costs,buffers)
select *,
ST_DistanceSpheroid(pos,ST_SetSRID(ST_Point(100,90),4326),'SPHEROID["WGS84",6378137,298.257223563]')
from test
where st_geohash(pos,11) ~ ('^'||st_geohash(st_setsrid(st_makepoint(100,90),4326), 4)) ;
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------------------------
Index Scan using idx_test_pos on public.test (cost=0.56..2532.74 rows=5000 width=44) (actual time=0.224..7.279 rows=46 loops=1)
Output: id, pos, st_distancespheroid(pos, '0101000020E610000000000000000059400000000000805640'::geometry, 'SPHEROID("WGS84",6378137,298.257223562997)'::spheroid)
Index Cond: ((st_geohash(test.pos, 11) >= 'ypzp'::text) AND (st_geohash(test.pos, 11) < 'ypzq'::text))
Filter: (st_geohash(test.pos, 11) ~ '^ypzp'::text)
Buffers: shared hit=50
Planning time: 0.194 ms
Execution time: 7.311 ms
(7 rows)
postgres=# explain (analyze,verbose,timing,costs,buffers)
select *
from test
where st_geohash(pos,11) ~ ('^'||st_geohash(st_setsrid(st_makepoint(100,90),4326), 4)) ;
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------
Index Scan using idx_test_pos on public.test (cost=0.56..32.74 rows=5000 width=36) (actual time=0.031..0.144 rows=46 loops=1)
Output: id, pos
Index Cond: ((st_geohash(test.pos, 11) >= 'ypzp'::text) AND (st_geohash(test.pos, 11) < 'ypzq'::text))
Filter: (st_geohash(test.pos, 11) ~ '^ypzp'::text)
Buffers: shared hit=50
Planning time: 0.173 ms
Execution time: 0.176 ms
(7 rows)
postgres=# select id,st_astext(pos),
ST_DistanceSpheroid(pos,ST_SetSRID(ST_Point(100,90),4326),'SPHEROID["WGS84",6378137,298.257223563]')
from test where
st_geohash(pos,11) ~ ('^'||st_geohash(st_setsrid(st_makepoint(100,90),4326), 4)) ;
id | st_astext | st_distancespheroid
----------+------------------------------------------+---------------------
25173094 | POINT(99.8623717390001 89.8564164061099) | 16037.4226616061
15256453 | POINT(99.886408187449 89.8609424661845) | 15531.8890312462
42820155 | POINT(99.900087621063 89.8426755331457) | 17572.1953385406
11941582 | POINT(99.9024446122348 89.8641818203032) | 15170.0726995738
27302558 | POINT(99.8545220866799 89.9049358069897) | 10618.0979324084
12748345 | POINT(99.8604090325534 89.9055569060147) | 10548.7249125223
23691474 | POINT(99.9338417127728 89.8551008664072) | 16184.3605168626
2789094 | POINT(99.9521030299366 89.8572400957346) | 15945.4214953106
13014164 | POINT(99.9811399541795 89.8462957609445) | 17167.8377188967
30480729 | POINT(99.9596883170307 89.899117089808) | 11268.0135911417
23922308 | POINT(99.9793708696961 89.8840416502208) | 12951.8493711937
37865268 | POINT(99.9939758330584 89.8977239336818) | 11423.6207380738
44349686 | POINT(99.8703192919493 89.9472489114851) | 5891.97898548146
24627681 | POINT(99.8947213590145 89.9413335509598) | 6552.68913675189
29549671 | POINT(99.8913882113993 89.9539441242814) | 5144.16402989278
19377493 | POINT(99.9157874286175 89.935091631487) | 7249.87395455637
15823642 | POINT(99.8874770477414 89.9763751029968) | 2638.75876144951
35541120 | POINT(99.9283183738589 89.9659094586968) | 3807.70821893377
5136659 | POINT(99.9655172601342 89.9136384855956) | 9646.06115073887
35330398 | POINT(99.9899900704622 89.9316252116114) | 7637.0521799929
18139347 | POINT(99.9744066037238 89.9915046896785) | 948.875017323096
39258118 | POINT(99.9836320616305 89.9618403799832) | 4262.19981177446
30534377 | POINT(99.9944747239351 89.9620758276433) | 4235.90172575841
5400987 | POINT(99.9796806648374 89.9704348482192) | 3302.2494557095
27994107 | POINT(100.035087894648 89.835905469954) | 18328.3705781338
11267788 | POINT(100.044458862394 89.8578487057239) | 15877.4434278569
26289436 | POINT(100.046943258494 89.8645990714431) | 15123.4682619251
44089622 | POINT(100.072270520031 89.8646929487586) | 15112.9827315541
3894728 | POINT(100.086387153715 89.8635132797062) | 15244.7446550829
35587898 | POINT(100.044751726091 89.8876157775521) | 12552.6408821271
2297611 | POINT(100.070957243443 89.8893462214619) | 12359.3607228444
31932031 | POINT(100.064905174077 89.8981667496264) | 11374.1608646003
22491539 | POINT(100.117400866002 89.8279081285) | 19221.6253932898
22237545 | POINT(100.126622300595 89.8635710310191) | 15238.2941814952
36179231 | POINT(100.174575503916 89.8459538631141) | 17206.0256453898
7990607 | POINT(100.183915123343 89.8694989643991) | 14576.179748045
42321853 | POINT(100.080158896744 89.9232382792979) | 8573.82201116562
11478472 | POINT(100.07952991873 89.929187502712) | 7909.32958390792
11105634 | POINT(100.022092089057 89.9607635568827) | 4382.47446858912
2532812 | POINT(100.037098880857 89.9638177547604) | 4041.33895476991
29379668 | POINT(100.023544505239 89.9718571733683) | 3143.3842999806
29884935 | POINT(100.043103173375 89.9965709634125) | 383.002742505678
1076756 | POINT(100.098143275827 89.9763003364205) | 2647.10973787711
22469249 | POINT(100.096434876323 89.9840644095093) | 1779.90951806841
14234913 | POINT(100.1863790676 89.9419186916202) | 6487.33244847654
33483358 | POINT(100.13643046841 89.9627213180065) | 4163.80433863277
(46 rows)
Greenplum 语法(因为Greenplum版本较老,不支持正则表达式的索引检索,我们需要人为转换一下):
select id,st_astext(pos),
ST_DistanceSpheroid(pos,ST_SetSRID(ST_Point(100,90),4326),'SPHEROID["WGS84",6378137,298.257223563]')
from test where
st_geohash(pos,11) >= st_geohash(st_setsrid(st_makepoint(100,90),4326), 4)
and
st_geohash(pos,11) <
substring(st_geohash(st_setsrid(st_makepoint(100,90),4326), 4),1, length(st_geohash(st_setsrid(st_makepoint(100,90),4326), 4))-1 )
||
chr(ascii(substring(st_geohash(st_setsrid(st_makepoint(100,90),4326), 4), '(.)$'))+1);
id | st_astext | st_distancespheroid
----------+------------------------------------------+---------------------
25173094 | POINT(99.8623717390001 89.8564164061099) | 16037.4226616061
15256453 | POINT(99.886408187449 89.8609424661845) | 15531.8890312462
42820155 | POINT(99.900087621063 89.8426755331457) | 17572.1953385406
11941582 | POINT(99.9024446122348 89.8641818203032) | 15170.0726995738
27302558 | POINT(99.8545220866799 89.9049358069897) | 10618.0979324084
12748345 | POINT(99.8604090325534 89.9055569060147) | 10548.7249125223
23691474 | POINT(99.9338417127728 89.8551008664072) | 16184.3605168626
2789094 | POINT(99.9521030299366 89.8572400957346) | 15945.4214953106
13014164 | POINT(99.9811399541795 89.8462957609445) | 17167.8377188967
30480729 | POINT(99.9596883170307 89.899117089808) | 11268.0135911417
23922308 | POINT(99.9793708696961 89.8840416502208) | 12951.8493711937
37865268 | POINT(99.9939758330584 89.8977239336818) | 11423.6207380738
44349686 | POINT(99.8703192919493 89.9472489114851) | 5891.97898548146
24627681 | POINT(99.8947213590145 89.9413335509598) | 6552.68913675189
29549671 | POINT(99.8913882113993 89.9539441242814) | 5144.16402989278
19377493 | POINT(99.9157874286175 89.935091631487) | 7249.87395455637
15823642 | POINT(99.8874770477414 89.9763751029968) | 2638.75876144951
35541120 | POINT(99.9283183738589 89.9659094586968) | 3807.70821893377
5136659 | POINT(99.9655172601342 89.9136384855956) | 9646.06115073887
35330398 | POINT(99.9899900704622 89.9316252116114) | 7637.0521799929
18139347 | POINT(99.9744066037238 89.9915046896785) | 948.875017323096
39258118 | POINT(99.9836320616305 89.9618403799832) | 4262.19981177446
30534377 | POINT(99.9944747239351 89.9620758276433) | 4235.90172575841
5400987 | POINT(99.9796806648374 89.9704348482192) | 3302.2494557095
27994107 | POINT(100.035087894648 89.835905469954) | 18328.3705781338
11267788 | POINT(100.044458862394 89.8578487057239) | 15877.4434278569
26289436 | POINT(100.046943258494 89.8645990714431) | 15123.4682619251
44089622 | POINT(100.072270520031 89.8646929487586) | 15112.9827315541
3894728 | POINT(100.086387153715 89.8635132797062) | 15244.7446550829
35587898 | POINT(100.044751726091 89.8876157775521) | 12552.6408821271
2297611 | POINT(100.070957243443 89.8893462214619) | 12359.3607228444
31932031 | POINT(100.064905174077 89.8981667496264) | 11374.1608646003
22491539 | POINT(100.117400866002 89.8279081285) | 19221.6253932898
22237545 | POINT(100.126622300595 89.8635710310191) | 15238.2941814952
36179231 | POINT(100.174575503916 89.8459538631141) | 17206.0256453898
7990607 | POINT(100.183915123343 89.8694989643991) | 14576.179748045
42321853 | POINT(100.080158896744 89.9232382792979) | 8573.82201116562
11478472 | POINT(100.07952991873 89.929187502712) | 7909.32958390792
11105634 | POINT(100.022092089057 89.9607635568827) | 4382.47446858912
2532812 | POINT(100.037098880857 89.9638177547604) | 4041.33895476991
29379668 | POINT(100.023544505239 89.9718571733683) | 3143.3842999806
29884935 | POINT(100.043103173375 89.9965709634125) | 383.002742505678
1076756 | POINT(100.098143275827 89.9763003364205) | 2647.10973787711
22469249 | POINT(100.096434876323 89.9840644095093) | 1779.90951806841
14234913 | POINT(100.1863790676 89.9419186916202) | 6487.33244847654
33483358 | POINT(100.13643046841 89.9627213180065) | 4163.80433863277
(46 rows)
5、CLUSTER优化,按geohash重排后,扫描的块降低,速度大幅提升。
postgres=# cluster test using idx_test_pos ;
优化后效率如下,从扫描50个块,降低到扫描5个块。扫描成本直线下降10倍,对SHARED BUFFER的需求也下降10倍。
explain (analyze,verbose,timing,costs,buffers)
select *,
ST_DistanceSpheroid(pos,ST_SetSRID(ST_Point(100,90),4326),'SPHEROID["WGS84",6378137,298.257223563]')
from test
where st_geohash(pos,11) ~ ('^'||st_geohash(st_setsrid(st_makepoint(100,90),4326), 4)) ;
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------------------------
Index Scan using idx_test_pos on public.test (cost=0.56..2532.74 rows=5000 width=44) (actual time=0.214..7.172 rows=46 loops=1)
Output: id, pos, st_distancespheroid(pos, '0101000020E610000000000000000059400000000000805640'::geometry, 'SPHEROID("WGS84",6378137,298.257223562997)'::spheroid)
Index Cond: ((st_geohash(test.pos, 11) >= 'ypzp'::text) AND (st_geohash(test.pos, 11) < 'ypzq'::text))
Filter: (st_geohash(test.pos, 11) ~ '^ypzp'::text)
Buffers: shared hit=5
Planning time: 0.174 ms
Execution time: 7.205 ms
(7 rows)
云端相关产品
小结
对于Greenplum来说,如果你觉得GiST索引引入的数据导入OVERHEAD不值得,那么建议改用B-Tree索引,使用geohash的btree来代替GiST索引,通过一重索引过滤,二重CPU过滤来进行空间数据的筛选。
在不太影响导入性能的同时,又能获得不错的空间数据筛选性能。何乐而不为呢?
最后,我们可能还有时间、空间、对象属性多维度查询的需求(可以使用多级分区表、多字段索引bitmap合并扫描等数据库黑科技),方法请参考:
《PostgreSQL\GPDB 毫秒级海量多维数据透视 案例分享》
《时间、空间、对象多维属性 海量数据任意多维 高效检索 - 阿里云RDS PostgreSQL最佳实践》
《(新零售)商户网格化运营 - 阿里云RDS PostgreSQL、HybridDB for PostgreSQL最佳实践》
《PostGIS空间索引(GiST、BRIN、R-Tree)选择、优化 - 阿里云RDS PostgreSQL最佳实践》
《PostgreSQL 单列组合查询优化 - 多个多边形查询优化》
二重过滤的CPU开销较大,如果这部分操作放到程序端实现,那么数据库的查询耗时将降低到0.17毫秒
-- 以下运算,建议放到程序过滤。(在一重过滤数据量可控的情况下,例如查询2.4公里内的点,放大到5公里操作,5公里内比如有100万个点,加上其他属性的过滤,可能过滤到50万个点,实际上这样返回给客户端,客户端再次过滤是可以接受的。)
ST_DistanceSpheroid(pos,ST_SetSRID(ST_Point(100,90),4326),'SPHEROID["WGS84",6378137,298.257223563]')
参考
http://postgis.net/docs/ST_Distance_Spheroid.html