Greenplum 轨迹相似(伴随分析)
背景
《阿里云 PostgreSQL 产品生态;案例、开发实践、管理实践、学习资料、学习视频 - 珍藏级》
以上有个CASE是讲:如何找出在同一辆车内的人,实际上就是通过车辆、人物的轨迹点数据,进行多轮的求交集,将结果收敛到一个较小的范围。
Greenplum伴随分析测试
1、创建测试表
create table tbl_pos (uid int8, phonenum text, ts timestamp, pos geometry, ghash text);
2、写入测试数据3亿条左右
insert into tbl_pos
select uid, md5(uid::text), ts, pos, st_geohash(pos,8) from
(
select
(random()*10000000)::int8 as uid,
ts,
st_setsrid(st_makepoint(120.2+0.5-random(), 30.3+0.5-random()), 4326) as pos
from
generate_series('2018-01-01'::timestamp, '2019-01-01'::timestamp, interval '0.1 sec')
t (ts)
) t;
3、重复生成数据到100亿
postgres=> insert into tbl_pos select * from tbl_pos;
INSERT 0 315360001
postgres=> \timing
Timing is on.
postgres=> insert into tbl_pos select * from tbl_pos;
INSERT 0 630720002
Time: 8992.130 ms (00:08.992)
postgres=> insert into tbl_pos select * from tbl_pos;
INSERT 0 1261440004
Time: 19517.465 ms (00:19.517)
postgres=> insert into tbl_pos select * from tbl_pos;
INSERT 0 2522880008
Time: 37244.890 ms (00:37.245)
postgres=> insert into tbl_pos select * from tbl_pos;
INSERT 0 5045760016
Time: 73391.309 ms (01:13.391)
4、新增索引
create index idx_tbl_pos_1 on tbl_pos(ghash);
create index idx_tbl_pos_2 on tbl_pos using gist(pos);
create index idx_tbl_pos_3 on tbl_pos (ts);
5、收集统计信息
postgres=> vacuum analyze tbl_pos;
6、伴随分析
postgres=> select count(*) from tbl_pos where ghash >= 'wtmm1k' and ghash < 'wtmm1'||chr(ascii('k')+1);
count
--------
609632
(1 row)
Time: 60.275 ms
postgres=> select count(*) from tbl_pos where ts between '2018-01-01 00:06:25.2' and '2018-01-01 01:06:25.2' and ghash >= 'wtmmmm' and ghash < 'wtmmm'||chr(ascii('m')+1);
count
-------
69984
(1 row)
Time: 70.161 ms
postgres=> explain select count(*) from tbl_pos where ts between '2018-01-01 00:06:25.2' and '2018-01-01 01:06:25.2' and ghash >= 'wtmmmm' and ghash < 'wtmmm'||chr(ascii('m')+1);
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=10681940.85..10681940.86 rows=1 width=8)
-> Gather Motion 1024:1 (slice1; segments: 1024) (cost=10681930.57..10681940.83 rows=1 width=8)
-> Aggregate (cost=10681930.57..10681930.58 rows=1 width=8)
-> Bitmap Heap Scan on tbl_pos (cost=10520118.62..10681889.69 rows=16 width=0)
Recheck Cond: ts >= '2018-01-01 00:06:25.2'::timestamp without time zone AND ts <= '2018-01-01 01:06:25.2'::timestamp without time zone AND ghash >= 'wtmmmm'::text AND ghash < 'wtmmmn'::text
-> BitmapAnd (cost=10520118.62..10520118.62 rows=2 width=0)
-> Bitmap Index Scan on idx_tbl_pos_3 (cost=0.00..1751.33 rows=161 width=0)
Index Cond: ts >= '2018-01-01 00:06:25.2'::timestamp without time zone AND ts <= '2018-01-01 01:06:25.2'::timestamp without time zone
-> Bitmap Index Scan on idx_tbl_pos_1 (cost=0.00..10518358.87 rows=97466 width=0)
Index Cond: ghash >= 'wtmmmm'::text AND ghash < 'wtmmmn'::text
Settings: effective_cache_size=8GB; enable_seqscan=off; gp_statistics_use_fkeys=on; optimizer=off; work_mem=64MB
Optimizer status: legacy query optimizer
(12 rows)
在某5天出现在某个范围的人的交集:
select uid from
(select distinct uid from tbl_pos where ts between '2018-01-01 00:06:25.2' and '2018-01-01 01:06:25.2' and ghash >= 'wtmmmm' and ghash < 'wtmmm'||chr(ascii('m')+1)) t1
join
(select distinct uid from tbl_pos where ts between '2018-02-01 00:06:25.2' and '2018-02-01 01:06:25.2' and ghash >= 'wtmmmm' and ghash < 'wtmmm'||chr(ascii('m')+1)) t2
using (uid)
join
(select distinct uid from tbl_pos where ts between '2018-03-01 00:06:25.2' and '2018-03-01 01:06:25.2' and ghash >= 'wtmmmm' and ghash < 'wtmmm'||chr(ascii('m')+1)) t3
using (uid)
join
(select distinct uid from tbl_pos where ts between '2018-04-01 00:06:25.2' and '2018-04-01 01:06:25.2' and ghash >= 'wtmmmm' and ghash < 'wtmmm'||chr(ascii('m')+1)) t4
using (uid)
join
(select distinct uid from tbl_pos where ts between '2018-05-01 00:06:25.2' and '2018-05-01 01:06:25.2' and ghash >= 'wtmmmm' and ghash < 'wtmmm'||chr(ascii('m')+1)) t5
using (uid);
uid
-----
(0 rows)
Time: 207.750 ms
小结
1、PostgreSQL bitmap scan支持将多个索引合并,在本例中体现在时间、空间、(其他)条件。
《PostgreSQL 数据库多列复合索引的字段顺序选择原理》
《PostgreSQL bitmapAnd, bitmapOr, bitmap index scan, bitmap heap scan》
2、除了使用这种暴力求交集的方法,还有别的方法来计算伴随么?相似轨迹分析可能也是一个不错的选择
参考
《阿里云 PostgreSQL 产品生态;案例、开发实践、管理实践、学习资料、学习视频 - 珍藏级》