聊一下PostgreSQL优化器 - in里面有重复值时PostgreSQL如何处理?
背景
比如某个业务APP收集了用户的位置信息,在数据库中会有用户去过的地方的一些行为日志数据。
现在要找出今天某些用户群体去过了哪些地方。
会发生什么呢?
一个人在某一个地点可能会上报很多条数据,同时不同的人也可能会去过同一个地点,因此同一个地名可能会有多条重复记录。
所以如果使用这样的查询,会导致IN里有很多重复值。
select * from 地址库 where 地址 in (select 去过的地方 from 用户跟踪表 where 某批用户群) ;
优化手段很多,我写过几篇可以参考如下
《distinct xx和count(distinct xx)的变态递归优化方法 - 索引收敛(skip scan)扫描》
《用PostgreSQL找回618秒逝去的青春 - 递归收敛优化》
《时序数据合并场景加速分析和实现 - 复合索引,窗口分组查询加速,变态递归加速》
本文不介绍优化方法,我介绍了PostgreSQL优化器遇到这样的情况会怎么做。
场景模拟
A表
地址库,ID为地名唯一标识。
B表
用户行为表,其中ID为用户ID,LID为去过的地名ID
建表
postgres=# create table a(id int primary key, info text);
CREATE TABLE
postgres=# create table b(id int, lid int, info text);
CREATE TABLE
插入测试数据
1000万地址库
postgres=# insert into a select generate_series(1,10000000), 'test';
1000万用户行为,大部分LID为重复值(仅生成1-1000的地址ID)。
postgres=# insert into b select 1000000*random(), 1000*random(), 'test' from generate_series(1,10000000);
INSERT 0 10000000
测试,这些用户都去过哪里?
(只是模拟场景,优化手段很多,我们今天就看PG自己的优化器怎么做的)
postgres=# explain (analyze,verbose,timing,costs,buffers) select * from a where id in (select lid from b);
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------
Nested Loop (cost=179057.62..181506.16 rows=1001 width=9) (actual time=3816.572..3820.227 rows=1000 loops=1)
Output: a.id, a.info
Buffers: shared hit=58062
-> HashAggregate (cost=179057.19..179067.20 rows=1001 width=4) (actual time=3816.541..3816.763 rows=1001 loops=1)
Output: b.lid
Group Key: b.lid
Buffers: shared hit=54055
-> Seq Scan on public.b (cost=0.00..154056.75 rows=10000175 width=4) (actual time=0.019..1227.238 rows=10000000 loops=1)
Output: b.id, b.lid, b.info
Buffers: shared hit=54055
-> Index Scan using a_pkey on public.a (cost=0.43..2.43 rows=1 width=9) (actual time=0.003..0.003 rows=1 loops=1001)
Output: a.id, a.info
Index Cond: (a.id = b.lid)
Buffers: shared hit=4007
Planning time: 0.200 ms
Execution time: 3820.401 ms
(16 rows)
分析
PostgreSQL这里选择了对B表的LID列进行hash聚合(聚合约3.8秒),这样就变成了少量值的匹配。由于A表的ID字段有索引,所以匹配花了很短的时间(搜索1001次总共约0.03毫秒)。
当然,PG的优化是基于CBO的,所以如果你有些成本因子没有选好,可能会走其他的执行计划,比如merge JOIN.
开脑洞的优化方法来了,由于B表的重复值太多了,可以适应递归收敛进行优化。
详见
《用PostgreSQL找回618秒逝去的青春 - 递归收敛优化》
强烈建议阅读
其他测试
1. 当IN里面是重复的常量时,PG是如何处理的呢?
postgres=# explain (analyze,verbose,timing,costs,buffers) select * from a where id in (0,0,0,0,0,0,0);
QUERY PLAN
------------------------------------------------------------------------------------------------------------------
Index Scan using a_pkey on public.a (cost=0.43..11.17 rows=7 width=9) (actual time=0.021..0.021 rows=0 loops=1)
Output: id, info
Index Cond: (a.id = ANY ('{0,0,0,0,0,0,0}'::integer[]))
Buffers: shared hit=3
Planning time: 0.116 ms
Execution time: 0.057 ms
(6 rows)
postgres=# explain (analyze,verbose,timing,costs,buffers) select * from a where id in (0,0,0,0,0,0,0,1);
QUERY PLAN
------------------------------------------------------------------------------------------------------------------
Index Scan using a_pkey on public.a (cost=0.43..12.62 rows=8 width=9) (actual time=0.029..0.030 rows=1 loops=1)
Output: id, info
Index Cond: (a.id = ANY ('{0,0,0,0,0,0,0,1}'::integer[]))
Buffers: shared hit=7
Planning time: 0.116 ms
Execution time: 0.067 ms
(6 rows)
通常这种值是客户端传上来的,建议客户端可以做一次聚合,去重之后再传给数据库。
实际上客户端如果传的VALUE不多,这块就无所谓了。
如果客户端传递的VALUE非常非常多,而且不想在客户端去重,那么也可以用一张临时来表示,然后PG就可以用哈希聚合去重了。
2. SRF function
PostgreSQL可以根据SRF进行逻辑推理,从而避免QUERY被执行。
例如
postgres=# explain (analyze,verbose,timing,costs,buffers) select * from a where id in (select 0 from generate_series(1,100000000));
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------
Nested Loop Semi Join (cost=0.44..12.47 rows=1 width=9) (actual time=0.026..0.026 rows=0 loops=1)
Output: a.id, a.info
Buffers: shared hit=3
-> Index Scan using a_pkey on public.a (cost=0.43..2.45 rows=1 width=9) (actual time=0.024..0.024 rows=0 loops=1)
Output: a.id, a.info
Index Cond: (a.id = 0)
Buffers: shared hit=3
-> Function Scan on pg_catalog.generate_series (cost=0.00..10.00 rows=1000 width=0) (never executed)
Output: generate_series.generate_series
Function Call: generate_series(1, 100000000)
Planning time: 0.146 ms
Execution time: 0.073 ms
(12 rows)
由于A表中的bucket显示根本就没有0,所以这条generate_series直接被忽略了
postgres=# explain (analyze,verbose,timing,costs,buffers) select * from a where id in (select 1 from generate_series(1,1000000));
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------
Nested Loop Semi Join (cost=0.44..12.47 rows=1 width=9) (actual time=133.626..133.635 rows=1 loops=1)
Output: a.id, a.info
Buffers: shared hit=4
-> Index Scan using a_pkey on public.a (cost=0.43..2.45 rows=1 width=9) (actual time=0.028..0.036 rows=1 loops=1)
Output: a.id, a.info
Index Cond: (a.id = 1)
Buffers: shared hit=4
-> Function Scan on pg_catalog.generate_series (cost=0.00..10.00 rows=1000 width=0) (actual time=133.591..133.591 rows=1 loops=1)
Output: generate_series.generate_series
Function Call: generate_series(1, 1000000)
Planning time: 0.148 ms
Execution time: 184.720 ms
(12 rows)
而1在bucket里是有的,所以被执行了
但是如果把0放到union all里面,又会被执行,此时它用了哈希聚合来减少扫描
postgres=# explain (analyze,verbose,timing,costs,buffers) select * from a where id in (select 0 from generate_series(1,10000000) union all select 1 from generate_series(1,1000000));
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Nested Loop (cost=25.44..519.50 rows=5000088 width=9) (actual time=8042.867..8042.873 rows=1 loops=1)
Output: a.id, a.info
Buffers: shared hit=7, temp read=18801 written=18799
-> HashAggregate (cost=25.00..27.00 rows=200 width=4) (actual time=8042.823..8042.824 rows=2 loops=1)
Output: (0)
Group Key: (0)
Buffers: temp read=18801 written=18799
-> Append (cost=0.00..20.00 rows=2000 width=4) (actual time=1571.580..5694.255 rows=11000000 loops=1)
Buffers: temp read=18801 written=18799
-> Function Scan on pg_catalog.generate_series (cost=0.00..10.00 rows=1000 width=4) (actual time=1571.577..3656.432 rows=10000000 loops=1)
Output: 0
Function Call: generate_series(1, 10000000)
Buffers: temp read=17091 written=17090
-> Function Scan on pg_catalog.generate_series generate_series_1 (cost=0.00..10.00 rows=1000 width=4) (actual time=156.018..365.703 rows=1000000 loops=1)
Output: 1
Function Call: generate_series(1, 1000000)
Buffers: temp read=1710 written=1709
-> Index Scan using a_pkey on public.a (cost=0.43..2.45 rows=1 width=9) (actual time=0.018..0.019 rows=0 loops=2)
Output: a.id, a.info
Index Cond: (a.id = (0))
Buffers: shared hit=7
Planning time: 0.323 ms
Execution time: 8087.089 ms
(23 rows)
参考
《distinct xx和count(distinct xx)的变态递归优化方法 - 索引收敛(skip scan)扫描》
《用PostgreSQL找回618秒逝去的青春 - 递归收敛优化》
《时序数据合并场景加速分析和实现 - 复合索引,窗口分组查询加速,变态递归加速》