PostgreSQL 模糊查询+大量重复值匹配 实践 - 分区索引 = any (array())
背景
在一些场景中,可能需要对大量的数据TBL_A进行A字段模糊查询,同时将匹配记录的另一个字段B,作为条件输入项,匹配另一张表TBL_B,输出另一张表TBL_B的匹配记录。
例子
小表,表示每个ID的详细信息
create table a (
id int primary key,
xx text
);
insert into a select id,'test'||id from generate_series(1,100000) t(id);
大表,表示FEED日志,一个AID可能产生很多条记录
create table b (
aid int,
info text
);
写入1000万数据
insert into b select random()*9, md5(random()::text) from generate_series(1,10000000);
create extension pg_trgm;
create index idx_b_2 on b using gin (info gin_trgm_ops);
如果需要用户需要输入AID字段的条件,则建议用这样的复合索引
create extension btree_gin;
create index idx_b_1 on b using gin (aid, info gin_trgm_ops);
查询info包含某个字符串(前后模糊查询)的数据,找出它们的AID对应的详细信息。
方法1、
postgres=# explain (analyze,verbose,timing,costs,buffers) select * from a where id = any (array(
select aid from b where info ~~ '%abc%'
));
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------
Index Scan using a_pkey on public.a (cost=22849.95..22865.95 rows=10 width=13) (actual time=72.259..72.267 rows=9 loops=1)
Output: a.id, a.xx
Index Cond: (a.id = ANY ($0))
Buffers: shared hit=20209
InitPlan 1 (returns $0)
-> Bitmap Heap Scan on public.b (cost=796.03..22849.65 rows=101010 width=4) (actual time=15.562..61.768 rows=73204 loops=1)
Output: b.aid
Recheck Cond: (b.info ~~ '%abc%'::text)
Heap Blocks: exact=20180
Buffers: shared hit=20188
-> Bitmap Index Scan on idx_b_1 (cost=0.00..770.78 rows=101010 width=0) (actual time=12.658..12.658 rows=73204 loops=1)
Index Cond: (b.info ~~ '%abc%'::text)
Buffers: shared hit=8
Planning Time: 0.157 ms
Execution Time: 72.315 ms
(15 rows)
方法2、
postgres=# explain (analyze,verbose,timing,costs,buffers) select * from a where exists (select 1 from b where info ~~ '%abc%' and a.id=b.aid);
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------
Merge Join (cost=23102.75..23103.04 rows=10 width=13) (actual time=74.316..74.325 rows=9 loops=1)
Output: a.id, a.xx
Inner Unique: true
Merge Cond: (a.id = b.aid)
Buffers: shared hit=20191
-> Index Scan using a_pkey on public.a (cost=0.29..1719.49 rows=100000 width=13) (actual time=0.018..0.021 rows=10 loops=1)
Output: a.id, a.xx
Buffers: shared hit=3
-> Sort (cost=23102.44..23102.47 rows=10 width=4) (actual time=74.291..74.292 rows=10 loops=1)
Output: b.aid
Sort Key: b.aid
Sort Method: quicksort Memory: 25kB
Buffers: shared hit=20188
-> HashAggregate (cost=23102.18..23102.28 rows=10 width=4) (actual time=74.279..74.281 rows=10 loops=1)
Output: b.aid
Group Key: b.aid
Buffers: shared hit=20188
-> Bitmap Heap Scan on public.b (cost=796.03..22849.65 rows=101010 width=4) (actual time=15.935..61.383 rows=73204 loops=1)
Output: b.aid
Recheck Cond: (b.info ~~ '%abc%'::text)
Heap Blocks: exact=20180
Buffers: shared hit=20188
-> Bitmap Index Scan on idx_b_1 (cost=0.00..770.78 rows=101010 width=0) (actual time=13.027..13.027 rows=73204 loops=1)
Index Cond: (b.info ~~ '%abc%'::text)
Buffers: shared hit=8
Planning Time: 0.344 ms
Execution Time: 74.380 ms
(27 rows)
结果
postgres=# select * from a where id = any (array(
select aid from b where info ~~ '%abc%'
));
id | xx
----+-------
1 | test1
2 | test2
3 | test3
4 | test4
5 | test5
6 | test6
7 | test7
8 | test8
9 | test9
(9 rows)
Time: 57.789 ms
其他方法
1、使用AID+模糊查询,两个条件,实际效果并不好,因为GIN复合索引实际上是内部BITMAP,并不是真正意义上的多颗树结构(分区结构)。
do language plpgsql $$
declare
v_id int;
v_xx text;
begin
for v_id,v_xx in select id,xx from a loop
perform 1 from b where aid=v_id and info ~~ '%abc%' limit 1;
if found then
raise notice '%, %', v_id, v_xx;
end if;
end loop;
end;
$$;
2、如果aid集合较小,并且可以穷举,也可以使用分区索引(多partial index)。
分区索引模拟,如下
nohup psql -c "create index idx_b_3 on b using gin (info gin_trgm_ops) where aid=1;" >/dev/null 2>&1 &
nohup psql -c "create index idx_b_4 on b using gin (info gin_trgm_ops) where aid=2;" >/dev/null 2>&1 &
nohup psql -c "create index idx_b_5 on b using gin (info gin_trgm_ops) where aid=3;" >/dev/null 2>&1 &
nohup psql -c "create index idx_b_6 on b using gin (info gin_trgm_ops) where aid=4;" >/dev/null 2>&1 &
nohup psql -c "create index idx_b_7 on b using gin (info gin_trgm_ops) where aid=5;" >/dev/null 2>&1 &
nohup psql -c "create index idx_b_8 on b using gin (info gin_trgm_ops) where aid=6;" >/dev/null 2>&1 &
nohup psql -c "create index idx_b_9 on b using gin (info gin_trgm_ops) where aid=7;" >/dev/null 2>&1 &
nohup psql -c "create index idx_b_10 on b using gin (info gin_trgm_ops) where aid=8;" >/dev/null 2>&1 &
nohup psql -c "create index idx_b_11 on b using gin (info gin_trgm_ops) where aid=9;" >/dev/null 2>&1 &
nohup psql -c "create index idx_b_12 on b using gin (info gin_trgm_ops) where aid not in (1,2,3,4,5,6,7,8,9);" >/dev/null 2>&1 &
查询语句模拟如下
select * from a where id = any (array(
select * from (select aid from b where aid=1 and info ~~ '%abc%' limit 1) t
union all
select * from (select aid from b where aid=2 and info ~~ '%abc%' limit 1) t
union all
select * from (select aid from b where aid=3 and info ~~ '%abc%' limit 1) t
union all
select * from (select aid from b where aid=4 and info ~~ '%abc%' limit 1) t
union all
select * from (select aid from b where aid=5 and info ~~ '%abc%' limit 1) t
union all
select * from (select aid from b where aid=6 and info ~~ '%abc%' limit 1) t
union all
select * from (select aid from b where aid=7 and info ~~ '%abc%' limit 1) t
union all
select * from (select aid from b where aid=8 and info ~~ '%abc%' limit 1) t
union all
select * from (select aid from b where aid=9 and info ~~ '%abc%' limit 1) t
union all
select * from (select aid from b where aid not in (1,2,3,4,5,6,7,8,9) and info ~~ '%abc%' limit 1) t
));
id | xx
----+-------
1 | test1
2 | test2
3 | test3
4 | test4
5 | test5
6 | test6
7 | test7
8 | test8
9 | test9
(9 rows)
Time: 4.970 ms
执行计划
postgres=# explain (analyze,verbose,timing,costs,buffers) select * from a where id = any (array(
select * from (select aid from b where aid=1 and info ~~ '%abc%' limit 1) t
union all
select * from (select aid from b where aid=2 and info ~~ '%abc%' limit 1) t
union all
select * from (select aid from b where aid=3 and info ~~ '%abc%' limit 1) t
union all
select * from (select aid from b where aid=4 and info ~~ '%abc%' limit 1) t
union all
select * from (select aid from b where aid=5 and info ~~ '%abc%' limit 1) t
union all
select * from (select aid from b where aid=6 and info ~~ '%abc%' limit 1) t
union all
select * from (select aid from b where aid=7 and info ~~ '%abc%' limit 1) t
union all
select * from (select aid from b where aid=8 and info ~~ '%abc%' limit 1) t
union all
select * from (select aid from b where aid=9 and info ~~ '%abc%' limit 1) t
union all
select * from (select aid from b where aid not in (1,2,3,4,5,6,7,8,9) and info ~~ '%abc%' limit 1) t
));
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------
Index Scan using a_pkey on public.a (cost=199.57..215.57 rows=10 width=13) (actual time=2.833..2.842 rows=9 loops=1)
Output: a.id, a.xx
Index Cond: (a.id = ANY ($0))
Buffers: shared hit=55
InitPlan 1 (returns $0)
-> Append (cost=0.00..199.27 rows=10 width=4) (actual time=0.249..2.799 rows=10 loops=1)
Buffers: shared hit=34
-> Limit (cost=0.00..15.31 rows=1 width=4) (actual time=0.249..0.249 rows=1 loops=1)
Output: b.aid
Buffers: shared hit=3
-> Seq Scan on public.b (cost=0.00..170791.00 rows=11158 width=4) (actual time=0.247..0.247 rows=1 loops=1)
Output: b.aid
Filter: ((b.info ~~ '%abc%'::text) AND (b.aid = 1))
Rows Removed by Filter: 1131
Buffers: shared hit=3
-> Limit (cost=0.00..14.75 rows=1 width=4) (actual time=0.243..0.244 rows=1 loops=1)
Output: b_1.aid
Buffers: shared hit=3
-> Seq Scan on public.b b_1 (cost=0.00..170791.00 rows=11576 width=4) (actual time=0.243..0.243 rows=1 loops=1)
Output: b_1.aid
Filter: ((b_1.info ~~ '%abc%'::text) AND (b_1.aid = 2))
Rows Removed by Filter: 1208
Buffers: shared hit=3
-> Limit (cost=0.00..15.39 rows=1 width=4) (actual time=0.023..0.023 rows=1 loops=1)
Output: b_2.aid
Buffers: shared hit=1
-> Seq Scan on public.b b_2 (cost=0.00..170791.00 rows=11098 width=4) (actual time=0.023..0.023 rows=1 loops=1)
Output: b_2.aid
Filter: ((b_2.info ~~ '%abc%'::text) AND (b_2.aid = 3))
Rows Removed by Filter: 70
Buffers: shared hit=1
-> Limit (cost=0.00..14.96 rows=1 width=4) (actual time=0.738..0.738 rows=1 loops=1)
Output: b_3.aid
Buffers: shared hit=8
-> Seq Scan on public.b b_3 (cost=0.00..170791.00 rows=11414 width=4) (actual time=0.738..0.738 rows=1 loops=1)
Output: b_3.aid
Filter: ((b_3.info ~~ '%abc%'::text) AND (b_3.aid = 4))
Rows Removed by Filter: 3563
Buffers: shared hit=8
-> Limit (cost=0.00..15.18 rows=1 width=4) (actual time=0.068..0.068 rows=1 loops=1)
Output: b_4.aid
Buffers: shared hit=1
-> Seq Scan on public.b b_4 (cost=0.00..170791.00 rows=11253 width=4) (actual time=0.068..0.068 rows=1 loops=1)
Output: b_4.aid
Filter: ((b_4.info ~~ '%abc%'::text) AND (b_4.aid = 5))
Rows Removed by Filter: 311
Buffers: shared hit=1
-> Limit (cost=0.00..15.09 rows=1 width=4) (actual time=0.178..0.178 rows=1 loops=1)
Output: b_5.aid
Buffers: shared hit=2
-> Seq Scan on public.b b_5 (cost=0.00..170791.00 rows=11316 width=4) (actual time=0.178..0.178 rows=1 loops=1)
Output: b_5.aid
Filter: ((b_5.info ~~ '%abc%'::text) AND (b_5.aid = 6))
Rows Removed by Filter: 894
Buffers: shared hit=2
-> Limit (cost=0.00..15.50 rows=1 width=4) (actual time=0.323..0.323 rows=1 loops=1)
Output: b_6.aid
Buffers: shared hit=4
-> Seq Scan on public.b b_6 (cost=0.00..170791.00 rows=11020 width=4) (actual time=0.322..0.322 rows=1 loops=1)
Output: b_6.aid
Filter: ((b_6.info ~~ '%abc%'::text) AND (b_6.aid = 7))
Rows Removed by Filter: 1599
Buffers: shared hit=4
-> Limit (cost=0.00..15.25 rows=1 width=4) (actual time=0.449..0.449 rows=1 loops=1)
Output: b_7.aid
Buffers: shared hit=5
-> Seq Scan on public.b b_7 (cost=0.00..170791.00 rows=11199 width=4) (actual time=0.448..0.448 rows=1 loops=1)
Output: b_7.aid
Filter: ((b_7.info ~~ '%abc%'::text) AND (b_7.aid = 8))
Rows Removed by Filter: 2231
Buffers: shared hit=5
-> Limit (cost=0.00..32.33 rows=1 width=4) (actual time=0.130..0.131 rows=1 loops=1)
Output: b_8.aid
Buffers: shared hit=2
-> Seq Scan on public.b b_8 (cost=0.00..170791.00 rows=5283 width=4) (actual time=0.130..0.130 rows=1 loops=1)
Output: b_8.aid
Filter: ((b_8.info ~~ '%abc%'::text) AND (b_8.aid = 9))
Rows Removed by Filter: 623
Buffers: shared hit=2
-> Limit (cost=0.00..45.36 rows=1 width=4) (actual time=0.392..0.392 rows=1 loops=1)
Output: b_9.aid
Buffers: shared hit=5
-> Seq Scan on public.b b_9 (cost=0.00..258291.00 rows=5694 width=4) (actual time=0.392..0.392 rows=1 loops=1)
Output: b_9.aid
Filter: ((b_9.info ~~ '%abc%'::text) AND (b_9.aid <> ALL ('{1,2,3,4,5,6,7,8,9}'::integer[])))
Rows Removed by Filter: 1931
Buffers: shared hit=5
Planning Time: 1.411 ms
Execution Time: 2.963 ms
(89 rows)
可见,分区索引,效果是比较好的。期待PG尽快支持。
另外,对于本例,用户需要IN很多值时,建议使用较大的work_mem,同时使用any(array())的写法。
work_mem='4MB'
参考
《HTAP数据库 PostgreSQL 场景与性能测试之 25 - (OLTP) IN , EXISTS 查询》
《HTAP数据库 PostgreSQL 场景与性能测试之 26 - (OLTP) NOT IN、NOT EXISTS 查询》