采用 部分索引、表达式索引 提高搜索效率
背景
在现实场景中,经常有搜索的需求,例如搜索附近的店铺,搜索通常会有一些搜索的附带条件,例如搜索附近的美食类店铺,加油站等。
这里实际上涉及两类搜索需求,一类是距离,一类是属性。
如果将属性枚举掉,那么搜索时可以变成只按距离搜索。建立空间索引即可。
而如果属性无法枚举,那么需要同时搜索空间和属性,可以建立 “空间+属性” 的“复合索引”,或者建立“多索引”,PG内部会使用bitmap自动将多个索引的过滤结果进行合并。
以上的做法都挺好理解,但是在PG里面还有更丰富的玩法。
例如部分索引、表达式索引。下面举例说明具体的场景和玩法。
问题1,搜索离我最近的女性用户
使用部分索引,只对女性用户建立索引。
postgres=# create table user_pos(
id int primary key, -- 主键
pos point, -- 位置
sex char(1) -- 性别
);
CREATE TABLE
postgres=# insert into user_pos select generate_series(1,10000000), point(random()*10000,random()*10000), (random())::int::text;
INSERT 0 10000000
postgres=# select * from user_pos limit 10;
id | pos | sex
----+-------------------------------------+-----
1 | (2447.73048441857,5153.31742353737) | 0
2 | (6969.8447175324,5497.46428150684) | 1
3 | (4143.54857057333,9740.06621632725) | 1
4 | (6990.53473770618,5271.83207217604) | 0
5 | (5196.75491377711,5041.81199707091) | 1
6 | (1515.07906615734,1538.82524929941) | 0
7 | (1805.89218158275,7099.36406929046) | 1
8 | (383.995678275824,1186.2367298454) | 1
9 | (9107.82004706562,9367.1752139926) | 0
10 | (9713.49926199764,9380.74112869799) | 1
(10 rows)
女性=0
男性=1
建立部分索引的方法如下,
postgres=# create index idx_user_pos_0 on user_pos using gist(pos) where sex='0'; -- 只对女性建立索引
CREATE INDEX
postgres=# create index idx_user_pos_1 on user_pos using gist(pos) where sex='1'; -- 只对男性建立索引
CREATE INDEX
表和索引的大小如下
postgres=# \dt+
List of relations
Schema | Name | Type | Owner | Size | Description
--------+----------+-------+----------+--------+-------------
public | user_pos | table | postgres | 575 MB |
(1 row)
postgres=# \di+
List of relations
Schema | Name | Type | Owner | Table | Size | Description
--------+----------------+-------+----------+----------+--------+-------------
public | idx_user_pos_0 | index | postgres | user_pos | 354 MB |
public | idx_user_pos_1 | index | postgres | user_pos | 354 MB |
public | user_pos_pkey | index | postgres | user_pos | 214 MB |
(3 rows)
搜索里某个位置最近的女性用户,男性用户。
使用到了部分索引,0.3x毫秒
postgres=# explain (analyze,verbose,timing,costs,buffers) select * from user_pos where sex='0' order by pos <-> point(5000,5000) limit 100;
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=0.41..5.99 rows=100 width=30) (actual time=0.067..0.290 rows=100 loops=1)
Output: id, pos, sex, ((pos <-> '(5000,5000)'::point))
Buffers: shared hit=106
-> Index Scan using idx_user_pos_0 on public.user_pos (cost=0.41..277775.58 rows=4983333 width=30) (actual time=0.066..0.277 rows=100 loops=1)
Output: id, pos, sex, (pos <-> '(5000,5000)'::point)
Order By: (user_pos.pos <-> '(5000,5000)'::point)
Buffers: shared hit=106
Planning time: 0.116 ms
Execution time: 0.324 ms
(9 rows)
postgres=# explain (analyze,verbose,timing,costs,buffers) select * from user_pos where sex='1' order by pos <-> point(5000,5000) limit 100;
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=0.41..5.97 rows=100 width=30) (actual time=0.118..0.342 rows=100 loops=1)
Output: id, pos, sex, ((pos <-> '(5000,5000)'::point))
Buffers: shared hit=107
-> Index Scan using idx_user_pos_1 on public.user_pos (cost=0.41..278530.69 rows=5016667 width=30) (actual time=0.117..0.330 rows=100 loops=1)
Output: id, pos, sex, (pos <-> '(5000,5000)'::point)
Order By: (user_pos.pos <-> '(5000,5000)'::point)
Buffers: shared hit=107
Planning time: 0.122 ms
Execution time: 0.377 ms
(9 rows)
当不带部分索引所示条件(性别条件)进行搜索时,无法使用部分索引
postgres=# explain (analyze,verbose,timing,costs,buffers) select * from user_pos order by pos <-> point(5000,5000) limit 100;
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=580722.81..580723.06 rows=100 width=30) (actual time=3995.670..3995.692 rows=100 loops=1)
Output: id, pos, sex, ((pos <-> '(5000,5000)'::point))
Buffers: shared hit=73530
-> Sort (cost=580722.81..605722.81 rows=10000000 width=30) (actual time=3995.668..3995.679 rows=100 loops=1)
Output: id, pos, sex, ((pos <-> '(5000,5000)'::point))
Sort Key: ((user_pos.pos <-> '(5000,5000)'::point))
Sort Method: top-N heapsort Memory: 38kB
Buffers: shared hit=73530
-> Seq Scan on public.user_pos (cost=0.00..198530.00 rows=10000000 width=30) (actual time=0.013..1962.988 rows=10000000 loops=1)
Output: id, pos, sex, (pos <-> '(5000,5000)'::point)
Buffers: shared hit=73530
Planning time: 0.091 ms
Execution time: 3995.733 ms
(13 rows)
postgres=# explain (analyze,verbose,timing,costs,buffers) select * from user_pos where sex in ('0', '1') order by pos <-> point(5000,5000) limit 100;
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=605722.81..605723.06 rows=100 width=30) (actual time=5044.095..5044.119 rows=100 loops=1)
Output: id, pos, sex, ((pos <-> '(5000,5000)'::point))
Buffers: shared hit=73530
-> Sort (cost=605722.81..630722.81 rows=10000000 width=30) (actual time=5044.093..5044.106 rows=100 loops=1)
Output: id, pos, sex, ((pos <-> '(5000,5000)'::point))
Sort Key: ((user_pos.pos <-> '(5000,5000)'::point))
Sort Method: top-N heapsort Memory: 38kB
Buffers: shared hit=73530
-> Seq Scan on public.user_pos (cost=0.00..223530.00 rows=10000000 width=30) (actual time=0.013..2978.314 rows=10000000 loops=1)
Output: id, pos, sex, (pos <-> '(5000,5000)'::point)
Filter: (user_pos.sex = ANY ('{0,1}'::bpchar[]))
Buffers: shared hit=73530
Planning time: 0.170 ms
Execution time: 5044.160 ms
(14 rows)
对于这种情况,只能使用UNION ALL变通
postgres=# explain (analyze,verbose,timing,costs,buffers) select * from ((select * from user_pos where sex='0' order by pos <-> point(5000,5000) limit 100) union all (select * from user_pos where sex='1' order by pos <-> point(5000,5000) limit 100)) t order by pos <-> point(5000,5000) limit 100;
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=24.10..24.35 rows=100 width=30) (actual time=0.784..0.800 rows=100 loops=1)
Output: "*SELECT* 1".id, "*SELECT* 1".pos, "*SELECT* 1".sex, (("*SELECT* 1".pos <-> '(5000,5000)'::point))
Buffers: shared hit=213
-> Sort (cost=24.10..24.60 rows=200 width=30) (actual time=0.783..0.789 rows=100 loops=1)
Output: "*SELECT* 1".id, "*SELECT* 1".pos, "*SELECT* 1".sex, (("*SELECT* 1".pos <-> '(5000,5000)'::point))
Sort Key: (("*SELECT* 1".pos <-> '(5000,5000)'::point))
Sort Method: quicksort Memory: 40kB
Buffers: shared hit=213
-> Result (cost=0.41..16.46 rows=200 width=30) (actual time=0.073..0.693 rows=200 loops=1)
Output: "*SELECT* 1".id, "*SELECT* 1".pos, "*SELECT* 1".sex, ("*SELECT* 1".pos <-> '(5000,5000)'::point)
Buffers: shared hit=213
-> Append (cost=0.41..13.96 rows=200 width=22) (actual time=0.072..0.650 rows=200 loops=1)
Buffers: shared hit=213
-> Subquery Scan on "*SELECT* 1" (cost=0.41..6.99 rows=100 width=22) (actual time=0.071..0.302 rows=100 loops=1)
Output: "*SELECT* 1".id, "*SELECT* 1".pos, "*SELECT* 1".sex
Buffers: shared hit=106
-> Limit (cost=0.41..5.99 rows=100 width=30) (actual time=0.071..0.289 rows=100 loops=1)
Output: user_pos.id, user_pos.pos, user_pos.sex, ((user_pos.pos <-> '(5000,5000)'::point))
Buffers: shared hit=106
-> Index Scan using idx_user_pos_0 on public.user_pos (cost=0.41..277775.58 rows=4983333 width=30) (actual time=0.070..0.278 rows=100 loops=1)
Output: user_pos.id, user_pos.pos, user_pos.sex, (user_pos.pos <-> '(5000,5000)'::point)
Order By: (user_pos.pos <-> '(5000,5000)'::point)
Buffers: shared hit=106
-> Subquery Scan on "*SELECT* 2" (cost=0.41..6.97 rows=100 width=22) (actual time=0.110..0.327 rows=100 loops=1)
Output: "*SELECT* 2".id, "*SELECT* 2".pos, "*SELECT* 2".sex
Buffers: shared hit=107
-> Limit (cost=0.41..5.97 rows=100 width=30) (actual time=0.110..0.314 rows=100 loops=1)
Output: user_pos_1.id, user_pos_1.pos, user_pos_1.sex, ((user_pos_1.pos <-> '(5000,5000)'::point))
Buffers: shared hit=107
-> Index Scan using idx_user_pos_1 on public.user_pos user_pos_1 (cost=0.41..278530.69 rows=5016667 width=30) (actual time=0.109..0.303 rows=100 loops=1)
Output: user_pos_1.id, user_pos_1.pos, user_pos_1.sex, (user_pos_1.pos <-> '(5000,5000)'::point)
Order By: (user_pos_1.pos <-> '(5000,5000)'::point)
Buffers: shared hit=107
Planning time: 0.235 ms
Execution time: 0.861 ms
(35 rows)
将前面的部分索引,换成全索引
postgres=# drop index idx_user_pos_0;
DROP INDEX
postgres=# drop index idx_user_pos_1;
DROP INDEX
postgres=# create index idx_user_pos_0 on user_pos using gist(pos);
CREATE INDEX
postgres=# \di+
List of relations
Schema | Name | Type | Owner | Table | Size | Description
--------+----------------+-------+----------+----------+--------+-------------
public | idx_user_pos_0 | index | postgres | user_pos | 711 MB |
public | user_pos_pkey | index | postgres | user_pos | 214 MB |
(2 rows)
使用同样的查询,相比部分索引,性能下降一半,因为索引数据包含了所有性别,需要通过FILTER来处理。
postgres=# explain (analyze,verbose,timing,costs,buffers) select * from user_pos where sex='0' order by pos <-> point(5000,5000) limit 100;
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=0.42..9.63 rows=100 width=30) (actual time=0.163..0.688 rows=100 loops=1)
Output: id, pos, sex, ((pos <-> '(5000,5000)'::point))
Buffers: shared hit=205
-> Index Scan using idx_user_pos_0 on public.user_pos (cost=0.42..459271.75 rows=4983333 width=30) (actual time=0.161..0.674 rows=100 loops=1)
Output: id, pos, sex, (pos <-> '(5000,5000)'::point)
Order By: (user_pos.pos <-> '(5000,5000)'::point)
Filter: (user_pos.sex = '0'::bpchar)
Rows Removed by Filter: 94
Buffers: shared hit=205
Planning time: 0.142 ms
Execution time: 0.736 ms
(11 rows)
postgres=# explain (analyze,verbose,timing,costs,buffers) select * from user_pos where sex='1' order by pos <-> point(5000,5000) limit 100;
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=0.42..9.57 rows=100 width=30) (actual time=0.110..0.578 rows=100 loops=1)
Output: id, pos, sex, ((pos <-> '(5000,5000)'::point))
Buffers: shared hit=218
-> Index Scan using idx_user_pos_0 on public.user_pos (cost=0.42..459355.08 rows=5016667 width=30) (actual time=0.109..0.566 rows=100 loops=1)
Output: id, pos, sex, (pos <-> '(5000,5000)'::point)
Order By: (user_pos.pos <-> '(5000,5000)'::point)
Filter: (user_pos.sex = '1'::bpchar)
Rows Removed by Filter: 107
Buffers: shared hit=218
Planning time: 0.105 ms
Execution time: 0.614 ms
(11 rows)
当不输入性别条件时,依旧可以使用全索引,性能0.3x毫秒,意料之中
postgres=# explain (analyze,verbose,timing,costs,buffers) select * from user_pos order by pos <-> point(5000,5000) limit 100;
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=0.42..4.89 rows=100 width=30) (actual time=0.095..0.318 rows=100 loops=1)
Output: id, pos, sex, ((pos <-> '(5000,5000)'::point))
Buffers: shared hit=107
-> Index Scan using idx_user_pos_0 on public.user_pos (cost=0.42..446813.42 rows=10000000 width=30) (actual time=0.093..0.307 rows=100 loops=1)
Output: id, pos, sex, (pos <-> '(5000,5000)'::point)
Order By: (user_pos.pos <-> '(5000,5000)'::point)
Buffers: shared hit=107
Planning time: 0.088 ms
Execution time: 0.353 ms
(9 rows)
将索引换成 空间+属性 复合全索引
postgres=# drop index idx_user_pos_0 ;
DROP INDEX
postgres=# create extension btree_gist;
CREATE EXTENSION
postgres=# create index idx_user_pos_0 on user_pos using gist(sex,pos);
CREATE INDEX
postgres=# \di+
List of relations
Schema | Name | Type | Owner | Table | Size | Description
--------+----------------+-------+----------+----------+--------+-------------
public | idx_user_pos_0 | index | postgres | user_pos | 843 MB |
public | user_pos_pkey | index | postgres | user_pos | 214 MB |
(2 rows)
使用同样的查询,相比部分索引,性能下降一些。
postgres=# explain (analyze,verbose,timing,costs,buffers) select * from user_pos where sex='0' order by pos <-> point(5000,5000) limit 100;
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=0.42..6.50 rows=100 width=30) (actual time=0.174..0.434 rows=100 loops=1)
Output: id, pos, sex, ((pos <-> '(5000,5000)'::point))
Buffers: shared hit=106
-> Index Scan using idx_user_pos_0 on public.user_pos (cost=0.42..303079.91 rows=4983333 width=30) (actual time=0.173..0.422 rows=100 loops=1)
Output: id, pos, sex, (pos <-> '(5000,5000)'::point)
Index Cond: (user_pos.sex = '0'::bpchar)
Order By: (user_pos.pos <-> '(5000,5000)'::point)
Buffers: shared hit=106
Planning time: 0.194 ms
Execution time: 0.471 ms
(10 rows)
postgres=# explain (analyze,verbose,timing,costs,buffers) select * from user_pos where sex='1' order by pos <-> point(5000,5000) limit 100;
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=0.42..6.48 rows=100 width=30) (actual time=0.205..0.457 rows=100 loops=1)
Output: id, pos, sex, ((pos <-> '(5000,5000)'::point))
Buffers: shared hit=107
-> Index Scan using idx_user_pos_0 on public.user_pos (cost=0.42..304368.42 rows=5016667 width=30) (actual time=0.203..0.445 rows=100 loops=1)
Output: id, pos, sex, (pos <-> '(5000,5000)'::point)
Index Cond: (user_pos.sex = '1'::bpchar)
Order By: (user_pos.pos <-> '(5000,5000)'::point)
Buffers: shared hit=107
Planning time: 0.105 ms
Execution time: 0.494 ms
(10 rows)
postgres=# explain (analyze,verbose,timing,costs,buffers) select * from user_pos order by pos <-> point(5000,5000) limit 100;
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=0.42..5.14 rows=100 width=30) (actual time=0.142..0.363 rows=100 loops=1)
Output: id, pos, sex, ((pos <-> '(5000,5000)'::point))
Buffers: shared hit=109
-> Index Scan using idx_user_pos_0 on public.user_pos (cost=0.42..472151.42 rows=10000000 width=30) (actual time=0.141..0.350 rows=100 loops=1)
Output: id, pos, sex, (pos <-> '(5000,5000)'::point)
Order By: (user_pos.pos <-> '(5000,5000)'::point)
Buffers: shared hit=109
Planning time: 0.087 ms
Execution time: 0.399 ms
(9 rows)
问题2,搜索激活的用户,附加其他搜索条件。
假设业务一定不会对未激活的用户进行检索。那么激活就是必选条件。
可以使用 激活 作为部分索引的条件,避免对不必要的数据进行索引。
create table test(id int, actived boolean, colx int);
create index idx_test_1 on test(id) where actived;
insert into test select generate_series(1,1000000), random()::int::boolean, 1 ;
postgres=# explain (analyze,verbose,timing,costs,buffers) select id from test where actived limit 10;
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=0.42..0.67 rows=10 width=4) (actual time=0.033..0.036 rows=10 loops=1)
Output: id
Buffers: shared hit=4
-> Index Only Scan using idx_test_1 on public.test (cost=0.42..12341.42 rows=503700 width=4) (actual time=0.032..0.034 rows=10 loops=1)
Output: id
Heap Fetches: 10
Buffers: shared hit=4
Planning time: 0.075 ms
Execution time: 0.054 ms
(9 rows)
postgres=# explain (analyze,verbose,timing,costs,buffers) select * from test where actived limit 10;
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=0.42..0.67 rows=10 width=9) (actual time=0.027..0.033 rows=10 loops=1)
Output: id, actived, colx
Buffers: shared hit=4
-> Index Scan using idx_test_1 on public.test (cost=0.42..12341.42 rows=503700 width=9) (actual time=0.025..0.029 rows=10 loops=1)
Output: id, actived, colx
Buffers: shared hit=4
Planning time: 0.101 ms
Execution time: 0.058 ms
(8 rows)
问题3,根据纠偏坐标搜索近距离用户
例如在某些时候,可能由于设备的问题,或者各国地理位置编码的问题使得经纬度需要纠偏,因此查询时都是通过纠偏后的数据进行查询。
这个需求,需要用到表达式索引,因为纠偏是需要计算的,我们需要将纠偏这个计算作为索引表达式。
纠偏函数例子
postgres=# create or replace function fix_point(point) returns point as $$
select $1 + point(0.9, 12);
$$ language sql strict immutable;
CREATE FUNCTION
postgres=# select fix_point(point(1,1));
fix_point
-----------
(1.9,13)
(1 row)
测试数据
create table test(id int, pos point, info text);
insert into test select generate_series(1,100000), point(random()*10000, random()*10000), 'test';
纠偏表达式索引
create index idx_test_1 on test using gist( fix_point(pos) );
纠偏查询
postgres=# explain (analyze,verbose,timing,costs,buffers) select * from test order by fix_point(pos) <-> point (1,10000) limit 10;;
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=0.28..0.75 rows=10 width=33) (actual time=0.062..0.078 rows=10 loops=1)
Output: id, pos, info, (((pos + '(0.9,12)'::point) <-> '(1,10000)'::point))
Buffers: shared hit=13
-> Index Scan using idx_test_1 on public.test (cost=0.28..4717.78 rows=100000 width=33) (actual time=0.061..0.076 rows=10 loops=1)
Output: id, pos, info, ((pos + '(0.9,12)'::point) <-> '(1,10000)'::point)
Order By: ((test.pos + '(0.9,12)'::point) <-> '(1,10000)'::point)
Buffers: shared hit=13
Planning time: 0.262 ms
Execution time: 0.113 ms
(9 rows)
问题4,条件表达式索引 - 全量部分索引
通常我们在建立索引时,需要制定在哪个列、哪些列、哪个表达式上面。
还有一种方法,实际上也是表达式索引,只是表达式看起来比较奇特,例如建立女性索引。
postgres=# create table user_pos(
id int primary key, -- 主键
pos point, -- 位置
sex char(1) -- 性别
);
CREATE TABLE
postgres=# insert into user_pos select generate_series(1,10000000), point(random()*10000,random()*10000), (random())::int::text;
INSERT 0 10000000
条件表达式索引
create index idx1 on user_pos ( (sex='0') );
这个索引的内容是: sex=’0’的结果 + 行号
测试
postgres=# explain (analyze,verbose,timing,costs,buffers) select * from user_pos where sex='0';
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on public.user_pos (cost=46663.43..182485.09 rows=4983333 width=22) (actual time=212.844..1420.697 rows=5000542 loops=1)
Output: id, pos, sex
Recheck Cond: (user_pos.sex = '0'::bpchar)
Rows Removed by Index Recheck: 2244498
Heap Blocks: exact=40505 lossy=33025
Buffers: shared hit=87195
-> Bitmap Index Scan on idx_user_pos_0 (cost=0.00..45417.60 rows=4983333 width=0) (actual time=204.879..204.879 rows=5000542 loops=1)
Buffers: shared hit=13665
Planning time: 0.395 ms
Execution time: 1688.563 ms
(10 rows)
postgres=# explain (analyze,verbose,timing,costs,buffers) select * from user_pos where sex='0' limit 10;
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=0.43..0.90 rows=10 width=22) (actual time=0.058..0.063 rows=10 loops=1)
Output: id, pos, sex
Buffers: shared hit=1 read=3
-> Index Scan using idx1 on public.user_pos (cost=0.43..230861.93 rows=4983333 width=22) (actual time=0.057..0.060 rows=10 loops=1)
Output: id, pos, sex
Index Cond: ((user_pos.sex = '0'::bpchar) = true)
Filter: (user_pos.sex = '0'::bpchar)
Buffers: shared hit=1 read=3
Planning time: 0.146 ms
Execution time: 0.085 ms
(10 rows)