PostgreSQL 模糊查询最佳实践 - (含单字、双字、多字模糊查询方法)
背景
前模糊(有前缀的模糊),后模糊(有后缀的模糊),前后模糊(无前后缀的模糊),正则匹配都属于文本搜索领域常见的需求。
PostgreSQL拥有很强的文本搜索能力,除了支持全文检索,还支持模糊查询、正则查询。内置的pg_trgm插件是一般数据库没有的,可能很多人没有听说过。同时还内置了表达式索引、GIN索引的功能。
不同的模糊查询需求,有不同的优化方法。
对于前模糊和后模糊,PostgreSQL则与其他数据库一样,可以使用btree来加速。后模糊可以使用反转函数的函数索引来加速。
对于前后模糊和正则匹配,一种方法是使用pg_trgm插件,利用GIN索引加速模糊和正则查询(输入3个或3个以上字符的模糊查询效果很好)。另一种方法是自定义GIN表达式索引的方法,适合于定制的模糊查询。
一、前模糊与后模糊的优化
1. 前模糊(有前缀的模糊)优化方法
使用b-tree可以支持前模糊的查询。
1.1 当使用类型默认的index ops class时,仅适合于collate=”C”的查询(当数据库默认的lc_collate<>C时,索引和查询都需要明确指定collate “C”)。
索引、查询条件的collate必须一致才能使用索引。
例子
test=# create table test(id int, info text);
CREATE TABLE
test=# insert into test select generate_series(1,1000000),md5(random()::text);
INSERT 0 1000000
test=# create index idx on test(info collate "C");
CREATE INDEX
test=# explain (analyze,verbose,timing,costs,buffers) select * from test where info like 'abcd%' collate "C";
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------
Index Scan using idx on public.test (cost=0.42..16.76 rows=100 width=37) (actual time=0.057..0.093 rows=18 loops=1)
Output: id, info
Index Cond: ((test.info >= 'abcd'::text) AND (test.info < 'abce'::text))
Filter: (test.info ~~ 'abcd%'::text COLLATE "C")
Buffers: shared hit=18 read=3
Planning time: 0.424 ms
Execution time: 0.124 ms
(7 rows)
1.2 当数据库默认的lc_collate<>C时,还有一种方法让b-tree索引支持模糊查询。使用对应类型的pattern ops,使用pattern ops将使用字符的查询方式而非binary的搜索方式。
文档中有如下解释
https://www.postgresql.org/docs/9.6/static/indexes-opclass.html
The operator classes text_pattern_ops, varchar_pattern_ops, and bpchar_pattern_ops
support B-tree indexes on the types text, varchar, and char respectively.
The difference from the default operator classes is that the values are compared strictly
character by character rather than according to the locale-specific collation rules.
This makes these operator classes suitable for use by queries involving pattern
matching expressions (LIKE or POSIX regular expressions) when the database
does not use the standard "C" locale.
例子
test=# drop table test;
DROP TABLE
test=# create table test(id int, info text);
CREATE TABLE
test=# insert into test select generate_series(1,1000000),md5(random()::text);
INSERT 0 1000000
test=# create index idx on test(info text_pattern_ops);
CREATE INDEX
test=# explain (analyze,verbose,timing,costs,buffers) select * from test where info like 'abcd%' collate "zh_CN";
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------
Index Scan using idx on public.test (cost=0.42..16.76 rows=100 width=37) (actual time=0.038..0.059 rows=12 loops=1)
Output: id, info
Index Cond: ((test.info ~>=~ 'abcd'::text) AND (test.info ~<~ 'abce'::text))
Filter: (test.info ~~ 'abcd%'::text COLLATE "zh_CN")
Buffers: shared hit=12 read=3
Planning time: 0.253 ms
Execution time: 0.081 ms
(7 rows)
test=# explain (analyze,verbose,timing,costs,buffers) select * from test where info like 'abcd%' collate "C";
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------
Index Scan using idx on public.test (cost=0.42..16.76 rows=100 width=37) (actual time=0.027..0.050 rows=12 loops=1)
Output: id, info
Index Cond: ((test.info ~>=~ 'abcd'::text) AND (test.info ~<~ 'abce'::text))
Filter: (test.info ~~ 'abcd%'::text COLLATE "C")
Buffers: shared hit=15
Planning time: 0.141 ms
Execution time: 0.072 ms
(7 rows)
使用类型对应的pattern ops,索引搜索不仅支持LIKE的写法,还支持规则表达式的写法,如下:
test=# explain (analyze,verbose,timing,costs,buffers) select * from test where info ~ '^abcd';
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------
Index Scan using idx on public.test (cost=0.42..16.76 rows=100 width=37) (actual time=0.031..0.061 rows=12 loops=1)
Output: id, info
Index Cond: ((test.info ~>=~ 'abcd'::text) AND (test.info ~<~ 'abce'::text))
Filter: (test.info ~ '^abcd'::text)
Buffers: shared hit=15
Planning time: 0.213 ms
Execution time: 0.083 ms
(7 rows)
2. 后模糊(有后缀的模糊)的优化方法
使用反转函数(reverse)索引,可以支持后模糊的查询。
2.1 当使用类型默认的index ops class时,仅适合于collate=”C”的查询(当数据库默认的lc_collate<>C时,索引和查询都需要明确指定collate “C”)。
索引、查询条件的collate必须一致才能使用索引。
例子
test=# create index idx1 on test(reverse(info) collate "C");
CREATE INDEX
test=# select * from test limit 1;
id | info
----+----------------------------------
1 | b3275976cdd437a033d4329775a52514
(1 row)
test=# explain (analyze,verbose,timing,costs,buffers) select * from test where reverse(info) like '4152%' collate "C";
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------
Index Scan using idx1 on public.test (cost=0.42..4009.43 rows=5000 width=37) (actual time=0.061..0.097 rows=18 loops=1)
Output: id, info
Index Cond: ((reverse(test.info) >= '4152'::text) AND (reverse(test.info) < '4153'::text))
Filter: (reverse(test.info) ~~ '4152%'::text COLLATE "C")
Buffers: shared hit=18 read=3
Planning time: 0.128 ms
Execution time: 0.122 ms
(7 rows)
test=# select * from test where reverse(info) like '4152%' collate "C";
id | info
--------+----------------------------------
847904 | abe2ecd90393b5275df8e34a39702514
414702 | 97f66d26545329321164042657d02514
191232 | 7820972c6220c2b01d46c11ebb532514
752742 | 93232ac39c6632e2540df44627c42514
217302 | 39e518893a1a7b1e691619bd1fc42514
1 | b3275976cdd437a033d4329775a52514
615718 | 4948f94c484c13dc6c4fae8a3db52514
308815 | fc2918ceff7c7a4dafd2e04031062514
149521 | 546d963842ea5ca593e622c810262514
811093 | 4b6eca2eb6b665af67b2813e91a62514
209000 | 1dfd0d4e326715c1739f031cca992514
937616 | 8827fd81f5b673fb5afecbe3e11b2514
419553 | bd6e01ce360af16137e8b6abc8ab2514
998324 | 7dff51c19dc5e5d9979163e7d14c2514
771518 | 8a54e30003a48539fff0aedc73ac2514
691566 | f90368348e3b6bf983fcbe10db2d2514
652274 | 8bf4a97b5f122a5540a21fa85ead2514
233437 | 739ed715fc203d47e37e79b5bcbe2514
(18 rows)
2.2 当数据库默认的lc_collate<>C时,还有一种方法让b-tree索引支持模糊查询。使用对应类型的pattern ops,使用pattern ops将使用字符的查询方式而非binary的搜索方式。
使用类型对应的pattern ops,索引搜索不仅支持LIKE的写法,还支持规则表达式的写法。
例子
test=# create index idx1 on test(reverse(info) text_pattern_ops);
CREATE INDEX
test=# explain (analyze,verbose,timing,costs,buffers) select * from test where reverse(info) like '4152%';
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------
Index Scan using idx1 on public.test (cost=0.42..4009.43 rows=5000 width=37) (actual time=0.026..0.049 rows=12 loops=1)
Output: id, info
Index Cond: ((reverse(test.info) ~>=~ '4152'::text) AND (reverse(test.info) ~<~ '4153'::text))
Filter: (reverse(test.info) ~~ '4152%'::text)
Buffers: shared hit=15
Planning time: 0.102 ms
Execution time: 0.072 ms
(7 rows)
test=# explain (analyze,verbose,timing,costs,buffers) select * from test where reverse(info) ~ '^4152';
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------
Index Scan using idx1 on public.test (cost=0.42..4009.43 rows=5000 width=37) (actual time=0.031..0.063 rows=12 loops=1)
Output: id, info
Index Cond: ((reverse(test.info) ~>=~ '4152'::text) AND (reverse(test.info) ~<~ '4153'::text))
Filter: (reverse(test.info) ~ '^4152'::text)
Buffers: shared hit=15
Planning time: 0.148 ms
Execution time: 0.087 ms
(7 rows)
3. 前、后模糊的合体优化方法
使用pg_trgm索引,可以支持前、后模糊的查询。
注意:
(有前缀的模糊)至少输入1个字符,(有后缀的模糊)至少输入2个字符,才有好的索引过滤效果。
如果要高效支持多字节字符(例如中文),数据库lc_ctype不能为”C”,只有TOKEN分割正确效果才是OK的。(lc_ctype设置正确,才能够正确的逐一分割多字节字符串中的文字: LC_CTYPE: Character classification (What is a letter? Its upper-case equivalent?))。
索引、查询条件的collate必须一致才能使用索引。
test=# \l+ test
List of databases
Name | Owner | Encoding | Collate | Ctype | Access privileges | Size | Tablespace | Description
------+----------+----------+------------+------------+-------------------+--------+------------+-------------
test | postgres | UTF8 | zh_CN.utf8 | zh_CN.utf8 | | 245 MB | pg_default |
(1 row)
test=# create extension pg_trgm;
test=# create table test001(c1 text);
CREATE TABLE
生成随机中文字符串的函数
test=# create or replace function gen_hanzi(int) returns text as $$
declare
res text;
begin
if $1 >=1 then
select string_agg(chr(19968+(random()*20901)::int), '') into res from generate_series(1,$1);
return res;
end if;
return null;
end;
$$ language plpgsql strict;
CREATE FUNCTION
生成随机数据
test=# insert into test001 select gen_hanzi(20) from generate_series(1,100000);
INSERT 0 100000
test=# create index idx_test001_1 on test001 using gin (c1 gin_trgm_ops);
CREATE INDEX
test=# select * from test001 limit 5;
c1
------------------------------------------
埳噪办甾讷昃碇玾陧箖燋邢賀浮媊踮菵暔谉橅
秌橑籛鴎拟倶敤麁鼋醠轇坙騉鏦纗蘛婃坹娴儅
蔎緾鎧爪鵬二悲膼朠麻鸂鋬楨窷違繇糭嘓索籓
馳泅薬鐗愅撞窍浉渗蛁灎厀攚摐瞪拡擜詜隝緼
襳铺煃匶瀌懲荼黹樆惺箧搔羾憯墆鋃硍蔓恧顤
(5 rows)
模糊查询
test=# explain (analyze,verbose,timing,costs,buffers) select * from test001 where c1 like '你%';
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on public.test001 (cost=5.08..15.20 rows=10 width=61) (actual time=0.030..0.034 rows=3 loops=1)
Output: c1
Recheck Cond: (test001.c1 ~~ '你%'::text)
Heap Blocks: exact=3
Buffers: shared hit=7
-> Bitmap Index Scan on idx_test001_1 (cost=0.00..5.08 rows=10 width=0) (actual time=0.020..0.020 rows=3 loops=1)
Index Cond: (test001.c1 ~~ '你%'::text)
Buffers: shared hit=4
Planning time: 0.119 ms
Execution time: 0.063 ms
(10 rows)
test=# explain (analyze,verbose,timing,costs,buffers) select * from test001 where c1 like '%恧顤';
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on public.test001 (cost=5.08..15.20 rows=10 width=61) (actual time=0.031..0.034 rows=1 loops=1)
Output: c1
Recheck Cond: (test001.c1 ~~ '%恧顤'::text)
Rows Removed by Index Recheck: 1
Heap Blocks: exact=2
Buffers: shared hit=6
-> Bitmap Index Scan on idx_test001_1 (cost=0.00..5.08 rows=10 width=0) (actual time=0.020..0.020 rows=2 loops=1)
Index Cond: (test001.c1 ~~ '%恧顤'::text)
Buffers: shared hit=4
Planning time: 0.136 ms
Execution time: 0.062 ms
(11 rows)
二、前后均模糊的优化
使用pg_trgm插件,支持前后模糊的查询。
注意:
如果要让pg_trgm高效支持多字节字符(例如中文),数据库lc_ctype不能为”C”,只有TOKEN分割正确效果才是OK的。(lc_ctype设置正确,才能够正确的逐一分割多字节字符串中的文字: Character classification (What is a letter? Its upper-case equivalent?))。
建议输入3个或3个以上字符,否则效果不佳(后面会分析原因)。
例子
test=# explain (analyze,verbose,timing,costs,buffers) select * from test001 where c1 like '%燋邢賀%';
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on public.test001 (cost=5.08..15.20 rows=10 width=61) (actual time=0.038..0.038 rows=1 loops=1)
Output: c1
Recheck Cond: (test001.c1 ~~ '%燋邢賀%'::text)
Heap Blocks: exact=1
Buffers: shared hit=5
-> Bitmap Index Scan on idx_test001_1 (cost=0.00..5.08 rows=10 width=0) (actual time=0.025..0.025 rows=1 loops=1)
Index Cond: (test001.c1 ~~ '%燋邢賀%'::text)
Buffers: shared hit=4
Planning time: 0.170 ms
Execution time: 0.076 ms
(10 rows)
test=# explain (analyze,verbose,timing,costs,buffers) select * from test001 where c1 like '%燋邢%';
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on public.test001 (cost=7615669.08..7615679.20 rows=10 width=61) (actual time=147.524..178.232 rows=1 loops=1)
Output: c1
Recheck Cond: (test001.c1 ~~ '%燋邢%'::text)
Rows Removed by Index Recheck: 99999
Heap Blocks: exact=1137
Buffers: shared hit=14429
-> Bitmap Index Scan on idx_test001_1 (cost=0.00..7615669.08 rows=10 width=0) (actual time=147.377..147.377 rows=100000 loops=1)
Index Cond: (test001.c1 ~~ '%燋邢%'::text)
Buffers: shared hit=13292
Planning time: 0.133 ms
Execution time: 178.265 ms
(11 rows)
三、正则匹配的优化
PostgreSQL 正则匹配的语法为 字符串 ~ 'pattern'
或 字符串 ~* 'pattern'
https://www.postgresql.org/docs/9.6/static/functions-matching.html
例子
test=# explain (analyze,verbose,timing,costs,buffers) select * from test001 where c1 ~ '12[0-9]{3,9}';
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on public.test001 (cost=65.08..75.20 rows=10 width=61) (actual time=0.196..0.196 rows=0 loops=1)
Output: c1
Recheck Cond: (test001.c1 ~ '12[0-9]{3,9}'::text)
Rows Removed by Index Recheck: 1
Heap Blocks: exact=1
Buffers: shared hit=50
-> Bitmap Index Scan on idx_test001_1 (cost=0.00..65.08 rows=10 width=0) (actual time=0.183..0.183 rows=1 loops=1)
Index Cond: (test001.c1 ~ '12[0-9]{3,9}'::text)
Buffers: shared hit=49
Planning time: 0.452 ms
Execution time: 0.221 ms
(11 rows)
test01=# explain (analyze,verbose,timing,costs,buffers) select * from test001 where c1 ~ '宸朾啣' collate "zh_CN";
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on public.test001 (cost=6.58..19.42 rows=10 width=61) (actual time=0.061..0.061 rows=1 loops=1)
Output: c1
Recheck Cond: (test001.c1 ~ '宸朾啣'::text COLLATE "zh_CN")
Heap Blocks: exact=1
Buffers: shared hit=5
-> Bitmap Index Scan on idx_test001_1 (cost=0.00..6.58 rows=10 width=0) (actual time=0.049..0.049 rows=1 loops=1)
Index Cond: (test001.c1 ~ '宸朾啣'::text COLLATE "zh_CN")
Buffers: shared hit=4
Planning time: 0.238 ms
Execution time: 0.082 ms
(10 rows)
正则匹配索引原理参考contrib/pg_trgm/trgm_regexp.c
pg_trgm模糊查询的原理
首先,pg_trgm将字符串的前端添加2个空格,末尾添加1个空格。
然后,每连续的3个字符为一个TOKEN,拆开。
最后,对TOKEN建立GIN倒排索引。
查看字符串的TOKEN,可以使用如下方法。
test=# select show_trgm('123');
show_trgm
-------------------------
{" 1"," 12",123,"23 "}
(1 row)
pg_trgm前后模糊字符个数要求的原因
使用pg_trgm时,如果要获得最好的效果,最好满足这些条件。
1. 有前缀的模糊查询,例如a%,至少需要提供1个字符。( 搜索的是token=’ a’ )
2. 有后缀的模糊查询,例如%ab,至少需要提供2个字符。( 搜索的是token=’ab ‘ )
3. 前后模糊查询,例如%abcd%,至少需要提供3个字符。( 这个使用数组搜索,搜索的是token(s) 包含 {“ a”,” ab”,abc,bcd,”cd “} )
原因是什么呢?
因为pg_trgm生成的TOKEN是三个字符,只有在以上三个条件下,才能匹配到对应的TOKEN。
test=# select show_trgm('123');
show_trgm
-------------------------
{" 1"," 12",123,"23 "}
(1 row)
四、小于3个输入字符的模糊查询的优化
当需要前后模糊搜索1个或者2个字符时,pg_trgm无法满足需求,但是我们可以使用表达式GIN索引。
使用表达式,将字符串拆成1个单字,两个连续的字符的数组,对数组建立GIN索引即可。
例子
test=# create or replace function split001(text) returns text[] as $$
declare
res text[];
begin
select regexp_split_to_array($1,'') into res;
for i in 1..length($1)-1 loop
res := array_append(res, substring($1,i,2));
end loop;
return res;
end;
$$ language plpgsql strict immutable;
CREATE FUNCTION
test=# create index idx_test001_2 on test001 using gin (split001(c1));
test=# explain (analyze,verbose,timing,costs,buffers) select * from test001 where split001(c1) @> array['你好'];
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on public.test001 (cost=8.87..550.12 rows=500 width=61) (actual time=0.041..0.041 rows=0 loops=1)
Output: c1
Recheck Cond: (split001(test001.c1) @> '{你好}'::text[])
Buffers: shared hit=4
-> Bitmap Index Scan on idx_test001_2 (cost=0.00..8.75 rows=500 width=0) (actual time=0.039..0.039 rows=0 loops=1)
Index Cond: (split001(test001.c1) @> '{你好}'::text[])
Buffers: shared hit=4
Planning time: 0.104 ms
Execution time: 0.068 ms
(9 rows)
test=# explain (analyze,verbose,timing,costs,buffers) select * from test001 where split001(c1) @> array['你'];
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on public.test001 (cost=8.87..550.12 rows=500 width=61) (actual time=0.063..0.183 rows=86 loops=1)
Output: c1
Recheck Cond: (split001(test001.c1) @> '{你}'::text[])
Heap Blocks: exact=80
Buffers: shared hit=84
-> Bitmap Index Scan on idx_test001_2 (cost=0.00..8.75 rows=500 width=0) (actual time=0.048..0.048 rows=86 loops=1)
Index Cond: (split001(test001.c1) @> '{你}'::text[])
Buffers: shared hit=4
Planning time: 0.101 ms
Execution time: 0.217 ms
(10 rows)
test=# select * from test001 where split001(c1) @> array['你'];
c1
------------------------------------------
殐踨洪冨垓丩贤閚偉垢胸鍘崩你萭隡劭芛雫袰
靅慨热脸罆淓寘鰻总襎戍謸枨陪丼倫柆套你仮
......
五、相似查询优化
模糊查询和正则匹配都是找出完全符合条件的记录,还有一种需求是相似查询。
例如postgresql字符串,输入 p0stgresgl 也能根据相似度匹配到。
这里同样用到了pg_trgm插件,如果要支持中文,同样有这样的要求:
如果需要让pg_trgm支持中文相似查询,数据库lc_ctype不能为”C”,只有TOKEN分割正确效果才是OK的。(lc_ctype设置正确,才能够正确的逐一分割多字节字符串中的文字: Character classification (What is a letter? Its upper-case equivalent?))。
建议输入3个或3个以上字符,否则效果不佳(后面会分析原因)。
例子
test=# create index idx_test001_3 on test001 using gist (c1 gist_trgm_ops);
CREATE INDEX
test=# explain (analyze,verbose,timing,costs,buffers) SELECT t, c1 <-> '癷磛鹚蠌鳃蠲123鶡埀婎鳊苿奶垨惸溴蔻筴熝憡' AS dist
FROM test001 t
ORDER BY dist LIMIT 5;
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=0.28..0.52 rows=5 width=89) (actual time=37.462..37.639 rows=5 loops=1)
Output: t.*, ((c1 <-> '癷磛鹚蠌鳃蠲123鶡埀婎鳊苿奶垨惸溴蔻筴熝憡'::text))
Buffers: shared hit=1631
-> Index Scan using idx_test001_3 on public.test001 t (cost=0.28..4763.28 rows=100000 width=89) (actual time=37.461..37.636 rows=5 loops=1)
Output: t.*, (c1 <-> '癷磛鹚蠌鳃蠲123鶡埀婎鳊苿奶垨惸溴蔻筴熝憡'::text)
Order By: (t.c1 <-> '癷磛鹚蠌鳃蠲123鶡埀婎鳊苿奶垨惸溴蔻筴熝憡'::text)
Buffers: shared hit=1631
Planning time: 0.089 ms
Execution time: 37.668 ms
(9 rows)
test=# SELECT t, c1 <-> '癷磛鹚蠌鳃蠲123鶡埀婎鳊苿奶垨惸溴蔻筴熝憡' AS dist
FROM test001 t
ORDER BY dist LIMIT 5;
t | dist
--------------------------------------------+----------
(癷磛鹚蠌鳃蠲你鶡埀婎鳊苿奶垨惸溴蔻筴熝憡) | 0.307692
(坆桻悁斾耾瑚豌腏炁悿隖轲盃挜稐睟礓蜮铅湆) | 0.976744
(癷鉜餯祂鼃恫蝅瓟顡廕梍蛸歡僷贊敔欓侑韌鐹) | 0.976744
(癷嚯鳬戚蹪熼胘檙佌欔韜挹樷覄惶蹝顼鑜鞖媗) | 0.976744
(癷饎瞲餿堒歃峽盾豼擔禞嵪豦咢脉馄竨济隘缄) | 0.976744
(5 rows)
六、小结
1. 如果只有前模糊查询需求(字符串 like ‘xx%’),使用collate “C”的b-tree索引;当collate不为”C”时,可以使用类型对应的pattern ops(例如text_pattern_ops)建立b-tree索引。
2. 如果只有后模糊的查询需求(字符串 like ‘%abc’ 等价于 reverse(字符串) like ‘cba%’),使用collate “C”的reverse()表达式的b-tree索引;当collate不为”C”时,可以使用类型对应的pattern ops(例如text_pattern_ops)建立b-tree索引。
3. 如果有前后模糊查询需求,并且包含中文,请使用lc_ctype <> “C”的数据库,同时使用pg_trgm插件的gin索引。(只有TOKEN分割正确效果才是OK的。(lc_ctype设置正确,才能够正确的逐一分割多字节字符串中的文字: Character classification (What is a letter? Its upper-case equivalent?))。)
4. 如果有前后模糊查询需求,并且不包含中文,请使用pg_trgm插件的gin索引。
5. 如果有正则表达式查询需求,请使用pg_trgm插件的gin索引。
6. 如果有输入条件少于3个字符的模糊查询需求,可以使用GIN表达式索引,通过数组包含的方式进行搜索,性能一样非常好。
七、性能
1亿条记录,每条记录15个随机中文。测试前后模糊查询性能。
1. 生成测试数据
vi test.sql
insert into test001 select gen_hanzi(15) from generate_series(1,2500000);
pgbench -n -r -P 1 -f ./test.sql -c 40 -j 40 -t 1 test
test=# select count(*) from test001;
count
-----------
100000000
(1 row)
test=# select * from test001 limit 10;
c1
--------------------------------
釾笉皜鰈确艄騚馺腃彊釲忰采汦擇
槮搮圮墔婂蹾飘孡鶒镇赀聵線麯櫕
孨鄈韞萅赫炧暤蟠檼駧餪崉娲譌筯
烸喖醝稦怩鷟棾奜妛曫仾飛饡绘韦
撑豁襉峊炠眏罱襄彊鰮莆壏妒辷阛
蜁愊鶱磹贰帵眲嚉榑苍潵檐簄椰魨
瑄翁蠃巨躋壾蛸湗鑂顂櫟砣八癱栵
馇巍笿鞒装棊嘢恓煓熴锠鋈蹃煿屓
訆韄踔牤嘇糺絢軿鵑燿螛梋鰢謇郼
撲蓨伤釱糕觩嬖蓷鰼繩圆醷熌靉掑
(10 rows)
2. 创建索引
test=# set maintenance_work_mem ='32GB';
test=# create index idx_test001_1 on test001 using gin (c1 gin_trgm_ops);
表和索引大小
test=# \di+
List of relations
Schema | Name | Type | Owner | Table | Size | Description
--------+---------------+-------+----------+---------+-------+-------------
public | idx_test001_1 | index | postgres | test001 | 12 GB |
(1 row)
test=# \dt+
List of relations
Schema | Name | Type | Owner | Size | Description
--------+---------+-------+----------+---------+-------------
public | test001 | table | postgres | 7303 MB |
(1 row)
3. 模糊查询性能测试
3.1 前模糊
响应时间:9毫秒
返回4701行
select * from test001 where c1 like '你%';
test=# explain (analyze,verbose,timing,costs,buffers) select * from test001 where c1 like '你%';
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on public.test001 (cost=89.50..10161.50 rows=10000 width=46) (actual time=1.546..8.868 rows=4701 loops=1)
Output: c1
Recheck Cond: (test001.c1 ~~ '你%'::text)
Rows Removed by Index Recheck: 85
Heap Blocks: exact=4776
Buffers: shared hit=4784
-> Bitmap Index Scan on idx_test001_1 (cost=0.00..87.00 rows=10000 width=0) (actual time=0.879..0.879 rows=4786 loops=1)
Index Cond: (test001.c1 ~~ '你%'::text)
Buffers: shared hit=8
Planning time: 0.099 ms
Execution time: 9.166 ms
(11 rows)
3.2 后模糊
响应时间:0.25毫秒
返回2行
select * from test001 where c1 like '%靉掑';
test=# explain (analyze,verbose,timing,costs,buffers) select * from test001 where c1 like '%靉掑';
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on public.test001 (cost=89.50..10161.50 rows=10000 width=46) (actual time=0.049..0.223 rows=2 loops=1)
Output: c1
Recheck Cond: (test001.c1 ~~ '%靉掑'::text)
Rows Removed by Index Recheck: 87
Heap Blocks: exact=89
Buffers: shared hit=94
-> Bitmap Index Scan on idx_test001_1 (cost=0.00..87.00 rows=10000 width=0) (actual time=0.031..0.031 rows=89 loops=1)
Index Cond: (test001.c1 ~~ '%靉掑'::text)
Buffers: shared hit=5
Planning time: 0.113 ms
Execution time: 0.249 ms
(11 rows)
3.3 前后模糊
响应时间:0.2毫秒
返回1行
select * from test001 where c1 like '%螛梋鰢%';
test=# explain (analyze,verbose,timing,costs,buffers) select * from test001 where c1 like '%螛梋鰢%';
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on public.test001 (cost=89.50..10161.50 rows=10000 width=46) (actual time=0.044..0.175 rows=1 loops=1)
Output: c1
Recheck Cond: (test001.c1 ~~ '%螛梋鰢%'::text)
Rows Removed by Index Recheck: 81
Heap Blocks: exact=82
Buffers: shared hit=87
-> Bitmap Index Scan on idx_test001_1 (cost=0.00..87.00 rows=10000 width=0) (actual time=0.027..0.027 rows=82 loops=1)
Index Cond: (test001.c1 ~~ '%螛梋鰢%'::text)
Buffers: shared hit=5
Planning time: 0.112 ms
Execution time: 0.201 ms
(11 rows)
小结
三字或以上模糊查询,使用pg_trgm可以很好的解决。
1,2个字的模糊查询,使用表达式索引也可以实现很好的性能。
postgres=# create or replace function split_12(text) returns text[] as $$
declare
res text[];
begin
select regexp_split_to_array($1, '') into res;
for i in 1..length($1)-1 loop
res := array_append(res, substring($1, i, 2));
end loop;
return res;
end;
$$ language plpgsql strict immutable;
CREATE FUNCTION
postgres=# select split_12('abc你好');
split_12
------------------------------
{a,b,c,你,好,ab,bc,c你,你好}
(1 row)
create index idx2 on tbl using gin (split_12(col));
select * from tbl where split_12(col) @> array['单字或双字'];
建议应用在查询的时候,判断一下词的个数,选择正确的SQL写法。