PostgreSQL 9.3 pg_trgm imporve support multi-bytes char and gist,gin index for reg-exp search
背景
PostgreSQL 9.3的pg_trgm模块主要有2方面的增强, 支持多字节字符, 同时支持gist, gin索引的规则表达式查询.
Note: pg_trgm ignores non-word characters (non-alphanumerics) when extracting trigrams from a string.
Each word is considered to have two spaces prefixed and one space suffixed when determining the set of trigrams contained in the string.
For example, the set of trigrams in the string "cat" is " c", " ca", "cat", and "at ".
The set of trigrams in the string "foo|bar" is " f", " fo", "foo", "oo ", " b", " ba", "bar", and "ar ".
多字节字符支持测试
前提条件:
如果要让 pg_trgm 支持宽字符, 数据库的collate,ctype 不能是 C.
例如
postgres=# create database test with template template0 encoding 'UTF8' lc_collate 'zh_CN.utf8' lc_ctype 'zh_CN.utf8';
CREATE DATABASE
postgres=# \c test
You are now connected to database "test" as user "postgres".
test=# create extension pg_trgm;
CREATE EXTENSION
test=# select show_trgm('刘德华abc, hello world. hello china.');
show_trgm
----------------------------------------------------------------------------------------------------------------------------------------------------
{0xb207ac,0xd2efc5," c"," h"," w"," ch"," he"," wo",0x304695,0x403652,abc,"bc ",chi,ell,hel,hin,ina,"ld ",llo,"lo ","na ",0x6ff95f,orl,rld,wor}
(1 row)
否则依旧忽略.
PostgreSQL 9.3
digoal=# select show_trgm('刘德华abc, hello world. hello china.');
show_trgm
------------------------------------------------------------------------------------------------------------------------------------
----------------
{0xb207ac,0xd2efc5," c"," h"," w"," ch"," he"," wo",0x304695,0x403652,abc,"bc ",chi,ell,hel,hin,ina,"ld ",llo,"lo ","na ",0x6ff9
5f,orl,rld,wor}
(1 row)
按照pg_trgm的拆分规则, 只保留了数字, 字母, 以及wchar. 标点空格等其他字符都不在组合的范围内.
得到的组合如下 :
刘
刘德
刘德华
德华a
华ab
abc
bc
h
he
hel
ell
llo
lo
w
wo
wor
orl
rld
ld
c
ch
chi
hin
ina
na
规则表达式查询走gist或gin索引.
digoal=# create table trgm_test(id serial4 primary key, info text);
CREATE TABLE
digoal=# create index trgm_test_gin on trgm_test using gin(info gin_trgm_ops);
CREATE INDEX
digoal=# insert into trgm_test (info) values ('abc刘德华def, hello world, hello china.');
INSERT 0 1
Time: 0.504 ms
digoal=# insert into trgm_test (info) values ('abc张学友def, hello world, hello china.');
INSERT 0 1
Time: 0.267 ms
digoal=# insert into trgm_test (info) select md5(random()::text) from generate_series(1,100000);
INSERT 0 100000
Time: 6003.028 ms
digoal=# select show_trgm(info), info from trgm_test where id=2;
show_trgm
| info
------------------------------------------------------------------------------------------------------------------------------------
--------------------------------+-----------------------------------------
{0x9212d6,0x09c328," a"," c"," h"," w"," ab"," ch"," he"," wo",0x480682,0x4c470a,abc,chi,def,"ef ",ell,hel,hin,ina,"ld ",llo,"l
o ","na ",orl,rld,wor,0x7c9196} | abc张学友def, hello world, hello china.
(1 row)
Time: 0.310 ms
digoal=# select show_trgm('el');
show_trgm
---------------------
{" e"," el","el "}
(1 row)
Time: 0.164 ms
digoal=# select info from (select unnest(show_trgm(info)) as info from trgm_test where id=2) t where info=any(show_trgm('el'));
info
------
(0 rows)
Time: 0.280 ms
对照以下结果以及以上结果, gin,gist索引不是简单的trgm拆分后的array gin索引. 否则el将检索不到.
digoal=# explain analyze select * from trgm_test where info ~ 'el';
QUERY PLAN
------------------------------------------------------------------------------------------------------------
Seq Scan on trgm_test (cost=0.00..2084.03 rows=20000 width=36) (actual time=0.014..97.282 rows=2 loops=1)
Filter: (info ~ 'el'::text)
Rows Removed by Filter: 100000
Total runtime: 97.309 ms
(4 rows)
Time: 97.874 ms
digoal=# select * from trgm_test where info ~ 'el';
id | info
----+-----------------------------------------
1 | abc刘德华def, hello world, hello china.
2 | abc张学友def, hello world, hello china.
(2 rows)
Time: 95.281 ms
digoal=# set enable_seqscan=off;
SET
digoal=# explain analyze select * from trgm_test where info ~ 'el';
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------
---
Bitmap Heap Scan on trgm_test (cost=27143.00..28227.00 rows=20000 width=36) (actual time=222.531..312.936 rows=2 loops=1)
Recheck Cond: (info ~ 'el'::text)
Rows Removed by Index Recheck: 100000
-> Bitmap Index Scan on trgm_test_gin (cost=0.00..27138.00 rows=20000 width=0) (actual time=222.401..222.401 rows=100002 loops=
1)
Index Cond: (info ~ 'el'::text)
Total runtime: 312.988 ms
(6 rows)
digoal=# select * from trgm_test where info ~ 'el';
id | info
----+-----------------------------------------
1 | abc刘德华def, hello world, hello china.
2 | abc张学友def, hello world, hello china.
(2 rows)
和trgm_limit也无关. 因为以下查询的匹配度小于limit 0.3, 但是走索引依然匹配了.
postgres=# explain analyze select *,similarity(info,'刘德华') from trgm_test where info ~ '刘德华';
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on trgm_test (cost=6.08..25.13 rows=10 width=36) (actual time=0.063..0.064 rows=1 loops=1)
Recheck Cond: (info ~ '刘德华'::text)
-> Bitmap Index Scan on trgm_test_gin (cost=0.00..6.08 rows=10 width=0) (actual time=0.046..0.046 rows=1 loops=1)
Index Cond: (info ~ '刘德华'::text)
Total runtime: 0.077 ms
(5 rows)
postgres=# select *,similarity(info,'刘德华') from trgm_test where info ~ '刘德华';
id | info | similarity
----+-----------------------------------------+------------
1 | abc刘德华def, hello world, hello china. | 0.0322581
(1 row)
postgres=# select show_limit();
show_limit
------------
0.3
(1 row)
匹配度越高, 速度越快.
digoal=# select * from trgm_test where info ~ 'hello';
id | info
----+-----------------------------------------
1 | abc刘德华def, hello world, hello china.
2 | abc张学友def, hello world, hello china.
(2 rows)
Time: 0.637 ms
digoal=# explain analyze select * from trgm_test where info ~ 'hello';
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on trgm_test (cost=19.24..252.15 rows=160 width=36) (actual time=0.049..0.051 rows=2 loops=1)
Recheck Cond: (info ~ 'hello'::text)
-> Bitmap Index Scan on trgm_test_gin (cost=0.00..19.20 rows=160 width=0) (actual time=0.043..0.043 rows=2 loops=1)
Index Cond: (info ~ 'hello'::text)
Total runtime: 0.071 ms
(5 rows)
Time: 0.312 ms
gist 索引
digoal=# drop index trgm_test_gin ;
DROP INDEX
digoal=# create index trgm_test_gist on trgm_test using gist(info gist_trgm_ops);
CREATE INDEX
digoal=# explain analyze select * from trgm_test where info ~ 'hello';
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on trgm_test (cost=9.52..242.43 rows=160 width=36) (actual time=18.155..18.159 rows=2 loops=1)
Recheck Cond: (info ~ 'hello'::text)
-> Bitmap Index Scan on trgm_test_gist (cost=0.00..9.48 rows=160 width=0) (actual time=18.126..18.126 rows=2 loops=1)
Index Cond: (info ~ 'hello'::text)
Total runtime: 18.204 ms
(5 rows)
索引大小
digoal=# \dt+ trgm_test
List of relations
Schema | Name | Type | Owner | Size | Description
--------+-----------+-------+----------+---------+-------------
public | trgm_test | table | postgres | 6704 kB |
(1 row)
digoal=# \di+ trgm_test_gist
List of relations
Schema | Name | Type | Owner | Table | Size | Description
--------+---------------+-------+----------+-----------+-------+-------------
public | trgm_test_gist | index | postgres | trgm_test | 18 MB |
(1 row)
digoal=# \di+ trgm_test_gin
List of relations
Schema | Name | Type | Owner | Table | Size | Description
--------+----------------+-------+----------+-----------+-------+-------------
public | trgm_test_gin | index | postgres | trgm_test | 35 MB |
(1 row)
小结
1. gin 索引比gist索引占用空间大, 但是gin查询速度快.
2. PostgreSQL又多了一个模糊查询的选择. 以前使用分词的方法有些可能检索不到, 但是用trgm是可以完全检测到的.
digoal=# explain analyze select * from trgm_test where info ~ '刘德华';
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on trgm_test (cost=6.08..25.11 rows=10 width=36) (actual time=0.043..0.044 rows=1 loops=1)
Recheck Cond: (info ~ '刘德华'::text)
-> Bitmap Index Scan on trgm_test_gin (cost=0.00..6.08 rows=10 width=0) (actual time=0.037..0.037 rows=1 loops=1)
Index Cond: (info ~ '刘德华'::text)
Total runtime: 0.055 ms
(5 rows)
Time: 0.273 ms
digoal=# select * from trgm_test where info ~ '刘德华';
id | info
----+-----------------------------------------
1 | abc刘德华def, hello world, hello china.
(1 row)
Time: 0.242 ms
但是gin索引的写性能比gist差很多, gin索引优化方法见如下(设置GIN索引参数 pending list和fast update即可), 可以达到gist 的性能 :
http://blog.163.com/digoal@126/blog/static/163877040201341625735128/
pg93@db-172-16-3-33-> vi ins.sql
insert into trgm_test (info) values(md5(random()::text));
pg93@db-172-16-3-33-> pgbench -M prepared -r -n -f ./ins.sql -c 16 -j 4 -T 60
transaction type: Custom query
scaling factor: 1
query mode: prepared
number of clients: 16
number of threads: 4
duration: 60 s
number of transactions actually processed: 52152
tps = 824.689788 (including connections establishing)
tps = 824.946710 (excluding connections establishing)
statement latencies in milliseconds:
19.267069 insert into trgm_test (info) values(md5(random()::text));
postgres=# drop index trgm_test_gin ;
DROP INDEX
postgres=# create index trgm_test_gist on trgm_test using gist(info gist_trgm_ops);
CREATE INDEX
postgres=# truncate trgm_test ;
TRUNCATE TABLE
postgres=# checkpoint;
CHECKPOINT
pg93@db-172-16-3-33-> pgbench -M prepared -r -n -f ./ins.sql -c 16 -j 4 -T 60
transaction type: Custom query
scaling factor: 1
query mode: prepared
number of clients: 16
number of threads: 4
duration: 60 s
number of transactions actually processed: 2129577
tps = 35463.124921 (including connections establishing)
tps = 35477.153945 (excluding connections establishing)
statement latencies in milliseconds:
0.449334 insert into trgm_test (info) values(md5(random()::text));
参考
1. http://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=3ccae48f44d993351e1f881761bd6c556ebd6638
2. http://blog.163.com/digoal@126/blog/static/16387704020132254838733/
3. http://blog.163.com/digoal@126/blog/static/163877040201191882553803/
4. contrib/pg_trgm/trgm_regexp.c
797 /*
798 * Convert pg_wchar to multibyte format.
799 * Returns false if the character should be ignored completely.
800 */
801 static bool
802 convertPgWchar(pg_wchar c, trgm_mb_char *result)
803 {
804 /* "s" has enough space for a multibyte character and a trailing NUL */
805 char s[MAX_MULTIBYTE_CHAR_LEN + 1];
806
807 /*
808 * We can ignore the NUL character, since it can never appear in a PG text
809 * string. This avoids the need for various special cases when
810 * reconstructing trigrams.
811 */
812 if (c == 0)
813 return false;
814
815 /* Do the conversion, making sure the result is NUL-terminated */
816 memset(s, 0, sizeof(s));
817 pg_wchar2mb_with_len(&c, s, 1);
818
819 /*
820 * In IGNORECASE mode, we can ignore uppercase characters. We assume that
821 * the regex engine generated both uppercase and lowercase equivalents
822 * within each color, since we used the REG_ICASE option; so there's no
823 * need to process the uppercase version.
824 *
825 * XXX this code is dependent on the assumption that lowerstr() works the
826 * same as the regex engine's internal case folding machinery. Might be
827 * wiser to expose pg_wc_tolower and test whether c == pg_wc_tolower(c).
828 * On the other hand, the trigrams in the index were created using
829 * lowerstr(), so we're probably screwed if there's any incompatibility
830 * anyway.
831 */
832 #ifdef IGNORECASE
833 {
834 char *lowerCased = lowerstr(s);
835
836 if (strcmp(lowerCased, s) != 0)
837 {
838 pfree(lowerCased);
839 return false;
840 }
841 pfree(lowerCased);
842 }
843 #endif
844
845 /* Fill result with exactly MAX_MULTIBYTE_CHAR_LEN bytes */
846 strncpy(result->bytes, s, MAX_MULTIBYTE_CHAR_LEN);
847 return true;
848 }
5. gist, gin index operator.
/* contrib/pg_trgm/pg_trgm--1.0--1.1.sql */
-- complain if script is sourced in psql, rather than via CREATE EXTENSION
\echo Use "ALTER EXTENSION pg_trgm UPDATE TO '1.1'" to load this file. \quit
ALTER OPERATOR FAMILY gist_trgm_ops USING gist ADD
OPERATOR 5 pg_catalog.~ (text, text),
OPERATOR 6 pg_catalog.~* (text, text);
ALTER OPERATOR FAMILY gin_trgm_ops USING gin ADD
OPERATOR 5 pg_catalog.~ (text, text),
OPERATOR 6 pg_catalog.~* (text, text);