PostgreSQL 9.3 pg_trgm imporve support multi-bytes char and gist,gin index for reg-exp search

7 minute read

背景

PostgreSQL 9.3的pg_trgm模块主要有2方面的增强, 支持多字节字符, 同时支持gist, gin索引的规则表达式查询.

Note: pg_trgm ignores non-word characters (non-alphanumerics) when extracting trigrams from a string.   
  
Each word is considered to have two spaces prefixed and one space suffixed when determining the set of trigrams contained in the string.   
  
For example, the set of trigrams in the string "cat" is " c", " ca", "cat", and "at ".   
  
The set of trigrams in the string "foo|bar" is " f", " fo", "foo", "oo ", " b", " ba", "bar", and "ar ".  

多字节字符支持测试

前提条件:

如果要让 pg_trgm 支持宽字符, 数据库的collate,ctype 不能是 C.

例如

postgres=# create database test with template template0 encoding 'UTF8' lc_collate 'zh_CN.utf8' lc_ctype 'zh_CN.utf8';
CREATE DATABASE

postgres=# \c test
You are now connected to database "test" as user "postgres".
test=# create extension pg_trgm;
CREATE EXTENSION
test=# select show_trgm('刘德华abc, hello world. hello china.');
                                                                     show_trgm                                                                      
----------------------------------------------------------------------------------------------------------------------------------------------------
 {0xb207ac,0xd2efc5,"  c","  h","  w"," ch"," he"," wo",0x304695,0x403652,abc,"bc ",chi,ell,hel,hin,ina,"ld ",llo,"lo ","na ",0x6ff95f,orl,rld,wor}
(1 row)

否则依旧忽略.

PostgreSQL 9.3

digoal=# select show_trgm('刘德华abc, hello world. hello china.');  
                                                                     show_trgm                                                        
                  
------------------------------------------------------------------------------------------------------------------------------------  
----------------  
 {0xb207ac,0xd2efc5,"  c","  h","  w"," ch"," he"," wo",0x304695,0x403652,abc,"bc ",chi,ell,hel,hin,ina,"ld ",llo,"lo ","na ",0x6ff9  
5f,orl,rld,wor}  
(1 row)  

按照pg_trgm的拆分规则, 只保留了数字, 字母, 以及wchar. 标点空格等其他字符都不在组合的范围内.

得到的组合如下 :

  刘  
 刘德  
刘德华  
德华a  
华ab  
abc  
bc   
  h  
 he  
hel  
ell  
llo  
lo   
  w  
 wo  
wor  
orl  
rld  
ld   
  c  
 ch  
chi  
hin  
ina  
na   

规则表达式查询走gist或gin索引.

digoal=# create table trgm_test(id serial4 primary key, info text);  
CREATE TABLE  
digoal=# create index trgm_test_gin on trgm_test using gin(info gin_trgm_ops);  
CREATE INDEX  
digoal=# insert into trgm_test (info) values ('abc刘德华def, hello world, hello china.');  
INSERT 0 1  
Time: 0.504 ms  
digoal=# insert into trgm_test (info) values ('abc张学友def, hello world, hello china.');  
INSERT 0 1  
Time: 0.267 ms  
digoal=# insert into trgm_test (info) select md5(random()::text) from generate_series(1,100000);  
INSERT 0 100000  
Time: 6003.028 ms  
digoal=# select show_trgm(info), info from trgm_test where id=2;  
                                                                             show_trgm                                                
                                |                  info                     
------------------------------------------------------------------------------------------------------------------------------------  
--------------------------------+-----------------------------------------  
 {0x9212d6,0x09c328,"  a","  c","  h","  w"," ab"," ch"," he"," wo",0x480682,0x4c470a,abc,chi,def,"ef ",ell,hel,hin,ina,"ld ",llo,"l  
o ","na ",orl,rld,wor,0x7c9196} | abc张学友def, hello world, hello china.  
(1 row)  
  
Time: 0.310 ms  
digoal=# select show_trgm('el');  
      show_trgm        
---------------------  
 {"  e"," el","el "}  
(1 row)  
Time: 0.164 ms  
digoal=# select info from (select unnest(show_trgm(info)) as info from trgm_test where id=2) t where info=any(show_trgm('el'));  
 info   
------  
(0 rows)  
Time: 0.280 ms  

对照以下结果以及以上结果, gin,gist索引不是简单的trgm拆分后的array gin索引. 否则el将检索不到.

digoal=# explain analyze select * from trgm_test where info ~ 'el';  
                                                 QUERY PLAN                                                   
------------------------------------------------------------------------------------------------------------  
 Seq Scan on trgm_test  (cost=0.00..2084.03 rows=20000 width=36) (actual time=0.014..97.282 rows=2 loops=1)  
   Filter: (info ~ 'el'::text)  
   Rows Removed by Filter: 100000  
 Total runtime: 97.309 ms  
(4 rows)  
Time: 97.874 ms  
digoal=# select * from trgm_test where info ~ 'el';  
 id |                  info                     
----+-----------------------------------------  
  1 | abc刘德华def, hello world, hello china.  
  2 | abc张学友def, hello world, hello china.  
(2 rows)  
Time: 95.281 ms  
  
digoal=# set enable_seqscan=off;  
SET  
digoal=# explain analyze select * from trgm_test where info ~ 'el';  
                                                              QUERY PLAN                                                              
     
------------------------------------------------------------------------------------------------------------------------------------  
---  
 Bitmap Heap Scan on trgm_test  (cost=27143.00..28227.00 rows=20000 width=36) (actual time=222.531..312.936 rows=2 loops=1)  
   Recheck Cond: (info ~ 'el'::text)  
   Rows Removed by Index Recheck: 100000  
   ->  Bitmap Index Scan on trgm_test_gin  (cost=0.00..27138.00 rows=20000 width=0) (actual time=222.401..222.401 rows=100002 loops=  
1)  
         Index Cond: (info ~ 'el'::text)  
 Total runtime: 312.988 ms  
(6 rows)  
digoal=# select * from trgm_test where info ~ 'el';  
 id |                  info                     
----+-----------------------------------------  
  1 | abc刘德华def, hello world, hello china.  
  2 | abc张学友def, hello world, hello china.  
(2 rows)  

和trgm_limit也无关. 因为以下查询的匹配度小于limit 0.3, 但是走索引依然匹配了.

postgres=# explain analyze select *,similarity(info,'刘德华') from trgm_test where info ~ '刘德华';  
                                                      QUERY PLAN                                                         
-----------------------------------------------------------------------------------------------------------------------  
 Bitmap Heap Scan on trgm_test  (cost=6.08..25.13 rows=10 width=36) (actual time=0.063..0.064 rows=1 loops=1)  
   Recheck Cond: (info ~ '刘德华'::text)  
   ->  Bitmap Index Scan on trgm_test_gin  (cost=0.00..6.08 rows=10 width=0) (actual time=0.046..0.046 rows=1 loops=1)  
         Index Cond: (info ~ '刘德华'::text)  
 Total runtime: 0.077 ms  
(5 rows)  
  
postgres=# select *,similarity(info,'刘德华') from trgm_test where info ~ '刘德华';  
 id |                  info                   | similarity   
----+-----------------------------------------+------------  
  1 | abc刘德华def, hello world, hello china. |  0.0322581  
(1 row)  
  
postgres=# select show_limit();  
 show_limit   
------------  
        0.3  
(1 row)  

匹配度越高, 速度越快.

digoal=# select * from trgm_test where info ~ 'hello';  
 id |                  info                     
----+-----------------------------------------  
  1 | abc刘德华def, hello world, hello china.  
  2 | abc张学友def, hello world, hello china.  
(2 rows)  
Time: 0.637 ms  
digoal=# explain analyze select * from trgm_test where info ~ 'hello';  
                                                        QUERY PLAN                                                          
--------------------------------------------------------------------------------------------------------------------------  
 Bitmap Heap Scan on trgm_test  (cost=19.24..252.15 rows=160 width=36) (actual time=0.049..0.051 rows=2 loops=1)  
   Recheck Cond: (info ~ 'hello'::text)  
   ->  Bitmap Index Scan on trgm_test_gin  (cost=0.00..19.20 rows=160 width=0) (actual time=0.043..0.043 rows=2 loops=1)  
         Index Cond: (info ~ 'hello'::text)  
 Total runtime: 0.071 ms  
(5 rows)  
  
Time: 0.312 ms  

gist 索引

digoal=# drop index trgm_test_gin ;  
DROP INDEX  
digoal=# create index trgm_test_gist on trgm_test using gist(info gist_trgm_ops);  
CREATE INDEX  
digoal=# explain analyze select * from trgm_test where info ~ 'hello';  
                                                        QUERY PLAN                                                           
---------------------------------------------------------------------------------------------------------------------------  
 Bitmap Heap Scan on trgm_test  (cost=9.52..242.43 rows=160 width=36) (actual time=18.155..18.159 rows=2 loops=1)  
   Recheck Cond: (info ~ 'hello'::text)  
   ->  Bitmap Index Scan on trgm_test_gist  (cost=0.00..9.48 rows=160 width=0) (actual time=18.126..18.126 rows=2 loops=1)  
         Index Cond: (info ~ 'hello'::text)  
 Total runtime: 18.204 ms  
(5 rows)  

索引大小

digoal=# \dt+ trgm_test  
                       List of relations  
 Schema |   Name    | Type  |  Owner   |  Size   | Description   
--------+-----------+-------+----------+---------+-------------  
 public | trgm_test | table | postgres | 6704 kB |   
(1 row)  
digoal=# \di+ trgm_test_gist   
                              List of relations  
 Schema |     Name      | Type  |  Owner   |   Table   | Size  | Description   
--------+---------------+-------+----------+-----------+-------+-------------  
 public | trgm_test_gist | index | postgres | trgm_test | 18 MB |   
(1 row)  
digoal=# \di+ trgm_test_gin  
                              List of relations  
 Schema |      Name      | Type  |  Owner   |   Table   | Size  | Description   
--------+----------------+-------+----------+-----------+-------+-------------  
 public | trgm_test_gin | index | postgres | trgm_test | 35 MB |   
(1 row)  

小结

1. gin 索引比gist索引占用空间大, 但是gin查询速度快.

2. PostgreSQL又多了一个模糊查询的选择. 以前使用分词的方法有些可能检索不到, 但是用trgm是可以完全检测到的.

digoal=# explain analyze select * from trgm_test where info ~ '刘德华';  
                                                      QUERY PLAN                                                         
-----------------------------------------------------------------------------------------------------------------------  
 Bitmap Heap Scan on trgm_test  (cost=6.08..25.11 rows=10 width=36) (actual time=0.043..0.044 rows=1 loops=1)  
   Recheck Cond: (info ~ '刘德华'::text)  
   ->  Bitmap Index Scan on trgm_test_gin  (cost=0.00..6.08 rows=10 width=0) (actual time=0.037..0.037 rows=1 loops=1)  
         Index Cond: (info ~ '刘德华'::text)  
 Total runtime: 0.055 ms  
(5 rows)  
Time: 0.273 ms  
digoal=# select * from trgm_test where info ~ '刘德华';  
 id |                  info                     
----+-----------------------------------------  
  1 | abc刘德华def, hello world, hello china.  
(1 row)  
Time: 0.242 ms  

但是gin索引的写性能比gist差很多, gin索引优化方法见如下(设置GIN索引参数 pending list和fast update即可), 可以达到gist 的性能 :

http://blog.163.com/digoal@126/blog/static/163877040201341625735128/

pg93@db-172-16-3-33-> vi ins.sql  
insert into trgm_test (info) values(md5(random()::text));  
  
pg93@db-172-16-3-33-> pgbench -M prepared -r -n -f ./ins.sql -c 16 -j 4 -T 60  
transaction type: Custom query  
scaling factor: 1  
query mode: prepared  
number of clients: 16  
number of threads: 4  
duration: 60 s  
number of transactions actually processed: 52152  
tps = 824.689788 (including connections establishing)  
tps = 824.946710 (excluding connections establishing)  
statement latencies in milliseconds:  
        19.267069       insert into trgm_test (info) values(md5(random()::text));  
postgres=# drop index trgm_test_gin ;  
DROP INDEX  
postgres=# create index trgm_test_gist on trgm_test using gist(info gist_trgm_ops);  
CREATE INDEX  
postgres=# truncate trgm_test ;  
TRUNCATE TABLE  
postgres=# checkpoint;  
CHECKPOINT  
pg93@db-172-16-3-33-> pgbench -M prepared -r -n -f ./ins.sql -c 16 -j 4 -T 60  
transaction type: Custom query  
scaling factor: 1  
query mode: prepared  
number of clients: 16  
number of threads: 4  
duration: 60 s  
number of transactions actually processed: 2129577  
tps = 35463.124921 (including connections establishing)  
tps = 35477.153945 (excluding connections establishing)  
statement latencies in milliseconds:  
        0.449334        insert into trgm_test (info) values(md5(random()::text));  

参考

1. http://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=3ccae48f44d993351e1f881761bd6c556ebd6638

2. http://blog.163.com/digoal@126/blog/static/16387704020132254838733/

3. http://blog.163.com/digoal@126/blog/static/163877040201191882553803/

4. contrib/pg_trgm/trgm_regexp.c

 797 /*  
 798  * Convert pg_wchar to multibyte format.  
 799  * Returns false if the character should be ignored completely.  
 800  */  
 801 static bool  
 802 convertPgWchar(pg_wchar c, trgm_mb_char *result)  
 803 {  
 804     /* "s" has enough space for a multibyte character and a trailing NUL */  
 805     char        s[MAX_MULTIBYTE_CHAR_LEN + 1];  
 806   
 807     /*  
 808      * We can ignore the NUL character, since it can never appear in a PG text  
 809      * string.  This avoids the need for various special cases when  
 810      * reconstructing trigrams.  
 811      */  
 812     if (c == 0)  
 813         return false;  
 814   
 815     /* Do the conversion, making sure the result is NUL-terminated */  
 816     memset(s, 0, sizeof(s));  
 817     pg_wchar2mb_with_len(&c, s, 1);  
 818   
 819     /*  
 820      * In IGNORECASE mode, we can ignore uppercase characters.  We assume that  
 821      * the regex engine generated both uppercase and lowercase equivalents  
 822      * within each color, since we used the REG_ICASE option; so there's no  
 823      * need to process the uppercase version.  
 824      *  
 825      * XXX this code is dependent on the assumption that lowerstr() works the  
 826      * same as the regex engine's internal case folding machinery.  Might be  
 827      * wiser to expose pg_wc_tolower and test whether c == pg_wc_tolower(c).  
 828      * On the other hand, the trigrams in the index were created using  
 829      * lowerstr(), so we're probably screwed if there's any incompatibility  
 830      * anyway.  
 831      */  
 832 #ifdef IGNORECASE  
 833     {  
 834         char       *lowerCased = lowerstr(s);  
 835   
 836         if (strcmp(lowerCased, s) != 0)  
 837         {  
 838             pfree(lowerCased);  
 839             return false;  
 840         }  
 841         pfree(lowerCased);  
 842     }  
 843 #endif  
 844   
 845     /* Fill result with exactly MAX_MULTIBYTE_CHAR_LEN bytes */  
 846     strncpy(result->bytes, s, MAX_MULTIBYTE_CHAR_LEN);  
 847     return true;  
 848 }  

5. gist, gin index operator.

/* contrib/pg_trgm/pg_trgm--1.0--1.1.sql */  
  
-- complain if script is sourced in psql, rather than via CREATE EXTENSION  
\echo Use "ALTER EXTENSION pg_trgm UPDATE TO '1.1'" to load this file. \quit  
  
ALTER OPERATOR FAMILY gist_trgm_ops USING gist ADD  
        OPERATOR        5       pg_catalog.~ (text, text),  
        OPERATOR        6       pg_catalog.~* (text, text);  
  
ALTER OPERATOR FAMILY gin_trgm_ops USING gin ADD  
        OPERATOR        5       pg_catalog.~ (text, text),  
        OPERATOR        6       pg_catalog.~* (text, text);  

Flag Counter

digoal’s 大量PostgreSQL文章入口