如何加快PostgreSQL结巴分词pg_jieba加载速度
背景
PostgreSQL的全文检索接口是开放API的,所以中文分词的插件也非常多,例如常用的scws分词插件,还有结巴分词的插件。
但是你在使用结巴分词插件的时候,有没有遇到这样的问题。
每个会话,第一次查询会比较慢,接下来的查询就快了。
例如
psql (9.5.3)
Type "help" for help.
postgres=# \timing
Timing is on.
postgres=# select * from ts_debug('jiebacfg', '子远e5a1cbb8');
alias | description | token | dictionaries | dictionary | lexemes
-------+-------------+----------+--------------+------------+------------
n | noun | 子远 | {jieba_stem} | jieba_stem | {子远}
n | noun | e5a1cbb8 | {jieba_stem} | jieba_stem | {e5a1cbb8}
(2 rows)
Time: 863.777 ms
postgres=# select * from ts_debug('jiebacfg', '子远e5a1cbb8');
alias | description | token | dictionaries | dictionary | lexemes
-------+-------------+----------+--------------+------------+------------
n | noun | 子远 | {jieba_stem} | jieba_stem | {子远}
n | noun | e5a1cbb8 | {jieba_stem} | jieba_stem | {e5a1cbb8}
(2 rows)
Time: 1.342 ms
原因分析
第一次加载pg_jieba模块时,需要调用加载字典的动作。
/*
* Module load callback
*/
void
_PG_init(void)
{
if (jieba_ctx)
return;
{
const char* dict_path = jieba_get_tsearch_config_filename(DICT_PATH, EXT);
const char* hmm_path = jieba_get_tsearch_config_filename(HMM_PATH, EXT);
const char* user_dict_path = jieba_get_tsearch_config_filename(USER_DICT, EXT);
/*
init will take a few seconds to load dicts.
*/
jieba_ctx = Jieba_New(dict_path, hmm_path, user_dict_path);
}
}
如果pg_jieba.so没有放在shared_preload_libraries或session_preload_libraries中,那么每个会话启动时,都需要load pg_jieba.so,从而导致了第一次查询速度非常慢。
例子
psql (9.5.3)
Type "help" for help.
postgres=# \timing
Timing is on.
postgres=# load 'pg_jieba';
LOAD
Time: 857.098 ms
postgres=# select * from ts_debug('jiebacfg', '子远e5a1cbb8');
alias | description | token | dictionaries | dictionary | lexemes
-------+-------------+----------+--------------+------------+------------
n | noun | 子远 | {jieba_stem} | jieba_stem | {子远}
n | noun | e5a1cbb8 | {jieba_stem} | jieba_stem | {e5a1cbb8}
(2 rows)
Time: 4.952 ms
如何解决
知道问题在哪里了,就好解决。
可以将pg_jieba.so配置在shared_preload_libraries或session_preload_libraries中,就能解决以上问题。
vi postgresql.conf
shared_preload_libraries = 'pg_jieba.so'
or
session_preload_libraries = 'pg_jieba.so'
重启数据库
pg_ctl restart -m fast
内存开销比对
1. 未配置
shared_preload_libraries = 'pg_jieba.so'
or
session_preload_libraries = 'pg_jieba.so'
session A :
psql (9.5.3)
Type "help" for help.
postgres=# select pg_backend_pid();
pg_backend_pid
----------------
12254
(1 row)
session B :
psql (9.5.3)
Type "help" for help.
postgres=# select pg_backend_pid();
pg_backend_pid
----------------
12261
(1 row)
backend process内存使用情况
# smem|grep 12261
PID User Command Swap USS PSS RSS
12261 digoal postgres: postgres postgres 0 812 1677 3780
# smem|grep 12254
PID User Command Swap USS PSS RSS
12254 digoal postgres: postgres postgres 0 812 1682 3788
在未使用pg_jieba时,通过/proc/12261/smaps 也可以看到没有加载pg_jieba.so。
分别执行加载pg_jieba的模块或执行pg_jieba词法解析后
postgres=# load 'pg_jieba';
LOAD
Time: 872.095 ms
内存飙升
# smem|grep 12254
PID User Command Swap USS PSS RSS
12254 digoal postgres: postgres postgres 0 114404 116326 120272
# smem|grep 12261
PID User Command Swap USS PSS RSS
12261 digoal postgres: postgres postgres 0 114404 116321 120260
2. 已配置
shared_preload_libraries = 'pg_jieba.so'
or
session_preload_libraries = 'pg_jieba.so'
分别执行QUERY后,backend process进程内存没有独占加载pg_jieba.so的内存,算在共享内存中。
[root@iZ28tqoemgtZ ~]# smem|grep 12410
PID User Command Swap USS PSS RSS
12410 digoal postgres: postgres postgres 0 3696 17754 118988
[root@iZ28tqoemgtZ ~]# smem|grep 12412
PID User Command Swap USS PSS RSS
12412 digoal postgres: postgres postgres 0 3124 17115 118296
通过/proc/12410/smaps 也可以看到,只是用到pg_jieba.so时算了少量的Pss。
7fb68fe40000-7fb68fe55000 r-xp 00000000 fd:01 1052111 /home/digoal/pgsql9.5/lib/pg_jieba.so
Size: 84 kB
Rss: 48 kB
Pss: 16 kB
Shared_Clean: 48 kB
Shared_Dirty: 0 kB
Private_Clean: 0 kB
Private_Dirty: 0 kB
Referenced: 48 kB
Anonymous: 0 kB
AnonHugePages: 0 kB
Swap: 0 kB
KernelPageSize: 4 kB
MMUPageSize: 4 kB
Locked: 0 kB
VmFlags: rd ex mr mw me
7fb68fe55000-7fb690054000 ---p 00015000 fd:01 1052111 /home/digoal/pgsql9.5/lib/pg_jieba.so
Size: 2044 kB
Rss: 0 kB
Pss: 0 kB
Shared_Clean: 0 kB
Shared_Dirty: 0 kB
Private_Clean: 0 kB
Private_Dirty: 0 kB
Referenced: 0 kB
Anonymous: 0 kB
AnonHugePages: 0 kB
Swap: 0 kB
KernelPageSize: 4 kB
MMUPageSize: 4 kB
Locked: 0 kB
VmFlags: mr mw me
7fb690054000-7fb690055000 r--p 00014000 fd:01 1052111 /home/digoal/pgsql9.5/lib/pg_jieba.so
Size: 4 kB
Rss: 4 kB
Pss: 0 kB
Shared_Clean: 0 kB
Shared_Dirty: 4 kB
Private_Clean: 0 kB
Private_Dirty: 0 kB
Referenced: 4 kB
Anonymous: 4 kB
AnonHugePages: 0 kB
Swap: 0 kB
KernelPageSize: 4 kB
MMUPageSize: 4 kB
Locked: 0 kB
VmFlags: rd mr mw me ac
7fb690055000-7fb690056000 rw-p 00015000 fd:01 1052111 /home/digoal/pgsql9.5/lib/pg_jieba.so
...
参考
-
https://github.com/jaiminpan/pg_jieba
-
另外要提一点,结巴分词没有逗号的问题
https://yq.aliyun.com/articles/58007
-
效率,每CPU核 约处理56.4万字/s。
postgres=# alter function to_tsvector(regconfig,text) volatile;
ALTER FUNCTION
postgres=# explain (buffers,timing,costs,verbose,analyze) select to_tsvector('jiebacfg','中华人民共和国万岁,如何加快PostgreSQL结巴分词加载速度') from generate_series(1,1000000);
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------
Function Scan on pg_catalog.generate_series (cost=0.00..260.00 rows=1000 width=0) (actual time=100.054..13943.166 rows=1000000 loops=1)
Output: to_tsvector('jiebacfg'::regconfig, '中华人民共和国万岁,如何加快PostgreSQL结巴分词加载速度'::text)
Function Call: generate_series(1, 1000000)
Buffers: temp read=1710 written=1709
Planning time: 0.040 ms
Execution time: 14175.527 ms
(6 rows)
Time: 14176.044 ms
postgres=# select to_tsvector('jiebacfg','中华人民共和国万岁,如何加快PostgreSQL结巴分词加载速度');
to_tsvector
------------------------------------------------------------------------------------------
'postgresql':6 '万岁':2 '中华人民共和国':1 '分词':8 '加快':5 '加载':9 '结巴':7 '速度':10
(1 row)
Time: 0.522 ms
postgres=# select 8*1000000/14.175527;
?column?
---------------------
564352.916120860974
(1 row)
Time: 0.743 ms
小结
-
为了提高结巴分词插件的装载速度,应该将so文件配置为数据库启动时自动加载。
-
使用数据库启动时自动加载,还有一个好处,内存使用量也大大减少。
祝大家玩得开心,欢迎随时来 阿里云促膝长谈 业务需求 ,恭候光临。
阿里云的小伙伴们加油,努力做 最贴地气的云数据库 。