PostgreSQL Greenplum 结巴分词(by plpython)

1 minute read

背景

除了数据库内置的中文分词，使用plpython数据库存储过程语言，也能实现方便的分词能力。在greenplum中是一个很好的选择。

结合PostgreSQL plpython和language transform可以很方便的实现中文分词。

https://github.com/fxsjy/jieba

http://www.postgresql.org/docs/9.5/static/sql-createtransform.html

http://www.postgresql.org/docs/9.5/static/plpython.html

例子

postgres=# create language plpythonu;  
CREATE LANGUAGE  
postgres=# select * from pg_language ;  
  lanname  | lanowner | lanispl | lanpltrusted | lanplcallfoid | laninline | lanvalidator | lanacl   
-----------+----------+---------+--------------+---------------+-----------+--------------+--------  
 internal  |       10 | f       | f            |             0 |         0 |         2246 |   
 c         |       10 | f       | f            |             0 |         0 |         2247 |   
 sql       |       10 | f       | t            |             0 |         0 |         2248 |   
 plpgsql   |       10 | t       | t            |         12724 |     12725 |        12726 |   
 plpythonu |       10 | t       | f            |         24177 |     24178 |        24179 |   
(5 rows)  
  
postgres=# create or replace function fenci(i_text text) returns tsvector as $$  
  import jieba  
  seg_list = jieba.cut(i_text, cut_all=False)  
  return(" ".join(seg_list))  
$$ language plpythonu;  
CREATE FUNCTION  
  
postgres=# select fenci('小明硕士毕业于中国科学院计算所，后在日本京都大学深造');  
                                        fenci                                           
--------------------------------------------------------------------------------------  
 '中国科学院' '于' '后' '在' '小明' '日本京都大学' '毕业' '深造' '硕士' '计算所' '，'  
(1 row)  
postgres=# select fenci('结婚的和尚未结婚的');  
          fenci            
-------------------------  
 '和' '尚未' '的' '结婚'  
(1 row)  
postgres=# \timing  
Timing is on.  
postgres=# do language plpgsql $$ declare begin for i in 1..100000  loop perform fenci('结婚的和尚未结婚的'); end loop; end $$;  
DO  
Time: 9848.447 ms  
postgres=# select 9848.0/100000;  
        ?column?          
------------------------  
 0.09848000000000000000  
(1 row)  
Time: 1.972 ms  

虚拟机，单核，每秒约处理一万次请求。

缺点是动态加载，第一次使用时需要1秒左右。每个新建的连接都需要动态加载它。对于短连接的用户如果每次都要使用这个函数是会崩溃的。

postgres=# \timing  
Timing is on.  
postgres=# select fenci('周正中');  
    fenci      
-------------  
 '周' '正中'  
(1 row)  
Time: 1585.619 ms  
postgres=# select fenci('周正中');  
    fenci      
-------------  
 '周' '正中'  
(1 row)  
Time: 1.008 ms  

支持动态添加WORD，重新连接后，需要重新添加。

postgres=# select fenci('阿里巴巴阿里妈妈');  
          fenci             
--------------------------  
 '妈妈' '阿里' '阿里巴巴'  
(1 row)  
  
postgres=# do language plpythonu $$  
import jieba  
jieba.add_word('阿里妈妈')  
$$;  
DO  
postgres=# select fenci('阿里巴巴阿里妈妈');  
         fenci           
-----------------------  
 '阿里妈妈' '阿里巴巴'  
(1 row)  
  
postgres=# \q  
postgres@digoal-> psql  
psql (9.4.4)  
Type "help" for help.  
postgres=# select fenci('阿里巴巴阿里妈妈');  
          fenci             
--------------------------  
 '妈妈' '阿里' '阿里巴巴'  
(1 row)  
  
discard all不影响已经载入的字典，可以放心在连接池使用。  
postgres=# discard all;  
DISCARD ALL  
postgres=# select fenci1('阿里巴巴,阿里妈妈');  
             fenci1               
--------------------------------  
 阿里 巴巴 阿里巴巴 , 阿里 妈妈  
(1 row)  
Time: 1.237 ms  

相比PostgreSQL 提供的full text search功能，这种用法比较入门。

这种用法建议在应用端实施。

另外一个结巴分词插件，是志铭同学提供的，已经转成了C的扩展，性能很棒。

https://github.com/jaiminpan/pg_jieba

《如何加快PostgreSQL结巴分词加载速度》

《使用阿里云PostgreSQL zhparser时不可不知的几个参数》

《PostgreSQL 行级全文检索》

《PostgreSQL 如何高效解决按任意字段分词检索的问题 - case 1》

《聊一聊双十一背后的技术 - 分词和搜索》

《多国语言字符串的加密、全文检索、模糊查询的支持》