PostgreSQL 使用 nlpbamboo chinesecfg 中文分词

4 minute read

背景

环境 :

CentOS 5.x 64bit  
PostgreSQL 9.1.3  
nlpbamboo-1.1.2  
cmake-2.8.8  
CRF++-0.57  

安装 :

-- cmake  
tar -zxvf cmake-2.8.8.tar.gz  
cd cmake-2.8.8  
./bootstrap --prefix=/opt/cmake2.8.8  
gmake  
gmake install  
  
vi ~/.bash_profile  
export PATH=/opt/cmake2.8.8/bin:$PATH  
. ~/.bash_profile  
  
-- crf  
tar -zxvf CRF++-0.57.tar.gz  
cd CRF++-0.57  
./configure  
gmake  
gmake install  
  
-- nlpbamboo  
vi ~/.bash_profile  
export PGHOME=/opt/pgsql  
export PATH=$PGHOME/bin:/opt/bamboo/bin:/opt/cmake2.8.8/bin:$PATH:.  
export LD_LIBRARY_PATH=$PGHOME/lib:/lib64:/usr/lib64:/usr/local/lib64:/lib:/usr/lib:/usr/local/lib:.  
. ~/.bash_profile  
  
tar -jxvf nlpbamboo-1.1.2.tar.bz2  
cd nlpbamboo-1.1.2  
mkdir build  
cd build  
cmake .. -DCMAKE_BUILD_TYPE=release  
gmake all  
gmake install  
-- 加入lib库链接.  
echo "/usr/lib" >>/etc/ld.so.conf (这个命令是bamboo对应的动态链接库)  
echo "/usr/local/lib" >>/etc/ld.so.conf (这个命令是CRF对应的动态链接库)  
ldconfig -f /etc/ld.so.conf  
  
-- 测试是否加入正常  
ldconfig -p|grep bambo  
        libbamboo.so.2 (libc6,x86-64) => /usr/lib/libbamboo.so.2  
        libbamboo.so (libc6,x86-64) => /usr/lib/libbamboo.so  
ldconfig -p|grep crf  
        libcrfpp.so.0 (libc6,x86-64) => /usr/local/lib/libcrfpp.so.0  
        libcrfpp.so (libc6,x86-64) => /usr/local/lib/libcrfpp.so  
-- 加入索引  
cd /opt/bamboo  
wget http://nlpbamboo.googlecode.com/files/index.tar.bz2  
tar -jxvf index.tar.bz2  
  
-- 编译PostgreSQL支持模块.  
cd /opt/bamboo/exts/postgres/chinese_parser  
make  
make install  
touch $PGHOME/share/tsearch_data/chinese_utf8.stop  
  
cd /opt/bamboo/exts/postgres/pg_tokenize  
make  
make install  
  
-- 安装PostgreSQL支持模块  
su - postgres  
cd $PGHOME/share/contrib/  
psql -h 127.0.0.1 postgres postgres -f chinese_parser.sql  
psql -h 127.0.0.1 postgres postgres -f pg_tokenize.sql  

查看全文检索配置中加入了chinesecfg的配置.

postgres=# select * from pg_ts_config;  
  cfgname   | cfgnamespace | cfgowner | cfgparser   
------------+--------------+----------+-----------  
 simple     |           11 |       10 |      3722  
 danish     |           11 |       10 |      3722  
 dutch      |           11 |       10 |      3722  
 english    |           11 |       10 |      3722  
 finnish    |           11 |       10 |      3722  
 french     |           11 |       10 |      3722  
 german     |           11 |       10 |      3722  
 hungarian  |           11 |       10 |      3722  
 italian    |           11 |       10 |      3722  
 norwegian  |           11 |       10 |      3722  
 portuguese |           11 |       10 |      3722  
 romanian   |           11 |       10 |      3722  
 russian    |           11 |       10 |      3722  
 spanish    |           11 |       10 |      3722  
 swedish    |           11 |       10 |      3722  
 turkish    |           11 |       10 |      3722  
 chinesecfg |           11 |       10 |     33463  
(17 rows)  

测试tokenize分词函数

postgres=# select * from tokenize('你好我是中国人');  
      tokenize         
---------------------  
 你好 我 是 中国 人   
(1 row)  
postgres=# select * from tokenize('中华人民共和国');  
    tokenize       
-----------------  
 中华人民共和国   
(1 row)  
postgres=# select * from tokenize('百度');  
 tokenize   
----------  
 百度   
(1 row)  
postgres=# select * from tokenize('谷歌');  
 tokenize   
----------  
 谷歌   
(1 row)  
postgres=# select * from tokenize('今年是龙年');  
   tokenize      
---------------  
 今年 是 龙年   
(1 row)  

测试全文检索类型转换函数

postgres=# select * from to_tsvector('chinesecfg','你好,我是中国人.目前在杭州斯凯做数据库相关的工作.');  
                                                                to_tsvector                                                           
         
------------------------------------------------------------------------------------------------------------------------------------  
-------  
 ',':2 '.':7,17 '中国':5 '人':6 '你好':1 '做':12 '在':9 '工作':16 '我':3 '数据库':13 '斯凯':11 '是':4 '杭州':10 '的':15 '目前':8 '相  
关':14  
(1 row)  

全文检索类型适合场景, 大内容的模糊查询, 非精确的模糊匹配 :

例如, title和content分别保存标题和正文的原始内容, ts_title和ts_content用来保存分词后的全文检索内容.

postgres=# create table blog (id serial primary key, user_id int8, title text, content text, ts_title tsvector, ts_content tsvector);  
NOTICE:  CREATE TABLE will create implicit sequence "blog_id_seq" for serial column "blog.id"  
NOTICE:  CREATE TABLE / PRIMARY KEY will create implicit index "blog_pkey" for table "blog"  
CREATE TABLE  

在tsvector的字段上可以使用gist索引, 或者gin索引, 加速检索.

postgres=# create index idx_blog_ts1 on blog using gist(ts_title);  
CREATE INDEX  
postgres=# create index idx_blog_ts2 on blog using gist(ts_content);  
CREATE INDEX  

插入测试内容 :

postgres=# insert into blog (user_id,title,content,ts_title,ts_content) values (1,'PostgreSQL QQ群 FAQ贴 - 1','QQ群里一些网友问到的 问题，收集如下 :   
目录 :   
PostgreSQL存储过程中自定义异常怎么弄?  
PostgreSQL9.1的同步事务在某些情况下用户主动cancel等待sync replication standby 的acknowledge,实际本地已提交.  
PostgreSQL如何满足已经存在则更新, 不存在则插入的需求.  
copy和insert哪个效率高?  
PostgreSQL能不能限制数据库的大小?  
怎样给一个用户授予只读角色?  
不想让数据插入到某个表应该怎么做?  
PostgreSQL 中有没有rownum这样的，显示结果集的序号?  
PostgreSQL 函数中如何使用savepoint?  
请问, pg脚本有宏替换, 计算字符串公式的能力? 类似 a=2 ; evaluate(5-a) ; 如果将这个值赋值给这个变量呢? zresult = evaluate(5-a) ;  
UPDATE A表 FROM B表 ?  
hex转decimal  
PostgreSQL 时区.',to_tsvector('chinesecfg','PostgreSQL QQ群 FAQ贴 - 1'),to_tsvector('chinesecfg','QQ群里一些网友问到的问题，收集如下 :   
目录 :   
PostgreSQL存储过程中自定义异常怎么弄?  
PostgreSQL9.1的同步事务在某些情况下用户主动cancel等待sync replication standby 的acknowledge,实际本地已提交.  
PostgreSQL如何满足已经存在则更新, 不存在则插入的需求.  
copy和insert哪个效率高?  
PostgreSQL能不能限制数据库的大小?  
怎样给一个用户授予只读角色?  
不想让数据插入到某个表应该怎么做?  
PostgreSQL 中有没有rownum这样的，显示结果集的序号?  
PostgreSQL 函数中如何使用savepoint?  
请问, pg脚本有宏替换, 计算字符串公式的能力? 类似 a=2 ; evaluate(5-a) ; 如果将这个值赋值给这个变量呢? zresult = evaluate(5-a) ;  
UPDATE A表 FROM B表 ?  
hex转decimal  
PostgreSQL 时区.'));  

分词查询测试 :

postgres=# select ts_content from blog;  
'(':155,171 ')':157,173 ',':46,60,135,142 '.':28,51,67,189 '1':29 '1.':15 '10.':133 '11.update':175 '12.':182 '13.':186 '2':152 '2.  
':26 '3.':52 '4.':68 '5-a':156,172 '5.':76 '6.':86 '7.':96 '8.':109 '9.':125 ':':12,14 ';':153,158,174 '=':151,169 '?':25,75,85,95,1  
08,124,132,148,167,181 'a':150,176 'acknowledge':45 'b':179 'cancel':39 'copy':69 'decimal':185 'evaluate':154,170 'from':178 'hex':  
183 'insert':71 'pg':136 'postgresql':16,53,77,110,126,187 'postgresql9':27 'qq':1 'replication':42 'rownum':115 'savepoint':131 'st  
andby':43 'sync':41 'zresult':168 '一个':89 '一些':3 '下':36 '不':61,79,97 '中':19,111,128 '主动':38 '事务':32 '使用':130 '值赋值':1  
62 '做':107 '公式':145 '函数':127 '则':58,63 '到':6,102 '变量':165 '只':92 '同步':31 '呢':166 '和':70 '哪个':72 '在':33 '大小':84 '  
下':11 '如何':54,129 '如果':159 '字符串':144 '存储':17 '存在':57,62 '宏替':140 '定义':21 '实际':47 '将':160 '已':49 '已经':56 '序号'  
:123 '应该':105 '异常':22 '弄':24 '怎么':23,106 '怎样':87 '情况':35 '想':98 '换':141 '授予':91 '提交':50 '插入':64,101 '收集':10 '效  
率':73 '数据':100 '数据库':82 '时区':188 '显示':119 '更新':59 '有':112,114,139 '本':138 '本地':48 '某个':103 '某些':34 '没':113 '满  
':55 '用户':37,90 '的':7,30,44,65,83,117,122,146 '目录':13 '等待':40 '类似':149 '结果':120 '给':88,163 '网友':4 '群里':2 '能':78,80   
'能力':147 '脚':137 '自':20 '表':104,177,180 '角色':94 '计算':143 '让':99 '请问':134 '读':93 '转':184 '过程':18 '这个':161,164 '这样  
':116 '问':5 '问题':8 '限制':81 '集':121 '需求':66 '高':74 '，':9,118  

从上面的全文检索类型字段ts_content可以看出, 函数和表在这分词里面, 使用函数和表作为匹配条件查询时将返回结果, 换个不存在的查询则没有结果 :

postgres=# select user_id,title from blog where ts_content @@ to_tsquery('函数 & 中国');  
 user_id | title   
---------+-------  
(0 rows)  
  
postgres=# select user_id,title from blog where ts_content @@ to_tsquery('函数 & 表');  
 user_id |           title             
---------+---------------------------  
       1 | PostgreSQL QQ群 FAQ贴 - 1  
(1 row)  

查看执行计划 :

postgres=# explain select user_id,title from blog where ts_content @@ to_tsquery('函数 & 中国');  
                                QUERY PLAN                                  
--------------------------------------------------------------------------  
 Index Scan using idx_blog_ts2 on blog  (cost=0.00..4.27 rows=1 width=40)  
   Index Cond: (ts_content @@ to_tsquery('函数 & 中国'::text))  
(2 rows)  
  
postgres=# explain select user_id,title from blog where ts_content @@ to_tsquery('函数 & 表');  
                                QUERY PLAN                                  
--------------------------------------------------------------------------  
 Index Scan using idx_blog_ts2 on blog  (cost=0.00..4.27 rows=1 width=40)  
   Index Cond: (ts_content @@ to_tsquery('函数 & 表'::text))  
(2 rows)  

参考

http://code.google.com/p/nlpbamboo/

http://crfpp.googlecode.com/svn/trunk/doc/index.html#download

http://www.cmake.org/

http://www.postgresql.org/docs/9.2/static/datatype-textsearch.html

http://www.postgresql.org/docs/9.2/static/functions-textsearch.html

其他分词工具

1. http://bbs.pgsqldb.com/client/post_show.php?zt_auto_bh=57211

2. http://amutu.com/blog/zhparser/

3. https://github.com/amutu/zhparser

digoal’s 大量PostgreSQL文章入口

Twitter Facebook Google+ LinkedIn

Digoal.zhou

PostgreSQL 使用 nlpbamboo chinesecfg 中文分词

背景

参考

其他分词工具

digoal’s 大量PostgreSQL文章入口

You May Also Enjoy

PostgreSQL(PPAS 兼容Oracle) 从零开始入门手册 - 珍藏版

PostgreSQL pipelinedb 流计算插件 - IoT应用 - 实时轨迹聚合

PostgreSQL plpgsql 存储过程、函数 - 状态、异常变量打印、异常捕获… - GET [STACKED] DIAGNOSTICS

PostgreSQL datediff 日期间隔（单位转换）兼容SQL用法