PostgreSQL汉字转拼音或拼音首字母的应用
背景
能不能把在ORACLE里实现的汉字转拼音首字母的功能在PG里面实现以下。
比如股票软件,五粮液这个股,输入wly就可以出来了。
其实用到的就是汉字的拼音首字母,提高了用户体验。
在手机上打字比电脑上打字更慢,如果能实现这种类似的缩写可以大大提高用户体验,比如在搜索人名或者歌曲名,城市名时使用可以比较广泛。当然一般程序上就可以有类似的库可以调用,实现起来也比数据库效率要高。
我这里只是举个数据库实现的例子。
首先要找到汉字对应的拼音的码表,这个网上很多,我以gbk为例, 共计6727个汉字. 其他的编码如18030汉字更多,但是常用的GBK其实也够了.
1、汉字和拼音以及拼音首字母的对照表
create table pinyin (hz varchar(1),py varchar(6),zm varchar(1));
2、索引以及唯一约束
create index idx_pinyin_hz on pinyin(hz);
create unique index idx_pinyin_hz_py on pinyin(hz,py);
-- create unique index idx_pinyin_hz_zm on pinyin(hz,zm); -- 这个可能不能唯一,例如多音字,可能首字母重复。
3、未收录汉字将插入以下表
create table new_discover (hz varchar(1) primary key,py varchar(6),zm varchar(1));
4、码表的SQL详见附件
接下来创建字符串转拼音和拼音首字母的函数, 注意有些是多音字, 所以这个函数以数组的形式输出, 组合了所有的多音字.
仅支持UTF8编码,因为ascii函数转超过255的码时仅支持UTF8.
如果要通用,可以把ascii的判断去掉.把ascii码表写入pinyin表即可.
5、转换函数如下
CREATE OR REPLACE FUNCTION get_py_zm(i_hz text) returns setof record AS $$
DECLARE
v_hz text;
i int;
v_sql1 text;
v_sql2 text;
v_sql3 text;
v_sql4 text;
v_sql text;
v_max_id int;
v_id int;
BEGIN
-- 存储字符串中每个汉字的拼音和拼音首字母的临时表
CREATE TEMPORARY TABLE IF NOT EXISTS tmp_get_py_zm (id int,py varchar(6),zm varchar(1)) ON COMMIT DELETE ROWS;
-- 如果要在一个事务中多次调用这个函数, 需要先清除临时表的数据.
truncate table tmp_get_py_zm;
-- 把字符串拆分成一个个字符,并转成拼音和拼音首字母存储到临时表中.
i := 0;
for v_hz in select regexp_split_to_table(i_hz,'') loop
if ascii(v_hz) > 255 then
insert into tmp_get_py_zm select i,py,zm from pinyin where hz=v_hz;
else
insert into tmp_get_py_zm values(i,v_hz,v_hz);
end if;
if not found then
perform 1 from new_discover where hz = v_hz;
if not found then
insert into new_discover(hz) values(v_hz);
end if;
return;
end if;
i := i+1;
end loop;
-- 拼装输出数组的SQL
select max(id) into v_max_id from tmp_get_py_zm;
if v_max_id > 0 then
v_sql1 := '';
v_sql3 := '';
v_sql4 := '';
v_id := 0;
for v_id in select generate_series(0,v_max_id) loop
if v_id <> v_max_id then
v_sql1 := v_sql1||'(select py,zm from tmp_get_py_zm where id='||v_id||') as t'||v_id||',';
v_sql3 := v_sql3||'t'||v_id||'.py::text||';
v_sql4 := v_sql4||'t'||v_id||'.zm::text||';
else
v_sql1 := v_sql1||'(select py,zm from tmp_get_py_zm where id='||v_id||') as t'||v_id;
v_sql3 := v_sql3||'t'||v_id||'.py::text';
v_sql4 := v_sql4||'t'||v_id||'.zm::text';
v_sql := 'select array_agg('||v_sql3||'),array_agg('||v_sql4||') from '||v_sql1;
end if;
end loop;
else
v_sql := 'select array_agg(py::text),array_agg(zm::text) from tmp_get_py_zm';
end if;
-- 输出数组
return query execute v_sql;
return;
END;
$$ language plpgsql;
测试, 长是多音字, 所以输出了两个元素的数组
digoal=# select * from get_py_zm('你好a 长安, 我很好!') as (c1 text[],c2 text[]);
NOTICE: relation "tmp_get_py_zm" already exists, skipping
CONTEXT: SQL statement "CREATE TEMPORARY TABLE IF NOT EXISTS tmp_get_py_zm (id int,py varchar(6),zm varchar(1)) ON COMMIT DELETE ROWS"
PL/pgSQL function "get_py_zm" line 13 at SQL statement
c1 | c2
-----------------------------------------------------------+---------------------------------
{"nihaoa changan, wohenhao!","nihaoa zhangan, wohenhao!"} | {"nha ca, whh!","nha za, whh!"}
测试无多音字的情况
digoal=# select * from get_py_zm('刘德华') as (c1 text[],c2 text[]);
NOTICE: relation "tmp_get_py_zm" already exists, skipping
CONTEXT: SQL statement "CREATE TEMPORARY TABLE IF NOT EXISTS tmp_get_py_zm (id int,py varchar(6),zm varchar(1)) ON COMMIT DELETE ROWS"
PL/pgSQL function "get_py_zm" line 13 at SQL statement
c1 | c2
------------+-------
{liudehua} | {ldh}
digoal=# select * from get_py_zm('五粮液') as (c1 text[],c2 text[]);
NOTICE: relation "tmp_get_py_zm" already exists, skipping
CONTEXT: SQL statement "CREATE TEMPORARY TABLE IF NOT EXISTS tmp_get_py_zm (id int,py varchar(6),zm varchar(1)) ON COMMIT DELETE ROWS"
PL/pgSQL function "get_py_zm" line 13 at SQL statement
c1 | c2
-------------+-------
{wuliangye} | {wly}
digoal=# select * from get_py_zm('五粮液 茅台') as (c1 text[],c2 text[]);
NOTICE: relation "tmp_get_py_zm" already exists, skipping
CONTEXT: SQL statement "CREATE TEMPORARY TABLE IF NOT EXISTS tmp_get_py_zm (id int,py varchar(6),zm varchar(1)) ON COMMIT DELETE ROWS"
PL/pgSQL function "get_py_zm" line 13 at SQL statement
c1 | c2
----------------------+------------
{"wuliangye maotai"} | {"wly mt"}
输入一个码表中不存在的汉字
digoal=# select * from get_py_zm('爲') as (c1 text[],c2 text[]);
NOTICE: relation "tmp_get_py_zm" already exists, skipping
CONTEXT: SQL statement "CREATE TEMPORARY TABLE IF NOT EXISTS tmp_get_py_zm (id int,py varchar(6),zm varchar(1)) ON COMMIT DELETE ROWS"
PL/pgSQL function "get_py_zm" line 13 at SQL statement
c1 | c2
----+----
(0 rows)
在码表中不存在的汉字将记录到new_discover 表
digoal=# select * from new_discover ;
hz | py | zm
----+----+----
爲 | |
码表中存在和不存在的汉字混合使用时也无输出
digoal=# select * from get_py_zm('这是繁體') as (c1 text[],c2 text[]);
NOTICE: relation "tmp_get_py_zm" already exists, skipping
CONTEXT: SQL statement "CREATE TEMPORARY TABLE IF NOT EXISTS tmp_get_py_zm (id int,py varchar(6),zm varchar(1)) ON COMMIT DELETE ROWS"
PL/pgSQL function "get_py_zm" line 13 at SQL statement
c1 | c2
----+----
(0 rows)
digoal=# select * from new_discover ;
hz | py | zm
----+----+----
爲 | |
體 | |
所以在发现new_discover中有记录后应该尽快完善到码表中.
例如
digoal=# begin;
BEGIN
digoal=# insert into pinyin values('爲','wei','w');
INSERT 0 1
digoal=# insert into pinyin values('體','ti','t');
INSERT 0 1
digoal=# delete from new_discover where hz in ('體','爲');
DELETE 2
digoal=# end;
COMMIT
再次执行以上查询时有返回结果
digoal=# select * from get_py_zm('这是繁體') as (c1 text[],c2 text[]);
NOTICE: relation "tmp_get_py_zm" already exists, skipping
CONTEXT: SQL statement "CREATE TEMPORARY TABLE IF NOT EXISTS tmp_get_py_zm (id int,py varchar(6),zm varchar(1)) ON COMMIT DELETE ROWS"
PL/pgSQL function "get_py_zm" line 13 at SQL statement
c1 | c2
---------------+--------
{zheshifanti} | {zsft}
由于汉字在码表中不存在时, 将无转换输出, 所以需要改进一下.
6、改进, 码表中没有的使用问号替代.
创建返回类型以及改进函数, 由于函数输入参数类型一致, 返回结果不一致, 需要将原函数删除再创建
CREATE TYPE t_py_zm as (c1 text[],c2 text[]);
CREATE OR REPLACE FUNCTION get_py_zm(i_hz text) returns setof t_py_zm AS $$
DECLARE
v_hz text;
i int;
v_sql1 text;
v_sql2 text;
v_sql3 text;
v_sql4 text;
v_sql text;
v_max_id int;
v_id int;
BEGIN
set client_min_messages = warning;
CREATE TEMPORARY TABLE IF NOT EXISTS tmp_get_py_zm (id int,py varchar(6),zm varchar(1)) ON COMMIT DELETE ROWS;
truncate table tmp_get_py_zm;
i := 0;
for v_hz in select regexp_split_to_table(i_hz,'') loop
if ascii(v_hz) > 255 then
insert into tmp_get_py_zm select i,py,zm from pinyin where hz=v_hz;
else
insert into tmp_get_py_zm values(i,v_hz,v_hz);
end if;
if not found then
perform 1 from new_discover where hz = v_hz;
if not found then
insert into new_discover(hz) values(v_hz);
end if;
insert into tmp_get_py_zm values(i,'?','?');
end if;
i := i+1;
end loop;
select max(id) into v_max_id from tmp_get_py_zm;
if v_max_id > 0 then
v_sql1 := '';
v_sql3 := '';
v_sql4 := '';
v_id := 0;
for v_id in select generate_series(0,v_max_id) loop
if v_id <> v_max_id then
v_sql1 := v_sql1||'(select py,zm from tmp_get_py_zm where id='||v_id||') as t'||v_id||',';
v_sql3 := v_sql3||'t'||v_id||'.py::text||';
v_sql4 := v_sql4||'t'||v_id||'.zm::text||';
else
v_sql1 := v_sql1||'(select py,zm from tmp_get_py_zm where id='||v_id||') as t'||v_id;
v_sql3 := v_sql3||'t'||v_id||'.py::text';
v_sql4 := v_sql4||'t'||v_id||'.zm::text';
v_sql := 'select array_agg('||v_sql3||'),array_agg('||v_sql4||') from '||v_sql1;
end if;
end loop;
else
v_sql := 'select array_agg(py::text),array_agg(zm::text) from tmp_get_py_zm';
end if;
return query execute v_sql;
return;
END;
$$ language plpgsql;
修改后的测试, 點選两个汉字在码表中不存在. 将以问号代替.
digoal=# select * from get_py_zm('这是繁體,點選');
c1 | c2
--------------------+-------------
{"zheshifanti,??"} | {"zsft,??"}
以问号代替不存在的汉字并插入new_discover表
digoal=# select * from new_discover ;
hz | py | zm
----+----+----
點 | |
選 | |
接下来测试可能使用的场景
创建包含歌手,歌曲信息的表.
create table music (id serial primary key, song text, singer text, songpy text[], songzm text[], singerpy text[], singerzm text[]);
插入一条测试数据
insert into music(song,singer,songpy,songzm,singerpy,singerzm) select '刘德华','中国人',t1.c1,t1.c2,t2.c1,t2.c2 from get_py_zm('刘德华') t1,get_py_zm('中国人') t2;
查看转码是否成功
digoal=# select * from music;
id | song | singer | songpy | songzm | singerpy | singerzm
----+--------+--------+------------+--------+---------------+----------
1 | 刘德华 | 中国人 | {liudehua} | {ldh} | {zhongguoren} | {zgr}
(1 row)
插入10W条测试记录
digoal=# insert into music(song,singer,songpy,songzm,singerpy,singerzm)
select t.info, t.info, (t.t1).c1, (t.t1).c2, (t.t2).c1, (t.t2).c2
from
(select t.info,get_py_zm(t.info) t1,get_py_zm(t.info) t2
from
(select '刘德华'||generate_series(1,100000) as info) as t
)
as t;
创建索引, 加快按字母检索的速度
digoal=# create index idx_music_songpy on music using gin (songpy);
CREATE INDEX
digoal=# create index idx_music_songzm on music using gin (songzm);
CREATE INDEX
digoal=# create index idx_music_singerpy on music using gin (singerpy);
CREATE INDEX
digoal=# create index idx_music_singerzm on music using gin (singerzm);
CREATE INDEX
检索测试
digoal=# explain analyze select * from music where singerzm @> array['ldh1'];
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on music (cost=8.78..188.78 rows=100 width=176) (actual time=0.024..0.024 rows=1 loops=1)
Recheck Cond: (singerzm @> '{ldh1}'::text[])
-> Bitmap Index Scan on idx_music_singerzm (cost=0.00..8.75 rows=100 width=0) (actual time=0.018..0.018 rows=1 loops=1)
Index Cond: (singerzm @> '{ldh1}'::text[])
Total runtime: 0.050 ms
(5 rows)
Time: 0.396 ms
digoal=# select * from music where singerzm @> array['ldh1'];
id | song | singer | songpy | songzm | singerpy | singerzm
-------+---------+---------+-------------+--------+-------------+----------
40947 | 刘德华1 | 刘德华1 | {liudehua1} | {ldh1} | {liudehua1} | {ldh1}
(1 row)
Time: 0.308 ms
速度还不错.