PostgreSQL sql文件编码引起的数据导入乱码或查询字符集异常报错(invalid byte sequence)

1 minute read

背景

当用户客户端字符集与服务端字符集不匹配时,写入的多字节字符(例如中文)可能出现乱码。

例子

数据库字符集为sql_ascii,允许存储任意编码字符。

digoal@pg11-320tb-zfs-> psql  
psql (11beta4)  
Type "help" for help.  
  
postgres=# \l+  
                                                                 List of databases  
   Name    |  Owner   | Encoding  | Collate |   Ctype    |   Access privileges   |  Size  | Tablespace |                Description                   
-----------+----------+-----------+---------+------------+-----------------------+--------+------------+--------------------------------------------  
 postgres  | postgres | SQL_ASCII | C       | en_US.utf8 |                       | 140 TB | pg_default | default administrative connection database  
 template0 | postgres | SQL_ASCII | C       | en_US.utf8 | =c/postgres          +| 15 MB  | pg_default | unmodifiable empty database  
           |          |           |         |            | postgres=CTc/postgres |        |            |   
 template1 | postgres | SQL_ASCII | C       | en_US.utf8 | =c/postgres          +| 15 MB  | pg_default | default template for new databases  
           |          |           |         |            | postgres=CTc/postgres |        |            |   
(3 rows)  

客户端为utf8编码

digoal@pg11-320tb-zfs-> echo $LANG  
en_US.utf8  

编辑一个文件,以UTF8编码

vi test.sql  
  
insert into tbl values (1, '你好');  

内容如下

digoal@pg11-320tb-zfs-> cat test.sql  
insert into tbl values (1, '你好');  

编码如下

digoal@pg11-320tb-zfs-> file test.sql  
test.sql: UTF-8 Unicode text  

转换为GBK,写入数据库

digoal@pg11-320tb-zfs-> iconv --help  
Usage: iconv [OPTION...] [FILE...]  
Convert encoding of given files from one encoding to another.  
  
 Input/Output format specification:  
  -f, --from-code=NAME       encoding of original text  
  -t, --to-code=NAME         encoding for output  
  
 Information:  
  -l, --list                 list all known coded character sets  
  
 Output control:  
  -c                         omit invalid characters from output  
  -o, --output=FILE          output file  
  -s, --silent               suppress warnings  
      --verbose              print progress information  
  
  -?, --help                 Give this help list  
      --usage                Give a short usage message  
  -V, --version              Print program version  
  
Mandatory or optional arguments to long options are also mandatory or optional  
for any corresponding short options.  
  
For bug reporting instructions, please see:  
<http://www.gnu.org/software/libc/bugs.html>.  
  
  
  
  
digoal@pg11-320tb-zfs-> iconv -f UTF8 -t GBK test.sql|psql -f -  
INSERT 0 1  

或者这样写入(client_encoding和server都设置为sql_ascii时,不检查编码合法性,直接存入数据库)

digoal@pg11-320tb-zfs-> iconv -f UTF8 -t GBK test.sql -o test.sql.gbk  
  
digoal@pg11-320tb-zfs-> psql  
psql (11beta4)  
Type "help" for help.  
  
postgres=# set client_encoding =sql_ascii;  
SET  
postgres=# \i ./test.sql.gbk  
INSERT 0 1  

当设置客户端client_encoding为utf8编码时,由于存入的数据编码不合法,导致查询异常

digoal@pg11-320tb-zfs-> psql  
psql (11beta4)  
Type "help" for help.  
  
  
postgres=# set client_encoding =utf8;  
SET  
postgres=# select * from tbl;  
ERROR:  invalid byte sequence for encoding "UTF8": 0xc4 0xe3  

当client_encoding设置为GBK编码,查询为乱码

postgres=# set client_encoding =gbk;  
SET  
postgres=# select * from tbl;  
 id | info   
----+------  
  1 | ?oí  
          (1 row)  

参考

《PostgreSQL 多字节字符集合法性检测》

《[转] SqlServe到PG迁移错误:无效的编码序列”UTF8”: 0x00》

《PostgreSQL UTF8 和 GB18030编码map文件不完整的问题》

《PostgreSQL WHY ERROR: invalid byte sequence for encoding “UTF8”》

《PostgreSQL SQL_ASCII encoding introduce》

《PostgreSQL Server Encoding sql_ascii attention》

Flag Counter

digoal’s 大量PostgreSQL文章入口