PostgreSQL sql文件编码引起的数据导入乱码或查询字符集异常报错(invalid byte sequence)
背景
当用户客户端字符集与服务端字符集不匹配时,写入的多字节字符(例如中文)可能出现乱码。
例子
数据库字符集为sql_ascii,允许存储任意编码字符。
digoal@pg11-320tb-zfs-> psql
psql (11beta4)
Type "help" for help.
postgres=# \l+
List of databases
Name | Owner | Encoding | Collate | Ctype | Access privileges | Size | Tablespace | Description
-----------+----------+-----------+---------+------------+-----------------------+--------+------------+--------------------------------------------
postgres | postgres | SQL_ASCII | C | en_US.utf8 | | 140 TB | pg_default | default administrative connection database
template0 | postgres | SQL_ASCII | C | en_US.utf8 | =c/postgres +| 15 MB | pg_default | unmodifiable empty database
| | | | | postgres=CTc/postgres | | |
template1 | postgres | SQL_ASCII | C | en_US.utf8 | =c/postgres +| 15 MB | pg_default | default template for new databases
| | | | | postgres=CTc/postgres | | |
(3 rows)
客户端为utf8编码
digoal@pg11-320tb-zfs-> echo $LANG
en_US.utf8
编辑一个文件,以UTF8编码
vi test.sql
insert into tbl values (1, '你好');
内容如下
digoal@pg11-320tb-zfs-> cat test.sql
insert into tbl values (1, '你好');
编码如下
digoal@pg11-320tb-zfs-> file test.sql
test.sql: UTF-8 Unicode text
转换为GBK,写入数据库
digoal@pg11-320tb-zfs-> iconv --help
Usage: iconv [OPTION...] [FILE...]
Convert encoding of given files from one encoding to another.
Input/Output format specification:
-f, --from-code=NAME encoding of original text
-t, --to-code=NAME encoding for output
Information:
-l, --list list all known coded character sets
Output control:
-c omit invalid characters from output
-o, --output=FILE output file
-s, --silent suppress warnings
--verbose print progress information
-?, --help Give this help list
--usage Give a short usage message
-V, --version Print program version
Mandatory or optional arguments to long options are also mandatory or optional
for any corresponding short options.
For bug reporting instructions, please see:
<http://www.gnu.org/software/libc/bugs.html>.
digoal@pg11-320tb-zfs-> iconv -f UTF8 -t GBK test.sql|psql -f -
INSERT 0 1
或者这样写入(client_encoding和server都设置为sql_ascii时,不检查编码合法性,直接存入数据库)
digoal@pg11-320tb-zfs-> iconv -f UTF8 -t GBK test.sql -o test.sql.gbk
digoal@pg11-320tb-zfs-> psql
psql (11beta4)
Type "help" for help.
postgres=# set client_encoding =sql_ascii;
SET
postgres=# \i ./test.sql.gbk
INSERT 0 1
当设置客户端client_encoding为utf8编码时,由于存入的数据编码不合法,导致查询异常
digoal@pg11-320tb-zfs-> psql
psql (11beta4)
Type "help" for help.
postgres=# set client_encoding =utf8;
SET
postgres=# select * from tbl;
ERROR: invalid byte sequence for encoding "UTF8": 0xc4 0xe3
当client_encoding设置为GBK编码,查询为乱码
postgres=# set client_encoding =gbk;
SET
postgres=# select * from tbl;
id | info
----+------
1 | ?oí
(1 row)
参考
《[转] SqlServe到PG迁移错误:无效的编码序列”UTF8”: 0x00》
《PostgreSQL UTF8 和 GB18030编码map文件不完整的问题》
《PostgreSQL WHY ERROR: invalid byte sequence for encoding “UTF8”》
《PostgreSQL SQL_ASCII encoding introduce》
《PostgreSQL Server Encoding sql_ascii attention》