PostgreSQL 多字节字符集合法性检测

9 minute read

背景

前面转了一篇关于字符编码的文章,有兴趣的朋友可以阅读一下。

http://blog.163.com/digoal@126/blog/static/163877040201581133017855/

本文介绍一下PostgreSQL是如何检查多字节字符的合法性的,以及在字符集之间是如何转换的?

我们以UTF8为例,先看看它的编码规则。

UTF-8是unicode编码的实现方式之一,相比UTF-16的好处是解决了兼容ASCII码时有一个字节为0x00的问题,UTF-8它是变长编码,所以必须要有方法能够知道一个字符的字节数。

规则见下图,

当首位为0时,兼容ASCII码,即1个字节。

当首位是110时,即2个1一个0,表示这个字符占用2个字节,并且第二个字节一定是10开头。

当首位是1110时,即3个1一个0,表示这个字符占用3个字节,并且第2,3个字节一定是10开头。

当首位是11110时,即3个1一个0,表示这个字符占用4个字节,并且第2,3,4个字节一定是10开头。

   Char. number range  |        UTF-8 octet sequence  
      (hexadecimal)    |              (binary)  
   --------------------+---------------------------------------------  
   0000 0000-0000 007F | 0xxxxxxx  
   0000 0080-0000 07FF | 110xxxxx 10xxxxxx  
   0000 0800-0000 FFFF | 1110xxxx 10xxxxxx 10xxxxxx  
   0001 0000-0010 FFFF | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx  

pic

实际上UTF-8最多可以占用6个字节,不过目前应该实际在用的是4个字节。并且PG也是这么来检测的,后面会看到。

UTF-8的编码规则(占用1,2,3,4个字节时的取值范围)就决定了它的合法性检测方法。

首先是当前字符编码长度的判断,方法如下:

src/backend/utils/mb/wchar.c

/*  
 * Return the byte length of a UTF8 character pointed to by s  
 *  
 * Note: in the current implementation we do not support UTF8 sequences  
 * of more than 4 bytes; hence do NOT return a value larger than 4.  
 * We return "1" for any leading byte that is either flat-out illegal or  
 * indicates a length larger than we support.  
 *  
 * pg_utf2wchar_with_len(), utf8_to_unicode(), pg_utf8_islegal(), and perhaps  
 * other places would need to be fixed to change this.  
 */  
int  
pg_utf_mblen(const unsigned char *s)  
{  
        int                     len;  
  
        if ((*s & 0x80) == 0)    //  即10000000 & ? == 0  
                len = 1;  
        else if ((*s & 0xe0) == 0xc0)  //  即11100000 & ? == 11000000  
                len = 2;  
        else if ((*s & 0xf0) == 0xe0)  //  即11110000 & ? == 11100000  
                len = 3;  
        else if ((*s & 0xf8) == 0xf0)  //  即11111000 & ? == 11110000  
                len = 4;  
#ifdef NOT_USED  // 6个字节不可用,在PostgreSQL中没有定义NOT_USED宏,所以以下代码忽略。  
        else if ((*s & 0xfc) == 0xf8)  
                len = 5;  
        else if ((*s & 0xfe) == 0xfc)  
                len = 6;  
#endif  
        else  //  其他则返回长度1,让下面的校验程序去完成校验。  
                len = 1;  
        return len;  
}  

下面是校验合法性的程序。

static int  
pg_utf8_verifier(const unsigned char *s, int len)  
{  
        int                     l = pg_utf_mblen(s);  
  
        if (len < l)  
                return -1;  
  
        if (!pg_utf8_islegal(s, l))  
                return -1;  
  
        return l;  
}  

合法性的检查需要结合字符长度。

/*  
 * Check for validity of a single UTF-8 encoded character  
 *  
 * This directly implements the rules in RFC3629.  The bizarre-looking  
 * restrictions on the second byte are meant to ensure that there isn't  
 * more than one encoding of a given Unicode character point; that is,  
 * you may not use a longer-than-necessary byte sequence with high order  
 * zero bits to represent a character that would fit in fewer bytes.  
 * To do otherwise is to create security hazards (eg, create an apparent  
 * non-ASCII character that decodes to plain ASCII).  
 *  
 * length is assumed to have been obtained by pg_utf_mblen(), and the  
 * caller must have checked that that many bytes are present in the buffer.  
 */  
bool  
pg_utf8_islegal(const unsigned char *source, int length)  
{  
        unsigned char a;  
  
        switch (length)  
        {  
                default:  
                        /* reject lengths 5 and 6 for now */  
                        return false;  
                case 4:    //  当前字符的编码长度是4时,最后一个字节的取值范围0x80 ~ 0xBF (即10xxxxxx的取值区间)  
                        a = source[3];  
                        if (a < 0x80 || a > 0xBF)  
                                return false;  
                        /* FALL THRU */  
                case 3:    //  当前字符的编码长度是3时,最后一个字节的取值范围0x80 ~ 0xBF (即10xxxxxx的取值区间)  
                        a = source[2];  
                        if (a < 0x80 || a > 0xBF)  
                                return false;  
                        /* FALL THRU */  
                case 2:    //  当前字符的编码长度是2时,需根据最后一个字节的值来判断第一个字节的取值范围。  
                        a = source[1];  
                        switch (*source)  
                        {  
                                case 0xE0:  
                                        if (a < 0xA0 || a > 0xBF)  
                                                return false;  
                                        break;  
                                case 0xED:  
                                        if (a < 0x80 || a > 0x9F)  
                                                return false;  
                                        break;  
                                case 0xF0:  
                                        if (a < 0x90 || a > 0xBF)  
                                                return false;  
                                        break;  
                                case 0xF4:  
                                        if (a < 0x80 || a > 0x8F)  
                                                return false;  
                                        break;  
		default:      
                                        if (a < 0x80 || a > 0xBF)  
                                                return false;  
                                        break;  
                        }  
                        /* FALL THRU */  
                case 1:  
                        a = *source;  
                        if (a >= 0x80 && a < 0xC2)  
                                return false;  
                        if (a > 0xF4)  
                                return false;  
                        break;  
        }  
        return true;  
}  

每种字符集的编码规则不同,所以校验方法也不同。这里不一一列举。

PostgreSQL字符集的转换方法。

postgres=# \df convert*  
                              List of functions  
   Schema   |     Name     | Result data type | Argument data types |  Type    
------------+--------------+------------------+---------------------+--------  
 pg_catalog | convert      | bytea            | bytea, name, name   | normal  
 pg_catalog | convert_from | text             | bytea, name         | normal  
 pg_catalog | convert_to   | bytea            | text, name          | normal  
(3 rows)  

我们使用bytea和转换函数来验证以上检测程序。

postgres=# \set VERBOSITY verbose  
postgres=# select convert_from(bytea '\x80', 'UTF8');  
ERROR:  22021: invalid byte sequence for encoding "UTF8": 0x80  
LOCATION:  report_invalid_encoding, wchar.c:2011  
postgres=# select convert_from(bytea '\xE28080', 'UTF8');  
 convert_from   
--------------  
 ?  
(1 row)  
postgres=# select convert_from(bytea '\xE08080', 'UTF8');  
ERROR:  22021: invalid byte sequence for encoding "UTF8": 0xe0 0x80 0x80  
LOCATION:  report_invalid_encoding, wchar.c:2011  

最终调用的是以下函数:

 pg_catalog | ascii_to_mic                   | void             | integer, integer, cstring, internal, integer                                       | normal  
 pg_catalog | ascii_to_utf8                  | void             | integer, integer, cstring, internal, integer                                       | normal  
 pg_catalog | big5_to_euc_tw                 | void             | integer, integer, cstring, internal, integer                                       | normal  
 pg_catalog | big5_to_mic                    | void             | integer, integer, cstring, internal, integer                                       | normal  
 pg_catalog | big5_to_utf8                   | void             | integer, integer, cstring, internal, integer                                       | normal  
 pg_catalog | euc_cn_to_mic                  | void             | integer, integer, cstring, internal, integer                                       | normal  
 pg_catalog | euc_cn_to_utf8                 | void             | integer, integer, cstring, internal, integer                                       | normal  
 pg_catalog | euc_jis_2004_to_shift_jis_2004 | void             | integer, integer, cstring, internal, integer                                       | normal  
 pg_catalog | euc_jis_2004_to_utf8           | void             | integer, integer, cstring, internal, integer                                       | normal  
 pg_catalog | euc_jp_to_mic                  | void             | integer, integer, cstring, internal, integer                                       | normal  
 pg_catalog | euc_jp_to_sjis                 | void             | integer, integer, cstring, internal, integer                                       | normal  
 pg_catalog | euc_jp_to_utf8                 | void             | integer, integer, cstring, internal, integer                                       | normal  
 pg_catalog | euc_kr_to_mic                  | void             | integer, integer, cstring, internal, integer                                       | normal  
 pg_catalog | euc_kr_to_utf8                 | void             | integer, integer, cstring, internal, integer                                       | normal  
 pg_catalog | euc_tw_to_big5                 | void             | integer, integer, cstring, internal, integer                                       | normal  
 pg_catalog | euc_tw_to_mic                  | void             | integer, integer, cstring, internal, integer                                       | normal  
 pg_catalog | euc_tw_to_utf8                 | void             | integer, integer, cstring, internal, integer                                       | normal  
 pg_catalog | gb18030_to_utf8                | void             | integer, integer, cstring, internal, integer                                       | normal  
 pg_catalog | gbk_to_utf8                    | void             | integer, integer, cstring, internal, integer                                       | normal  
 pg_catalog | iso8859_1_to_utf8              | void             | integer, integer, cstring, internal, integer                                       | normal  
 pg_catalog | iso8859_to_utf8                | void             | integer, integer, cstring, internal, integer                                       | normal  
 pg_catalog | iso_to_koi8r                   | void             | integer, integer, cstring, internal, integer                                       | normal  
 pg_catalog | iso_to_mic                     | void             | integer, integer, cstring, internal, integer                                       | normal  
 pg_catalog | iso_to_win1251                 | void             | integer, integer, cstring, internal, integer                                       | normal  
 pg_catalog | iso_to_win866                  | void             | integer, integer, cstring, internal, integer                                       | normal  
 pg_catalog | johab_to_utf8                  | void             | integer, integer, cstring, internal, integer                                       | normal  
 pg_catalog | koi8r_to_iso                   | void             | integer, integer, cstring, internal, integer                                       | normal  
 pg_catalog | koi8r_to_mic                   | void             | integer, integer, cstring, internal, integer                                       | normal  
 pg_catalog | koi8r_to_utf8                  | void             | integer, integer, cstring, internal, integer                                       | normal  
 pg_catalog | koi8r_to_win1251               | void             | integer, integer, cstring, internal, integer                                       | normal  
 pg_catalog | koi8r_to_win866                | void             | integer, integer, cstring, internal, integer                                       | normal  
 pg_catalog | koi8u_to_utf8                  | void             | integer, integer, cstring, internal, integer                                       | normal  
 pg_catalog | latin1_to_mic                  | void             | integer, integer, cstring, internal, integer                                       | normal  
 pg_catalog | latin2_to_mic                  | void             | integer, integer, cstring, internal, integer                                       | normal  
 pg_catalog | latin2_to_win1250              | void             | integer, integer, cstring, internal, integer                                       | normal  
 pg_catalog | latin3_to_mic                  | void             | integer, integer, cstring, internal, integer                                       | normal  
 pg_catalog | latin4_to_mic                  | void             | integer, integer, cstring, internal, integer                                       | normal  
 pg_catalog | mic_to_ascii                   | void             | integer, integer, cstring, internal, integer                                       | normal  
 pg_catalog | mic_to_big5                    | void             | integer, integer, cstring, internal, integer                                       | normal  
 pg_catalog | mic_to_euc_cn                  | void             | integer, integer, cstring, internal, integer                                       | normal  
 pg_catalog | mic_to_euc_jp                  | void             | integer, integer, cstring, internal, integer                                       | normal  
 pg_catalog | mic_to_euc_kr                  | void             | integer, integer, cstring, internal, integer                                       | normal  
 pg_catalog | mic_to_euc_tw                  | void             | integer, integer, cstring, internal, integer                                       | normal  
 pg_catalog | mic_to_iso                     | void             | integer, integer, cstring, internal, integer                                       | normal  
 pg_catalog | mic_to_koi8r                   | void             | integer, integer, cstring, internal, integer                                       | normal  
 pg_catalog | mic_to_latin1                  | void             | integer, integer, cstring, internal, integer                                       | normal  
 pg_catalog | mic_to_latin2                  | void             | integer, integer, cstring, internal, integer                                       | normal  
 pg_catalog | mic_to_latin3                  | void             | integer, integer, cstring, internal, integer                                       | normal  
 pg_catalog | mic_to_latin4                  | void             | integer, integer, cstring, internal, integer                                       | normal  
 pg_catalog | mic_to_sjis                    | void             | integer, integer, cstring, internal, integer                                       | normal  
 pg_catalog | mic_to_win1250                 | void             | integer, integer, cstring, internal, integer                                       | normal  
 pg_catalog | mic_to_win1251                 | void             | integer, integer, cstring, internal, integer                                       | normal  
 pg_catalog | mic_to_win866                  | void             | integer, integer, cstring, internal, integer                                       | normal  
 pg_catalog | shift_jis_2004_to_euc_jis_2004 | void             | integer, integer, cstring, internal, integer                                       | normal  
 pg_catalog | shift_jis_2004_to_utf8         | void             | integer, integer, cstring, internal, integer                                       | normal  
 pg_catalog | sjis_to_euc_jp                 | void             | integer, integer, cstring, internal, integer                                       | normal  
 pg_catalog | sjis_to_mic                    | void             | integer, integer, cstring, internal, integer                                       | normal  
 pg_catalog | sjis_to_utf8                   | void             | integer, integer, cstring, internal, integer                                       | normal  
 pg_catalog | uhc_to_utf8                    | void             | integer, integer, cstring, internal, integer                                       | normal  
 pg_catalog | utf8_to_ascii                  | void             | integer, integer, cstring, internal, integer                                       | normal  
 pg_catalog | utf8_to_big5                   | void             | integer, integer, cstring, internal, integer                                       | normal  
 pg_catalog | utf8_to_euc_cn                 | void             | integer, integer, cstring, internal, integer                                       | normal  
 pg_catalog | utf8_to_euc_jis_2004           | void             | integer, integer, cstring, internal, integer                                       | normal  
 pg_catalog | utf8_to_euc_jp                 | void             | integer, integer, cstring, internal, integer                                       | normal  
 pg_catalog | utf8_to_euc_kr                 | void             | integer, integer, cstring, internal, integer                                       | normal  
 pg_catalog | utf8_to_euc_tw                 | void             | integer, integer, cstring, internal, integer                                       | normal  
 pg_catalog | utf8_to_gb18030                | void             | integer, integer, cstring, internal, integer                                       | normal  
 pg_catalog | utf8_to_gbk                    | void             | integer, integer, cstring, internal, integer                                       | normal  
 pg_catalog | utf8_to_iso8859                | void             | integer, integer, cstring, internal, integer                                       | normal  
 pg_catalog | utf8_to_iso8859_1              | void             | integer, integer, cstring, internal, integer                                       | normal  
 pg_catalog | utf8_to_johab                  | void             | integer, integer, cstring, internal, integer                                       | normal  
 pg_catalog | utf8_to_koi8r                  | void             | integer, integer, cstring, internal, integer                                       | normal  
 pg_catalog | utf8_to_koi8u                  | void             | integer, integer, cstring, internal, integer                                       | normal  
 pg_catalog | utf8_to_shift_jis_2004         | void             | integer, integer, cstring, internal, integer                                       | normal  
 pg_catalog | utf8_to_sjis                   | void             | integer, integer, cstring, internal, integer                                       | normal  
 pg_catalog | utf8_to_uhc                    | void             | integer, integer, cstring, internal, integer                                       | normal  
 pg_catalog | utf8_to_win                    | void             | integer, integer, cstring, internal, integer                                       | normal  
 pg_catalog | win1250_to_latin2              | void             | integer, integer, cstring, internal, integer                                       | normal  
 pg_catalog | win1250_to_mic                 | void             | integer, integer, cstring, internal, integer                                       | normal  
 pg_catalog | win1251_to_iso                 | void             | integer, integer, cstring, internal, integer                                       | normal  
 pg_catalog | win1251_to_koi8r               | void             | integer, integer, cstring, internal, integer                                       | normal  
 pg_catalog | win1251_to_mic                 | void             | integer, integer, cstring, internal, integer                                       | normal  
 pg_catalog | win1251_to_win866              | void             | integer, integer, cstring, internal, integer                                       | normal  
 pg_catalog | win866_to_iso                  | void             | integer, integer, cstring, internal, integer                                       | normal  
 pg_catalog | win866_to_koi8r                | void             | integer, integer, cstring, internal, integer                                       | normal  
 pg_catalog | win866_to_mic                  | void             | integer, integer, cstring, internal, integer                                       | normal  
 pg_catalog | win866_to_win1251              | void             | integer, integer, cstring, internal, integer                                       | normal  
 pg_catalog | win_to_utf8                    | void             | integer, integer, cstring, internal, integer                                       | normal  

代码在 :

src/backend/utils/mb/conversion_procs

PostgreSQL服务端支持的字符集,以及客户端支持的字符集。

http://www.postgresql.org/docs/9.5/static/multibyte.html

参考

src/backend/utils/mb/conversion_procs

src/include/mb/pg_wchar.h

src/backend/utils/mb/wchar.c

http://blog.jobbole.com/86813/

http://blog.jobbole.com/74109/

http://www.postgresql.org/docs/9.5/static/multibyte.html

https://tools.ietf.org/html/rfc3629

https://en.wikipedia.org/wiki/UTF-8

https://msdn.microsoft.com/en-us/library/cc194916.aspx

https://msdn.microsoft.com/en-us/library/gg638873

Flag Counter

digoal’s 大量PostgreSQL文章入口