PostgreSQL varlena field upto 1GB and large object upto 4TB(8KB block_size)(>=9.3) 2GB(<=9.2)

3 minute read

背景

我们知道PostgreSQL变长类型的字段有4种存储类型可以选择, 如下.

catalog/pg_type.h

        /* ----------------  
         * typstorage tells if the type is prepared for toasting and what  
         * the default strategy for attributes of this type should be.  
         *  
         * 'p' PLAIN      type not prepared for toasting  
         * 'e' EXTERNAL   external storage possible, don't try to compress  
         * 'x' EXTENDED   try to compress and store external if required  
         * 'm' MAIN               like 'x' but try to keep in main tuple  
         * ----------------  
         */  
        char            typstorage;  

定长字段没有外部存储方式可以选择,都是存储为plain模式.

当变长字段选择p或者m时, 数据在基表block中存储. 那么由于行不能跨block存储的限制, 行的最大size受到block size的限制.

当选择x或者e存储时, 当字段长度超过一定的阈值则会存储到基表以外的TOAST表中,(每个基表可能对应1个TOAST表. pg_class.reltoastrelid).

阈值代码如下.

/*  
 * When we store an oversize datum externally, we divide it into chunks  
 * containing at most TOAST_MAX_CHUNK_SIZE data bytes.  This number *must*  
 * be small enough that the completed toast-table tuple (including the  
 * ID and sequence fields and all overhead) will fit on a page.  
 * The coding here sets the size on the theory that we want to fit  
 * EXTERN_TUPLES_PER_PAGE tuples of maximum size onto a page.  
 *  
 * NB: Changing TOAST_MAX_CHUNK_SIZE requires an initdb.  
 */  
#define EXTERN_TUPLES_PER_PAGE  4               /* tweak only this */  
  
#define EXTERN_TUPLE_MAX_SIZE   MaximumBytesPerTuple(EXTERN_TUPLES_PER_PAGE)  
  
#define TOAST_MAX_CHUNK_SIZE    \  
        (EXTERN_TUPLE_MAX_SIZE -                                                        \  
         MAXALIGN(offsetof(HeapTupleHeaderData, t_bits)) -      \  
         sizeof(Oid) -                                                                          \  
         sizeof(int32) -                                                                        \  
         VARHDRSZ)  

通过调整阈值可以调整数据存储方式, 有兴趣的朋友参考我以前写的一篇blog.

http://blog.163.com/digoal@126/blog/static/16387704020130108132117/

那么为什么变长字段的数据存储在TOAST长度限制是1GB呢?

正文

代码如下 :

src/include/postgres.h

/*  
 * These structs describe the header of a varlena object that may have been  
 * TOASTed.  Generally, don't reference these structs directly, but use the  
 * macros below.  
 *  
 * We use separate structs for the aligned and unaligned cases because the  
 * compiler might otherwise think it could generate code that assumes  
 * alignment while touching fields of a 1-byte-header varlena.  
 */  
typedef union  
{  
        struct                                          /* Normal varlena (4-byte length) */  
        {  
                uint32          va_header;  
                char            va_data[1];  
        }                       va_4byte;  
        struct                                          /* Compressed-in-line format */  
        {  
                uint32          va_header;  
                uint32          va_rawsize; /* Original data size (excludes header) */  
                char            va_data[1]; /* Compressed data */  
        }                       va_compressed;  
} varattrib_4b;  
  
typedef struct  
{  
        uint8           va_header;  
        char            va_data[1];             /* Data begins here */  
} varattrib_1b;  
  
/* inline portion of a short varlena pointing to an external resource */  
typedef struct  
{  
        uint8           va_header;              /* Always 0x80 or 0x01 */  
        uint8           va_tag;                 /* Type of datum */  
        char            va_data[1];             /* Data (of the type indicated by va_tag) */  
} varattrib_1b_e;  
  
/*  
 * Bit layouts for varlena headers on big-endian machines:  
 *  
 * 00xxxxxx 4-byte length word, aligned, uncompressed data (up to 1G)  
 * 01xxxxxx 4-byte length word, aligned, *compressed* data (up to 1G)  
 * 10000000 1-byte length word, unaligned, TOAST pointer  
 * 1xxxxxxx 1-byte length word, unaligned, uncompressed data (up to 126b)  
 *  
 * Bit layouts for varlena headers on little-endian machines:  
 *  
 * xxxxxx00 4-byte length word, aligned, uncompressed data (up to 1G)  
 * xxxxxx10 4-byte length word, aligned, *compressed* data (up to 1G)  
 * 00000001 1-byte length word, unaligned, TOAST pointer  
 * xxxxxxx1 1-byte length word, unaligned, uncompressed data (up to 126b)  
 *  
 * The "xxx" bits are the length field (which includes itself in all cases).  
 * In the big-endian case we mask to extract the length, in the little-endian  
 * case we shift.  Note that in both cases the flag bits are in the physically  
 * first byte.  Also, it is not possible for a 1-byte length word to be zero;  
 * this lets us disambiguate alignment padding bytes from the start of an  
 * unaligned datum.  (We now *require* pad bytes to be filled with zero!)  
 *  
 * In TOAST datums the tag field in varattrib_1b_e is used to discern whether  
 * its an indirection pointer or more commonly an on-disk tuple.  
 */  

4字节的头除去2个比特位剩余的30个比特位用来表示该字段存储的实际长度.

所以最大的长度是2^30-1=1GB.

如果要存储单字段超过1GB的数据, 可以选择large object, 在9.2以及更早的版本中大对象支持2GB的存储.

在9.3以及后续的版本, 加入了64位访问接口, 支持单字段最大4TB的存储.

有兴趣的朋友可参考如下BLOG :

http://blog.163.com/digoal@126/blog/static/163877040201341082422954/

参考

1. http://blog.163.com/digoal@126/blog/static/16387704020130108132117/

2. http://blog.163.com/digoal@126/blog/static/16387704020130108132117/

3. http://www.postgresql.org/docs/9.3/static/storage-toast.html

4. http://www.postgresql.org/docs/9.3/static/lo-intro.html

5. http://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=461ef73f0977c95c9452680495bc161618db9227

6. http://blog.163.com/digoal@126/blog/static/163877040201341082422954/

7. 数组结构参考如下

src/include/utils/array.h

Flag Counter

digoal’s 大量PostgreSQL文章入口