PostgreSQL varlena field upto 1GB and large object upto 4TB(8KB block_size)(>=9.3) 2GB(<=9.2)
背景
我们知道PostgreSQL变长类型的字段有4种存储类型可以选择, 如下.
catalog/pg_type.h
/* ----------------
* typstorage tells if the type is prepared for toasting and what
* the default strategy for attributes of this type should be.
*
* 'p' PLAIN type not prepared for toasting
* 'e' EXTERNAL external storage possible, don't try to compress
* 'x' EXTENDED try to compress and store external if required
* 'm' MAIN like 'x' but try to keep in main tuple
* ----------------
*/
char typstorage;
定长字段没有外部存储方式可以选择,都是存储为plain模式.
当变长字段选择p或者m时, 数据在基表block中存储. 那么由于行不能跨block存储的限制, 行的最大size受到block size的限制.
当选择x或者e存储时, 当字段长度超过一定的阈值则会存储到基表以外的TOAST表中,(每个基表可能对应1个TOAST表. pg_class.reltoastrelid).
阈值代码如下.
/*
* When we store an oversize datum externally, we divide it into chunks
* containing at most TOAST_MAX_CHUNK_SIZE data bytes. This number *must*
* be small enough that the completed toast-table tuple (including the
* ID and sequence fields and all overhead) will fit on a page.
* The coding here sets the size on the theory that we want to fit
* EXTERN_TUPLES_PER_PAGE tuples of maximum size onto a page.
*
* NB: Changing TOAST_MAX_CHUNK_SIZE requires an initdb.
*/
#define EXTERN_TUPLES_PER_PAGE 4 /* tweak only this */
#define EXTERN_TUPLE_MAX_SIZE MaximumBytesPerTuple(EXTERN_TUPLES_PER_PAGE)
#define TOAST_MAX_CHUNK_SIZE \
(EXTERN_TUPLE_MAX_SIZE - \
MAXALIGN(offsetof(HeapTupleHeaderData, t_bits)) - \
sizeof(Oid) - \
sizeof(int32) - \
VARHDRSZ)
通过调整阈值可以调整数据存储方式, 有兴趣的朋友参考我以前写的一篇blog.
http://blog.163.com/digoal@126/blog/static/16387704020130108132117/
那么为什么变长字段的数据存储在TOAST长度限制是1GB呢?
正文
代码如下 :
src/include/postgres.h
/*
* These structs describe the header of a varlena object that may have been
* TOASTed. Generally, don't reference these structs directly, but use the
* macros below.
*
* We use separate structs for the aligned and unaligned cases because the
* compiler might otherwise think it could generate code that assumes
* alignment while touching fields of a 1-byte-header varlena.
*/
typedef union
{
struct /* Normal varlena (4-byte length) */
{
uint32 va_header;
char va_data[1];
} va_4byte;
struct /* Compressed-in-line format */
{
uint32 va_header;
uint32 va_rawsize; /* Original data size (excludes header) */
char va_data[1]; /* Compressed data */
} va_compressed;
} varattrib_4b;
typedef struct
{
uint8 va_header;
char va_data[1]; /* Data begins here */
} varattrib_1b;
/* inline portion of a short varlena pointing to an external resource */
typedef struct
{
uint8 va_header; /* Always 0x80 or 0x01 */
uint8 va_tag; /* Type of datum */
char va_data[1]; /* Data (of the type indicated by va_tag) */
} varattrib_1b_e;
/*
* Bit layouts for varlena headers on big-endian machines:
*
* 00xxxxxx 4-byte length word, aligned, uncompressed data (up to 1G)
* 01xxxxxx 4-byte length word, aligned, *compressed* data (up to 1G)
* 10000000 1-byte length word, unaligned, TOAST pointer
* 1xxxxxxx 1-byte length word, unaligned, uncompressed data (up to 126b)
*
* Bit layouts for varlena headers on little-endian machines:
*
* xxxxxx00 4-byte length word, aligned, uncompressed data (up to 1G)
* xxxxxx10 4-byte length word, aligned, *compressed* data (up to 1G)
* 00000001 1-byte length word, unaligned, TOAST pointer
* xxxxxxx1 1-byte length word, unaligned, uncompressed data (up to 126b)
*
* The "xxx" bits are the length field (which includes itself in all cases).
* In the big-endian case we mask to extract the length, in the little-endian
* case we shift. Note that in both cases the flag bits are in the physically
* first byte. Also, it is not possible for a 1-byte length word to be zero;
* this lets us disambiguate alignment padding bytes from the start of an
* unaligned datum. (We now *require* pad bytes to be filled with zero!)
*
* In TOAST datums the tag field in varattrib_1b_e is used to discern whether
* its an indirection pointer or more commonly an on-disk tuple.
*/
4字节的头除去2个比特位剩余的30个比特位用来表示该字段存储的实际长度.
所以最大的长度是2^30-1=1GB.
如果要存储单字段超过1GB的数据, 可以选择large object, 在9.2以及更早的版本中大对象支持2GB的存储.
在9.3以及后续的版本, 加入了64位访问接口, 支持单字段最大4TB的存储.
有兴趣的朋友可参考如下BLOG :
http://blog.163.com/digoal@126/blog/static/163877040201341082422954/
参考
1. http://blog.163.com/digoal@126/blog/static/16387704020130108132117/
2. http://blog.163.com/digoal@126/blog/static/16387704020130108132117/
3. http://www.postgresql.org/docs/9.3/static/storage-toast.html
4. http://www.postgresql.org/docs/9.3/static/lo-intro.html
5. http://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=461ef73f0977c95c9452680495bc161618db9227
6. http://blog.163.com/digoal@126/blog/static/163877040201341082422954/
7. 数组结构参考如下
src/include/utils/array.h