TOAST,The Oversized-Attribute Storage Technique - 暨存储格式main, extended, external, plain介绍

15 minute read

背景

PostgreSQL的数据类型有哪些存储格式？

如何存储超过一个数据块大小的数据呢？

正文

toast直接翻译的话就是切片面包(sliced bread)的意思，PostgreSQL中TOAST是一个缩写，全称是The Oversized-Attribute Storage Technique，为什么会有OVERSIZED-ATTRIBUTE呢？

原因很简单，因为在PostgreSQL,一条记录不能跨PAGE存储，跨越PAGE的话必须使用TOAST（即unaligned,与原表分开存储）存储。

TOAST表不能独立创建，只有当普通表包含了main,extended或external存储格式的字段时，系统会自动创建一个和普通表关联的TOAST表。

当一条记录(tuple)在存储时(如果压缩的话算压缩后的大小)大于TOAST_TUPLE_THRESHOLD(通常是2kB)这个值时，会存储到TOAST表。

而此时在普通表的该字段处包含了一个指向TOAST的tableoid和chunk_id的数据，从而能够找到该字段的记录。

TOAST和普通的TABLE存储不太一样，TOAST表一般包含如下字段 :

 tableoid  -- TOAST表的OID  
 cmax  
 xmax  
 cmin  
 xmin  
 ctid  
 chunk_id  -- 普通表通过TOAST pointer把一个被TOAST的列关联到这里  
 chunk_seq -- 同一个chunk_id如果大于TOAST_MAX_CHUNK_SIZE，将被切片存储。这里存储切片后的序号。  
 chunk_data -- 真实的数据  
  
chunk_id + chunk_seq = primary key  

并不是所有的字段都可以使用TOAST来存储，在PostgreSQL中字段(field)类型有定长的如int4,int8等,也有变长的字段类型如varchar,text等，变长的数据类型最大可以存储1GB的数据，这个可以从PostgreSQL的源代码得到证实，变长字段最多可以选择使用32BIT的列头,预留2BIT作为FLAG，还有30BIT存储字段长度2^30刚好等于1GB。

具体的代码后面会有解释。下面来看一下存储格式。

对于定长的字段类型，存储格式如下：

PLAIN   
prevents either compression or out-of-line storage; furthermore it disables use of single-byte headers for varlena types. This is the only possible strategy for columns of non-TOAST-able data types.  

对于变长的字段类型，除了可以使用PLAIN格式，还可以使用如下存储格式：

EXTENDED   
allows both compression and out-of-line storage.   
This is the default for most TOAST-able data types.   
Compression will be attempted first, then out-of-line storage if the row is still too big.  
  
EXTERNAL   
allows out-of-line storage but not compression.   
Use of EXTERNAL will make substring operations on wide text and bytea columns faster (at the penalty of increased storage space) because these operations are optimized to fetch only the required parts of the out-of-line value when it is not compressed.  
  
MAIN   
allows compression but not out-of-line storage.   
(Actually, out-of-line storage will still be performed for such columns, but only as a last resort when there is no other way to make the row small enough to fit on a page.)  

例如我们来创建一个测试表：

根据前面的介绍，TOAST表只在有变长字段，并且存储为main,extended或external时才会创建。

digoal=> create table tbl_user (id int);  -- 这样的表是没有TOAST的。  
digoal=> select relname,reltoastrelid from pg_class where relname='tbl_user';  
 relname  | reltoastrelid   
----------+---------------  
 tbl_user |             0  

原因很简单，定长类型以PLAIN存储。

当其中有一个字段存储格式为main时，系统字段创建了一个OID=2066068的TOAST表。

digoal=> \d+ tbl_user  
               Table "digoal.tbl_user"  
 Column |  Type   | Modifiers | Storage | Description   
--------+---------+-----------+---------+-------------  
 id     | integer |           | plain   |   
digoal=> create table tbl_user (id numeric);  
digoal=> \d+ tbl_user  
               Table "digoal.tbl_user"  
 Column |  Type   | Modifiers | Storage | Description   
--------+---------+-----------+---------+-------------  
 id     | numeric |           | main    |   
digoal=> select relname,reltoastrelid from pg_class where relname='tbl_user';  
 relname  | reltoastrelid   
----------+---------------  
 tbl_user |       2066068  

如何查找关联的TOAST表？

这个前面已经介绍了。根据reltoastrelid 就可以关联到toast表.

digoal=> select relname from pg_class where oid=2066068;  
     relname        
------------------  
 pg_toast_2066065  

如何获得字段大小？

digoal=> create table tbl_article (id int,author name,title varchar(256),content text);  
digoal=> \d+ tbl_article   
                      Table "digoal.tbl_article"  
 Column  |          Type          | Modifiers | Storage  | Description   
---------+------------------------+-----------+----------+-------------  
 id      | integer                |           | plain    |   
 author  | name                   |           | plain    |   
 title   | character varying(256) |           | extended |   
 content | text                   |           | extended |   
  
digoal=> select relname,reltoastrelid from pg_class where relname='tbl_article';  
   relname   | reltoastrelid   
-------------+---------------  
 tbl_article |       2066074  

下面来插入一条测试记录(你可以使用repeat, generate_series, random() md5等函数产生较大记录，或者使用大对象接口导入大文件产生大字段) :

insert into tbl_article (id,author,title,content) values (1,'digoal.zhou','test','51.2. Index Access Method FunctionsThe index construction and maintenance functions that an index access method must provide are:IndexBuildResult *ambuild (Relation heapRelation,         Relation indexRelation,         IndexInfo *indexInfo);Build a new index. The index relation has been physically created, but is empty. It must be filled in with whatever fixed data the access method requires, plus entries for all tuples already existing in the table. Ordinarily the ambuild function will call IndexBuildHeapScan() to scan the table for existing tuples and compute the keys that need to be inserted into the index. The function must return a pallocd struct containing statistics about the new index.boolaminsert (Relation indexRelation,          Datum *values,          bool *isnull,          ItemPointer heap_tid,          Relation heapRelation,          IndexUniqueCheck checkUnique);Insert a new tuple into an existing index. The values and isnull arrays give the key values to be indexed, and heap_tid is the TID to be indexed. If the access method supports unique indexes (its pg_am.amcanunique flag is true) then checkUnique indicates the type of uniqueness check to perform. This varies depending on whether the unique constraint is deferrable; see Section 51.5 for details. Normally the access method only needs the heapRelation parameter when performing uniqueness checking (since then it will have to look into the heap to verify tuple liveness).The functions Boolean result value is significant only when checkUnique is UNIQUE_CHECK_PARTIAL. In this case a TRUE result means the new entry is known unique, whereas FALSE means it might be non-unique (and a deferred uniqueness check must be scheduled). For other cases a constant FALSE result is recommended.Some indexes might not index all tuples. If the tuple is not to be indexed, aminsert should just return without doing anything.IndexBulkDeleteResult *ambulkdelete (IndexVacuumInfo *info,              IndexBulkDeleteResult *stats,              IndexBulkDeleteCallback callback,              void *callback_state);Delete tuple(s) from the index. This is a "bulk delete" operation that is intended to be implemented by scanning the whole index and checking each entry to see if it should be deleted. The passed-in callback function must be called, in the style callback(TID, callback_state) returns bool, to determine whether any particular index entry, as identified by its referenced TID, is to be deleted. Must return either NULL or a pallocd struct containing statistics about the effects of the deletion operation. It is OK to return NULL if no information needs to be passed on to amvacuumcleanup.Because of limited maintenance_work_mem, ambulkdelete might need to be called more than once when many tuples are to be deleted. The stats argument is the result of the previous call for this index (it is NULL for the first call within a VACUUM operation). This allows the AM to accumulate statistics across the whole operation. Typically, ambulkdelete will modify and return the same struct if the passed stats is not null.IndexBulkDeleteResult *amvacuumcleanup (IndexVacuumInfo *info,                 IndexBulkDeleteResult *stats);Clean up after a VACUUM operation (zero or more ambulkdelete calls). This does not have to do anything beyond returning index statistics, but it might perform bulk cleanup such as reclaiming empty index pages. stats is whatever the last ambulkdelete call returned, or NULL if ambulkdelete was not called because no tuples needed to be deleted. If the result is not NULL it must be a pallocd struct. The statistics it contains will be used to update pg_class, and will be reported by VACUUM if VERBOSE is given. It is OK to return NULL if the index was not changed at all during the VACUUM operation, but otherwise correct stats should be returned.As of PostgreSQL 8.4, amvacuumcleanup will also be called at completion of an ANALYZE operation. In this case stats is always NULL and any return value will be ignored. This case can be distinguished by checking info->analyze_only. It is recommended that the access method do nothing except post-insert cleanup in such a call, and that only in an autovacuum worker process.voidamcostestimate (PlannerInfo *root,                IndexOptInfo *index,                List *indexQuals,                RelOptInfo *outer_rel,                Cost *indexStartupCost,                Cost *indexTotalCost,                Selectivity *indexSelectivity,                double *indexCorrelation);Estimate the costs of an index scan. This function is described fully in Section 51.6, below.bytea *amoptions (ArrayType *reloptions,           bool validate);');  

digoal=> select oid,relname,reltoastrelid from pg_class where relname='tbl_article';  
   oid   |   relname   | reltoastrelid   
---------+-------------+---------------  
 2066071 | tbl_article |       2066074  
digoal=> select pg_relation_size(2066071);  
 pg_relation_size   
             8192  
digoal=> select pg_relation_size(2066074);  
 pg_relation_size   
             8192  

注意到，此时已经使用了TOAST，下面来从COLUMN_SIZE来证实一下（2485>TOAST_TUPLE_THRESHOLD 2KB）

digoal=> select pg_column_size(id),pg_column_size(author),pg_column_size(title),pg_column_size(content) from tbl_article;  
 pg_column_size | pg_column_size | pg_column_size | pg_column_size   
----------------+----------------+----------------+----------------  
              4 |             64 |              5 |           2485  

换插一条比较短的记录，看看是否会用到TOAST？

digoal=> truncate table tbl_article;  
digoal=> select pg_relation_size(2066074);  
 pg_relation_size   
                0  
digoal=> select pg_relation_size(2066071);  
 pg_relation_size   
                0  
digoal=> insert into tbl_article (id,author,title,content) values (1,'digoal.zhou','test','test');  
digoal=> select pg_relation_size(2066071);  
 pg_relation_size   
             8192  
digoal=> select pg_relation_size(2066074);  
 pg_relation_size   
                0  

很明显，TOAST未被使用，因为此时没有一个变长字段的长度超过2kB。

digoal=> select pg_column_size(id),pg_column_size(author),pg_column_size(title),pg_column_size(content) from tbl_article;  
 pg_column_size | pg_column_size | pg_column_size | pg_column_size   
----------------+----------------+----------------+----------------  
              4 |             64 |              5 |              5  
pg_column_size : bytes required to store the value, perhaps with compression  

测试超长字段索引 :

insert into tbl_article (id,author,title,content) values (1,'digoal.zhou','test','此处省略1W字');  
digoal=> select pg_column_size(id),pg_column_size(author),pg_column_size(title),pg_column_size(content) from tbl_article;  
 pg_column_size | pg_column_size | pg_column_size | pg_column_size   
----------------+----------------+----------------+----------------  
              4 |             64 |              5 |              5  
              4 |             64 |              5 |           5075  
digoal=> create index idx_content on tbl_article (content);  
ERROR:  index row size 5088 exceeds maximum 2712 for index "idx_content"  
HINT:  Values larger than 1/3 of a buffer page cannot be indexed.  
Consider a function index of an MD5 hash of the value, or use full text indexing.  

索引无法在超长字段上创建，有两处函数会报这个错。

src/backend/access/nbtree/nbtsort.c

src/backend/access/nbtree/nbtinsert.c

        /*  
         * Check whether the item can fit on a btree page at all. (Eventually, we  
         * ought to try to apply TOAST methods if not.) We actually need to be  
         * able to fit three items on every page, so restrict any one item to 1/3  
         * the per-page available space. Note that at this point, itupsz doesn't  
         * include the ItemId.  
         *  
         * NOTE: similar code appears in _bt_insertonpg() to defend against  
         * oversize items being inserted into an already-existing index. But  
         * during creation of an index, we don't go through there.  
         */  
        if (itupsz > BTMaxItemSize(npage))  
                ereport(ERROR,  
                                (errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),  
                        errmsg("index row size %zu exceeds maximum %zu for index \"%s\"",  
                                   itupsz, BTMaxItemSize(npage),  
                                   RelationGetRelationName(wstate->index)),  
                errhint("Values larger than 1/3 of a buffer page cannot be indexed.\n"  
                                "Consider a function index of an MD5 hash of the value, "  
                                "or use full text indexing."),  
                                 errtableconstraint(wstate->heap,  
                                                                        RelationGetRelationName(wstate->index))));  

BTREE索引长度限制，约为BLOCK_SIZE的1/3：

src/include/access/nbtree.h

/*  
 * Maximum size of a btree index entry, including its tuple header.  
 *  
 * We actually need to be able to fit three items on every page,  
 * so restrict any one item to 1/3 the per-page available space.  
 */  
#define BTMaxItemSize(page) \  
        MAXALIGN_DOWN((PageGetPageSize(page) - \  
                                   MAXALIGN(SizeOfPageHeaderData + 3*sizeof(ItemIdData)) - \  
                                   MAXALIGN(sizeof(BTPageOpaqueData))) / 3)  

如何查看类型的默认存储格式？

digoal=> select typname,typstorage from pg_type order by typname;  
p : plain  
m : main  
e : external  
x : extended  

如何修改类型的存储格式？

alter table tbl_article alter column {$column_name} set storage { PLAIN | MAIN | EXTERNAL | EXTENDED } ;  

TOAST表的好处

1. UPDATE一个普通表时，当该表的TOAST表存储的数据没有修改时，TOAST表不需要更新。

2. 由于TOAST在物理存储上和普通表分开，所以当SELECT时没有查询被TOAST的列数据时，不需要把这些TOAST的PAGE加载到内存，从而加快了检索速度并且节约了使用空间。

3. 在排序时，由于TOAST和普通表存储分开，当针对非TOAST字段排序时大大提高了排序速度。

使用TOAST存储格式注意事项：

1. 当变长字段上需要使用索引时，权衡CPU和存储的开销，考虑是否需要压缩或非压缩存储。（压缩节约磁盘空间，但是带来CPU的开销）

2. 对于经常要查询或UPDATE的变长字段，如果字段长度不是太大，可以考虑使用MAIN存储。

3. 在超长字段，或者将来会插入超长值的字段上建索引的话需要注意，因为索引最大不能超过三分之一的PAGE，所以超长字段上可能建索引不成功，或者有索引的情况下，超长字段插入值将不成功。解决办法一般可以使用MD5值来建，当然看你的需求了。

小结

1. 哪些tuple会触发TOAST？

当行的长度超过TOAST_TUPLE_THRESHOLD时，会调用toast_insert_or_update，即触发TOAST。(TOAST_TUPLE_THRESHOLD的计算方法参考本文末尾)

见源码：

src/backend/access/heap/heapam.c

/*  
 * Subroutine for heap_insert(). Prepares a tuple for insertion. This sets the  
 * tuple header fields, assigns an OID, and toasts the tuple if necessary.  
 * Returns a toasted version of the tuple if it was toasted, or the original  
 * tuple if not. Note that in any case, the header fields are also set in  
 * the original tuple.  
 */  
static HeapTuple  
heap_prepare_insert(Relation relation, HeapTuple tup, TransactionId xid,  
                                        CommandId cid, int options)  
{  
....  
        else if (HeapTupleHasExternal(tup) || tup->t_len > TOAST_TUPLE_THRESHOLD)  
                return toast_insert_or_update(relation, tup, NULL, options);    
....  
}  
...........................................  
HTSU_Result  
heap_update(Relation relation, ItemPointer otid, HeapTuple newtup,  
                        CommandId cid, Snapshot crosscheck, bool wait,  
                        HeapUpdateFailureData *hufd, LockTupleMode *lockmode)  
{  
......  
                need_toast = (HeapTupleHasExternal(&oldtup) ||  
                                          HeapTupleHasExternal(newtup) ||  
                                          newtup->t_len > TOAST_TUPLE_THRESHOLD);  
......  
}  

2. 哪些字段的内容会存储到TOAST，生成TOAST的规则。

这个其实就是toast_insert_or_update函数的处理逻辑：

src/backend/access/heap/tuptoaster.c

        /* ----------  
         * Compress and/or save external until data fits into target length  
         *  
         *      1: Inline compress attributes with attstorage 'x', and store very  
         *         large attributes with attstorage 'x' or 'e' external immediately  
         *      2: Store attributes with attstorage 'x' or 'e' external  
         *      3: Inline compress attributes with attstorage 'm'  
         *      4: Store attributes with attstorage 'm' external  
         * ----------  
         */  
        /* compute header overhead --- this should match heap_form_tuple() */  
        hoff = offsetof(HeapTupleHeaderData, t_bits);  
        if (has_nulls)  
                hoff += BITMAPLEN(numAttrs);  
        if (newtup->t_data->t_infomask & HEAP_HASOID)  
                hoff += sizeof(Oid);  
        hoff = MAXALIGN(hoff);  
        /* now convert to a limit on the tuple data size */  
        maxDataLen = TOAST_TUPLE_TARGET - hoff;  

第一步，依字段顺序压缩存储属性为x(extended)的字段直到TUPLE长度<=maxDataLen，如果压缩（x属性）后或（e属性不压缩）字段长度>maxDataLen，则直接将该字段切到TOAST。

（如果循环结束前TUPLE长度<=maxDataLen，则有可能某些字段就不需要被压缩，跳出循环）

第二步，上一步处理完后如果TUPLE长度还是>maxDataLen，依字段顺序将属性为x,e还未切入TOAST的字段切入TOAST直到TUPLE长度<=maxDataLen。

（如果循环结束前TUPLE长度<=maxDataLen，则有可能某些字段就不需要切入TOAST，跳出循环）

第三步，上一步处理完后如果TUPLE长度还是>maxDataLen，压缩属性为m(main)的字段。

（如果循环结束前TUPLE长度<=maxDataLen，则有可能某些字段就不需要被压缩，跳出循环）

第四步，上一步处理完后如果TUPLE长度还是>maxDataLen，依字段顺序将属性为m(main)的字段切入TOAST直到TUPLE长度<=maxDataLen。

（如果循环结束前TUPLE长度<=maxDataLen，则有可能某些属性为m的字段依旧保留在HEAP TUPLE中，跳出循环）

3. 字段存储到TOAST后，在HEAP TUPLE中存储什么，如何指向TOAST？

切入TOAST的字段在HEAP TUPLE中会存储一个类似指针的结构，结构如下：

struct varatt_external  
{  
    int32        va_rawsize;        /* Original data size (includes header) */  
    int32        va_extsize;        /* External saved size (doesn't) */  
    Oid            va_valueid;        /* Unique ID of value within TOAST table */  
    Oid            va_toastrelid;    /* RelID of TOAST table containing it */  
};  

本文使用到的函数：

                                   List of functions  
   Schema   |          Name          | Result data type | Argument data types |  Type    
------------+------------------------+------------------+---------------------+--------  
 pg_catalog | pg_column_size         | integer          | "any"               | normal  
 pg_catalog | pg_database_size       | bigint           | name                | normal  
 pg_catalog | pg_database_size       | bigint           | oid                 | normal  
 pg_catalog | pg_indexes_size        | bigint           | regclass            | normal  
 pg_catalog | pg_relation_size       | bigint           | regclass            | normal  
 pg_catalog | pg_relation_size       | bigint           | regclass, text      | normal  
 pg_catalog | pg_size_pretty         | text             | bigint              | normal  
 pg_catalog | pg_table_size          | bigint           | regclass            | normal  
 pg_catalog | pg_tablespace_size     | bigint           | name                | normal  
 pg_catalog | pg_tablespace_size     | bigint           | oid                 | normal  
 pg_catalog | pg_total_relation_size | bigint           | regclass            | normal  
(11 rows)  

参考

1. postgres.h

/*  
 * struct varatt_external is a "TOAST pointer", that is, the information  
 * needed to fetch a stored-out-of-line Datum.    The data is compressed  
 * if and only if va_extsize < va_rawsize - VARHDRSZ.  This struct must not  
 * contain any padding, because we sometimes compare pointers using memcmp.  
 *  
 * Note that this information is stored unaligned within actual tuples, so  
 * you need to memcpy from the tuple into a local struct variable before  
 * you can look at these fields!  (The reason we use memcmp is to avoid  
 * having to do that just to detect equality of two TOAST pointers...)  
 */  
struct varatt_external  
{  
    int32        va_rawsize;        /* Original data size (includes header) */  
    int32        va_extsize;        /* External saved size (doesn't) */  
    Oid            va_valueid;        /* Unique ID of value within TOAST table */  
    Oid            va_toastrelid;    /* RelID of TOAST table containing it */  
};  
  
/*  
 * These structs describe the header of a varlena object that may have been  
 * TOASTed.  Generally, don't reference these structs directly, but use the  
 * macros below.  
 *  
 * We use separate structs for the aligned and unaligned cases because the  
 * compiler might otherwise think it could generate code that assumes  
 * alignment while touching fields of a 1-byte-header varlena.  
 */  
typedef union  
{  
    struct                        /* Normal varlena (4-byte length) */  
    {  
        uint32        va_header;  
        char        va_data[1];  
    }            va_4byte;  
    struct                        /* Compressed-in-line format */  
    {  
        uint32        va_header;  
        uint32        va_rawsize; /* Original data size (excludes header) */  
        char        va_data[1]; /* Compressed data */  
    }            va_compressed;  
} varattrib_4b;  
  
typedef struct  
{  
    uint8        va_header;  
    char        va_data[1];        /* Data begins here */  
} varattrib_1b;  
  
typedef struct  
{  
    uint8        va_header;        /* Always 0x80 or 0x01 */  
    uint8        va_len_1be;        /* Physical length of datum */  
    char        va_data[1];        /* Data (for now always a TOAST pointer) */  
} varattrib_1b_e;  
  
/*  
 * Bit layouts for varlena headers on big-endian machines:  
 *  
 * 00xxxxxx 4-byte length word, aligned, uncompressed data (up to 1G)  
 * 01xxxxxx 4-byte length word, aligned, *compressed* data (up to 1G)  
 * 10000000 1-byte length word, unaligned, TOAST pointer  
 * 1xxxxxxx 1-byte length word, unaligned, uncompressed data (up to 126b)  
 *  
 * Bit layouts for varlena headers on little-endian machines:  
 *  
 * xxxxxx00 4-byte length word, aligned, uncompressed data (up to 1G)  
 * xxxxxx10 4-byte length word, aligned, *compressed* data (up to 1G)  
 * 00000001 1-byte length word, unaligned, TOAST pointer  
 * xxxxxxx1 1-byte length word, unaligned, uncompressed data (up to 126b)  
 *  
 * The "xxx" bits are the length field (which includes itself in all cases).  
 * In the big-endian case we mask to extract the length, in the little-endian  
 * case we shift.  Note that in both cases the flag bits are in the physically  
 * first byte.    Also, it is not possible for a 1-byte length word to be zero;  
 * this lets us disambiguate alignment padding bytes from the start of an  
 * unaligned datum.  (We now *require* pad bytes to be filled with zero!)  
 */  

2. tuptoaster.h

/*  
 * The code will also consider moving MAIN data out-of-line, but only as a  
 * last resort if the previous steps haven't reached the target tuple size.  
 * In this phase we use a different target size, currently equal to the  
 * largest tuple that will fit on a heap page.    This is reasonable since  
 * the user has told us to keep the data in-line if at all possible.  
 */  
  
/*  
 * If an index value is larger than TOAST_INDEX_TARGET, we will try to  
 * compress it (we can't move it out-of-line, however).  Note that this  
 * number is per-datum, not per-tuple, for simplicity in index_form_tuple().  
 */  
#define TOAST_INDEX_TARGET        (MaxHeapTupleSize / 16)  
  
/*  
 * When we store an oversize datum externally, we divide it into chunks  
 * containing at most TOAST_MAX_CHUNK_SIZE data bytes.    This number *must*  
 * be small enough that the completed toast-table tuple (including the  
 * ID and sequence fields and all overhead) will fit on a page.  
 * The coding here sets the size on the theory that we want to fit  
 * EXTERN_TUPLES_PER_PAGE tuples of maximum size onto a page.  
 *  
 * NB: Changing TOAST_MAX_CHUNK_SIZE requires an initdb.  
 */  

3. src/include/access/tuptoaster.h

/*  
 * Find the maximum size of a tuple if there are to be N tuples per page.  
 */  
#define MaximumBytesPerTuple(tuplesPerPage) \  
        MAXALIGN_DOWN((BLCKSZ - \  
                                   MAXALIGN(SizeOfPageHeaderData + (tuplesPerPage) * sizeof(ItemIdData))) \  
                                  / (tuplesPerPage))  
  
/*  
 * These symbols control toaster activation.  If a tuple is larger than  
 * TOAST_TUPLE_THRESHOLD, we will try to toast it down to no more than  
 * TOAST_TUPLE_TARGET bytes through compressing compressible fields and  
 * moving EXTENDED and EXTERNAL data out-of-line.  
 *  
 * The numbers need not be the same, though they currently are.  It doesn't  
 * make sense for TARGET to exceed THRESHOLD, but it could be useful to make  
 * it be smaller.  
 *  
 * Currently we choose both values to match the largest tuple size for which  
 * TOAST_TUPLES_PER_PAGE tuples can fit on a heap page.  
 *  
 * XXX while these can be modified without initdb, some thought needs to be  
 * given to needs_toast_table() in toasting.c before unleashing random  
 * changes.  Also see LOBLKSIZE in large_object.h, which can *not* be  
 * changed without initdb.  
 */  
#define TOAST_TUPLES_PER_PAGE   4  
  
#define TOAST_TUPLE_THRESHOLD   MaximumBytesPerTuple(TOAST_TUPLES_PER_PAGE)  
  
#define TOAST_TUPLE_TARGET              TOAST_TUPLE_THRESHOLD  

4. src/include/storage/itemid.h

/*  
 * An item pointer (also called line pointer) on a buffer page  
 *  
 * In some cases an item pointer is "in use" but does not have any associated  
 * storage on the page.  By convention, lp_len == 0 in every item pointer  
 * that does not have storage, independently of its lp_flags state.  
 */  
typedef struct ItemIdData  
{  
        unsigned        lp_off:15,              /* offset to tuple (from start of page) */  
                                lp_flags:2,             /* state of item pointer, see below */  
                                lp_len:15;              /* byte length of tuple */  
} ItemIdData;  

5. src/include/pg_config.h

/* Size of a disk block --- this also limits the size of a tuple. You can set  
   it bigger if you need bigger tuples (although TOAST should reduce the need  
   to have large tuples, since fields can be spread across multiple tuples).  
   BLCKSZ must be a power of 2. The maximum possible value of BLCKSZ is  
   currently 2^15 (32768). This is determined by the 15-bit widths of the  
   lp_off and lp_len fields in ItemIdData (see include/storage/itemid.h).  
   Changing BLCKSZ requires an initdb. */  
#define BLCKSZ 32768  

6. src/include/storage/bufpage.h

/*  
 * For historical reasons, the 64-bit LSN value is stored as two 32-bit  
 * values.  
 */  
typedef struct  
{  
        uint32          xlogid;                 /* high bits */  
        uint32          xrecoff;                /* low bits */  
} PageXLogRecPtr;  
  
typedef uint16 LocationIndex;  
  
typedef struct PageHeaderData  
{  
        /* XXX LSN is member of *any* block, not only page-organized ones */  
        PageXLogRecPtr pd_lsn;          /* LSN: next byte after last byte of xlog  
                                                                 * record for last change to this page */  
        uint16          pd_checksum;    /* checksum */  
        uint16          pd_flags;               /* flag bits, see below */  
        LocationIndex pd_lower;         /* offset to start of free space */  
        LocationIndex pd_upper;         /* offset to end of free space */  
        LocationIndex pd_special;       /* offset to start of special space */  
        uint16          pd_pagesize_version;  
        TransactionId pd_prune_xid; /* oldest prunable XID, or zero if none */  
        ItemIdData      pd_linp[1];             /* beginning of line pointer array */  
} PageHeaderData;  
  
/*  
 * line pointer(s) do not count as part of header  
 */  
#define SizeOfPageHeaderData (offsetof(PageHeaderData, pd_linp))  

7. src/include/c.h

typedef uint32 TransactionId;

计算，当数据库集群的BLOCK SIZE=32KB时：

TOAST_TUPLE_THRESHOLD = MaximumBytesPerTuple(TOAST_TUPLES_PER_PAGE) =   
MaximumBytesPerTuple(4) =   
MAXALIGN_DOWN((BLCKSZ - MAXALIGN(SizeOfPageHeaderData + (4) * sizeof(ItemIdData))) / (4)) =   
MAXALIGN_DOWN((32768 - MAXALIGN(192 + (4) * 32)) / (4)) ~= 8KB  
~= BLCKSZ *(1/4)  

digoal’s 大量PostgreSQL文章入口

Twitter Facebook Google+ LinkedIn

Digoal.zhou

TOAST,The Oversized-Attribute Storage Technique - 暨存储格式main, extended, external, plain介绍

背景

正文

TOAST表的好处

小结

参考

digoal’s 大量PostgreSQL文章入口

You May Also Enjoy

PostgreSQL(PPAS 兼容Oracle) 从零开始入门手册 - 珍藏版

PostgreSQL pipelinedb 流计算插件 - IoT应用 - 实时轨迹聚合

PostgreSQL plpgsql 存储过程、函数 - 状态、异常变量打印、异常捕获… - GET [STACKED] DIAGNOSTICS

PostgreSQL datediff 日期间隔（单位转换）兼容SQL用法