PostgreSQL 支持压缩吗?压缩比如何?
背景
PostgreSQL支持压缩,没有错,只要是变长字段类型,你就可以选择是否要压缩存储。
存储格式支持
The TOAST management code recognizes four different strategies for storing TOAST-able columns on disk:
不压缩 PLAIN prevents either compression or out-of-line storage; furthermore it disables use of single-byte headers for varlena types. This is the only possible strategy for columns of non-TOAST-able data types.
压缩 EXTENDED allows both compression and out-of-line storage. This is the default for most TOAST-able data types. Compression will be attempted first, then out-of-line storage if the row is still too big.
不压缩 EXTERNAL allows out-of-line storage but not compression. Use of EXTERNAL will make substring operations on wide text and bytea columns faster (at the penalty of increased storage space) because these operations are optimized to fetch only the required parts of the out-of-line value when it is not compressed.
压缩 MAIN allows compression but not out-of-line storage. (Actually, out-of-line storage will still be performed for such columns, but only as a last resort when there is no other way to make the row small enough to fit on a page.)
Each TOAST-able data type specifies a default strategy for columns of that data type, but the strategy for a given table column can be altered with ALTER TABLE SET STORAGE.
This scheme has a number of advantages compared to a more straightforward approach such as allowing row values to span pages. Assuming that queries are usually qualified by comparisons against relatively small key values, most of the work of the executor will be done using the main row entry. The big values of TOASTed attributes will only be pulled out (if selected at all) at the time the result set is sent to the client. Thus, the main table is much smaller and more of its rows fit in the shared buffer cache than would be the case without any out-of-line storage. Sort sets shrink also, and sorts will more often be done entirely in memory. A little test showed that a table containing typical HTML pages and their URLs was stored in about half of the raw data size including the TOAST table, and that the main table contained only about 10% of the entire data (the URLs and some small HTML pages). There was no run time difference compared to an un-TOASTed comparison table, in which all the HTML pages were cut down to 7 kB to fit.
PostgreSQL是字段级压缩。
PostgreSQL will do this automatically for you when they go above a certain size. Compression is applied at each individual data value though - not at the full table level. Meaning that if you have a billion rows that are very narrow, they won't get compressed. Or if you have very many columns each with only a small value in it, they won't get compressed.
If you need it on the full table level, a solution is to create a TABLESPACE for those tables that you want compressed, and point it to a compressed filesystem. As long as the filesystem still obeys fsync() and standard POSIX semantics, this should be perfectly safe. Like ZFS, Btrfs.
如果你需要更高的压缩比,可以选择文件系统级的压缩功能,例如ZFS, Btrfs。
或者使用列式数据存储,例如cstore。
下面测试一下PostgreSQL自带的压缩功能,以及压缩比:
postgres=# create table test_plain(c text);
CREATE TABLE
postgres=# alter table test_plain alter column c set storage external;
ALTER TABLE
postgres=# create table test_compress(c text);
CREATE TABLE
postgres=# alter table test_compress alter column c set storage extended;
ALTER TABLE
postgres=# insert into test_plain select repeat(md5(clock_timestamp()::text),10000) from generate_series(1,1000);
INSERT 0 1000
postgres=# insert into test_compress select repeat(md5(clock_timestamp()::text),10000) from generate_series(1,1000);
INSERT 0 1000
postgres=# select pg_size_pretty(pg_total_relation_size('test_compress'));
pg_size_pretty
----------------
4168 kB
(1 row)
postgres=# select pg_size_pretty(pg_total_relation_size('test_plain'));
pg_size_pretty
----------------
317 MB
(1 row)
这个CASE压缩比还是不错的,但是实用性不强,因为这个CASE中,字段是由很多个重复值组成,换一个接近现实的CASE。
postgres=# truncate test_compress ;
TRUNCATE TABLE
postgres=# truncate test_plain ;
TRUNCATE TABLE
postgres=# do language plpgsql $$
declare
begin
for i in 1..100 loop
insert into test_compress select string_agg(md5(random()::text),' ') v from generate_series(1,10000);
end loop;
end;
$$;
DO
postgres=# do language plpgsql $$ declare begin for i in 1..100 loop insert into test_plain select string_agg(md5(random()::text),' ') v from generate_series(1,10000); end loop; end; $$;
DO
这个CASE,PG自带的压缩功能没有起到任何作用。
postgres=# select pg_size_pretty(pg_total_relation_size('test_compress'));
pg_size_pretty
----------------
33 MB
(1 row)
postgres=# select pg_size_pretty(pg_total_relation_size('test_plain'));
pg_size_pretty
----------------
33 MB
(1 row)
是否压缩成功,可以在代码中跟踪,
toast_compress_datum@src/backend/access/heap/tuptoaster.c
if (pglz_compress(VARDATA_ANY(DatumGetPointer(value)), valsize,
(PGLZ_Header *) tmp, PGLZ_strategy_default) &&
VARSIZE(tmp) < valsize - 2)
{
/* successful compression */
elog(LOG, "compress success."); // 在这里加一个输出, 压缩起到效果。
return PointerGetDatum(tmp);
}
else
{
/* incompressible data */
pfree(tmp);
elog(LOG, "incompressible data."); // 在这里加一个输出, 压缩没有起到效果。
return PointerGetDatum(NULL);
}
改完后,我们可以观察到,某些数据压缩了,某些数据不能压缩:
postgres=# insert into test_compress select string_agg(md5(random()::text),' ') v from generate_series(1,100000);
LOG: statement: insert into test_compress select string_agg(md5(random()::text),' ') v from generate_series(1,100000);
LOG: incompressible data.
LOG: duration: 305.480 ms
INSERT 0 1
postgres=# insert into test_compress select repeat(md5(random()::text), 100000) ;
LOG: statement: insert into test_compress select repeat(md5(random()::text), 100000) ;
LOG: compress success.
LOG: duration: 70.072 ms
INSERT 0 1
另外可以关注一下列存储。
https://commitfest.postgresql.org/6/354/
参考
https://www.citusdata.com/citus-products/products-overview
http://www.postgresql.org/docs/9.5/static/storage-toast.html
“utils/pg_lzcompress.h”
src/backend/access/heap/tuptoaster.c
src/backend/utils/adt/pg_lzcompress.c