Greenplum 海量数据，大宽表行存 VS 列存

8 minute read

背景

GPDB支持行存列存，而且相关的选择原理我们之前的文章也有很详细的分析，同时在性能方面也做过较多的评测对比。

但是在大宽表上，相差到底有多悬殊我们之前没有对其进行过对比。主要的差异来自deform和QUERY扫描的存储空间（理论上，数据量越大，同时访问的列越少，列存的优势越明显。或者是访问到的列越靠后，优势越明显）。

一些原理参考

《PostgreSQL VOPS 向量计算 + DBLINK异步并行 - 单实例 10亿聚合计算跑进2秒》

《PostgreSQL 向量化执行插件(瓦片式实现-vops) 10x提速OLAP》

一些选择参考本文末尾的链接文章。

测试

1、创建一个函数，用于创建1000个列的表(每列类型INT8)

create or replace function crt_tbl(name, text, text) returns void as $$  
declare  
  sql text;  
begin  
  sql := 'create table '||$1||'(';  
  for i in 1..1000 loop  
    sql := sql||'c'||i||' int8 default random()*10000,';  
  end loop;  
  sql := rtrim(sql, ',');  
  sql := sql||') with (APPENDONLY=true, ORIENTATION='||$2||', COMPRESSTYPE='||$3||')';  
  execute sql;  
end;  
$$ language plpgsql strict;  

2、创建行存表

select crt_tbl('tbl_1000_row_nonc', 'row', 'none');  
  
select crt_tbl('tbl_1000_row_c', 'row', 'zlib');  

3、创建列存表

select crt_tbl('tbl_1000_column_nonc', 'column', 'none');  
  
select crt_tbl('tbl_1000_column_c', 'column', 'zlib');  

4、写入100万行测试数据

postgres=# insert into tbl_1000_row_nonc (c1) select generate_series(1,1000000);  
INSERT 0 1000000  
Time: 221003.467 ms  
postgres=# insert into tbl_1000_row_c select * from tbl_1000_row_nonc;  
INSERT 0 1000000  
Time: 12298.931 ms  
postgres=# insert into tbl_1000_column_nonc select * from tbl_1000_row_nonc;  
INSERT 0 1000000  
Time: 23332.039 ms  
postgres=# insert into tbl_1000_column_c select * from tbl_1000_row_nonc;  
INSERT 0 1000000  
Time: 17017.119 ms  

5、空间对比

postgres=# select pg_size_pretty(pg_total_relation_size('tbl_1000_row_nonc'));  
 pg_size_pretty   
----------------  
 7653 MB  
(1 row)  
  
Time: 282.180 ms  
postgres=# select pg_size_pretty(pg_total_relation_size('tbl_1000_column_nonc'));  
 pg_size_pretty   
----------------  
 7647 MB  
(1 row)  
  
Time: 55.315 ms  
  
postgres=# select pg_size_pretty(pg_total_relation_size('tbl_1000_row_c'));  
 pg_size_pretty   
----------------  
 2522 MB  
(1 row)  
  
Time: 56.017 ms  
postgres=# select pg_size_pretty(pg_total_relation_size('tbl_1000_column_c'));  
 pg_size_pretty   
----------------  
 2520 MB  
(1 row)  
  
Time: 55.557 ms  

6、第一列查询对比

postgres=# explain analyze select c1,count(*) from tbl_1000_row_nonc group by c1;  
                                                                 QUERY PLAN                                                                    
---------------------------------------------------------------------------------------------------------------------------------------------  
 Gather Motion 48:1  (slice1; segments: 48)  (cost=259714.00..272214.00 rows=1000000 width=16)  
   Rows out:  1000000 rows at destination with 352 ms to end, start offset by 1.483 ms.  
   ->  HashAggregate  (cost=259714.00..272214.00 rows=20834 width=16)  
         Group By: c1  
         Rows out:  Avg 20833.3 rows x 48 workers.  Max 20854 rows (seg42) with 0.005 ms to first row, 101 ms to end, start offset by 63 ms.  
         ->  Append-only Scan on tbl_1000_row_nonc  (cost=0.00..254714.00 rows=20834 width=8)  
               Rows out:  0 rows (seg0) with 19 ms to end, start offset by 82 ms.  
 Slice statistics:  
   (slice0)    Executor memory: 405K bytes.  
   (slice1)    Executor memory: 500K bytes avg x 48 workers, 500K bytes max (seg0).  
 Statement statistics:  
   Memory used: 128000K bytes  
 Settings:  enable_bitmapscan=off; enable_seqscan=off; optimizer=off  
 Optimizer status: legacy query optimizer  
 Total runtime: 353.793 ms  
(15 rows)  
  
postgres=# explain analyze select c1,count(*) from tbl_1000_row_c group by c1;  
                                                                 QUERY PLAN                                                                    
---------------------------------------------------------------------------------------------------------------------------------------------  
 Gather Motion 48:1  (slice1; segments: 48)  (cost=95537.00..108037.00 rows=1000000 width=16)  
   Rows out:  1000000 rows at destination with 1128 ms to end, start offset by 1.360 ms.  
   ->  HashAggregate  (cost=95537.00..108037.00 rows=20834 width=16)  
         Group By: c1  
         Rows out:  Avg 20833.3 rows x 48 workers.  Max 20854 rows (seg42) with 0.003 ms to first row, 699 ms to end, start offset by 42 ms.  
         ->  Append-only Scan on tbl_1000_row_c  (cost=0.00..90537.00 rows=20834 width=8)  
               Rows out:  0 rows (seg0) with 13 ms to end, start offset by 42 ms.  
 Slice statistics:  
   (slice0)    Executor memory: 405K bytes.  
   (slice1)    Executor memory: 500K bytes avg x 48 workers, 500K bytes max (seg0).  
 Statement statistics:  
   Memory used: 128000K bytes  
 Settings:  enable_bitmapscan=off; enable_seqscan=off; optimizer=off  
 Optimizer status: legacy query optimizer  
 Total runtime: 1130.006 ms  
(15 rows)  
  
  
postgres=# explain analyze select c1,count(*) from tbl_1000_column_nonc group by c1;  
                                                                  QUERY PLAN                                                                     
-----------------------------------------------------------------------------------------------------------------------------------------------  
 Gather Motion 48:1  (slice1; segments: 48)  (cost=259518.00..272018.00 rows=1000000 width=16)  
   Rows out:  1000000 rows at destination with 245 ms to end, start offset by 1.365 ms.  
   ->  HashAggregate  (cost=259518.00..272018.00 rows=20834 width=16)  
         Group By: c1  
         Rows out:  Avg 20833.3 rows x 48 workers.  Max 20854 rows (seg42) with 0.005 ms to first row, 9.968 ms to end, start offset by 60 ms.  
         ->  Append-only Columnar Scan on tbl_1000_column_nonc  (cost=0.00..254518.00 rows=20834 width=8)  
               Rows out:  0 rows (seg0) with 16 ms to end, start offset by 60 ms.  
 Slice statistics:  
   (slice0)    Executor memory: 405K bytes.  
   (slice1)    Executor memory: 821K bytes avg x 48 workers, 821K bytes max (seg0).  
 Statement statistics:  
   Memory used: 128000K bytes  
 Settings:  enable_bitmapscan=off; enable_seqscan=off; optimizer=off  
 Optimizer status: legacy query optimizer  
 Total runtime: 246.967 ms  
(15 rows)  
  
  
postgres=# explain analyze select c1,count(*) from tbl_1000_column_c group by c1;  
                                                                  QUERY PLAN                                                                     
-----------------------------------------------------------------------------------------------------------------------------------------------  
 Gather Motion 48:1  (slice1; segments: 48)  (cost=259518.00..272018.00 rows=1000000 width=16)  
   Rows out:  1000000 rows at destination with 249 ms to end, start offset by 1.450 ms.  
   ->  HashAggregate  (cost=259518.00..272018.00 rows=20834 width=16)  
         Group By: c1  
         Rows out:  Avg 20833.3 rows x 48 workers.  Max 20854 rows (seg42) with 0.004 ms to first row, 8.861 ms to end, start offset by 46 ms.  
         ->  Append-only Columnar Scan on tbl_1000_column_c  (cost=0.00..254518.00 rows=20834 width=8)  
               Rows out:  0 rows (seg0) with 16 ms to end, start offset by 45 ms.  
 Slice statistics:  
   (slice0)    Executor memory: 405K bytes.  
   (slice1)    Executor memory: 853K bytes avg x 48 workers, 853K bytes max (seg0).  
 Statement statistics:  
   Memory used: 128000K bytes  
 Settings:  enable_bitmapscan=off; enable_seqscan=off; optimizer=off  
 Optimizer status: legacy query optimizer  
 Total runtime: 250.583 ms  
(15 rows)  

7、最后一列查询对比

postgres=# explain analyze select c1000,count(*) from tbl_1000_row_nonc group by c1000;  
                                                                      QUERY PLAN                                                                         
-------------------------------------------------------------------------------------------------------------------------------------------------------  
 Gather Motion 48:1  (slice2; segments: 48)  (cost=260048.04..260167.34 rows=9544 width=16)  
   Rows out:  10001 rows at destination with 296 ms to end, start offset by 1.806 ms.  
   ->  HashAggregate  (cost=260048.04..260167.34 rows=199 width=16)  
         Group By: tbl_1000_row_nonc.c1000  
         Rows out:  Avg 208.4 rows x 48 workers.  Max 223 rows (seg17) with 0.002 ms to first row, 133 ms to end, start offset by 21 ms.  
         ->  Redistribute Motion 48:48  (slice1; segments: 48)  (cost=259714.00..259904.88 rows=199 width=16)  
               Hash Key: tbl_1000_row_nonc.c1000  
               Rows out:  Avg 8749.4 rows x 48 workers at destination.  Max 9360 rows (seg46) with 141 ms to end, start offset by 21 ms.  
               ->  HashAggregate  (cost=259714.00..259714.00 rows=199 width=16)  
                     Group By: tbl_1000_row_nonc.c1000  
                     Rows out:  Avg 8749.4 rows x 48 workers.  Max 8814 rows (seg18) with 0.003 ms to first row, 209 ms to end, start offset by 63 ms.  
                     ->  Append-only Scan on tbl_1000_row_nonc  (cost=0.00..254714.00 rows=20834 width=8)  
                           Rows out:  0 rows (seg0) with 16 ms to end, start offset by 64 ms.  
 Slice statistics:  
   (slice0)    Executor memory: 417K bytes.  
   (slice1)    Executor memory: 1818K bytes avg x 48 workers, 1818K bytes max (seg0).  
   (slice2)    Executor memory: 434K bytes avg x 48 workers, 434K bytes max (seg0).  
 Statement statistics:  
   Memory used: 128000K bytes  
 Settings:  enable_bitmapscan=off; enable_seqscan=off; optimizer=off  
 Optimizer status: legacy query optimizer  
 Total runtime: 298.222 ms  
(22 rows)  
  
postgres=# explain analyze select c1000,count(*) from tbl_1000_row_c group by c1000;  
                                                                       QUERY PLAN                                                                         
--------------------------------------------------------------------------------------------------------------------------------------------------------  
 Gather Motion 48:1  (slice2; segments: 48)  (cost=95886.65..96011.53 rows=9990 width=16)  
   Rows out:  10001 rows at destination with 1244 ms to end, start offset by 1.760 ms.  
   ->  HashAggregate  (cost=95886.65..96011.53 rows=209 width=16)  
         Group By: tbl_1000_row_c.c1000  
         Rows out:  Avg 208.4 rows x 48 workers.  Max 223 rows (seg17) with 0.002 ms to first row, 432 ms to end, start offset by 32 ms.  
         ->  Redistribute Motion 48:48  (slice1; segments: 48)  (cost=95537.00..95736.80 rows=209 width=16)  
               Hash Key: tbl_1000_row_c.c1000  
               Rows out:  Avg 8749.4 rows x 48 workers at destination.  Max 9360 rows (seg46) with 755 ms to end, start offset by 57 ms.  
               ->  HashAggregate  (cost=95537.00..95537.00 rows=209 width=16)  
                     Group By: tbl_1000_row_c.c1000  
                     Rows out:  Avg 8749.4 rows x 48 workers.  Max 8814 rows (seg18) with 0.004 ms to first row, 1085 ms to end, start offset by 63 ms.  
                     ->  Append-only Scan on tbl_1000_row_c  (cost=0.00..90537.00 rows=20834 width=8)  
                           Rows out:  0 rows (seg0) with 22 ms to end, start offset by 59 ms.  
 Slice statistics:  
   (slice0)    Executor memory: 417K bytes.  
   (slice1)    Executor memory: 1818K bytes avg x 48 workers, 1818K bytes max (seg0).  
   (slice2)    Executor memory: 434K bytes avg x 48 workers, 434K bytes max (seg0).  
 Statement statistics:  
   Memory used: 128000K bytes  
 Settings:  enable_bitmapscan=off; enable_seqscan=off; optimizer=off  
 Optimizer status: legacy query optimizer  
 Total runtime: 1246.276 ms  
(22 rows)  
  
postgres=# explain analyze select c1000,count(*) from tbl_1000_column_c group by c1000;  
                                                                       QUERY PLAN                                                                          
---------------------------------------------------------------------------------------------------------------------------------------------------------  
 Gather Motion 48:1  (slice2; segments: 48)  (cost=259870.80..259996.80 rows=10080 width=16)  
   Rows out:  10001 rows at destination with 78 ms to end, start offset by 1.783 ms.  
   ->  HashAggregate  (cost=259870.80..259996.80 rows=210 width=16)  
         Group By: tbl_1000_column_c.c1000  
         Rows out:  Avg 208.4 rows x 48 workers.  Max 223 rows (seg17) with 0.004 ms to first row, 24 ms to end, start offset by 26 ms.  
         ->  Redistribute Motion 48:48  (slice1; segments: 48)  (cost=259518.00..259719.60 rows=210 width=16)  
               Hash Key: tbl_1000_column_c.c1000  
               Rows out:  Avg 8749.4 rows x 48 workers at destination.  Max 9360 rows (seg46) with 31 ms to end, start offset by 23 ms.  
               ->  HashAggregate  (cost=259518.00..259518.00 rows=210 width=16)  
                     Group By: tbl_1000_column_c.c1000  
                     Rows out:  Avg 8749.4 rows x 48 workers.  Max 8814 rows (seg18) with 0.004 ms to first row, 5.962 ms to end, start offset by 42 ms.  
                     ->  Append-only Columnar Scan on tbl_1000_column_c  (cost=0.00..254518.00 rows=20834 width=8)  
                           Rows out:  0 rows (seg0) with 28 ms to end, start offset by 30 ms.  
 Slice statistics:  
   (slice0)    Executor memory: 417K bytes.  
   (slice1)    Executor memory: 1787K bytes avg x 48 workers, 1787K bytes max (seg0).  
   (slice2)    Executor memory: 434K bytes avg x 48 workers, 434K bytes max (seg0).  
 Statement statistics:  
   Memory used: 128000K bytes  
 Settings:  enable_bitmapscan=off; enable_seqscan=off; optimizer=off  
 Optimizer status: legacy query optimizer  
 Total runtime: 80.152 ms  
(22 rows)  
  
postgres=# explain analyze select c1000,count(*) from tbl_1000_column_nonc group by c1000;  
                                                                       QUERY PLAN                                                                          
---------------------------------------------------------------------------------------------------------------------------------------------------------  
 Gather Motion 48:1  (slice2; segments: 48)  (cost=259856.48..259977.37 rows=9671 width=16)  
   Rows out:  10001 rows at destination with 77 ms to end, start offset by 2.019 ms.  
   ->  HashAggregate  (cost=259856.48..259977.37 rows=202 width=16)  
         Group By: tbl_1000_column_nonc.c1000  
         Rows out:  Avg 208.4 rows x 48 workers.  Max 223 rows (seg17) with 0.003 ms to first row, 26 ms to end, start offset by 35 ms.  
         ->  Redistribute Motion 48:48  (slice1; segments: 48)  (cost=259518.00..259711.42 rows=202 width=16)  
               Hash Key: tbl_1000_column_nonc.c1000  
               Rows out:  Avg 8749.4 rows x 48 workers at destination.  Max 9360 rows (seg46) with 20 ms to end, start offset by 30 ms.  
               ->  HashAggregate  (cost=259518.00..259518.00 rows=202 width=16)  
                     Group By: tbl_1000_column_nonc.c1000  
                     Rows out:  Avg 8749.4 rows x 48 workers.  Max 8814 rows (seg18) with 0.006 ms to first row, 5.476 ms to end, start offset by 25 ms.  
                     ->  Append-only Columnar Scan on tbl_1000_column_nonc  (cost=0.00..254518.00 rows=20834 width=8)  
                           Rows out:  0 rows (seg0) with 27 ms to end, start offset by 25 ms.  
 Slice statistics:  
   (slice0)    Executor memory: 417K bytes.  
   (slice1)    Executor memory: 1755K bytes avg x 48 workers, 1755K bytes max (seg0).  
   (slice2)    Executor memory: 1330K bytes avg x 48 workers, 1330K bytes max (seg0).  
 Statement statistics:  
   Memory used: 128000K bytes  
 Settings:  enable_bitmapscan=off; enable_seqscan=off; optimizer=off  
 Optimizer status: legacy query optimizer  
 Total runtime: 78.976 ms  
(22 rows)  

9、更新对比

postgres=# update tbl_1000_row_nonc set c1000=c1000+1;  
UPDATE 1000000  
Time: 7085.645 ms  
postgres=# update tbl_1000_row_c set c1000=c1000+1;  
UPDATE 1000000  
Time: 6734.279 ms  
postgres=# update tbl_1000_column_nonc set c1000=c1000+1;  
UPDATE 1000000  
Time: 13514.749 ms  
postgres=# update tbl_1000_column_c set c1000=c1000+1;  
UPDATE 1000000  
Time: 10629.104 ms  

小结

表|空间|查询第一列|查询最后一列|全表更新
—|—|—|—|—
AO行存不压缩 | 7653 MB | 353.793 ms | 298.222 ms | 7085.645 ms
AO行存压缩 | 2522 MB | 1130.006 ms | 1246.276 ms | 6734.279 ms
AO列存不压缩 | 7647 MB | 246.967 ms | 80.152 ms | 13514.749 ms
AO列存压缩 | 2520 MB | 250.583 ms | 78.976 ms | 10629.104 ms

对于大宽表，查询性能显然列存储要好很多(不管是第一列还是最后一列，性能一致)，写入和更新性能列存储略差因为需要操作更多的数据文件。

列存第一列和最后一列性能差异是返回记录数导致的，除了第一列，其他列的取值空间都是10000，第一列是100万。

有兴趣可以再测试一些其他参数，比如压缩级别，块大小，HEAP TABLE：

where storage_parameter is:
   APPENDONLY={TRUE|FALSE}
   BLOCKSIZE={8192-2097152}
   ORIENTATION={COLUMN|ROW}
   COMPRESSTYPE={ZLIB|QUICKLZ|RLE_TYPE|NONE}
   COMPRESSLEVEL={0-9}
   CHECKSUM={TRUE|FALSE}
   FILLFACTOR={10-100}
   OIDS[=TRUE|FALSE]