AWS redshift->hdb pg(Greenplum),DDL语法转换规则
背景
redshift 和 Greenplum都是源自PostgreSQL的MPP数据库:
https://wiki.postgresql.org/wiki/PostgreSQL_derived_databases
语法上有一些差异,从redshift切换到greenplum(或阿里云hdb pg)时,DDL需要做一定的转换。
redshift create table
https://docs.aws.amazon.com/redshift/latest/dg/r_CREATE_TABLE_NEW.html
CREATE [ [LOCAL ] { TEMPORARY | TEMP } ] TABLE
[ IF NOT EXISTS ] table_name
( { column_name data_type [column_attributes] [ column_constraints ]
| table_constraints
| LIKE parent_table [ { INCLUDING | EXCLUDING } DEFAULTS ] }
[, ... ] )
[ BACKUP { YES | NO } ] -- gpdb 不支持,(指定是否自动快照备份)
[table_attribute]
where column_attributes are:
[ DEFAULT default_expr ]
[ IDENTITY ( seed, step ) ] -- gpdb 不支持该语法,使用SEQUENCE+DEFAULT代替
[ ENCODE encoding ] -- 指定列压缩算法,gpdb 不支持该语法,使用全局压缩算法代替
[ DISTKEY ] -- gpdb 不支持该语法,但是redshift只支持一列作为分布键,GPDB支持多列作为分布键,gpdb使用distributed by(colname1, ...)代替
[ SORTKEY ] -- gpdb 不支持,阿里云hdb pg支持,语法参考后面的转换规则。
and column_constraints are:
[ { NOT NULL | NULL } ]
[ { UNIQUE | PRIMARY KEY } ]
[ REFERENCES reftable [ ( refcolumn ) ] ]
and table_constraints are:
[ UNIQUE ( column_name [, ... ] ) ]
[ PRIMARY KEY ( column_name [, ... ] ) ]
[ FOREIGN KEY (column_name [, ... ] ) REFERENCES reftable [ ( refcolumn ) ]
and table_attributes are:
[ DISTSTYLE { EVEN | KEY | ALL } ] -- gpdb 不支持该语法,even使用distributed randomly代替,key使用distributed by (colname1,...) 代替,ALL不支持(维度表,所有SEGMENT都有一份)。
[ DISTKEY ( column_name ) ] -- gpdb 不支持该语法,但是redshift只支持一列作为分布键,GPDB支持多列作为分布键,gpdb使用distributed by(colname1, ...)代替
[ [COMPOUND | INTERLEAVED ] SORTKEY ( column_name [, ...] ) ] -- gpdb 不支持,阿里云HDB PG支持sortkey,语法参考后面的转换规则
greenplum(阿里云hdb pg) create table
https://gpdb.docs.pivotal.io/43300/ref_guide/sql_commands/CREATE_TABLE.html
Synopsis
CREATE [[GLOBAL | LOCAL] {TEMPORARY | TEMP}] TABLE table_name (
[ { column_name data_type [ DEFAULT default_expr ]
[column_constraint [ ... ]
[ ENCODING ( storage_directive [,...] ) ]
]
| table_constraint
| LIKE other_table [{INCLUDING | EXCLUDING}
{DEFAULTS | CONSTRAINTS}] ...}
[, ... ] ]
)
[ INHERITS ( parent_table [, ... ] ) ]
[ WITH ( storage_parameter=value [, ... ] )
[ ON COMMIT {PRESERVE ROWS | DELETE ROWS | DROP} ]
[ TABLESPACE tablespace ]
[ DISTRIBUTED BY (column, [ ... ] ) | DISTRIBUTED RANDOMLY ]
[ PARTITION BY partition_type (column)
[ SUBPARTITION BY partition_type (column) ]
[ SUBPARTITION TEMPLATE ( template_spec ) ]
[...]
( partition_spec )
| [ SUBPARTITION BY partition_type (column) ]
[...]
( partition_spec
[ ( subpartition_spec
[(...)]
) ]
)
where column_constraint is:
[CONSTRAINT constraint_name]
NOT NULL | NULL
| UNIQUE [USING INDEX TABLESPACE tablespace]
[WITH ( FILLFACTOR = value )]
| PRIMARY KEY [USING INDEX TABLESPACE tablespace]
[WITH ( FILLFACTOR = value )]
| CHECK ( expression )
| REFERENCES table_name [ ( column_name [, ... ] ) ]
[ key_match_type ]
[ key_action ]
where storage_directive for a column is:
COMPRESSTYPE={ZLIB | QUICKLZ | RLE_TYPE | NONE}
[COMPRESSLEVEL={0-9} ]
[BLOCKSIZE={8192-2097152} ]
where storage_parameter for the table is:
APPENDONLY={TRUE|FALSE}
BLOCKSIZE={8192-2097152}
ORIENTATION={COLUMN|ROW}
CHECKSUM={TRUE|FALSE}
COMPRESSTYPE={ZLIB|QUICKLZ|RLE_TYPE|NONE}
COMPRESSLEVEL={0-9}
FILLFACTOR={10-100}
OIDS[=TRUE|FALSE]
and table_constraint is:
[CONSTRAINT constraint_name]
UNIQUE ( column_name [, ... ] )
[USING INDEX TABLESPACE tablespace]
[WITH ( FILLFACTOR=value )]
| PRIMARY KEY ( column_name [, ... ] )
[USING INDEX TABLESPACE tablespace]
[WITH ( FILLFACTOR=value )]
| CHECK ( expression )
| FOREIGN KEY ( column_name [, ... ] )
REFERENCES table_name [ ( column_name [, ... ] ) ]
[ key_match_type ]
[ key_action ]
[ key_checking_mode ]
where key_match_type is:
MATCH FULL
| SIMPLE
where key_action is:
ON DELETE
| ON UPDATE
| NO ACTION
| RESTRICT
| CASCADE
| SET NULL
| SET DEFAULT
where key_checking_mode is:
DEFERRABLE
| NOT DEFERRABLE
| INITIALLY DEFERRED
| INITIALLY IMMEDIATE
where partition_type is:
LIST
| RANGE
where partition_specification is:
partition_element [, …]
and partition_element is:
DEFAULT PARTITION name
| [PARTITION name] VALUES (list_value [,...] )
| [PARTITION name]
START ([datatype] 'start_value') [INCLUSIVE | EXCLUSIVE]
[ END ([datatype] 'end_value') [INCLUSIVE | EXCLUSIVE] ]
[ EVERY ([datatype] [number | INTERVAL] 'interval_value') ]
| [PARTITION name]
END ([datatype] 'end_value') [INCLUSIVE | EXCLUSIVE]
[ EVERY ([datatype] [number | INTERVAL] 'interval_value') ]
[ WITH ( partition_storage_parameter=value [, ... ] ) ]
[ TABLESPACE tablespace ]
where subpartition_spec or template_spec is:
subpartition_element [, …]
and subpartition_element is:
DEFAULT SUBPARTITION name
| [SUBPARTITION name] VALUES (list_value [,...] )
| [SUBPARTITION name]
START ([datatype] 'start_value') [INCLUSIVE | EXCLUSIVE]
[ END ([datatype] 'end_value') [INCLUSIVE | EXCLUSIVE] ]
[ EVERY ([datatype] [number | INTERVAL] 'interval_value') ]
| [SUBPARTITION name]
END ([datatype] 'end_value') [INCLUSIVE | EXCLUSIVE]
[ EVERY ([datatype] [number | INTERVAL] 'interval_value') ]
[ WITH ( partition_storage_parameter=value [, ... ] ) ]
[ TABLESPACE tablespace ]
where storage_parameter for a partition is:
APPENDONLY={TRUE|FALSE}
BLOCKSIZE={8192-2097152}
ORIENTATION={COLUMN|ROW}
CHECKSUM={TRUE|FALSE}
COMPRESSTYPE={ZLIB|QUICKLZ|RLE_TYPE|NONE}
COMPRESSLEVEL={1-9}
FILLFACTOR={10-100}
OIDS[=TRUE|FALSE]
阿里云 hdb pg sortkey 语法
https://help.aliyun.com/knowledge_detail/59195.html
[ DISTRIBUTED BY (column, [ ... ] ) | DISTRIBUTED RANDOMLY ]
[ SORTKEY (column, [ ... ] )]
例子
create table test(date text, time text, open float, high float, low float, volume int) with(APPENDONLY=true,ORIENTATION=column) sortkey (volume);
ALTER [[GLOBAL | LOCAL] {TEMPORARY | TEMP}] TABLE table_name SET SORTKEY (column, [ ... ] )
转换规则
1、去除完全不支持的语法
[ BACKUP { YES | NO } ] -- gpdb 不支持,(指定是否自动快照备份)
2、转换不支持但兼容的语法
where column_attributes are:
2.1、
[ IDENTITY ( seed, step ) ] -- gpdb 不支持该语法,使用SEQUENCE+DEFAULT代替
去掉并转换为:
create sequence seq1 start with seed increment by step;
create table test (
id int default nextval('seq1')
);
例如
postgres=# create sequence seq1 start with 100 increment by 2;
CREATE SEQUENCE
postgres=# select nextval('seq1');
nextval
---------
100
(1 row)
postgres=# select nextval('seq1');
nextval
---------
102
(1 row)
2.2、
[ ENCODE encoding ] -- 指定列压缩算法,gpdb 不支持该语法,使用全局压缩算法代替
BYTEDICT
DELTA
DELTA32K
LZO
MOSTLY8
MOSTLY16
MOSTLY32
RAW (no compression)
RUNLENGTH
TEXT255
TEXT32K
ZSTD
去掉并转换为:
create table test (id int)
with (APPENDONLY=true, COMPRESSTYPE={ZLIB|QUICKLZ|RLE_TYPE|NONE}) -- 选择一个压缩算法
;
2.3、
[ DISTKEY ] -- gpdb 不支持该语法,但是redshift只支持一列作为分布键,GPDB支持多列作为分布键,gpdb使用distributed by(colname1, ...)代替
去掉并转换为:
create table test (id int, info text)
distributed by (id); -- 分布列名与DISTKEY对应列名一致
2.4、
[ SORTKEY ] -- gpdb 不支持,阿里云hdb pg支持,语法参考后面的转换规则。
去掉并转换为:
create table test(date text, time text, open float, high float, low float, volume int)
with(APPENDONLY=true,ORIENTATION=column)
sortkey (volume); -- 这里指定sortkey
ALTER [[GLOBAL | LOCAL] {TEMPORARY | TEMP}] TABLE table_name
SET SORTKEY (column, [ ... ] ); -- 这里指定sortkey
如果设置了sort key,在数据导入完成后,执行如下SQL,(堵塞DDL,DML,DSL,重新对数据排序)
VACUUM SORT ONLY [tablename]
and table_attributes are:
2.5、
[ DISTSTYLE { EVEN | KEY | ALL } ] -- gpdb 不支持该语法,even使用distributed randomly代替,key使用distributed by (colname1,...) 代替,ALL不支持(维度表,所有SEGMENT都有一份)。
去掉并转换为:
distributed randomly
distributed by (colname1, ...)
2.6、
[ DISTKEY ( column_name ) ] -- gpdb 不支持该语法,但是redshift只支持一列作为分布键,GPDB支持多列作为分布键,gpdb使用distributed by(colname1, ...)代替
去掉并转换为:
distributed by(colname1, ...)
2.7、
[ [COMPOUND | INTERLEAVED ] SORTKEY ( column_name [, ...] ) ] -- gpdb 不支持,阿里云HDB PG支持sortkey,语法参考后面的转换规则
同2.4。
COMPOUND表示完全按用户指定的字段排序,类似PG的CLUSTER。
INTERLEAVED表示按维度排序,任意列的顺序都是公平的,类似多维空间聚集存放。
参考
https://docs.aws.amazon.com/redshift/latest/dg/r_CREATE_TABLE_NEW.html
https://gpdb.docs.pivotal.io/43300/ref_guide/sql_commands/CREATE_TABLE.html
https://help.aliyun.com/knowledge_detail/59195.html