Greenplum 最佳实践 - 多维分析的使用(CUBE, ROLLUP, GROUPING SETS in GreenPlum and Oracle)
背景
cube, rollup, grouping sets是GROUP BY中使用的, 可用于多种组合分组. 而不局限于group by 中指出的字段.
例如 :
group by col1, col2, col3 : 1. 常规的分组
(col1, col2, col3)
group by cube(col1, col2, col3) : 2^3. 相当于以下分组的union all
(col1, col2, col3)
(col1, col2)
(col1, col3)
(col2, col3)
(col1)
(col2)
(col3)
()
group by rollup(col1, col2, col3) : 3+1. 相当于以下分组的union all
(col1, col2, col3)
(col1, col2)
(col1)
()
group by grouping sets((col1, col2), (col1, col3, col4)) : 2. 相当于以下分组的union all
(col1, col2)
(col1, col3, col4)
group by col4, cube(col1, col2, col3) : 2^3. 相当于以下分组的union all
col4, (col1, col2, col3)
col4, (col1, col2)
col4, (col1, col3)
col4, (col2, col3)
col4, (col1)
col4, (col2)
col4, (col3)
col4, ()
group by rollup((col1, col2, col4), col3) : 2+1. 相当于以下分组的union all
(col1, col2, col4), col3
(col1, col2, col4)
()
目前PostgreSQL (9.5开始支持了)不支持这些用法, GreenPlum 3.3.4支持.
Oracle支持.
下面在GreenPlum以及Oracle中演示一下这几种用法.
一、Oracle
创建测试表
CREATE TABLE dimension_tab (
fact_1_id NUMBER NOT NULL,
fact_2_id NUMBER NOT NULL,
fact_3_id NUMBER NOT NULL,
fact_4_id NUMBER NOT NULL,
sales_value NUMBER(10,2) NOT NULL
);
插入测试数据
INSERT INTO dimension_tab
SELECT TRUNC(DBMS_RANDOM.value( 1, 3)) AS fact_1_id,
TRUNC(DBMS_RANDOM.value( 1, 6)) AS fact_2_id,
TRUNC(DBMS_RANDOM.value( 1, 11)) AS fact_3_id,
TRUNC(DBMS_RANDOM.value( 1, 11)) AS fact_4_id,
ROUND(DBMS_RANDOM.value( 1, 100), 2) AS sales_value
FROM dual
CONNECT BY level <= 1000;
COMMIT;
ROLLUP 测试
In addition to the regular aggregation results we expect from the GROUP BY clause,
the ROLLUP extension produces group subtotals from right to left and a grand total.
If "n" is the number of columns listed in the ROLLUP, there will be n+1 levels of subtotals.
SELECT fact_1_id,
fact_2_id,
fact_3_id,
SUM(sales_value) AS sales_value
FROM dimension_tab
GROUP BY ROLLUP (fact_1_id, fact_2_id, fact_3_id)
ORDER BY fact_1_id, fact_2_id, fact_3_id;
FACT_1_ID FACT_2_ID FACT_3_ID SALES_VALUE
---------- ---------- ---------- -----------
1 1 1 391.03
1 1 2 292.54
1 1 3 306.14
1 1 4 517.02
1 1 5 426.57
1 1 6 525.23
1 1 7 318.49
1 1 8 417.48
1 1 9 441.98
1 1 10 263.05
1 1 3899.53
1 2 1 239.45
1 2 2 470.51
1 2 3 253.29
1 2 4 203.97
1 2 5 635.41
1 2 6 549.31
1 2 7 506.27
1 2 8 353.12
1 2 9 394.07
1 2 10 490.13
1 2 4095.53
1 3 1 973.27
1 3 2 532.92
1 3 3 718.65
1 3 4 676.31
1 3 5 357.85
1 3 6 854.94
1 3 7 398.28
1 3 8 417.47
1 3 9 421.62
1 3 10 764.42
1 3 6115.73
1 4 1 227.24
1 4 2 852.9
1 4 3 414.8
1 4 4 434.01
1 4 5 537.11
1 4 6 559.52
1 4 7 469.66
1 4 8 572.12
1 4 9 684.96
1 4 10 661.73
1 4 5414.05
1 5 1 813.1
1 5 2 634.41
1 5 3 301.85
1 5 4 618.07
1 5 5 629.55
1 5 6 377.24
1 5 7 219.54
1 5 8 724.27
1 5 9 216.57
1 5 10 429.87
1 5 4964.47
1 24489.31
2 1 1 252.79
2 1 2 669.02
2 1 3 773.11
2 1 4 393.42
2 1 5 693.87
2 1 6 734.42
2 1 7 904.45
2 1 8 448.28
2 1 9 629.31
2 1 10 195.91
2 1 5694.58
2 2 1 569.6
2 2 2 434.58
2 2 3 571.17
2 2 4 434.79
2 2 5 145.03
2 2 6 570.25
2 2 7 593.2
2 2 8 936.59
2 2 9 569.44
2 2 10 311.33
2 2 5135.98
2 3 1 686.04
2 3 2 331.67
2 3 3 659.12
2 3 4 593.77
2 3 5 588.32
2 3 6 559.81
2 3 7 588.53
2 3 8 375.92
2 3 9 677.74
2 3 10 395.91
2 3 5456.83
2 4 1 399.2
2 4 2 882.01
2 4 3 59.21
2 4 4 349.9
2 4 5 598.39
2 4 6 431.06
2 4 7 340.95
2 4 8 420.85
2 4 9 493.9
2 4 10 517.31
2 4 4492.78
2 5 1 184.45
2 5 2 481.18
2 5 3 382.22
2 5 4 849.66
2 5 5 343.34
2 5 6 460.22
2 5 7 770.76
2 5 8 297.37
2 5 9 441.46
2 5 10 339.33
2 5 4549.99
2 25330.16
49819.47
113 rows selected.
CUBE 测试
In addition to the subtotals generated by the ROLLUP extension, the CUBE extension will generate subtotals for all combinations of the dimensions specified.
If "n" is the number of columns listed in the CUBE, there will be 2^n subtotal combinations.
SELECT fact_1_id,
fact_2_id,
fact_3_id,
SUM(sales_value) AS sales_value
FROM dimension_tab
GROUP BY fact_1_id, CUBE (fact_2_id, fact_3_id)
ORDER BY fact_1_id, fact_2_id, fact_3_id;
FACT_1_ID FACT_2_ID FACT_3_ID SALES_VALUE
---------- ---------- ---------- -----------
1 1 1 391.03
1 1 2 292.54
1 1 3 306.14
1 1 4 517.02
1 1 5 426.57
1 1 6 525.23
1 1 7 318.49
1 1 8 417.48
1 1 9 441.98
1 1 10 263.05
1 1 3899.53
1 2 1 239.45
1 2 2 470.51
1 2 3 253.29
1 2 4 203.97
1 2 5 635.41
1 2 6 549.31
1 2 7 506.27
1 2 8 353.12
1 2 9 394.07
1 2 10 490.13
1 2 4095.53
1 3 1 973.27
1 3 2 532.92
1 3 3 718.65
1 3 4 676.31
1 3 5 357.85
1 3 6 854.94
1 3 7 398.28
1 3 8 417.47
1 3 9 421.62
1 3 10 764.42
1 3 6115.73
1 4 1 227.24
1 4 2 852.9
1 4 3 414.8
1 4 4 434.01
1 4 5 537.11
1 4 6 559.52
1 4 7 469.66
1 4 8 572.12
1 4 9 684.96
1 4 10 661.73
1 4 5414.05
1 5 1 813.1
1 5 2 634.41
1 5 3 301.85
1 5 4 618.07
1 5 5 629.55
1 5 6 377.24
1 5 7 219.54
1 5 8 724.27
1 5 9 216.57
1 5 10 429.87
1 5 4964.47
1 1 2644.09
1 2 2783.28
1 3 1994.73
1 4 2449.38
1 5 2586.49
1 6 2866.24
1 7 1912.24
1 8 2484.46
1 9 2159.2
1 10 2609.2
1 24489.31
2 1 1 252.79
2 1 2 669.02
2 1 3 773.11
2 1 4 393.42
2 1 5 693.87
2 1 6 734.42
2 1 7 904.45
2 1 8 448.28
2 1 9 629.31
2 1 10 195.91
2 1 5694.58
2 2 1 569.6
2 2 2 434.58
2 2 3 571.17
2 2 4 434.79
2 2 5 145.03
2 2 6 570.25
2 2 7 593.2
2 2 8 936.59
2 2 9 569.44
2 2 10 311.33
2 2 5135.98
2 3 1 686.04
2 3 2 331.67
2 3 3 659.12
2 3 4 593.77
2 3 5 588.32
2 3 6 559.81
2 3 7 588.53
2 3 8 375.92
2 3 9 677.74
2 3 10 395.91
2 3 5456.83
2 4 1 399.2
2 4 2 882.01
2 4 3 59.21
2 4 4 349.9
2 4 5 598.39
2 4 6 431.06
2 4 7 340.95
2 4 8 420.85
2 4 9 493.9
2 4 10 517.31
2 4 4492.78
2 5 1 184.45
2 5 2 481.18
2 5 3 382.22
2 5 4 849.66
2 5 5 343.34
2 5 6 460.22
2 5 7 770.76
2 5 8 297.37
2 5 9 441.46
2 5 10 339.33
2 5 4549.99
2 1 2092.08
2 2 2798.46
2 3 2444.83
2 4 2621.54
2 5 2368.95
2 6 2755.76
2 7 3197.89
2 8 2479.01
2 9 2811.85
2 10 1759.79
2 25330.16
132 rows selected.
GROUPING 测试
It can be quite easy to visually identify subtotals generated by rollups and cubes,
but to do it programatically you really need something more accurate than the presence of null values in the grouping columns.
This is where the GROUPING function comes in.
It accepts a single column as a parameter and returns "1" if the column contains a null value generated as part of a subtotal by a ROLLUP or CUBE operation or "0" for any other value, including stored null values.
SELECT fact_1_id,
fact_2_id,
SUM(sales_value) AS sales_value,
GROUPING(fact_1_id) AS f1g,
GROUPING(fact_2_id) AS f2g
FROM dimension_tab
GROUP BY CUBE (fact_1_id, fact_2_id)
HAVING GROUPING(fact_1_id) = 1 OR GROUPING(fact_2_id) = 1
ORDER BY GROUPING(fact_1_id), GROUPING(fact_2_id);
FACT_1_ID FACT_2_ID SALES_VALUE F1G F2G
---------- ---------- ----------- ---------- ----------
1 24489.31 0 1
2 25330.16 0 1
5 9514.46 1 0
3 11572.56 1 0
2 9231.51 1 0
1 9594.11 1 0
4 9906.83 1 0
49819.47 1 1
8 rows selected.
解释一下GROUPING的作用 :
F1G=0表示这条输出的记录GROUPING(fact_1_id) AS f1g=0 也就是fact_1_id 非空.
F2G=1表示这条输出的记录GROUPING(fact_2_id) AS f2g=1 也就是fact_2_id 是空值. (包括子分组中不包含fact_2_id列是的置空以及本身记录的空).
GROUPING_ID 测试 :
The GROUPING_ID function provides an alternate and more compact way to identify subtotal rows. Passing the dimension columns as arguments, it returns a number indicating the GROUP BY level.
SELECT fact_1_id,
fact_2_id,
SUM(sales_value) AS sales_value,
GROUPING_ID(fact_1_id, fact_2_id) AS grouping_id
FROM dimension_tab
GROUP BY CUBE (fact_1_id, fact_2_id)
ORDER BY fact_1_id, fact_2_id;
FACT_1_ID FACT_2_ID SALES_VALUE GROUPING_ID
---------- ---------- ----------- -----------
1 1 3899.53 0
1 2 4095.53 0
1 3 6115.73 0
1 4 5414.05 0
1 5 4964.47 0
1 24489.31 1
2 1 5694.58 0
2 2 5135.98 0
2 3 5456.83 0
2 4 4492.78 0
2 5 4549.99 0
2 25330.16 1
1 9594.11 2
2 9231.51 2
3 11572.56 2
4 9906.83 2
5 9514.46 2
49819.47 3
18 rows selected.
解释一下GROUPING_ID :
GROUPING_ID(fact_1_id, fact_2_id) AS grouping_id 和 CUBE (fact_1_id, fact_2_id) 对应, 指CUBE (fact_1_id, fact_2_id)中的subgroup id.
本文开始的时候已经列出了CUBE的subgroup多少的计算方法. 这里总共两列放在CUBE中, 所以组就是2^2=4个. 因此我们看到grouping_id是0,1,2,3.
更通俗的看下面 :
(fact_1_id, fact_2_id) : grouping_id=0
(fact_1_id) : grouping_id=1
(fact_2_id) : grouping_id=2
() : grouping_id=3
GROUP_ID 测试 :
It's possible to write queries that return the duplicate subtotals, which can be a little confusing.
The GROUP_ID function assigns the value "0" to the first set, and all subsequent sets get assigned a higher number.
The following query forces duplicates to show the GROUP_ID function in action.
SELECT fact_1_id,
fact_2_id,
SUM(sales_value) AS sales_value,
GROUPING_ID(fact_1_id, fact_2_id) AS grouping_id,
GROUP_ID() AS group_id
FROM dimension_tab
GROUP BY GROUPING SETS(fact_1_id, CUBE (fact_1_id, fact_2_id))
ORDER BY fact_1_id, fact_2_id;
FACT_1_ID FACT_2_ID SALES_VALUE GROUPING_ID GROUP_ID
---------- ---------- ----------- ----------- ----------
1 1 3899.53 0 0
1 2 4095.53 0 0
1 3 6115.73 0 0
1 4 5414.05 0 0
1 5 4964.47 0 0
1 24489.31 1 0
1 24489.31 1 1
2 1 5694.58 0 0
2 2 5135.98 0 0
2 3 5456.83 0 0
2 4 4492.78 0 0
2 5 4549.99 0 0
2 25330.16 1 1
2 25330.16 1 0
1 9594.11 2 0
2 9231.51 2 0
3 11572.56 2 0
4 9906.83 2 0
5 9514.46 2 0
49819.47 3 0
20 rows selected.
GROUP_ID 解释 :
group_id是用来去重的, 什么重复呢, subgroup重复, 例如本例的
GROUP BY GROUPING SETS(fact_1_id, CUBE (fact_1_id, fact_2_id))
分解后如下 :
1. (fact_1_id) :
2. CUBE (fact_1_id, fact_2_id) 再分解 :
(fact_1_id, fact_2_id)
(fact_1_id)
(fact_2_id)
()
所以(fact_1_id)重复了. 这个用group_id来区分.
要去除subgroup的重复可以使用条件 : HAVING GROUP_ID() = 0
SELECT fact_1_id,
fact_2_id,
SUM(sales_value) AS sales_value,
GROUPING_ID(fact_1_id, fact_2_id) AS grouping_id,
GROUP_ID() AS group_id
FROM dimension_tab
GROUP BY GROUPING SETS(fact_1_id, CUBE (fact_1_id, fact_2_id))
HAVING GROUP_ID() = 0
ORDER BY fact_1_id, fact_2_id;
FACT_1_ID FACT_2_ID SALES_VALUE GROUPING_ID GROUP_ID
---------- ---------- ----------- ----------- ----------
1 1 3899.53 0 0
1 2 4095.53 0 0
1 3 6115.73 0 0
1 4 5414.05 0 0
1 5 4964.47 0 0
1 24489.31 1 0
2 1 5694.58 0 0
2 2 5135.98 0 0
2 3 5456.83 0 0
2 4 4492.78 0 0
2 5 4549.99 0 0
2 25330.16 1 0
1 9594.11 2 0
2 9231.51 2 0
3 11572.56 2 0
4 9906.83 2 0
5 9514.46 2 0
49819.47 3 0
18 rows selected.
GROUPING SETS 测试 :
Calculating all possible subtotals in a cube, especially those with many dimensions, can be quite an intensive process.
If you don't need all the subtotals, this can represent a considerable amount of wasted effort.
The following cube with three dimensions gives 8 levels of subtotals (GROUPING_ID: 0-7)
SELECT fact_1_id,
fact_2_id,
fact_3_id,
SUM(sales_value) AS sales_value,
GROUPING_ID(fact_1_id, fact_2_id, fact_3_id) AS grouping_id
FROM dimension_tab
GROUP BY GROUPING SETS((fact_1_id, fact_2_id), (fact_1_id, fact_3_id))
ORDER BY fact_1_id, fact_2_id, fact_3_id;
FACT_1_ID FACT_2_ID FACT_3_ID SALES_VALUE GROUPING_ID
---------- ---------- ---------- ----------- -----------
1 1 3899.53 1
1 2 4095.53 1
1 3 6115.73 1
1 4 5414.05 1
1 5 4964.47 1
1 1 2644.09 2
1 2 2783.28 2
1 3 1994.73 2
1 4 2449.38 2
1 5 2586.49 2
1 6 2866.24 2
1 7 1912.24 2
1 8 2484.46 2
1 9 2159.2 2
1 10 2609.2 2
2 1 5694.58 1
2 2 5135.98 1
2 3 5456.83 1
2 4 4492.78 1
2 5 4549.99 1
2 1 2092.08 2
2 2 2798.46 2
2 3 2444.83 2
2 4 2621.54 2
2 5 2368.95 2
2 6 2755.76 2
2 7 3197.89 2
2 8 2479.01 2
2 9 2811.85 2
2 10 1759.79 2
30 rows selected.
grouping sets 解释 :
GROUP BY GROUPING SETS((fact_1_id, fact_2_id), (fact_1_id, fact_3_id)) 分解如下 :
(fact_1_id, fact_2_id)
(fact_1_id, fact_3_id)
二、GreenPlum
创建测试表
CREATE TABLE dimension_tab (
fact_1_id numeric NOT NULL,
fact_2_id numeric NOT NULL,
fact_3_id numeric NOT NULL,
fact_4_id numeric NOT NULL,
sales_value numeric(10,2) NOT NULL
) distributed randomly;
插入测试数据 :
INSERT INTO dimension_tab
SELECT TRUNC(3*random()) AS fact_1_id,
TRUNC(6*random()) AS fact_2_id,
TRUNC(11*random()) AS fact_3_id,
TRUNC(11*random()) AS fact_4_id,
ROUND(100*random()::numeric, 2) AS sales_value
from generate_series(1,1000) ;
测试结果略 :
1. CUBE : 支持
2. ROLLUP : 支持
3. GROUPING : 支持
4. GROUPING_ID : 不支持
5. GROUP_ID : 支持
6. GROUPING SETS : 支持
注意
1. GreenPlum 3.3.4中没有grouping_id函数. 其他版本是否支持未验证.
2. 从PostgreSQL的wiki来看, PostgreSQL应该在考虑加入cube, rollup, grouping sets的功能. 9.5已加入.
参考
1. http://www.oracle-base.com/articles/misc/rollup-cube-grouping-functions-and-grouping-sets.php
2. http://wiki.postgresql.org/wiki/Grouping_Sets
3. http://bitcast-a.v1.o1.sjc1.bitgravity.com/greenplum/Greenplum_CE_Database/documentation/4.2.2/greenplum_database_4.2_administrator_guide.pdf
greenplum 4.2.2 手册上的描述 :
The GROUP BY Clause
The optional GROUP BY clause has the general form:
GROUP BY grouping_element [, ...]
where grouping_element can be one of:
()
expression
ROLLUP ( expression [,...])
CUBE ( expression [,...])
GROUPING SETS (( grouping_element [, ...]))
GROUP BY will condense into a single row all selected rows that shar e the same values
for the grouped expressions. expression can be an input column name, or the name
or ordinal number of an output column (SELECT list item), or an arbitrary expression
formed from input-co lumn values. In case of ambiguity, a GROUP BY name will be
interpreted as an input-column name ra ther than an output column name.
Aggregate functions, if any are used, are computed across all rows making up each
group, producing a separate value for each group (whereas without GROUP BY , an
aggregate produces a single value computed across all the selected rows). When
GROUP BY is present, it is not valid for the SELECT list expressions to refer to
ungrouped columns except within aggregate functions, since there would be more
than one possible value to return for an ungrouped column.
Greenplum Database has the following additional OLAP grouping extensions (often
referred to as supergroups ):
ROLLUP
A ROLLUP grouping is an extension to the GROUP BY clause that creates aggregate
subtotals that roll up from the most detailed level to a grand total, following a list of
grouping columns (or expressions). ROLLUP takes an ordered list of grouping
columns, calculates the standard aggregate values specified in the GROUP BY clause,
then creates progressively higher-level subt otals, moving from right to left through
the list. Finally, it creates a grand total. A ROLLUP grouping can be thought of as a
series of grouping sets. For example:
GROUP BY ROLLUP (a,b,c)
is equivalent to:
GROUP BY GROUPING SETS( (a,b,c), (a,b), (a), () )
Notice that the n elements of a ROLLUP translate to n+1 grouping sets. Also, the
order in which the grouping expressions are specified is significant in a ROLLUP.
CUBE
A CUBE grouping is an extension to the GROUP BY clause that creates subtotals for all
of the possible combinations of the given list of grouping columns (or expressions).
In terms of multidimensional analysis, CUBE generates all the subtotals that could be
calculated for a data cube with the specified dimensions. For example:
GROUP BY CUBE (a,b,c)
is equivalent to:
GROUP BY GROUPING SETS( (a,b,c), (a,b), (a,c), (b,c), (a),
(b), (c), () )
Notice that n elements of a CUBE translate to 2
n
grouping sets. Consider using CUBE
in any situation requiring cross-tabular reports. CUBE is typically most suitable in
queries that use columns from multiple dime nsions rather than columns representing
different levels of a single dimension. For instance, a commonly requested
cross-tabulation might need subtotals for all the combinations of month, state, and
product.
GROUPING SETS
You can selectively specify the set of gr oups that you want to create using a
GROUPING SETS expression within a GROUP BY clause. This allows precise
specification across multiple dimensions without computing a whole ROLLUP or
CUBE. For example:
GROUP BY GROUPING SETS( (a,c), (a,b) )
If using the grouping extension clauses ROLLUP, CUBE, or GROUPING SETS , two
challenges arise. First, how do you dete rmine which result rows are subtotals, and
then the exact level of aggregation for a given subtotal. Or, how do you differentiate
between result rows that contain both stored NULL values and “NULL” values
created by the ROLLUP or CUBE. Secondly, when duplicate grouping sets are
specified in the GROUP BY clause, how do you determine which result rows are
duplicates? There are two additional gr ouping functions you can use in the SELECT
list to help with this:
grouping(column [, ...]) The grouping function can be applied to one
or more grouping attributes to distinguish super-aggregated rows from regular
grouped rows. This can be helpful in distinguishing a “NULL” representing
the set of all values in a super-aggregated row from a NULL value in a regular
row. Each argument in this function produces a bit — either 1 or 0, where 1
means the result row is super-aggregated, and 0 means the result row is from a
regular grouping. The grouping function returns an in teger by treating these
bits as a binary number and then converting it to a base-10 integer.
group_id() For grouping extension queries that contain duplicate grouping
sets, the group_id function is used to identify duplicate rows in the output.
All unique grouping set output rows will have a group_id value of 0. For each
duplicate grouping set detected, the group_id function assigns a group_id
number greater than 0. All output rows in a particular duplicate grouping set
are identified by the same group_id number.