PostgreSQL Oracle 兼容性 - WITHIN GROUP (sql standard ordered set aggregate functions)
背景
WITHIN GROUP 可以认为是PostgreSQL 聚合函数的功能性扩展. Oracle 对此支持比较完善. PostgreSQL 9.4 新增了这个功能.
本文末尾对Oracle的用法做了详细的介绍和举例说明 :
对聚合或者自定义聚合函数不了解到朋友可以先参考以下文章 :
《Postgres-XC customized aggregate introduction》
《PostgreSQL aggregate function customize》
《performance tuning about multi-rows query aggregated to single-row query》
本文将要提到的这个PostgreSQL within group功能补丁实现方法如下 :
This is our current work-in-progress patch for WITHIN GROUP.
What mostly works:
- percentile_cont(float8) within group (order by float8)
- percentile_cont(float8) within group (order by interval)
- percentile_disc(float8) within group (order by float8)
What doesn't work:
- supporting other types in percentile_disc (want polymorphism to
work first)
- no commands yet to add new ordered set functions (want to nail
down the catalog representation first)
- no hypothetical set functions yet (need to resolve the above two
points first)
- some rough edges
- probably some bugs
- docs
Implementation details:
For execution, we repurpose the existing aggregate-orderby mechanics.
Given func(directargs) WITHIN GROUP (ORDER BY args), we process the
(ORDER BY args) into a tuplesort in the same way currently done for
agg(args ORDER BY args). Rather than using a transfn, we then call the
finalfn as finalfn(directargs), providing an API by which the finalfn
can access the tuplesort. (This is somewhat inspired by the window
function API, but unfortunately has nothing in common with it in terms
of requirements, so we couldn't just reuse it.)
-- 上面的片段介绍有点问题, 应该是常规调用func(directargs) WITHIN GROUP (ORDER BY args)
-- 最后再调用finalfn(directargs)
func(p1,p2,...) WITHIN GROUP (ORDER BY q1,q2,...) is represented in the
catalog with two pg_proc rows:
func(p1,p2,...,q1,q2,...) (proisagg=true)
func_final(p1,p2,...)
with the usual pg_aggregate row linking them, though aggtransfn is set
to InvalidOid (as is aggtranstype) and an additional flag indicates
that this is an ordered set function.
(This representation is inadequate for a number of reasons; it does not
handle polymorphism well and would require special-case coding for
hypothetical set functions, which we have not yet tackled. See our other
post.)
函数 :
digoal=# select oid,* from pg_proc where proname='percentile_disc';
-[ RECORD 1 ]---+----------------
oid | 3931
proname | percentile_disc
pronamespace | 11
proowner | 10
prolang | 12
procost | 1
prorows | 0
provariadic | 0
protransform | -
proisagg | t
proiswindow | f
prosecdef | f
proleakproof | f
proisstrict | t
proretset | f
provolatile | i
pronargs | 2
pronargdefaults | 0
prorettype | 701
proargtypes | 701 701
proallargtypes |
proargmodes |
proargnames |
proargdefaults |
prosrc | aggregate_dummy
probin |
proconfig |
proacl |
digoal=# select * from pg_aggregate where aggfnoid=3931;
-[ RECORD 1 ]---+----------------------
aggfnoid | percentile_disc
aggtransfn | -
aggfinalfn | percentile_disc_final
aggsortop | 0
aggtranstype | 0
aggisordsetfunc | t
agginitval |
digoal=# \df PERCENTILE_cont
List of functions
Schema | Name | Result data type | Argument data types | Type
------------+-----------------+------------------+------------------------------------+------
pg_catalog | percentile_cont | double precision | double precision, double precision | agg
聚合函数PERCENTILE_cont参数为2个双精, 分别表示func(directargs) WITHIN GROUP (ORDER BY args)里的directargs和args.
finalfn为percentile_disc_final一个双精.表示directargs.
digoal=# \df percentile_disc_final
List of functions
Schema | Name | Result data type | Argument data types | Type
------------+-----------------------+------------------+---------------------+--------
pg_catalog | percentile_disc_final | double precision | double precision | normal
(1 row)
补丁测试 :
[root@db-172-16-3-33 pg94]# cd postgresql-10a509d
[root@db-172-16-3-33 postgresql-10a509d]# patch -p1 < ../patch19713context.patch
patching file src/backend/executor/execQual.c
patching file src/backend/executor/functions.c
patching file src/backend/executor/nodeAgg.c
patching file src/backend/nodes/copyfuncs.c
patching file src/backend/nodes/equalfuncs.c
patching file src/backend/nodes/makefuncs.c
patching file src/backend/nodes/nodeFuncs.c
Hunk #1 succeeded at 1631 (offset 59 lines).
Hunk #2 succeeded at 2102 (offset 1 line).
patching file src/backend/nodes/outfuncs.c
patching file src/backend/nodes/readfuncs.c
patching file src/backend/optimizer/util/clauses.c
Hunk #4 succeeded at 3827 (offset -56 lines).
patching file src/backend/parser/gram.y
patching file src/backend/parser/parse_agg.c
patching file src/backend/parser/parse_expr.c
patching file src/backend/parser/parse_func.c
patching file src/backend/parser/parse_oper.c
patching file src/backend/utils/adt/Makefile
patching file src/backend/utils/adt/inversedistribution.c
patching file src/backend/utils/adt/ruleutils.c
Hunk #2 succeeded at 7405 (offset 19 lines).
Hunk #4 succeeded at 7562 (offset 19 lines).
patching file src/backend/utils/sort/tuplesort.c
patching file src/include/catalog/pg_aggregate.h
patching file src/include/catalog/pg_proc.h
Hunk #1 succeeded at 4729 (offset 2 lines).
patching file src/include/executor/nodeAgg.h
patching file src/include/fmgr.h
patching file src/include/nodes/execnodes.h
patching file src/include/nodes/nodes.h
patching file src/include/nodes/parsenodes.h
patching file src/include/nodes/primnodes.h
patching file src/include/parser/kwlist.h
patching file src/include/parser/parse_agg.h
patching file src/include/parser/parse_func.h
patching file src/test/regress/expected/aggregates.out
patching file src/test/regress/expected/opr_sanity.out
patching file src/test/regress/sql/aggregates.sql
patching file src/test/regress/sql/opr_sanity.sql
重新编译, 重启数据库 :
gmake
sudo gmake install
pg_ctl restart -m fast
测试 :
pg94@db-172-16-3-33-> psql
psql (9.4devel)
Type "help" for help.
digoal=# CREATE TABLE aggtest ( a int2, b float4);
digoal=# copy aggtest from '/home/pg94/postgresql-10a509d/src/test/regress/data/agg.data';
COPY 4
digoal=# CREATE TABLE tenk1 (
unique1 int4,
unique2 int4,
two int4,
four int4,
ten int4,
twenty int4,
hundred int4,
thousand int4,
twothousand int4,
fivethous int4,
tenthous int4,
odd int4,
even int4,
stringu1 name,
stringu2 name,
string4 name
);
digoal=# copy tenk1 from '/home/pg94/postgresql-10a509d/src/test/regress/data/tenk.data';
COPY 10000
select p, percentile_cont(p order by p) within group (order by x::float8)
from generate_series(1,5) x, (values (0::float8),(0.1),(0.25),(0.4),(0.5),(0.6),(0.75),(0.9),(1)) v(p)
group by p order by x;
ERROR: Cannot have multiple ORDER BY clauses with WITHIN GROUP
LINE 1: select p, percentile_cont(p order by p) within group (order ...
^
select p, sum() within group (order by x::float8)
from generate_series(1,5) x,
(values (0::float8),(0.1),(0.25),(0.4),(0.5),(0.6),(0.75),(0.9),(1)) v(p) group by p order by p;
ERROR: sum(double precision) is not an ordered set function
select p, percentile_cont(p,p) from generate_series(1,5) x,
(values (0::float8),(0.1),(0.25),(0.4),(0.5),(0.6),(0.75),(0.9),(1)) v(p) group by p order by p;
ERROR: Ordered set function specified, but WITHIN GROUP not present
LINE 1: select p, percentile_cont(p,p) from generate_series(1,5) x,
^
select percentile_cont(0.5) within group (order by b) from aggtest;
percentile_cont
------------------
53.4485001564026
(1 row)
select percentile_cont(0.5) within group (order by b),sum(b) from aggtest;
percentile_cont | sum
------------------+---------
53.4485001564026 | 431.773
(1 row)
select percentile_cont(0.5) within group (order by thousand) from tenk1;
percentile_cont
-----------------
499.5
(1 row)
select percentile_disc(0.5) within group (order by thousand) from tenk1;
percentile_disc
-----------------
499
下面在对比Oracle和PostgreSQL中的percentile_disc和percentile_cont; 结果一致.
Oracle测试在本文参考部分.
PostgreSQL测试如下 :
digoal=# create table test (id numeric);
CREATE TABLE
digoal=# insert into test values (1.1),(2.2),(3.3),(4.4),(5.5),(6.6),(null);
INSERT 0 7
digoal=# select * from test;
id
-----
1.1
2.2
3.3
4.4
5.5
6.6
(7 rows)
digoal=# select PERCENTILE_cont(0.5) within group (order by id) from test;
percentile_cont
-----------------
3.85
(1 row)
digoal=# insert into test select 6.6 from generate_series(1,8);
INSERT 0 8
digoal=# select PERCENTILE_disc(.285714286) within group (order by id) from test;
percentile_disc
-----------------
5.5
(1 row)
digoal=# select PERCENTILE_disc(.285714285) within group (order by id) from test;
percentile_disc
-----------------
4.4
(1 row)
参考
《Postgres-XC customized aggregate introduction》
《PostgreSQL aggregate function customize》
《performance tuning about multi-rows query aggregated to single-row query》
http://www.postgresql.org/message-id/flat/CAOeZVieLrQZGb23vtS26YyrU11NcGsEN59eUaK43nHXiKpF-CA@mail.gmail.com#CAOeZVieLrQZGb23vtS26YyrU11NcGsEN59eUaK43nHXiKpF-CA@mail.gmail.com
Oracle? Database SQL Language Reference 11g Release 2 (11.2)
http://docs.oracle.com/cd/E11882_01/server.112/e26088/functions003.htm#i89203
oracle within group
Aggregate Functions :
CUME_DIST, DENSE_RANK, LISTAGG, PERCENT_RANK,
PERCENTILE_CONT, PERCENTILE_DISC, RANK ...
Oracle MEDIAN function, 用来获取一组数据中排序后得到的中间值. 举例如下 :
SQL> create table test(id numeric(8,3));
SQL> insert into test values (1.1);
SQL> insert into test values (2.2);
SQL> insert into test values (3.3);
SQL> insert into test values (4.4);
SQL> insert into test values (5.5);
SQL> insert into test values (6.6);
SQL> insert into test values (null);
SQL> select median(id) from test;
MEDIAN(ID)
----------
3.85
这个结果怎么得到的?
N=6(忽略id is null的行), 因为未使用partition by, 也为使用group by, 所有记录为1个分组. 所以分组共6行.
RN=(1+0.5*(N-1))=3.5
FRN=floor(3.5)=3
CRN=ceiling(3.5)=4
if (CRN = FRN = RN) then
(value of expression from row at RN)
else
(CRN - RN) * (value of expression for row at FRN) +
(RN - FRN) * (value of expression for row at CRN)
value of expression for row at FRN, CRN. (按照id排序, 然后取对应行号上的值.)
最终结果median(id)等于
(4-3.5)*3.3+(3.5-3)*4.4=3.85
等同于如下PERCENTILE_CONT用法 :
SQL> select PERCENTILE_CONT(0.5) within group (order by id) from test;
PERCENTILE_CONT(0.5)WITHINGROUP(ORDERBYID)
------------------------------------------
3.85
PERCENTILE_DISC用法 :
SQL> select row_number() over(order by id),test.* from test order by id;
ROW_NUMBER()OVER(ORDERBYID) ID
--------------------------- ----------
1 1.1
2 2.2
3 3.3
4 4.4
5 5.5
6 6.6
7 6.6
8 6.6
9 6.6
10 6.6
11 6.6
12 6.6
13 6.6
14 6.6
15 NULL
15 rows selected.
id有效的行数为14, 第四行的连续性位置值为.285714286
SQL> select 4/14.0 from dual;
4/14.0
----------
.285714286
PERCENTILE_disc的参数为numeric类型, 结果为该分组内按id排序的大于或等于该连续性位置的id值.
因此取0.286则为4.4的下一个值.
SQL> select PERCENTILE_disc(0.286) within group (order by id) from test;
PERCENTILE_DISC(0.286)WITHINGROUP(ORDERBYID)
--------------------------------------------
5.5
取0.285则为第四行的id值.
SQL> select PERCENTILE_disc(0.285) within group (order by id) from test;
PERCENTILE_DISC(0.285)WITHINGROUP(ORDERBYID)
--------------------------------------------
4.4
下面的更加精确的说明这个结果 :
SQL> select PERCENTILE_disc(.285714286) within group (order by id) from test;
PERCENTILE_DISC(.285714286)WITHINGROUP(ORDERBYID)
-------------------------------------------------
5.5
SQL> select PERCENTILE_disc(.285714285) within group (order by id) from test;
PERCENTILE_DISC(.285714285)WITHINGROUP(ORDERBYID)
-------------------------------------------------
4.4
PERCENTILE_DISC函数解释 :
The first expr must evaluate to a numeric value between 0 and 1, because it is a percentile value.
This expression must be constant within each aggregate group.
The ORDER BY clause takes a single expression that can be of any type that can be sorted.
For a given percentile value P,
PERCENTILE_DISC sorts the values of the expression in the ORDER BY clause
and returns the value with the smallest CUME_DIST value
(with respect to the same sort specification) that is greater than or equal to P.
PERCENTILE_CONT函数解释 :
The result of PERCENTILE_CONT is computed by linear interpolation between values after ordering them. Using the percentile value (P) and the number of rows (N) in the aggregation group, you can compute the row number you are interested in after ordering the rows with respect to the sort specification. This row number (RN) is computed according to the formula RN = (1+(P*(N-1)). The final result of the aggregate function is computed by linear interpolation between the values from rows at row numbers CRN = CEILING(RN) and FRN = FLOOR(RN).
The final result will be:
If (CRN = FRN = RN) then the result is
(value of expression from row at RN)
Otherwise the result is
(CRN - RN) * (value of expression for row at FRN) +
(RN - FRN) * (value of expression for row at CRN)
MEDIAN函数详解 :
MEDIAN is an inverse distribution function that assumes a continuous distribution model.
It takes a numeric or datetime value and returns the middle value or an interpolated
value that would be the middle value once the values are sorted. Nulls are ignored in the calculation.
This function takes as arguments any numeric data type or any nonnumeric data type that can be implicitly converted to a numeric data type.
If you specify only expr, then the function returns the same data type as the numeric data type of the argument.
If you specify the OVER clause, then Oracle Database determines the argument with the highest numeric precedence,
implicitly converts the remaining arguments to that data type, and returns that data type.
The result of MEDIAN is computed by first ordering the rows. Using N as the number of rows in the group,
Oracle calculates the row number (RN) of interest with the formula RN = (1 + (0.5*(N-1)).
The final result of the aggregate function is computed by linear interpolation between the values from rows
at row numbers CRN = CEILING(RN) and FRN = FLOOR(RN).
The final result will be:
if (CRN = FRN = RN) then
(value of expression from row at RN)
else
(CRN - RN) * (value of expression for row at FRN) +
(RN - FRN) * (value of expression for row at CRN)
摘录oracle一个rank的within group 用法.
select rank(1500) within group (order by salary desc) "rank of 1500" from employees;
实际得到的结果就是:
如果存在一条记录,这条记录的salary字段值为1500。
那么将该条记录插入employees表中后,按照salary字段降序排列后,该条记录的序号为多少?
比如原表employees内容如下
SQL> select * from employees;
EMP_ID EMP_NAME SALARY
---------- -------------------- ----------
10001 ZhangSan 500
10002 LiSi 1000
10003 WangWu 1500
10004 MaLiu 2000
10005 NiuQi 2500
则如果一个员工的薪水为1500,那么该员工在员工表中的薪水排名应与WangWu相同,并列排名第三。
通过聚合函数RANK() WITHIN GROUP验证一下:
SQL> select rank(1500) within group (order by salary) as "rank number" from employees;
rank number
-----------
3
若原表内容如下
SQL> select * from employees;
EMP_ID EMP_NAME SALARY
---------- -------------------- --------------
10001 ZhangSan 500
10004 MaLiu 2000
10005 NiuQi 2500
则排名应为第2,验证如下
SQL> select rank(1500) within group (order by salary) as "rank number" from employees;
rank number
-----------
2
Oracle RANK聚合用法语法 :
RANK(expr [, expr ]...) WITHIN GROUP
(ORDER BY
expr [ DESC | ASC ]
[ NULLS { FIRST | LAST } ]
[, expr [ DESC | ASC ]
[ NULLS { FIRST | LAST } ]
]...
)
Oracle RANK分析函数用法语法 :
RANK( )
OVER ([ query_partition_clause ] order_by_clause)
Oracle PERCENTILE_CONT 用法 :
PERCENTILE_CONT(expr) WITHIN GROUP
(ORDER BY expr [ DESC | ASC ])
[ OVER (query_partition_clause) ]