PostgreSQL Oracle 兼容性 - WITHIN GROUP (sql standard ordered set aggregate functions)

9 minute read

背景

WITHIN GROUP 可以认为是PostgreSQL 聚合函数的功能性扩展. Oracle 对此支持比较完善. PostgreSQL 9.4 新增了这个功能.

本文末尾对Oracle的用法做了详细的介绍和举例说明 :

对聚合或者自定义聚合函数不了解到朋友可以先参考以下文章 :

《Postgres-XC customized aggregate introduction》

《PostgreSQL aggregate function customize》

《performance tuning about multi-rows query aggregated to single-row query》

本文将要提到的这个PostgreSQL within group功能补丁实现方法如下 :

This is our current work-in-progress patch for WITHIN GROUP.  
  
What mostly works:  
  - percentile_cont(float8) within group (order by float8)  
  - percentile_cont(float8) within group (order by interval)  
  - percentile_disc(float8) within group (order by float8)  
  
What doesn't work:  
  - supporting other types in percentile_disc (want polymorphism to  
    work first)  
  - no commands yet to add new ordered set functions (want to nail  
    down the catalog representation first)  
  - no hypothetical set functions yet (need to resolve the above two  
    points first)  
  - some rough edges  
  - probably some bugs  
  - docs  
  
Implementation details:  
  
For execution, we repurpose the existing aggregate-orderby mechanics.  
Given func(directargs) WITHIN GROUP (ORDER BY args), we process the  
(ORDER BY args) into a tuplesort in the same way currently done for  
agg(args ORDER BY args). Rather than using a transfn, we then call the  
finalfn as finalfn(directargs), providing an API by which the finalfn  
can access the tuplesort. (This is somewhat inspired by the window  
function API, but unfortunately has nothing in common with it in terms  
of requirements, so we couldn't just reuse it.)  
  
-- 上面的片段介绍有点问题, 应该是常规调用func(directargs) WITHIN GROUP (ORDER BY args)  
-- 最后再调用finalfn(directargs)  
  
func(p1,p2,...) WITHIN GROUP (ORDER BY q1,q2,...) is represented in the  
catalog with two pg_proc rows:  
  
func(p1,p2,...,q1,q2,...)  (proisagg=true)  
func_final(p1,p2,...)  
  
with the usual pg_aggregate row linking them, though aggtransfn is set  
to InvalidOid (as is aggtranstype) and an additional flag indicates  
that this is an ordered set function.  
  
(This representation is inadequate for a number of reasons; it does not  
handle polymorphism well and would require special-case coding for  
hypothetical set functions, which we have not yet tackled. See our other  
post.)  

函数 :

digoal=# select oid,* from pg_proc where proname='percentile_disc';  
-[ RECORD 1 ]---+----------------  
oid             | 3931  
proname         | percentile_disc  
pronamespace    | 11  
proowner        | 10  
prolang         | 12  
procost         | 1  
prorows         | 0  
provariadic     | 0  
protransform    | -  
proisagg        | t  
proiswindow     | f  
prosecdef       | f  
proleakproof    | f  
proisstrict     | t  
proretset       | f  
provolatile     | i  
pronargs        | 2  
pronargdefaults | 0  
prorettype      | 701  
proargtypes     | 701 701  
proallargtypes  |   
proargmodes     |   
proargnames     |   
proargdefaults  |   
prosrc          | aggregate_dummy  
probin          |   
proconfig       |   
proacl          |   
  
digoal=# select * from pg_aggregate where aggfnoid=3931;  
-[ RECORD 1 ]---+----------------------  
aggfnoid        | percentile_disc  
aggtransfn      | -  
aggfinalfn      | percentile_disc_final  
aggsortop       | 0  
aggtranstype    | 0  
aggisordsetfunc | t  
agginitval      |   
digoal=# \df PERCENTILE_cont  
                                      List of functions  
   Schema   |      Name       | Result data type |        Argument data types         | Type   
------------+-----------------+------------------+------------------------------------+------  
 pg_catalog | percentile_cont | double precision | double precision, double precision | agg  

聚合函数PERCENTILE_cont参数为2个双精, 分别表示func(directargs) WITHIN GROUP (ORDER BY args)里的directargs和args.

finalfn为percentile_disc_final一个双精.表示directargs.

digoal=# \df percentile_disc_final  
                                  List of functions  
   Schema   |         Name          | Result data type | Argument data types |  Type    
------------+-----------------------+------------------+---------------------+--------  
 pg_catalog | percentile_disc_final | double precision | double precision    | normal  
(1 row)  

补丁测试 :

[root@db-172-16-3-33 pg94]# cd postgresql-10a509d  
[root@db-172-16-3-33 postgresql-10a509d]# patch -p1 < ../patch19713context.patch   
patching file src/backend/executor/execQual.c  
patching file src/backend/executor/functions.c  
patching file src/backend/executor/nodeAgg.c  
patching file src/backend/nodes/copyfuncs.c  
patching file src/backend/nodes/equalfuncs.c  
patching file src/backend/nodes/makefuncs.c  
patching file src/backend/nodes/nodeFuncs.c  
Hunk #1 succeeded at 1631 (offset 59 lines).  
Hunk #2 succeeded at 2102 (offset 1 line).  
patching file src/backend/nodes/outfuncs.c  
patching file src/backend/nodes/readfuncs.c  
patching file src/backend/optimizer/util/clauses.c  
Hunk #4 succeeded at 3827 (offset -56 lines).  
patching file src/backend/parser/gram.y  
patching file src/backend/parser/parse_agg.c  
patching file src/backend/parser/parse_expr.c  
patching file src/backend/parser/parse_func.c  
patching file src/backend/parser/parse_oper.c  
patching file src/backend/utils/adt/Makefile  
patching file src/backend/utils/adt/inversedistribution.c  
patching file src/backend/utils/adt/ruleutils.c  
Hunk #2 succeeded at 7405 (offset 19 lines).  
Hunk #4 succeeded at 7562 (offset 19 lines).  
patching file src/backend/utils/sort/tuplesort.c  
patching file src/include/catalog/pg_aggregate.h  
patching file src/include/catalog/pg_proc.h  
Hunk #1 succeeded at 4729 (offset 2 lines).  
patching file src/include/executor/nodeAgg.h  
patching file src/include/fmgr.h  
patching file src/include/nodes/execnodes.h  
patching file src/include/nodes/nodes.h  
patching file src/include/nodes/parsenodes.h  
patching file src/include/nodes/primnodes.h  
patching file src/include/parser/kwlist.h  
patching file src/include/parser/parse_agg.h  
patching file src/include/parser/parse_func.h  
patching file src/test/regress/expected/aggregates.out  
patching file src/test/regress/expected/opr_sanity.out  
patching file src/test/regress/sql/aggregates.sql  
patching file src/test/regress/sql/opr_sanity.sql  

重新编译, 重启数据库 :

gmake  
sudo gmake install  
pg_ctl restart -m fast  

测试 :

pg94@db-172-16-3-33-> psql  
psql (9.4devel)  
Type "help" for help.  
  
digoal=# CREATE TABLE aggtest ( a     int2, b       float4);  
digoal=# copy aggtest from '/home/pg94/postgresql-10a509d/src/test/regress/data/agg.data';  
COPY 4  
  
digoal=# CREATE TABLE tenk1 (                                                               
        unique1         int4,  
        unique2         int4,  
        two                     int4,  
        four            int4,  
        ten                     int4,  
        twenty          int4,  
        hundred         int4,  
        thousand        int4,  
        twothousand     int4,  
        fivethous       int4,  
        tenthous        int4,  
        odd                     int4,  
        even            int4,  
        stringu1        name,  
        stringu2        name,  
        string4         name  
);  
digoal=# copy tenk1 from '/home/pg94/postgresql-10a509d/src/test/regress/data/tenk.data';  
COPY 10000  
  
select p, percentile_cont(p order by p) within group (order by x::float8)   
        from generate_series(1,5) x, (values (0::float8),(0.1),(0.25),(0.4),(0.5),(0.6),(0.75),(0.9),(1)) v(p)   
        group by p order by x;  
ERROR:  Cannot have multiple ORDER BY clauses with WITHIN GROUP  
LINE 1: select p, percentile_cont(p order by p) within group (order ...  
                                                ^  
select p, sum() within group (order by x::float8)   
        from generate_series(1,5) x,   
        (values (0::float8),(0.1),(0.25),(0.4),(0.5),(0.6),(0.75),(0.9),(1)) v(p) group by p order by p;  
ERROR:  sum(double precision) is not an ordered set function  
select p, percentile_cont(p,p) from generate_series(1,5) x,   
        (values (0::float8),(0.1),(0.25),(0.4),(0.5),(0.6),(0.75),(0.9),(1)) v(p) group by p order by p;  
ERROR:  Ordered set function specified, but WITHIN GROUP not present  
LINE 1: select p, percentile_cont(p,p) from generate_series(1,5) x,   
                  ^  
select percentile_cont(0.5) within group (order by b) from aggtest;  
 percentile_cont    
------------------  
 53.4485001564026  
(1 row)  
  
select percentile_cont(0.5) within group (order by b),sum(b) from aggtest;  
 percentile_cont  |   sum     
------------------+---------  
 53.4485001564026 | 431.773  
(1 row)  
  
select percentile_cont(0.5) within group (order by thousand) from tenk1;  
 percentile_cont   
-----------------  
           499.5  
(1 row)  
  
select percentile_disc(0.5) within group (order by thousand) from tenk1;  
 percentile_disc   
-----------------  
             499  

下面在对比Oracle和PostgreSQL中的percentile_disc和percentile_cont; 结果一致.

Oracle测试在本文参考部分.

PostgreSQL测试如下 :

digoal=# create table test (id numeric);  
CREATE TABLE  
digoal=# insert into test values  (1.1),(2.2),(3.3),(4.4),(5.5),(6.6),(null);  
INSERT 0 7  
digoal=# select * from test;  
 id    
-----  
 1.1  
 2.2  
 3.3  
 4.4  
 5.5  
 6.6  
(7 rows)  
digoal=# select PERCENTILE_cont(0.5) within group (order by id) from test;  
 percentile_cont   
-----------------  
            3.85  
(1 row)  
digoal=# insert into test select 6.6 from generate_series(1,8);  
INSERT 0 8  
digoal=# select PERCENTILE_disc(.285714286) within group (order by id) from test;  
 percentile_disc   
-----------------  
             5.5  
(1 row)  
  
digoal=# select PERCENTILE_disc(.285714285) within group (order by id) from test;  
 percentile_disc   
-----------------  
             4.4  
(1 row)  

参考

《Postgres-XC customized aggregate introduction》

《PostgreSQL aggregate function customize》

《performance tuning about multi-rows query aggregated to single-row query》

http://www.postgresql.org/message-id/flat/CAOeZVieLrQZGb23vtS26YyrU11NcGsEN59eUaK43nHXiKpF-CA@mail.gmail.com#CAOeZVieLrQZGb23vtS26YyrU11NcGsEN59eUaK43nHXiKpF-CA@mail.gmail.com

Oracle? Database SQL Language Reference 11g Release 2 (11.2)

http://docs.oracle.com/cd/E11882_01/server.112/e26088/functions003.htm#i89203

oracle within group

Aggregate Functions :

CUME_DIST, DENSE_RANK, LISTAGG, PERCENT_RANK,   
PERCENTILE_CONT, PERCENTILE_DISC, RANK ...  

Oracle MEDIAN function, 用来获取一组数据中排序后得到的中间值. 举例如下 :

SQL> create table test(id numeric(8,3));  
SQL> insert into test values (1.1);  
SQL> insert into test values (2.2);  
SQL> insert into test values (3.3);  
SQL> insert into test values (4.4);  
SQL> insert into test values (5.5);  
SQL> insert into test values (6.6);  
SQL>  insert into test values (null);  
SQL> select median(id) from test;  
MEDIAN(ID)  
----------  
       3.85  

这个结果怎么得到的?

N=6(忽略id is null的行), 因为未使用partition by, 也为使用group by, 所有记录为1个分组. 所以分组共6行.  
RN=(1+0.5*(N-1))=3.5  
FRN=floor(3.5)=3  
CRN=ceiling(3.5)=4  
if (CRN = FRN = RN) then  
      (value of expression from row at RN)  
   else  
      (CRN - RN) * (value of expression for row at FRN) +  
      (RN - FRN) * (value of expression for row at CRN)  
value of expression for row at FRN, CRN. (按照id排序, 然后取对应行号上的值.)  

最终结果median(id)等于

(4-3.5)*3.3+(3.5-3)*4.4=3.85  

等同于如下PERCENTILE_CONT用法 :

SQL> select PERCENTILE_CONT(0.5) within group (order by id) from test;  
PERCENTILE_CONT(0.5)WITHINGROUP(ORDERBYID)  
------------------------------------------  
                                      3.85  

PERCENTILE_DISC用法 :

SQL> select row_number() over(order by id),test.* from test order by id;  
ROW_NUMBER()OVER(ORDERBYID)         ID  
--------------------------- ----------  
                          1        1.1  
                          2        2.2  
                          3        3.3  
                          4        4.4  
                          5        5.5  
                          6        6.6  
                          7        6.6  
                          8        6.6  
                          9        6.6  
                         10        6.6  
                         11        6.6  
                         12        6.6  
                         13        6.6  
                         14        6.6  
                         15 NULL  
15 rows selected.  

id有效的行数为14, 第四行的连续性位置值为.285714286

SQL> select 4/14.0 from dual;  
    4/14.0  
----------  
.285714286  

PERCENTILE_disc的参数为numeric类型, 结果为该分组内按id排序的大于或等于该连续性位置的id值.

因此取0.286则为4.4的下一个值.

SQL> select PERCENTILE_disc(0.286) within group (order by id) from test;  
PERCENTILE_DISC(0.286)WITHINGROUP(ORDERBYID)  
--------------------------------------------  
                                         5.5  

取0.285则为第四行的id值.

SQL> select PERCENTILE_disc(0.285) within group (order by id) from test;  
PERCENTILE_DISC(0.285)WITHINGROUP(ORDERBYID)  
--------------------------------------------  
                                         4.4  

下面的更加精确的说明这个结果 :

SQL> select PERCENTILE_disc(.285714286) within group (order by id) from test;  
PERCENTILE_DISC(.285714286)WITHINGROUP(ORDERBYID)  
-------------------------------------------------  
                                              5.5  
SQL> select PERCENTILE_disc(.285714285) within group (order by id) from test;  
PERCENTILE_DISC(.285714285)WITHINGROUP(ORDERBYID)  
-------------------------------------------------  
                                              4.4  

PERCENTILE_DISC函数解释 :

The first expr must evaluate to a numeric value between 0 and 1, because it is a percentile value.   
  
This expression must be constant within each aggregate group.   
  
The ORDER BY clause takes a single expression that can be of any type that can be sorted.  
  
For a given percentile value P,   
PERCENTILE_DISC sorts the values of the expression in the ORDER BY clause   
and returns the value with the smallest CUME_DIST value   
(with respect to the same sort specification) that is greater than or equal to P.  

PERCENTILE_CONT函数解释 :

The result of PERCENTILE_CONT is computed by linear interpolation between values after ordering them. Using the percentile value (P) and the number of rows (N) in the aggregation group, you can compute the row number you are interested in after ordering the rows with respect to the sort specification. This row number (RN) is computed according to the formula RN = (1+(P*(N-1)). The final result of the aggregate function is computed by linear interpolation between the values from rows at row numbers CRN = CEILING(RN) and FRN = FLOOR(RN).  
  
The final result will be:  
  
  If (CRN = FRN = RN) then the result is  
    (value of expression from row at RN)  
  Otherwise the result is  
    (CRN - RN) * (value of expression for row at FRN) +  
    (RN - FRN) * (value of expression for row at CRN)  

MEDIAN函数详解 :

MEDIAN is an inverse distribution function that assumes a continuous distribution model.   
It takes a numeric or datetime value and returns the middle value or an interpolated   
value that would be the middle value once the values are sorted. Nulls are ignored in the calculation.  
  
This function takes as arguments any numeric data type or any nonnumeric data type that can be implicitly converted to a numeric data type.   
If you specify only expr, then the function returns the same data type as the numeric data type of the argument.   
If you specify the OVER clause, then Oracle Database determines the argument with the highest numeric precedence,   
implicitly converts the remaining arguments to that data type, and returns that data type.  
  
The result of MEDIAN is computed by first ordering the rows. Using N as the number of rows in the group,   
Oracle calculates the row number (RN) of interest with the formula RN = (1 + (0.5*(N-1)).   
The final result of the aggregate function is computed by linear interpolation between the values from rows   
at row numbers CRN = CEILING(RN) and FRN = FLOOR(RN).  
  
The final result will be:  
  
if (CRN = FRN = RN) then  
      (value of expression from row at RN)  
   else  
      (CRN - RN) * (value of expression for row at FRN) +  
      (RN - FRN) * (value of expression for row at CRN)  

摘录oracle一个rank的within group 用法.

select rank(1500) within group (order by salary desc) "rank of 1500" from employees;  

实际得到的结果就是:

如果存在一条记录,这条记录的salary字段值为1500。

那么将该条记录插入employees表中后,按照salary字段降序排列后,该条记录的序号为多少?

比如原表employees内容如下

SQL> select * from employees;  
  
EMP_ID     EMP_NAME         SALARY  
---------- -------------------- ----------  
10001      ZhangSan              500  
10002      LiSi                        1000  
10003      WangWu               1500  
10004      MaLiu                     2000  
10005      NiuQi                      2500  

则如果一个员工的薪水为1500,那么该员工在员工表中的薪水排名应与WangWu相同,并列排名第三。

通过聚合函数RANK() WITHIN GROUP验证一下:

SQL> select rank(1500) within group (order by salary) as "rank number" from employees;  
rank number  
-----------  
          3  

若原表内容如下

SQL> select * from employees;  
  
EMP_ID     EMP_NAME         SALARY  
---------- -------------------- --------------  
10001      ZhangSan             500  
10004      MaLiu                    2000  
10005      NiuQi                    2500  

则排名应为第2,验证如下

SQL> select rank(1500) within group (order by salary) as "rank number" from employees;  
rank number  
-----------  
          2  

Oracle RANK聚合用法语法 :

RANK(expr [, expr ]...) WITHIN GROUP  
   (ORDER BY  
    expr [ DESC | ASC ]  
         [ NULLS { FIRST | LAST } ]  
    [, expr [ DESC | ASC ]  
            [ NULLS { FIRST | LAST } ]  
    ]...  
   )  

Oracle RANK分析函数用法语法 :

RANK( )  
   OVER ([ query_partition_clause ] order_by_clause)  
Oracle PERCENTILE_CONT 用法 :   
PERCENTILE_CONT(expr) WITHIN GROUP  
   (ORDER BY expr [ DESC | ASC ])  
   [ OVER (query_partition_clause) ]  

Flag Counter

digoal’s 大量PostgreSQL文章入口