快速入门PostgreSQL应用开发与管理 - 4 高级SQL用法

33 minute read

背景

本章大纲

1. 聚集函数

常用聚合函数

统计类的聚合函数

分组排序聚合

Hypothetical-Set聚合函数

多维分析

2. 子查询

3. union\union all\except\intersect

4. 自连接

5. 内连接

优化器如何强制连接顺序?

6. 外连接

左外连接

右外连接

全外连接

7. 窗口查询

第二章:高级SQL用法

1. 聚集函数

https://www.postgresql.org/docs/9.6/static/functions-aggregate.html

常用聚合函数

Function Argument Type(s) Return Type Description
array_agg(expression) any array of the argument type input values, including nulls, concatenated into an array
avg(expression) smallint, int, bigint, real, double precision, numeric, or interval numeric for any integer-type argument, double precision for a floating-point argument, otherwise the same as the argument data type the average (arithmetic mean) of all input values
bit_and(expression) smallint, int, bigint, or bit same as argument data type the bitwise AND of all non-null input values, or null if none
bit_or(expression) smallint, int, bigint, or bit same as argument data type the bitwise OR of all non-null input values, or null if none
bool_and(expression) bool bool true if all input values are true, otherwise false
bool_or(expression) bool bool true if at least one input value is true, otherwise false
count(*) - bigint number of input rows
count(expression) any bigint number of input rows for which the value of expression is not null
every(expression) bool bool equivalent to bool_and
json_agg(expression) any json aggregates values as a JSON array
json_object_agg(name,value) (any, any) json aggregates name/value pairs as a JSON object
max(expression) any array, numeric, string, or date/time type same as argument type maximum value of expression across all input values
min(expression) any array, numeric, string, or date/time type same as argument type minimum value of expression across all input values
string_agg(expression,delimiter) (text, text) or (bytea, bytea) same as argument types input values concatenated into a string, separated by delimiter
sum(expression) smallint, int, bigint, real, double precision, numeric, interval, or money bigint for smallint or int arguments, numeric for bigint arguments, otherwise the same as the argument data type sum of expression across all input values
xmlagg(expression) xml xml concatenation of XML values (see alsoSection 9.14.1.7)

上图中所有聚合函数, 当没有行输入时, 除了count返回0, 其他都返回null.

使用sum, array_agg时, 当没有行输入, 返回NULL可能有点别扭, 可以使用coalesce来替代NULL, 如coalesce(sum(x), 0)

coalesce(array_agg(x), ‘{}’::int[])

例子 :

聚合后得到数组, null将计入数组元素

postgres=# select array_agg(id) from (values(null),(1),(2)) as t(id);    
 array_agg      
------------    
 {NULL,1,2}    
(1 row)    

算平均值时不计算null

postgres=# select avg(id) from (values(null),(1),(2)) as t(id);    
        avg             
--------------------    
 1.5000000000000000    
(1 row)    
算bit与 或 时也不计算NULL
postgres=# select bit_and(id) from (values(null),(1),(2)) as t(id);    
 bit_and     
---------    
       0    
(1 row)    
postgres=# select bit_or(id) from (values(null),(1),(2)) as t(id);    
 bit_or     
--------    
      3    
(1 row)    

算布尔逻辑时也不计算NULL

postgres=# select bool_and(id) from (values(null),(true),(false)) as t(id);    
 bool_and     
----------    
 f    
(1 row)    

every是bool_and的别名, 实际上是SQL标准中定义的.

postgres=# select every(id) from (values(null),(true),(false)) as t(id);    
 every     
-------    
 f    
(1 row)    

SQL标准中还定义了any和some为bool_or的别名, 但是因为any和some还可以被解释为子查询, 所以在PostgreSQL中any和some的布尔逻辑聚合不可用.

postgres=# select any(id) from (values(null),(true),(false)) as t(id);    
ERROR:  syntax error at or near "any"    
LINE 1: select any(id) from (values(null),(true),(false)) as t(id);    
               ^    
postgres=# select some(id) from (values(null),(true),(false)) as t(id);    
ERROR:  syntax error at or near "some"    
LINE 1: select some(id) from (values(null),(true),(false)) as t(id);    
               ^    

bool_or的例子

postgres=# select bool_or(id) from (values(null),(true),(false)) as t(id);    
 bool_or     
---------    
 t    
(1 row)    

计算非空的表达式个数, count带表达式时, 不计算null

postgres=# select count(id) from (values(null),(1),(2)) as t(id);    
 count     
-------    
     2    
(1 row)    

计算表达式(含空值)的个数, count(*)计算null, 注意count(*)是一个独立的聚合函数. 请和count(express)区分开来.

postgres=# select count(*) from (values(null),(1),(2)) as t(id);    
 count     
-------    
     3    
(1 row)    
postgres=# select count(*) from (values(null),(null),(1),(2)) as t(id);    
 count     
-------    
     4    
(1 row)    

聚合后得到json, 不带key的json聚合

postgres=# select json_agg(id) from (values(null),(true),(false)) as t(id);    
      json_agg           
---------------------    
 [null, true, false]    
(1 row)    

聚合后得到json, 带key的json聚合, 注意key不能为null, 否则报错.

postgres=# select json_object_agg(c1,c2) from (values('a',null),('b',true),('c',false)) as t(c1,c2);    
             json_object_agg                 
-----------------------------------------    
 { "a" : null, "b" : true, "c" : false }    
(1 row)    
postgres=# select json_object_agg(c1,c2) from (values(null,null),('b',true),('c',false)) as t(c1,c2);    
ERROR:  22023: field name must not be null    
LOCATION:  json_object_agg_transfn, json.c:1959    

计算最大最小值, max, min都不计算null

postgres=# select max(id) from (values(null),(1),(2)) as t(id);    
 max     
-----    
   2    
(1 row)    
postgres=# select min(id) from (values(null),(1),(2)) as t(id);    
 min     
-----    
   1    
(1 row)    

聚合后得到字符串, 字符串聚合

postgres=# select string_agg(c1,'***') from (values('a',null),('b',true),('c',false)) as t(c1,c2);    
 string_agg     
------------    
 a***b***c    
(1 row)    
postgres=# select string_agg(id,'***') from (values(null),('digoal'),('zhou')) as t(id);    
  string_agg       
---------------    
 digoal***zhou    
(1 row)    

计算总和, sum不计算null, 当所有行都是null时, 即没有任何行输入, 返回null.

postgres=# select sum(id) from (values(null),(1),(2)) as t(id);    
 sum     
-----    
   3    
(1 row)    
postgres=# select sum(id::int) from (values(null),(null),(null)) as t(id);    
 sum     
-----    
        
(1 row)    
(1 row)    

聚合后得到xml

postgres=# select xmlagg(id::xml) from (values(null),('<foo>digoal</foo>'),('<bar/>')) as t(id);    
         xmlagg              
-------------------------    
 <foo>digoal</foo><bar/>    
(1 row)    

某些聚合函数得到的结果可能和行的输入顺序有关, 例如array_agg, json_agg, json_object_agg, string_agg, and xmlagg, 以及某些自定义聚合函数. 如何来实现呢?

支持聚合函数中使用order by的PostgreSQL版本可以用如下语法 :

postgres=# select string_agg(id,'***' order by id) from (values(null),('digoal'),('zhou')) as t(id);    
  string_agg       
---------------    
 digoal***zhou    
(1 row)    
postgres=# select string_agg(id,'***' order by id desc) from (values(null),('digoal'),('zhou')) as t(id);    
  string_agg       
---------------    
 zhou***digoal    
(1 row)    

统计类的聚合函数

Function Argument Type Return Type Description
corr(Y, X) double precision double precision correlation coefficient
covar_pop(Y, X) double precision double precision population covariance
covar_samp(Y, X) double precision double precision sample covariance
regr_avgx(Y, X) double precision double precision average of the independent variable (sum(X)/N)
regr_avgy(Y, X) double precision double precision average of the dependent variable (sum(Y)/N)
regr_count(Y, X) double precision bigint number of input rows in which both expressions are nonnull
regr_intercept(Y, X) double precision double precision y-intercept of the least-squares-fit linear equation determined by the (X, Y) pairs
regr_r2(Y, X) double precision double precision square of the correlation coefficient
regr_slope(Y, X) double precision double precision slope of the least-squares-fit linear equation determined by the (X, Y) pairs
regr_sxx(Y, X) double precision double precision sum(X^2) - sum(X)^2/N (“sum of squares” of the independent variable)
regr_sxy(Y, X) double precision double precision sum(X*Y) - sum(X) * sum(Y)/N (“sum of products” of independent times dependent variable)
regr_syy(Y, X) double precision double precision sum(Y^2) - sum(Y)^2/N (“sum of squares” of the dependent variable)
stddev(expression) smallint, int, bigint, real, double precision, or numeric double precision for floating-point arguments, otherwise numeric historical alias for stddev_samp
stddev_pop(expression) smallint, int, bigint, real, double precision, or numeric double precision for floating-point arguments, otherwise numeric population standard deviation of the input values
stddev_samp(expression) smallint, int, bigint, real, double precision, or numeric double precision for floating-point arguments, otherwise numeric sample standard deviation of the input values
variance(expression) smallint, int, bigint, real, double precision, or numeric double precision for floating-point arguments, otherwise numeric historical alias for var_samp
var_pop(expression) smallint, int, bigint, real, double precision, or numeric double precision for floating-point arguments, otherwise numeric population variance of the input values (square of the population standard deviation)
var_samp(expression) smallint, int, bigint, real, double precision, or numeric double precision for floating-point arguments, otherwise numeric sample variance of the input values (square of the sample standard deviation)

相关性统计 :

corr, regr_r2

总体 样本 方差, 标准方差 :

variance, var_pop, var_samp

stddev, stddev_pop, stddev_samp

总体协方差, 样本协方差 :

covar_pop, covar_samp

线性回归 :

regr_avgx, regr_avgy, regr_count, regr_intercept(截距), regr_r2(相关度corr的平方), regr_slope(斜率), regr_sxx, regr_sxy, regr_syy.

分组排序聚合

Function| Direct Argument Type(s)| Aggregated Argument Type(s)| Return Type| Description
—|—|—|—|—
mode() WITHIN GROUP (ORDER BYsort_expression)| - | any sortable type| same as sort expression| returns the most frequent input value (arbitrarily choosing the first one if there are multiple equally-frequent results)
percentile_cont(fraction) WITHIN GROUP (ORDER BY sort_expression)| double precision| double precisionor interval| same as sort expression| continuous percentile: returns a value corresponding to the specified fraction in the ordering, interpolating between adjacent input items if needed
percentile_cont(fractions) WITHIN GROUP (ORDER BY sort_expression)| double precision[]| double precisionor interval| array of sort expression’s type| multiple continuous percentile: returns an array of results matching the shape of the fractionsparameter, with each non-null element replaced by the value corresponding to that percentile
percentile_disc(fraction) WITHIN GROUP (ORDER BY sort_expression)| double precision| any sortable type| same as sort expression| discrete percentile: returns the first input value whose position in the ordering equals or exceeds the specified fraction
percentile_disc(fractions) WITHIN GROUP (ORDER BY sort_expression)| double precision[]| any sortable type| array of sort expression’s type| multiple discrete percentile: returns an array of results matching the shape of the fractionsparameter, with each non-null element replaced by the input value corresponding to that percentile

mode比较好理解, 就是取分组中出现频率最高的值或表达式, 如果最高频率的值有多个, 则随机取一个.

postgres=# create table test(id int, info text);    
CREATE TABLE    
postgres=# insert into test values (1,'test1');    
INSERT 0 1    
postgres=# insert into test values (1,'test1');    
INSERT 0 1    
postgres=# insert into test values (1,'test2');    
INSERT 0 1    
postgres=# insert into test values (1,'test3');    
INSERT 0 1    
postgres=# insert into test values (2,'test1');    
INSERT 0 1    
postgres=# insert into test values (2,'test1');    
INSERT 0 1    
postgres=# insert into test values (2,'test1');    
INSERT 0 1    
postgres=# insert into test values (3,'test4');    
INSERT 0 1    
postgres=# insert into test values (3,'test4');    
INSERT 0 1    
postgres=# insert into test values (3,'test4');    
INSERT 0 1    
postgres=# insert into test values (3,'test4');    
INSERT 0 1    
postgres=# insert into test values (3,'test4');    
INSERT 0 1    
postgres=# select * from test;    
 id | info      
----+-------    
  1 | test1    
  1 | test1    
  1 | test2    
  1 | test3    
  2 | test1    
  2 | test1    
  2 | test1    
  3 | test4    
  3 | test4    
  3 | test4    
  3 | test4    
  3 | test4    
(12 rows)    

取出所有数据中, 出现频率最高的info, 有可能是test1也有可能是test4, 因为他们的出现频率一致.

mode的返回结果数据类型和order by后面的表达式一致.

postgres=# select mode() within group (order by info) from test;    
 mode      
-------    
 test1    
(1 row)    

如果按INFO来分组的话, 取出出现频率最高的info, 实际上这个操作是没有任何意义的, 返回值就是所有记录的info的唯一值.

postgres=# select mode() within group (order by info) from test group by info;    
 mode      
-------    
 test1    
 test2    
 test3    
 test4    
(4 rows)    

按id来分组, 取出组内出现频率最高的info值, 这个是有意义的.

postgres=# select mode() within group (order by info) from test group by id;    
 mode      
-------    
 test1    
 test1    
 test4    
(3 rows)    

id=1 , 出现频率最高的info是test1. 出现2次.

如下 :

postgres=# select id,info,count(*) from test group by id,info;    
 id | info  | count     
----+-------+-------    
  1 | test1 |     2    
  1 | test3 |     1    
  3 | test4 |     5    
  1 | test2 |     1    
  2 | test1 |     3    
(5 rows)    

如果要返回mode()并返回频率次数. 可以使用row_number()窗口来实现. 如下.

postgres=# select id,info,cnt from (select id,info,cnt,row_number() over(partition by id order by cnt desc) as rn from (select id,info,count(*) cnt from test group by id,info) t) t where t.rn=1;    
 id | info  | cnt     
----+-------+-----    
  1 | test1 |   2    
  2 | test1 |   3    
  3 | test4 |   5    
(3 rows)    

其他, mode的返回结果数据类型和order by后面的表达式一致.

postgres=# select mode() within group (order by id) from test;    
 mode     
------    
    3    
(1 row)    
postgres=# select mode() within group (order by id+1) from test;    
 mode     
------    
    4    
(1 row)    

另外还有4个函数是和数据分布有关的, 需要指定从0到1的分布位置. 返回排序后, 在指定分布位置的值或表达式的值.

src/backend/utils/adt/orderedsetaggs.c    
    
    
        if (percentile < 0 || percentile > 1 || isnan(percentile))    
                ereport(ERROR,    
                                (errcode(ERRCODE_NUMERIC_VALUE_OUT_OF_RANGE),    
                                 errmsg("percentile value %g is not between 0 and 1",    
                                                percentile)));    

同时还需要注意区分连续分布和离散分布.

postgres=# create table test(id int, info text);    
CREATE TABLE    
postgres=# insert into test values (1,'test1');    
INSERT 0 1    
postgres=# insert into test values (2,'test2');    
INSERT 0 1    
postgres=# insert into test values (3,'test2');    
INSERT 0 1    
postgres=# insert into test values (4,'test2');    
INSERT 0 1    
postgres=# insert into test values (5,'test2');    
INSERT 0 1    
postgres=# insert into test values (6,'test2');    
INSERT 0 1    
postgres=# insert into test values (7,'test2');    
INSERT 0 1    
postgres=# insert into test values (8,'test3');    
INSERT 0 1    
postgres=# insert into test values (100,'test3');    
INSERT 0 1    
postgres=# insert into test values (1000,'test4');    
INSERT 0 1    
postgres=# select * from test;    
  id  | info      
------+-------    
    1 | test1    
    2 | test2    
    3 | test2    
    4 | test2    
    5 | test2    
    6 | test2    
    7 | test2    
    8 | test3    
  100 | test3    
 1000 | test4    
(10 rows)    

取连续分布的中位数可以用percentile_cont(0.5)来获得.

postgres=# select percentile_cont(0.5) within group (order by id) from test;    
 percentile_cont     
-----------------    
             5.5    
(1 row)    

这个5.5是怎么计算来的呢? 参考本文末尾 :

  If (CRN = FRN = RN) then the result is    
    (value of expression from row at RN)    
  Otherwise the result is    
    (CRN - RN) * (value of expression for row at FRN) +    
    (RN - FRN) * (value of expression for row at CRN)    

解释 :

N = 当前分组的行数 = 10

RN = (1+传入参数(N-1)) = (1+0.5(10-1)) = 5.5

CRN = ceiling(RN) = 6

FRN = floor(RN) = 5

value of expression for row at FRN : 当前分组内第FRN行的值 = 5

value of expression for row at CRN : 当前分组内第CRN行的值 = 6

所以最终中位数值 :

    (CRN - RN) * (value of expression for row at FRN) +    
    (RN - FRN) * (value of expression for row at CRN) =     
(6-5.5)*(5) + (5.5 - 5)*(6) = 5.5;    

使用info分组 :

postgres=# select percentile_cont(0.5) within group (order by id),info from test group by info;    
 percentile_cont | info      
-----------------+-------    
               1 | test1    
             4.5 | test2    
              54 | test3    
            1000 | test4    
(4 rows)    
验证这个值 4.5 test2 :
    2 | test2    
    3 | test2    
    4 | test2    
    5 | test2    
    6 | test2    
    7 | test2    

N = 当前分组的行数 = 6

RN = (1+传入参数(N-1)) = (1+0.5(6-1)) = 3.5

CRN = ceiling(RN) = 4

FRN = floor(RN) = 3

value of expression for row at FRN : 当前分组内第FRN行的值 = 4

value of expression for row at CRN : 当前分组内第CRN行的值 = 5

所以最终中位数值 :

    (CRN - RN) * (value of expression for row at FRN) +    
    (RN - FRN) * (value of expression for row at CRN) =     
(4-3.5)*(4) + (3.5 - 3)*(5) = 4.5;    

当输入参数为数组时, 返回值也是数组, 如下 :

postgres=# select percentile_cont(array[0.5, 1]) within group (order by id) from test;    
 percentile_cont     
-----------------    
 {5.5,1000}    
(1 row)    

接下来看一下稀疏分布 :

返回行号大于等于指定百分比的值或表达式值.

例如 :

postgres=# select id from test;    
  id      
------    
    1    
    2    
    3    
    4    
    5    
    6    
    7    
    8    
  100    
 1000    
(10 rows)    

当前组一共10行, 取位置在0.5的.即行号>=0.5*10的第一行的值或表达式的值.

postgres=# select percentile_disc(0.5) within group (order by id) from test;    
 percentile_disc     
-----------------    
               5    
(1 row)    
postgres=# select percentile_disc(0.5) within group (order by id^2) from test;    
 percentile_disc     
-----------------    
              25    
(1 row)    

输入0.11, 表示行号返回>=1.1的第一行的值.

postgres=# select percentile_disc(0.11) within group (order by id) from test;    
 percentile_disc     
-----------------    
               2    
(1 row)    

再看个例子

postgres=# select id,info,count(*) over (partition by info) from test;    
  id  | info  | count     
------+-------+-------    
    1 | test1 |     1    
    2 | test2 |     6    
    3 | test2 |     6    
    4 | test2 |     6    
    5 | test2 |     6    
    6 | test2 |     6    
    7 | test2 |     6    
    8 | test3 |     2    
  100 | test3 |     2    
 1000 | test4 |     1    
(10 rows)    

取分组的数据, 主要看test2 这个组一共有6行, 0.3*6=1.8, 所以它需要取第二行的数据.

postgres=# select info,percentile_disc(0.3) within group (order by id) from test group by info;    
 info  | percentile_disc     
-------+-----------------    
 test1 |               1    
 test2 |               3    
 test3 |               8    
 test4 |            1000    
(4 rows)    

注意

最终计算的是表达式的分布数, 而不是计算列值的分布数后再计算表达式.

验证如下 :

或者你可以看代码 :

postgres=# select percentile_cont(0.5) within group (order by id^2),info from test group by info;    
 percentile_cont | info      
-----------------+-------    
               1 | test1    
            20.5 | test2    
            5032 | test3    
         1000000 | test4    
(4 rows)    
    
postgres=# select percentile_cont(0.5) within group (order by id),info from test group by info;    
 percentile_cont | info      
-----------------+-------    
               1 | test1    
             4.5 | test2    
              54 | test3    
            1000 | test4    
(4 rows)    
    
postgres=# select 4.5^2;    
      ?column?           
---------------------    
 20.2500000000000000    
(1 row)    
    
postgres=# select 54^2;    
 ?column?     
----------    
     2916    
(1 row)    

Hypothetical-Set聚合函数

函数 | 返回类型 | 描述
—|—|—
rank()| bigint| rank of the current row with gaps; same as row_number of its first peer
dense_rank()| bigint| rank of the current row without gaps; this function counts peer groups
percent_rank()| double precision| relative rank of the current row: (rank - 1) / (total rows - 1)
cume_dist()| double precision| relative rank of the current row: (number of rows preceding or peer with current row) / (total rows)

the four ranking functions are defined so that they give the same answer for any two peer rows.

rank 返回值在分组内的等级, 如果值有重复的话, 跳级处理.

dense_rank 返回值在分组内的等级, 如果值有重复的话, 不跳级处理.

percent_rank 返回 (rank - 1) / (total rows - 1), rank指当前rank值, rows指当前组的记录数

cume_dist 返回(number of rows preceding or peer with current row) / (total rows), 即截至当前记录等级一共有多少行除以本组的总行数.

看一个例子比较明白.

postgres=# select *,row_number() over(partition by info order by id),rank() over(partition by info order by id),dense_rank() over(partition by info order by id),percent_rank() over(partition by info order by id),cume_dist() over(partition by info order by id) from test;    
  id  | info  | row_number | rank | dense_rank | percent_rank |     cume_dist         
------+-------+------------+------+------------+--------------+-------------------    
    1 | test1 |          1 |    1 |          1 |            0 |                 1    
    2 | test2 |          1 |    1 |          1 |            0 | 0.444444444444444    
    2 | test2 |          2 |    1 |          1 |            0 | 0.444444444444444    
    2 | test2 |          3 |    1 |          1 |            0 | 0.444444444444444    
    2 | test2 |          4 |    1 |          1 |            0 | 0.444444444444444    
    3 | test2 |          5 |    5 |          2 |          0.5 | 0.555555555555556    
    4 | test2 |          6 |    6 |          3 |        0.625 | 0.666666666666667    
    5 | test2 |          7 |    7 |          4 |         0.75 | 0.777777777777778    
    6 | test2 |          8 |    8 |          5 |        0.875 | 0.888888888888889    
    7 | test2 |          9 |    9 |          6 |            1 |                 1    
    8 | test3 |          1 |    1 |          1 |            0 |               0.5    
  100 | test3 |          2 |    2 |          2 |            1 |                 1    
 1000 | test4 |          1 |    1 |          1 |            0 |                 1    
(13 rows)    

算法 :

以info=’test2’这个组为例 :

    2 | test2 |          1 |    1 |          1 |            0 | 0.444444444444444    
    2 | test2 |          2 |    1 |          1 |            0 | 0.444444444444444    
    2 | test2 |          3 |    1 |          1 |            0 | 0.444444444444444    
    2 | test2 |          4 |    1 |          1 |            0 | 0.444444444444444    

id=2 的 rank和dense_rank都是1.

percent_rank 返回 (rank - 1) / (total rows - 1), rank指当前rank值, rows指当前组的记录数

cume_dist 返回(number of rows preceding or peer with current row) / (total rows), 截至当前记录等级一共有多少行除以本组的总行数.

所以

percent_rank = (1-1)/(9-1)=0    
cume_dist = (4)/(9) = 0.444444444444444    
    3 | test2 |          5 |    5 |          2 |          0.5 | 0.555555555555556    

rank = 5, 跳级

dense_rank = 2, 不跳级

percent_rank = (5-1)/(9-1)=0.5    
cume_dist = (5)/(9) = 0.555555555555556    

这些窗口函数的另一种用法, 聚合用法.

Function Direct Argument Type(s) Aggregated Argument Type(s) Return Type Description
rank(args) WITHIN GROUP (ORDER BY sorted_args) VARIADIC “any” VARIADIC “any” bigint rank of the hypothetical row, with gaps for duplicate rows
dense_rank(args) WITHIN GROUP (ORDER BY sorted_args) VARIADIC “any” VARIADIC “any” bigint rank of the hypothetical row, without gaps
percent_rank(args) WITHIN GROUP (ORDER BYsorted_args) VARIADIC “any” VARIADIC “any” double precision relative rank of the hypothetical row, ranging from 0 to 1
cume_dist(args) WITHIN GROUP (ORDER BY sorted_args) VARIADIC “any” VARIADIC “any” double precision relative rank of the hypothetical row, ranging from 1/N to 1

这些用法比较奇特, 其实是要返回给定参数在集合中的位置.

例如 :

1    
2    
3    
4    
5    

如果我们给一个参数值是2.2, 应该排在以上数据中的第三行.

例子 :

postgres=# select * from test order by info,id;    
  id  | info      
------+-------    
    1 | test1    
    2 | test2    
    2 | test2    
    2 | test2    
    2 | test2    
    3 | test2    
    4 | test2    
    5 | test2    
    6 | test2    
    7 | test2    
    8 | test3    
  100 | test3    
 1000 | test4    
(13 rows)    
    
postgres=# select info,rank(4.9) within group (order by id),dense_rank(4.9) within group (order by id) from test group by info;    
 info  | rank | dense_rank     
-------+------+------------    
 test1 |    2 |          2    
 test2 |    7 |          4    
 test3 |    1 |          1    
 test4 |    1 |          1    
(4 rows)    

4.9在test1这个分组, 排名第2, 并且这个分组只有1个值, 所以没有gap.

重点关注test2这个组, 这个组有9个值, 其中有4个重复值2, 所以4.9在这里排名需要考虑gap.

rank 返回7, 即4.9在这里考虑GAP排名第7

dense_rank 返回4, 即4.9在这里不考虑GAP排名第4.

又如 :

postgres=# select info,rank(5) within group (order by id),dense_rank(5) within group (order by id) from test group by info;    
 info  | rank | dense_rank     
-------+------+------------    
 test1 |    2 |          2    
 test2 |    7 |          4    
 test3 |    1 |          1    
 test4 |    1 |          1    
(4 rows)    
postgres=# select info,rank(5.1) within group (order by id),dense_rank(5.1) within group (order by id) from test group by info;    
 info  | rank | dense_rank     
-------+------+------------    
 test1 |    2 |          2    
 test2 |    8 |          5    
 test3 |    1 |          1    
 test4 |    1 |          1    
(4 rows)    

最后要看计算0~1代表位置的聚合函数percent_rank和cume_dist.

算法

percent_rank 返回 (rank - 1) / (total rows - 1), rank指当前rank值, rows指当前组的记录数

cume_dist 返回(number of rows preceding or peer with current row) / (total rows), 截至当前记录等级一共有多少行除以本组的总行数.

例子1 :

postgres=# select info,rank(4.9) within group (order by id),dense_rank(4.9) within group (order by id),percent_rank(4.9) within group (order by id),cume_dist(4.9) within group (order by id) from test group by info;    
 info  | rank | dense_rank |   percent_rank    |     cume_dist         
-------+------+------------+-------------------+-------------------    
 test1 |    2 |          2 |                 1 |                 1    
 test2 |    7 |          4 | 0.666666666666667 |               0.7    
 test3 |    1 |          1 |                 0 | 0.333333333333333    
 test4 |    1 |          1 |                 0 |               0.5    
(4 rows)    

同样以test2为分组, 讲解算法. 把4.9插入到这个分组后. 数据应该变成 :

    2 | test2    
    2 | test2    
    2 | test2    
    2 | test2    
    3 | test2    
    4 | test2    
    4.9 | test2  # 计算位置    
    5 | test2    
    6 | test2    
    7 | test2    

一共10行.

percent_rank 返回 (rank - 1) / (total rows - 1), rank指当前rank值, rows指当前组的记录数

cume_dist 返回(number of rows preceding or peer with current row) / (total rows), 截至当前记录等级一共有多少行除以本组的总行数.

所以4.9对应的percent_rank 和 cume_dist 分别为 :

percent_rank = (rank - 1) / (total rows - 1) = (7-1)/(10-1) = 0.666666666666667     
cume_dist = (7)/10 = 0.7    

例子2 :

postgres=# select info,rank(5) within group (order by id),dense_rank(5) within group (order by id),percent_rank(5) within group (order by id),cume_dist(5) within group (order by id) from test group by info;    
 info  | rank | dense_rank |   percent_rank    |     cume_dist         
-------+------+------------+-------------------+-------------------    
 test1 |    2 |          2 |                 1 |                 1    
 test2 |    7 |          4 | 0.666666666666667 |               0.8    
 test3 |    1 |          1 |                 0 | 0.333333333333333    
 test4 |    1 |          1 |                 0 |               0.5    
(4 rows)    

插入计算值5后, 数据变成

    2 | test2    
    2 | test2    
    2 | test2    
    2 | test2    
    3 | test2    
    4 | test2    
    5 | test2  # 计算位置, 即参数值    
    5 | test2    
    6 | test2    
    7 | test2    

依旧10行. 但是截至当前记录等级一共有多少行? 注意是8了.

percent_rank = (rank - 1) / (total rows - 1) = (7-1)/(10-1) = 0.666666666666667     
cume_dist = (8)/10 = 0.8    

例子3 :

postgres=# select info,rank(5.1) within group (order by id),dense_rank(5.1) within group (order by id),percent_rank(5.1) within group (order by id),cume_dist(5.1) within group (order by id) from test group by info;    
 info  | rank | dense_rank |   percent_rank    |     cume_dist         
-------+------+------------+-------------------+-------------------    
 test1 |    2 |          2 |                 1 |                 1    
 test2 |    8 |          5 | 0.777777777777778 |               0.8    
 test3 |    1 |          1 |                 0 | 0.333333333333333    
 test4 |    1 |          1 |                 0 |               0.5    
(4 rows)    

插入计算值5.1后, 数据变成 :

    2 | test2    
    2 | test2    
    2 | test2    
    2 | test2    
    3 | test2    
    4 | test2    
    5 | test2    
    5.1 | test2  # 计算位置, 即参数值    
    6 | test2    
    7 | test2    

例子4 :

postgres=# select info,rank(5) within group (order by id desc),dense_rank(5) within group (order by id desc),percent_rank(5) within group (order by id desc),cume_dist(5) within group (order by id desc) from test group by info;    
 info  | rank | dense_rank |   percent_rank    | cume_dist     
-------+------+------------+-------------------+-----------    
 test1 |    1 |          1 |                 0 |       0.5    
 test2 |    3 |          3 | 0.222222222222222 |       0.4    
 test3 |    3 |          3 |                 1 |         1    
 test4 |    2 |          2 |                 1 |         1    
(4 rows)    

插入计算值5后, 数据变成 :

    7 | test2    
    6 | test2    
    5 | test2  # 注意, 这才是计算位置, 即插入位置.    
    5 | test2    
    4 | test2    
    3 | test2    
    2 | test2    
    2 | test2    
    2 | test2    
    2 | test2    

多维分析

=> SELECT * FROM items_sold;    
 brand | size | sales    
-------+------+-------    
 Foo   | L    |  10    
 Foo   | M    |  20    
 Bar   | M    |  15    
 Bar   | L    |  5    
(4 rows)    
    
=> SELECT brand, size, sum(sales) FROM items_sold GROUP BY GROUPING SETS ((brand), (size), ());    
 brand | size | sum    
-------+------+-----    
 Foo   |      |  30    
 Bar   |      |  20    
       | L    |  15    
       | M    |  35    
       |      |  50    
(5 rows)    

其中GROUP BY GROUPING SETS ((brand), (size), ());

相当于以下三个group by的union all(未在分组的列以NULL代替。):

group by brand    
group by size    
group by ()    

分组集合除了可以用GROUPING SETS来指定,另外还提供了两个特殊的写法rollup和cube.

ROLLUP ( e1, e2, e3, ... )    

代表递减分组,一般用于异构结构的分组如国家,省份,城市,乡镇这样的结构查询。

逐级分组汇总结果,它相当于如下写法:

GROUPING SETS (    
    ( e1, e2, e3, ... ),    
    ...    
    ( e1, e2 )    
    ( e1 )    
    ( )  -- 注意包含全集    
)    

还有一种写法是CUBE

CUBE ( a, b, c )    

cube是任意组合,相当于:

GROUPING SETS (    
    ( a, b, c ),    
    ( a, b    ),    
    ( a,    c ),    
    ( a       ),    
    (    b, c ),    
    (    b    ),    
    (       c ),    
    (         )    -- 注意包含全集    
)    

在cube和rollup中使用括号可以将多个表达式作为单个表达式来处理:

ROLLUP ( a, (b,c), d )    

递减,相当于

GROUPING SETS (    
    ( a, b, c, d ),    
    ( a, b, c    ),    
    ( a          ),    
    (            )    
)    
CUBE ( (a,b), (c,d) )    

相当于:

GROUPING SETS (    
    ( a, b, c, d ),    
    ( a, b       ),    
    (       c, d ),    
    (            )    
)    

同时cube,rollup,grouping sets还可以混合使用:

GROUP BY a, CUBE(b,c), GROUPING SETS ((d), (e))    

相当于:

GROUP BY GROUPING SETS (    
  (a,b,c,d), (a,b,c,e),    
  (a,b,d),   (a,b,e),    
  (a,c,d),   (a,c,e),    
  (a,d),     (a,e)    
)    

既然分组聚合相当于多个group by 的union all,那么我们如何区分当前输出的记录是哪个分组group by的输出呢?

用grouping(cols)可以表示未参与聚合的表达式的比特位,并转换为INT输出。

例如:

=> SELECT * FROM items_sold;    
 make  | model | sales    
-------+-------+-------    
 Foo   | GT    |  10    
 Foo   | Tour  |  20    
 Bar   | City  |  15    
 Bar   | Sport |  5    
(4 rows)    

grouping()中必须包含group by后面的任意或所有列。

=> SELECT make, model, GROUPING(make,model), sum(sales) FROM items_sold GROUP BY ROLLUP(make,model);    
 make  | model | grouping | sum    
-------+-------+----------+-----    
 Foo   | GT    |        0 | 10    
 Foo   | Tour  |        0 | 20    
 Bar   | City  |        0 | 15    
 Bar   | Sport |        0 | 5    
 Foo   |       |        1 | 30    
 Bar   |       |        1 | 20    
       |       |        3 | 50    
(7 rows)    

grouping()中必须包含group by后面的任意或所有列,不能包含未参与聚合的列。

grouping()中的每个表达式用1个比特位表示。

postgres=# create table tbl(c1 int,c2 int,c3 int, c4 int, c5 int);    
CREATE TABLE    
postgres=# select c1,count(*),grouping(c1,c2,c3) from tbl group by cube(c1,c2,c3);    
 c1 | count | grouping     
----+-------+----------    
    |     0 |        7   -- b'111'::int = 7    
(1 row)    
    
postgres=# select c1,count(*),grouping(c1,c2) from tbl group by cube(c1,c2,c3);    
 c1 | count | grouping     
----+-------+----------    
    |     0 |        3  b'11'::int = 3    
(1 row)    
    
postgres=# select c1,count(*),grouping(c2) from tbl group by cube(c1,c2,c3);    
 c1 | count | grouping     
----+-------+----------    
    |     0 |        1  b'1'::int = 1    
(1 row)    
    
postgres=# insert into tbl values (1,2,3,4,5);    
INSERT 0 1    
postgres=# insert into tbl values (1,2,3,4,6);    
INSERT 0 1    
postgres=# insert into tbl values (2,3,4,5,6);    
INSERT 0 1    
postgres=# select c1,count(*),grouping(c2) from tbl group by cube(c1,c2,c3);    
 c1 | count | grouping     
----+-------+----------    
  1 |     2 |        0    
  1 |     2 |        0    
  1 |     2 |        1    
  2 |     1 |        0    
  2 |     1 |        0    
  2 |     1 |        1    
    |     3 |        1    
  1 |     2 |        1    
    |     2 |        1    
  2 |     1 |        1    
    |     1 |        1    
    |     2 |        0    
    |     2 |        0    
    |     1 |        0    
    |     1 |        0    
(15 rows)    

grouping中包含未参与聚合的列将报错:

postgres=# select c1,count(*),grouping(c4) from tbl group by cube(c1,c2,c3);    
ERROR:  arguments to GROUPING must be grouping expressions of the associated query level    
LINE 1: select c1,count(*),grouping(c4) from tbl group by cube(c1,c2...    
                                    ^    
postgres=# select c1,count(*),grouping(c1,c2,c3,c4) from tbl group by cube(c1,c2,c3);    
ERROR:  arguments to GROUPING must be grouping expressions of the associated query level    
LINE 1: select c1,count(*),grouping(c1,c2,c3,c4) from tbl group by c...    
                                             ^    
postgres=# select c1,count(*),grouping(c1,c2,c3) from tbl group by cube(c1,c2,c3);    
 c1 | count | grouping     
----+-------+----------    
  1 |     2 |        0    
  1 |     2 |        1    
  1 |     2 |        3    
  2 |     1 |        0    
  2 |     1 |        1    
  2 |     1 |        3    
    |     3 |        7    
  1 |     2 |        2    
    |     2 |        6    
  2 |     1 |        2    
    |     1 |        6    
    |     2 |        4    
    |     2 |        5    
    |     1 |        4    
    |     1 |        5    
(15 rows)    

2. 子查询

select子查询只能返回一列

postgres=# select (select * from (values (1,2),(2,3)) as t(c1,c2)) , relname, relkind from pg_class;    
ERROR:  subquery must return only one column    
LINE 1: select (select * from (values (1,2),(2,3)) as t(c1,c2)) , re...    
               ^    

select子查询只能返回一条记录

postgres=# select (select * from (values (1),(2)) as t(c1)) , relname, relkind from pg_class;    
ERROR:  more than one row returned by a subquery used as an expression    

子查询可以用在select子句也可以用在源中,还可以用在with, update from语句

postgres=# select (select * from (values (1),(2)) as t(c1) limit 1) , relname, relkind from pg_class;    
 c1 |                    relname                    | relkind     
----+-----------------------------------------------+---------    
  1 | pg_type                                       | r    
  1 | pg_toast_187550                               | t    
  1 | new_type                                      | c    
  1 | pg_toast_187550_index                         | i    
  1 | test                                          | r    
  1 | pg_toast_187556                               | t    
    
postgres=# select t.relname from (select * from pg_class limit 1) t , pg_class where t.relname=pg_class.relname;    
 relname     
---------    
 pg_type    
(1 row)    

3. union\union all\except\intersect

两条或者多条查询结果的合并、去重合并、相交、求差。

union all(不去重)

postgres=# select * from (values (1),(1)) t(id) union all select * from (values (2),(2)) t(id);  
 id   
----  
  1  
  1  
  2  
  2  
(4 rows)  

union(去重)

postgres=# select * from (values (1),(1)) t(id) union select * from (values (2),(2)) t(id);  
 id   
----  
  1  
  2  
(2 rows)  

except(去重)

postgres=# select * from (values (1),(1),(2)) t(id) except select * from (values (2),(2)) t(id);  
 id   
----  
  1  
(1 row)  

intersect(去重)

postgres=# select * from (values (1),(1),(2)) t(id) intersect select * from (values (2),(2)) t(id);  
 id   
----  
  2  
(1 row)  
  

4. 自连接

当某个表或者子句的A字段要和B字段进行关联时,可以使用自关联。

postgres=# create table tbl8(c1 int, c2 int, info text);  
CREATE TABLE  
postgres=# insert into tbl8 select generate_series(1,100), generate_series(2,101),md5(random()::text) ;  
INSERT 0 100  
  
postgres=# select t1.* from tbl8 t1, tbl8 t2 where t1.c1=t2.c2 and t2.c2<10 limit 10;  
 c1 | c2 |               info                 
----+----+----------------------------------  
  2 |  3 | b54fbfd843c2343330ceaa1758882ee4  
  3 |  4 | df999680d87435f2d2d8d5604aca2f1e  
  4 |  5 | 497ad2cfa8acd1f7062bb1bd7aa7a646  
  5 |  6 | ef55bcdc85dfa8d54978e4c2085ec55a  
  6 |  7 | 45f60117d6d577389707b22823a513e5  
  7 |  8 | ebde16ccaeced9a400a1608d591a2bf0  
  8 |  9 | fdeb1505dd02aca33abed20531592302  
  9 | 10 | 010d07f3b9b05b2c36b46440b1dd92aa  
(8 rows)  

5. 内连接

内连接,仅仅输出符合连接条件的记录。

digoal=# create table tbl_join_1(id int primary key,info text);  
CREATE TABLE  
digoal=# create table tbl_join_2(id int primary key,info text);  
CREATE TABLE  
digoal=# create table tbl_join_3(id int primary key,info text);  
CREATE TABLE  
digoal=# create table tbl_join_4(id int primary key,info text);  
CREATE TABLE  
digoal=# create table tbl_join_5(id int primary key,info text);  
CREATE TABLE  
digoal=# create table tbl_join_6(id int primary key,info text);  
CREATE TABLE  
digoal=# create table tbl_join_7(id int primary key,info text);  
CREATE TABLE  
digoal=# create table tbl_join_8(id int primary key,info text);  
CREATE TABLE  
digoal=# create table tbl_join_9(id int primary key,info text);  
CREATE TABLE  
  
digoal=# insert into tbl_join_1 select generate_series(1,10),md5(random()::text);  
digoal=# insert into tbl_join_2 select generate_series(1,100),md5(random()::text);  
digoal=# insert into tbl_join_3 select generate_series(1,1000),md5(random()::text);  
digoal=# insert into tbl_join_4 select generate_series(1,10000),md5(random()::text);  
digoal=# insert into tbl_join_5 select generate_series(1,100000),md5(random()::text);  
digoal=# insert into tbl_join_6 select generate_series(1,1000000),md5(random()::text);  
digoal=# insert into tbl_join_7 select generate_series(1,2000000),md5(random()::text);  
digoal=# insert into tbl_join_8 select generate_series(1,3000000),md5(random()::text);  
digoal=# insert into tbl_join_9 select generate_series(1,4000000),md5(random()::text);  

非显示JOIN

postgres=# select t1.* from tbl_join_1 t1, tbl_join_2 t2 where t1.id=t2.id and t2.id=1;  
 id |               info                 
----+----------------------------------  
  1 | 5b6049ec7b94d17c0bd92dc4da436311  
(1 row)  

显示JOIN

postgres=# select t1.* from tbl_join_1 t1 join tbl_join_2 t2 on(t1.id=t2.id and t2.id=1);  
 id |               info                 
----+----------------------------------  
  1 | 5b6049ec7b94d17c0bd92dc4da436311  
(1 row)  

优化器如何强制连接顺序?

当join_collapse_limit=1时, 显示JOIN的关联按SQL写法进行关联. 例如 :

以下写法非显示关联,无法强制JOIN顺序

digoal=# set join_collapse_limit=1;  
  
digoal=# explain select t1.info, t5.info from tbl_join_1 t1,  
tbl_join_2 t2,  
tbl_join_3 t3,  
tbl_join_4 t4,  
tbl_join_5 t5,  
tbl_join_6 t6,  
tbl_join_7 t7,  
tbl_join_8 t8,  
tbl_join_9 t9  
where   
t1.id=t2.id and  
t2.id=t3.id and  
t3.id=t4.id and  
t4.id=t5.id and  
t5.id=t6.id and  
t6.id=t7.id and  
t7.id=t8.id and  
t8.id=t9.id and  
t9.id=10000;  
  
                                                              QUERY PLAN                                                                
--------------------------------------------------------------------------------------------------------------------------------------  
 Nested Loop  (cost=2.72..25.99 rows=1 width=65)  
   ->  Nested Loop  (cost=2.29..22.93 rows=1 width=69)  
         ->  Nested Loop  (cost=1.86..19.87 rows=1 width=69)  
               ->  Nested Loop  (cost=1.43..16.82 rows=1 width=69)  
                     ->  Nested Loop  (cost=1.00..13.76 rows=1 width=69)  
                           ->  Nested Loop  (cost=0.71..10.84 rows=1 width=36)  
                                 ->  Nested Loop  (cost=0.43..7.93 rows=1 width=36)  
                                       ->  Nested Loop  (cost=0.15..5.03 rows=1 width=36)  
                                             ->  Index Scan using tbl_join_1_pkey on tbl_join_1 t1  (cost=0.15..2.77 rows=1 width=36)  
                                                   Index Cond: (id = 10000)  
                                             ->  Seq Scan on tbl_join_2 t2  (cost=0.00..2.25 rows=1 width=4)  
                                                   Filter: (id = 10000)  
                                       ->  Index Only Scan using tbl_join_3_pkey on tbl_join_3 t3  (cost=0.28..2.89 rows=1 width=4)  
                                             Index Cond: (id = 10000)  
                                 ->  Index Only Scan using tbl_join_4_pkey on tbl_join_4 t4  (cost=0.29..2.90 rows=1 width=4)  
                                       Index Cond: (id = 10000)  
                           ->  Index Scan using tbl_join_5_pkey on tbl_join_5 t5  (cost=0.29..2.91 rows=1 width=37)  
                                 Index Cond: (id = 10000)  
                     ->  Index Only Scan using tbl_join_6_pkey on tbl_join_6 t6  (cost=0.42..3.04 rows=1 width=4)  
                           Index Cond: (id = 10000)  
               ->  Index Only Scan using tbl_join_7_pkey on tbl_join_7 t7  (cost=0.43..3.04 rows=1 width=4)  
                     Index Cond: (id = 10000)  
         ->  Index Only Scan using tbl_join_8_pkey on tbl_join_8 t8  (cost=0.43..3.05 rows=1 width=4)  
               Index Cond: (id = 10000)  
   ->  Index Only Scan using tbl_join_9_pkey on tbl_join_9 t9  (cost=0.43..3.05 rows=1 width=4)  
         Index Cond: (id = 10000)  
(26 rows)  
  
Time: 8.398 ms  

以下写法显示关联,可以强制JOIN顺序

digoal=# set join_collapse_limit=1;  
  
digoal=# explain select t1.info, t5.info from   
tbl_join_1 t1 join tbl_join_2 t2 on (t1.id=t2.id)  
join tbl_join_3 t3 on (t2.id=t3.id)  
join tbl_join_4 t4 on (t3.id=t4.id)  
join tbl_join_5 t5 on (t4.id=t5.id)  
join tbl_join_6 t6 on (t5.id=t6.id)  
join tbl_join_7 t7 on (t6.id=t7.id)  
join tbl_join_8 t8 on (t7.id=t8.id)  
join tbl_join_9 t9 on (t8.id=t9.id)  
where t9.id=10000;  
  
                                                              QUERY PLAN                                                                
--------------------------------------------------------------------------------------------------------------------------------------  
 Nested Loop  (cost=2.72..25.99 rows=1 width=65)  
   ->  Nested Loop  (cost=2.29..22.93 rows=1 width=69)  
         ->  Nested Loop  (cost=1.86..19.87 rows=1 width=69)  
               ->  Nested Loop  (cost=1.43..16.82 rows=1 width=69)  
                     ->  Nested Loop  (cost=1.00..13.76 rows=1 width=69)  
                           ->  Nested Loop  (cost=0.71..10.84 rows=1 width=36)  
                                 ->  Nested Loop  (cost=0.43..7.93 rows=1 width=36)  
                                       ->  Nested Loop  (cost=0.15..5.03 rows=1 width=36)  
                                             ->  Index Scan using tbl_join_1_pkey on tbl_join_1 t1  (cost=0.15..2.77 rows=1 width=36)  
                                                   Index Cond: (id = 10000)  
                                             ->  Seq Scan on tbl_join_2 t2  (cost=0.00..2.25 rows=1 width=4)  
                                                   Filter: (id = 10000)  
                                       ->  Index Only Scan using tbl_join_3_pkey on tbl_join_3 t3  (cost=0.28..2.89 rows=1 width=4)  
                                             Index Cond: (id = 10000)  
                                 ->  Index Only Scan using tbl_join_4_pkey on tbl_join_4 t4  (cost=0.29..2.90 rows=1 width=4)  
                                       Index Cond: (id = 10000)  
                           ->  Index Scan using tbl_join_5_pkey on tbl_join_5 t5  (cost=0.29..2.91 rows=1 width=37)  
                                 Index Cond: (id = 10000)  
                     ->  Index Only Scan using tbl_join_6_pkey on tbl_join_6 t6  (cost=0.42..3.04 rows=1 width=4)  
                           Index Cond: (id = 10000)  
               ->  Index Only Scan using tbl_join_7_pkey on tbl_join_7 t7  (cost=0.43..3.04 rows=1 width=4)  
                     Index Cond: (id = 10000)  
         ->  Index Only Scan using tbl_join_8_pkey on tbl_join_8 t8  (cost=0.43..3.05 rows=1 width=4)  
               Index Cond: (id = 10000)  
   ->  Index Only Scan using tbl_join_9_pkey on tbl_join_9 t9  (cost=0.43..3.05 rows=1 width=4)  
         Index Cond: (id = 10000)  
(26 rows)  
  
Time: 0.829 ms  

设置join_collapse_limit=1后, 按照SQL写法进行关联. 执行计划的时间也缩短了.

6. 外连接

左外连接

左边的表,即使没有匹配,也输出

postgres=# create table tab1(id int, info text, crt_time timestamp);  
CREATE TABLE  
Time: 26.056 ms  
postgres=# create table tab2(id int, info text, crt_time timestamp);  
CREATE TABLE  
Time: 36.215 ms  
postgres=# insert into tab1 values (1,'test',now());  
INSERT 0 1  
Time: 0.520 ms  
postgres=# insert into tab1 values (2,'test',now());  
INSERT 0 1  
Time: 0.297 ms  
postgres=# insert into tab2 values (2,'test',now());  
INSERT 0 1  
Time: 11.325 ms  
postgres=# insert into tab2 values (3,'test',now());  
INSERT 0 1  
Time: 0.352 ms  
postgres=# select * from tab1;  
 id | info |          crt_time            
----+------+----------------------------  
  1 | test | 2017-04-11 17:48:29.37083  
  2 | test | 2017-04-11 17:48:32.742795  
(2 rows)  
  
Time: 0.506 ms  
postgres=# select * from tab2;  
 id | info |          crt_time            
----+------+----------------------------  
  2 | test | 2017-04-11 17:48:39.722821  
  3 | test | 2017-04-11 17:48:41.901834  
(2 rows)  
  
Time: 0.335 ms  

右表没有被匹配时,输出NULL

postgres=# select * from tab1 left join tab2 on (tab1.id=tab2.id);  
 id | info |          crt_time          | id | info |          crt_time            
----+------+----------------------------+----+------+----------------------------  
  1 | test | 2017-04-11 17:48:29.37083  |    |      |   
  2 | test | 2017-04-11 17:48:32.742795 |  2 | test | 2017-04-11 17:48:39.722821  
(2 rows)  

过滤在A表但是不在B表的记录。

postgres=# select tab1.* from tab1 left join tab2 on (tab1.id=tab2.id) where tab2.* is null;  
 id | info |         crt_time            
----+------+---------------------------  
  1 | test | 2017-04-11 17:48:29.37083  
(1 row)  
  
postgres=# select * from tab1 where id in (select id from tab1 except select id from tab2);  
 id | info |         crt_time            
----+------+---------------------------  
  1 | test | 2017-04-11 17:48:29.37083  
(1 row)  

通过外部where filter,可以过滤连接=false的记录

postgres=# select * from tab1 left join tab2 on (tab1.id=tab2.id) where tab1.id=1;  
 id | info |         crt_time          | id | info | crt_time   
----+------+---------------------------+----+------+----------  
  1 | test | 2017-04-11 17:48:29.37083 |    |      |   
(1 row)  
  
postgres=# explain select * from tab1 left join tab2 on (tab1.id=tab2.id) where tab1.id=1;  
                            QUERY PLAN                              
------------------------------------------------------------------  
 Nested Loop Left Join  (cost=0.00..48.81 rows=36 width=88)  
   Join Filter: (tab1.id = tab2.id)  
   ->  Seq Scan on tab1  (cost=0.00..24.12 rows=6 width=44)  
         Filter: (id = 1)  
   ->  Materialize  (cost=0.00..24.16 rows=6 width=44)  
         ->  Seq Scan on tab2  (cost=0.00..24.12 rows=6 width=44)  
               Filter: (id = 1)  
(7 rows)  

join内的条件,不会过滤未连接的记录

postgres=# select * from tab1 left join tab2 on (tab1.id=tab2.id and tab1.id=1);  
 id | info |          crt_time          | id | info | crt_time   
----+------+----------------------------+----+------+----------  
  1 | test | 2017-04-11 17:48:29.37083  |    |      |   
  2 | test | 2017-04-11 17:48:32.742795 |    |      |   
(2 rows)  
  
postgres=# explain select * from tab1 left join tab2 on (tab1.id=tab2.id and tab1.id=1);  
                             QUERY PLAN                                
---------------------------------------------------------------------  
 Hash Left Join  (cost=35.42..298.96 rows=1130 width=88)  
   Hash Cond: (tab1.id = tab2.id)  
   Join Filter: (tab1.id = 1)  
   ->  Seq Scan on tab1  (cost=0.00..21.30 rows=1130 width=44)  
   ->  Hash  (cost=21.30..21.30 rows=1130 width=44)  
         ->  Seq Scan on tab2  (cost=0.00..21.30 rows=1130 width=44)  
(6 rows)  

右外连接

左表没有被匹配时,输出NULL

postgres=# select * from tab1 right join tab2 on (tab1.id=tab2.id);  
 id | info |          crt_time          | id | info |          crt_time            
----+------+----------------------------+----+------+----------------------------  
  2 | test | 2017-04-11 17:48:32.742795 |  2 | test | 2017-04-11 17:48:39.722821  
    |      |                            |  3 | test | 2017-04-11 17:48:41.901834  
(2 rows)  

全外连接

没有被匹配时,输出NULL

postgres=# select * from tab1 full join tab2 on (tab1.id=tab2.id);  
 id | info |          crt_time          | id | info |          crt_time            
----+------+----------------------------+----+------+----------------------------  
  1 | test | 2017-04-11 17:48:29.37083  |    |      |   
  2 | test | 2017-04-11 17:48:32.742795 |  2 | test | 2017-04-11 17:48:39.722821  
    |      |                            |  3 | test | 2017-04-11 17:48:41.901834  
(3 rows)  

7. 窗口查询

https://www.postgresql.org/docs/9.6/static/functions-window.html

常用窗口函数

Function Return Type Description
row_number() bigint number of the current row within its partition, counting from 1
rank() bigint rank of the current row with gaps; same as row_number of its first peer
dense_rank() bigint rank of the current row without gaps; this function counts peer groups
percent_rank() double precision relative rank of the current row: (rank - 1) / (total rows - 1)
cume_dist() double precision relative rank of the current row: (number of rows preceding or peer with current row) / (total rows)
ntile(num_buckets integer) integer integer ranging from 1 to the argument value, dividing the partition as equally as possible
lag(value anyelement [, offset integer [, default anyelement ]]) same type as value returns value evaluated at the row that is offset rows before the current row within the partition; if there is no such row, instead return default (which must be of the same type as value). Both offset and default are evaluated with respect to the current row. If omitted, offset defaults to 1 and default to null
lead(value anyelement [, offset integer [, default anyelement ]]) same type as value returns value evaluated at the row that is offset rows after the current row within the partition; if there is no such row, instead return default (which must be of the same type as value). Both offset and default are evaluated with respect to the current row. If omitted, offset defaults to 1 and default to null
first_value(value any) same type as value returns value evaluated at the row that is the first row of the window frame
last_value(value any) same type as value returns value evaluated at the row that is the last row of the window frame
nth_value(value any, nth integer) same type as value returns value evaluated at the row that is the nth row of the window frame (counting from 1); null if no such row

其他函数也可用在窗口语法中

例子

CREATE TABLE window_test(id int, name text, subject text, score numeric);  
INSERT INTO window_test VALUES(1,'digoal','数学',99.5),  
(2,'digoal','语文',89.5),  
(3,'digoal','英语',79.5),  
(4,'digoal','物理',99.5),  
(5,'digoal','化学',98.5),  
(6,'刘德华','数学',89.5),  
(7,'刘德华','语文',99.5),  
(8,'刘德华','英语',79.5),  
(9,'刘德华','物理',89.5),  
(10,'刘德华','化学',69.5),  
(11,'张学友','数学',89.5),  
(12,'张学友','语文',91.5),  
(13,'张学友','英语',92.5),  
(14,'张学友','物理',93.5),  
(15,'张学友','化学',94.5);  

– 取出每门课程的第一名.

SELECT id,name,subject,score FROM   
  (SELECT row_number() OVER (PARTITION BY subject ORDER BY score DESC) AS rn,* FROM window_test) AS t   
WHERE rn=1 ORDER BY SUBJECT;  
 id |  name  | subject | score   
----+--------+---------+-------  
  5 | digoal | 化学    |  98.5  
  1 | digoal | 数学    |  99.5  
  4 | digoal | 物理    |  99.5  
 13 | 张学友 | 英语    |  92.5  
  7 | 刘德华 | 语文    |  99.5  

每个人每门成绩与第一名的差距,第一名,排名

postgres=# SELECT row_number() over(partition by subject order by score desc) as ord,
id,name,subject,score,max(score) over (partition by subject) as top1, 
max(score) over (partition by subject) - score as diff from window_test ;  
 ord | id |  name  | subject | score | top1 | diff   
-----+----+--------+---------+-------+------+------  
   1 |  5 | digoal | 化学    |  98.5 | 98.5 |  0.0  
   2 | 15 | 张学友 | 化学    |  94.5 | 98.5 |  4.0  
   3 | 10 | 刘德华 | 化学    |  69.5 | 98.5 | 29.0  
   1 |  1 | digoal | 数学    |  99.5 | 99.5 |  0.0  
   2 | 11 | 张学友 | 数学    |  89.5 | 99.5 | 10.0  
   3 |  6 | 刘德华 | 数学    |  89.5 | 99.5 | 10.0  
   1 |  4 | digoal | 物理    |  99.5 | 99.5 |  0.0  
   2 | 14 | 张学友 | 物理    |  93.5 | 99.5 |  6.0  
   3 |  9 | 刘德华 | 物理    |  89.5 | 99.5 | 10.0  
   1 | 13 | 张学友 | 英语    |  92.5 | 92.5 |  0.0  
   2 |  3 | digoal | 英语    |  79.5 | 92.5 | 13.0  
   3 |  8 | 刘德华 | 英语    |  79.5 | 92.5 | 13.0  
   1 |  7 | 刘德华 | 语文    |  99.5 | 99.5 |  0.0  
   2 | 12 | 张学友 | 语文    |  91.5 | 99.5 |  8.0  
   3 |  2 | digoal | 语文    |  89.5 | 99.5 | 10.0  
(15 rows)  

Flag Counter

digoal’s 大量PostgreSQL文章入口