population & sample covariance, standard deviation Aggregate in PostgreSQL

4 minute read

背景

PostgreSQL自带了一些常用的统计学聚合函数, 非常好用.

本文介绍一下方差和标准差的一些聚合函数.

总体方差 : population covariance

总体标准差 : population standard deviation

样本方差 : sample covariance

样本标准差 : sample standard deviation

均值 : mean

样本均值和样本方差的介绍 :

http://en.wikipedia.org/wiki/Sample_mean_and_sample_covariance

均值介绍 :

http://en.wikipedia.org/wiki/Mean

对方差, 标准差, 均值不了解的话, 建议参考网易公开课, 统计学.

浅显易懂.

http://v.163.com/special/Khan/khstatistics.html

http://v.ku6.com/playlist/index_6598382.html

PostgreSQL计算方差, 标准差的聚合函数如下 :

http://www.postgresql.org/docs/devel/static/functions-aggregate.html

函数	参数类型	返回值类型	描述
stddev(expression)	smallint, int, bigint, real, double precision, or numeric	double precision for floating-point arguments, otherwise numeric	historical alias for stddev_samp
stddev_pop(expression)	smallint, int, bigint, real, double precision, or numeric	double precision for floating-point arguments, otherwise numeric	population standard deviation of the input values
stddev_samp(expression)	smallint, int, bigint, real, double precision, or numeric	double precision for floating-point arguments, otherwise numeric	sample standard deviation of the input values
variance(expression)	smallint, int, bigint, real, double precision, or numeric	double precision for floating-point arguments, otherwise numeric	historical alias for var_samp
var_pop(expression)	smallint, int, bigint, real, double precision, or numeric	double precision for floating-point arguments, otherwise numeric	population variance of the input values (square of the population standard deviation)
var_samp(expression)	smallint, int, bigint, real, double precision, or numeric	double precision for floating-point arguments, otherwise numeric	sample variance of the input values (square of the sample standard deviation)

其中stddev和variance是stddev_samp和var_samp的别名.

这些函数用于计算数据集的总体/样本方差,总体/样本标准差.

例如 :

1,2,3,100 这组数据共4个值, 总体均值和样本均值分别为 :

(1+2+3+100)/4 = 26.5

总体方差 : ((1-26.5)^2 + (2-26.5)^2 + (3-26.5)^2 + (100-26.5)^2)/4 = 1801.25

样本方差 : ((1-26.5)^2 + (2-26.5)^2 + (3-26.5)^2 + (100-26.5)^2)/(4-1) = 2401.6666….

总体标准差 : 平方根(总体方差) = 42.4411357058220109

样本标准差 : 平方根(样本方差) = 49.0068022489395513

使用PostgreSQL计算如下 :

postgres=# select variance(id) from (values(1),(2),(3),(100)) as t(id);  
       variance          
-----------------------  
 2401.6666666666666667  
(1 row)  
postgres=# select var_pop(id) from (values(1),(2),(3),(100)) as t(id);  
        var_pop          
-----------------------  
 1801.2500000000000000  
(1 row)  
postgres=# select var_samp(id) from (values(1),(2),(3),(100)) as t(id);  
       var_samp          
-----------------------  
 2401.6666666666666667  
(1 row)  
postgres=# select stddev(id) from (values(1),(2),(3),(100)) as t(id);  
       stddev          
---------------------  
 49.0068022489395513  
(1 row)  
postgres=# select stddev_pop(id) from (values(1),(2),(3),(100)) as t(id);  
     stddev_pop        
---------------------  
 42.4411357058220109  
(1 row)  
postgres=# select stddev_samp(id) from (values(1),(2),(3),(100)) as t(id);  
     stddev_samp       
---------------------  
 49.0068022489395513  
(1 row)  

参考

1. src/backend/utils/adt/float.c

/*  
 *              =========================  
 *              FLOAT AGGREGATE OPERATORS  
 *              =========================  
 *  
 *              float8_accum            - accumulate for AVG(), variance aggregates, etc.  
 *              float4_accum            - same, but input data is float4  
 *              float8_avg                      - produce final result for float AVG()  
 *              float8_var_samp         - produce final result for float VAR_SAMP()  
 *              float8_var_pop          - produce final result for float VAR_POP()  
 *              float8_stddev_samp      - produce final result for float STDDEV_SAMP()  
 *              float8_stddev_pop       - produce final result for float STDDEV_POP()  
 *  
 * The transition datatype for all these aggregates is a 3-element array  
 * of float8, holding the values N, sum(X), sum(X*X) in that order.  
 *  
 * Note that we represent N as a float to avoid having to build a special  
 * datatype.  Given a reasonable floating-point implementation, there should  
 * be no accuracy loss unless N exceeds 2 ^ 52 or so (by which time the  
 * user will have doubtless lost interest anyway...)  
 */  
..................  
Datum  
float8_var_pop(PG_FUNCTION_ARGS)  
{  
        ArrayType  *transarray = PG_GETARG_ARRAYTYPE_P(0);  
        float8     *transvalues;  
        float8          N,  
                                sumX,  
                                sumX2,  
                                numerator;  
  
        transvalues = check_float8_array(transarray, "float8_var_pop", 3);  
        N = transvalues[0];  
        sumX = transvalues[1];  
        sumX2 = transvalues[2];  
  
        /* Population variance is undefined when N is 0, so return NULL */  
        if (N == 0.0)  
                PG_RETURN_NULL();  
  
        numerator = N * sumX2 - sumX * sumX;  
        CHECKFLOATVAL(numerator, isinf(sumX2) || isinf(sumX), true);  
  
        /* Watch out for roundoff error producing a negative numerator */  
        if (numerator <= 0.0)  
                PG_RETURN_FLOAT8(0.0);  
  
        PG_RETURN_FLOAT8(numerator / (N * N));  
}  
  
Datum  
float8_var_samp(PG_FUNCTION_ARGS)  
{  
        ArrayType  *transarray = PG_GETARG_ARRAYTYPE_P(0);  
        float8     *transvalues;  
        float8          N,  
                                sumX,  
                                sumX2,  
                                numerator;  
  
        transvalues = check_float8_array(transarray, "float8_var_samp", 3);  
        N = transvalues[0];  
        sumX = transvalues[1];  
        sumX2 = transvalues[2];  
  
        /* Sample variance is undefined when N is 0 or 1, so return NULL */  
        if (N <= 1.0)  
                PG_RETURN_NULL();  
  
        numerator = N * sumX2 - sumX * sumX;  
        CHECKFLOATVAL(numerator, isinf(sumX2) || isinf(sumX), true);  
  
        /* Watch out for roundoff error producing a negative numerator */  
        if (numerator <= 0.0)  
                PG_RETURN_FLOAT8(0.0);  
  
        PG_RETURN_FLOAT8(numerator / (N * (N - 1.0)));  
}  
  
Datum  
float8_stddev_pop(PG_FUNCTION_ARGS)  
{  
        ArrayType  *transarray = PG_GETARG_ARRAYTYPE_P(0);  
        float8     *transvalues;  
        float8          N,  
                                sumX,  
                                sumX2,  
                                numerator;  
  
        transvalues = check_float8_array(transarray, "float8_stddev_pop", 3);  
        N = transvalues[0];  
        sumX = transvalues[1];  
        sumX2 = transvalues[2];  
  
        /* Population stddev is undefined when N is 0, so return NULL */  
        if (N == 0.0)  
                PG_RETURN_NULL();  
  
        numerator = N * sumX2 - sumX * sumX;  
        CHECKFLOATVAL(numerator, isinf(sumX2) || isinf(sumX), true);  
  
        /* Watch out for roundoff error producing a negative numerator */  
        if (numerator <= 0.0)  
                PG_RETURN_FLOAT8(0.0);  
  
        PG_RETURN_FLOAT8(sqrt(numerator / (N * N)));  
}  
  
Datum  
float8_stddev_samp(PG_FUNCTION_ARGS)  
{  
        ArrayType  *transarray = PG_GETARG_ARRAYTYPE_P(0);  
        float8     *transvalues;  
        float8          N,  
                                sumX,  
                                sumX2,  
                                numerator;  
  
        transvalues = check_float8_array(transarray, "float8_stddev_samp", 3);  
        N = transvalues[0];  
        sumX = transvalues[1];  
        sumX2 = transvalues[2];  
  
        /* Sample stddev is undefined when N is 0 or 1, so return NULL */  
        if (N <= 1.0)  
                PG_RETURN_NULL();  
  
        numerator = N * sumX2 - sumX * sumX;  
        CHECKFLOATVAL(numerator, isinf(sumX2) || isinf(sumX), true);  
  
        /* Watch out for roundoff error producing a negative numerator */  
        if (numerator <= 0.0)  
                PG_RETURN_FLOAT8(0.0);  
  
        PG_RETURN_FLOAT8(sqrt(numerator / (N * (N - 1.0))));  
}  

digoal’s 大量PostgreSQL文章入口

Twitter Facebook Google+ LinkedIn

Digoal.zhou

population & sample covariance, standard deviation Aggregate in PostgreSQL

背景

参考

digoal’s 大量PostgreSQL文章入口

You May Also Enjoy

PostgreSQL(PPAS 兼容Oracle) 从零开始入门手册 - 珍藏版

PostgreSQL pipelinedb 流计算插件 - IoT应用 - 实时轨迹聚合

PostgreSQL plpgsql 存储过程、函数 - 状态、异常变量打印、异常捕获… - GET [STACKED] DIAGNOSTICS

PostgreSQL datediff 日期间隔（单位转换）兼容SQL用法