PostgreSQL 统计信息之 - 逻辑与物理存储的线性相关性

6 minute read

背景

PostgreSQL统计信息中, 有一个相关性的统计, 在pg_stats.correlation中可以查看到,

统计值范围从-1到1, 趋向于-1表示逆向相关, 趋向于1表示正向相关, 趋向于0表示不相关.

postgres=# \d pg_stats  
          View "pg_catalog.pg_stats"  
         Column         |   Type   | Modifiers   
------------------------+----------+-----------  
 schemaname             | name     |   
 tablename              | name     |   
 attname                | name     |   
 inherited              | boolean  |   
 null_frac              | real     |   
 avg_width              | integer  |   
 n_distinct             | real     |   
 most_common_vals       | anyarray |   
 most_common_freqs      | real[]   |   
 histogram_bounds       | anyarray |   
 correlation            | real     |   
 most_common_elems      | anyarray |   
 most_common_elem_freqs | real[]   |   
 elem_count_histogram   | real[]   |   

correlation的含义是什么呢?

即列的物理顺序和列的逻辑顺序的相关性.

相关性越高, 走索引扫描的离散块扫描更少, 也就是说, 相关性越高, 走索引扫描的离散块扫描代价越低.

相关性在其他领域也有非常重要的应用, 例如广告投入和销售额的数据, 看百度提到的例子 :

软件公司在全国有许多代理商，为研究它的财务软件产品的广告投入与销售额的关系，统计人员随机选择10家代理商进行观察，搜集到年广告投入费和月平均销售额的数据，并编制成相关表，见表1:

表1　广告费与月平均销售额相关表　单位：万元

年广告费投入 | 月均销售额  
12.5　　15.3　　23.2　　26.4　　33.5　　34.4　　39.4　　45.2　　55.4　　60.9  
21.2　　23.9　　32.9　　34.1　　42.5　　43.2　　49.0　　52.8　　59.4　　63.5  

参照表1，可计算相关系数如表2：

序号 | 广告投入(万元) x | 月均销售额(万元) y  
1　　2　　3　　4　　5　　6　　7　　8　　9　　10  
12.5　　15.3　　23.2　　26.4　　33.5　　34.4　　39.4　　45.2　　55.4　　60.9  
21.2　　23.9　　32.9　　34.1　　42.5　　43.2　　49.0　　52.8　　59.4　　63.5  
156.25　　234.09　　538.24　　696.96　　1122.25　　1183.36　　1552.36　　2043.04　　3069.16　　3708.81  
449.44　　571.21　　1082.41　　1162.81　　1806.25　　1866.24　　2401.00　　2787.84　　3528.36　　4032.25  
265.00　　365.67　　763.28　　900.24　　1423.75　　1486.08　　1930.60　　2386.56　　3290.76　　3867.15  
合计 346.2 422.5 14304.52 19687.81 16679.09  
  
=0.9942  

相关系数为0.9942，说明广告投入费与月平均销售额之间有高度的线性正相关关系。

相关性越高, 说明广告投入和销售额的关系越明显.

相关性是如何计算的呢? 实际上是 “协方差(x,y)除以(平方根(方差(x)*方差(y)))” .

PostgreSQL 统计信息与线性相关性应用

在运维领域, 也可以做相对应的统计, 例如服务器的内存使用量, 负载, 进程数, 网络吞吐量, 用户请求量, 用户请求响应时间等数据, 可以做相关性的统计, 观察他们之间的关系.

接下来进入正题, 看看PostgreSQL是如何计算列的逻辑和物理顺序相关性的

首选看一下pg_stats这个视图对应的correlation是怎么来的

postgres=# \d+ pg_stats  
        CASE  
            WHEN s.stakind1 = 3 THEN s.stanumbers1[1]  
            WHEN s.stakind2 = 3 THEN s.stanumbers2[1]  
            WHEN s.stakind3 = 3 THEN s.stanumbers3[1]  
            WHEN s.stakind4 = 3 THEN s.stanumbers4[1]  
            WHEN s.stakind5 = 3 THEN s.stanumbers5[1]  
            ELSE NULL::real  
        END AS correlation,  
。。。  
   FROM pg_statistic s  
     JOIN pg_class c ON c.oid = s.starelid  
     JOIN pg_attribute a ON c.oid = a.attrelid AND a.attnum = s.staattnum  
     LEFT JOIN pg_namespace n ON n.oid = c.relnamespace  
  WHERE NOT a.attisdropped AND has_column_privilege(c.oid, a.attnum, 'select'::text);  

其实是来自pg_statistic这个表, corr的统计是在analyze中完成的.

相关性计算的代码如下, 注意是采样统计 :

src/backend/commands/analyze.c

                /*  
                 * Now scan the values in order, find the most common ones, and also  
                 * accumulate ordering-correlation statistics.  
                 *  
                 * To determine which are most common, we first have to count the  
                 * number of duplicates of each value.  The duplicates are adjacent in  
                 * the sorted list, so a brute-force approach is to compare successive  
                 * datum values until we find two that are not equal. However, that  
                 * requires N-1 invocations of the datum comparison routine, which are  
                 * completely redundant with work that was done during the sort.  (The  
                 * sort algorithm must at some point have compared each pair of items  
                 * that are adjacent in the sorted order; otherwise it could not know  
                 * that it's ordered the pair correctly.) We exploit this by having  
                 * compare_scalars remember the highest tupno index that each  
                 * ScalarItem has been found equal to.  At the end of the sort, a  
                 * ScalarItem's tupnoLink will still point to itself if and only if it  
                 * is the last item of its group of duplicates (since the group will  
                 * be ordered by tupno).  
                 */  
                corr_xysum = 0;  
                ndistinct = 0;  
                nmultiple = 0;  
                dups_cnt = 0;  
                for (i = 0; i < values_cnt; i++)  
                {  
                        int                     tupno = values[i].tupno;  
  
                        corr_xysum += ((double) i) * ((double) tupno);  
                        dups_cnt++;  
                        if (tupnoLink[tupno] == tupno)  
                        {  
                                /* Reached end of duplicates of this value */  
                                ndistinct++;  
                                if (dups_cnt > 1)  
                                {  
                                        nmultiple++;  
                                        if (track_cnt < num_mcv ||  
                                                dups_cnt > track[track_cnt - 1].count)  
                                        {  
                                                /*  
                                                 * Found a new item for the mcv list; find its  
                                                 * position, bubbling down old items if needed. Loop  
                                                 * invariant is that j points at an empty/ replaceable  
                                                 * slot.  
                                                 */  
                                                int                     j;  
  
                                                if (track_cnt < num_mcv)  
                                                        track_cnt++;  
                                                for (j = track_cnt - 1; j > 0; j--)  
                                                {  
                                                        if (dups_cnt <= track[j - 1].count)  
                                                                break;  
                                                        track[j].count = track[j - 1].count;  
                                                        track[j].first = track[j - 1].first;  
                                                }  
                                                track[j].count = dups_cnt;  
                                                track[j].first = i + 1 - dups_cnt;  
                                        }  
                                }  
                                dups_cnt = 0;  
                        }  
                }  
  
.........................  
                /* Generate a correlation entry if there are multiple values */  
                if (values_cnt > 1)  
                {  
                        MemoryContext old_context;  
                        float4     *corrs;  
                        double          corr_xsum,  
                                                corr_x2sum;  
  
                        /* Must copy the target values into anl_context */  
                        old_context = MemoryContextSwitchTo(stats->anl_context);  
                        corrs = (float4 *) palloc(sizeof(float4));  
                        MemoryContextSwitchTo(old_context);  
  
                        /*----------  
                         * Since we know the x and y value sets are both  
                         *              0, 1, ..., values_cnt-1  
                         * we have sum(x) = sum(y) =  
                         *              (values_cnt-1)*values_cnt / 2  
                         * and sum(x^2) = sum(y^2) =  
                         *              (values_cnt-1)*values_cnt*(2*values_cnt-1) / 6.  
                         *----------  
                         */  
                        corr_xsum = ((double) (values_cnt - 1)) *  
                                ((double) values_cnt) / 2.0;  
                        corr_x2sum = ((double) (values_cnt - 1)) *  
                                ((double) values_cnt) * (double) (2 * values_cnt - 1) / 6.0;  
  
                        /* And the correlation coefficient reduces to */  
                        corrs[0] = (values_cnt * corr_xysum - corr_xsum * corr_xsum) /  
                                (values_cnt * corr_x2sum - corr_xsum * corr_xsum);  
  
                        stats->stakind[slot_idx] = STATISTIC_KIND_CORRELATION;  
                        stats->staop[slot_idx] = mystats->ltopr;  
                        stats->stanumbers[slot_idx] = corrs;  
                        stats->numnumbers[slot_idx] = 1;  
                        slot_idx++;  
                }  

PostgreSQL 提供了相关性统计的函数, corr供用户使用.

参考

http://www.postgresql.org/docs/9.4/static/functions-aggregate.html

corr代码如下 :

src/backend/utils/adt/float.c

Datum  
float8_corr(PG_FUNCTION_ARGS)  
{  
        ArrayType  *transarray = PG_GETARG_ARRAYTYPE_P(0);  
        float8     *transvalues;  
        float8          N,  
                                sumX,  
                                sumX2,  
                                sumY,  
                                sumY2,  
                                sumXY,  
                                numeratorX,  
                                numeratorY,  
                                numeratorXY;  
  
        transvalues = check_float8_array(transarray, "float8_corr", 6);  
        N = transvalues[0];  
        sumX = transvalues[1];  
        sumX2 = transvalues[2];  
        sumY = transvalues[3];  
        sumY2 = transvalues[4];  
        sumXY = transvalues[5];  
  
        /* if N is 0 we should return NULL */  
        if (N < 1.0)  
                PG_RETURN_NULL();  
  
        numeratorX = N * sumX2 - sumX * sumX;  
        CHECKFLOATVAL(numeratorX, isinf(sumX2) || isinf(sumX), true);  
        numeratorY = N * sumY2 - sumY * sumY;  
        CHECKFLOATVAL(numeratorY, isinf(sumY2) || isinf(sumY), true);  
        numeratorXY = N * sumXY - sumX * sumY;  
        CHECKFLOATVAL(numeratorXY, isinf(sumXY) || isinf(sumX) ||  
                                  isinf(sumY), true);  
        if (numeratorX <= 0 || numeratorY <= 0)  
                PG_RETURN_NULL();  
  
        PG_RETURN_FLOAT8(numeratorXY / sqrt(numeratorX * numeratorY));  
}  

我们可以用corr来验证PostgreSQL的采样统计, 但是注意, 要验证的话, 数据量小一点比较好, 这样的话PG会全量采样, 和corr得到的结果一致, 如果数据量太大, 得到的结果可能有少量偏差.

postgres=# create table t(id int);  
CREATE TABLE  
postgres=# insert into t values (2),(5),(8),(3),(4),(6),(9),(7),(1);  
INSERT 0 9  

行号, ID值如下

postgres=# select ctid,* from t;  
 ctid  | id   
-------+----  
 (0,1) |  2  
 (0,2) |  5  
 (0,3) |  8  
 (0,4) |  3  
 (0,5) |  4  
 (0,6) |  6  
 (0,7) |  9  
 (0,8) |  7  
 (0,9) |  1  
(9 rows)  

使用窗口函数进行输出

postgres=# select * from (select row_number() over(order by ctid) as rn, * from t) as t(rn,id);  
 rn | id   
----+----  
  1 |  2  
  2 |  5  
  3 |  8  
  4 |  3  
  5 |  4  
  6 |  6  
  7 |  9  
  8 |  7  
  9 |  1  
(9 rows)  

分析 :

postgres=# analyze t;  
ANALYZE  

查询统计信息的correlation

postgres=# select * from pg_stats where attname ='id' and tablename='t';  
-[ RECORD 1 ]----------+--------------------  
schemaname             | public  
tablename              | t  
attname                | id  
inherited              | f  
null_frac              | 0  
avg_width              | 4  
n_distinct             | -1  
most_common_vals       |   
most_common_freqs      |   
histogram_bounds       | {1,2,3,4,5,6,7,8,9}  
correlation            | 0.116667  
most_common_elems      |   
most_common_elem_freqs |   
elem_count_histogram   |   

结果和corr函数计算得到的结果一致

postgres=# select corr(rn,id) from (select row_number() over(order by ctid) as rn, * from t) as t(rn,id);  
-[ RECORD 1 ]-----------  
corr | 0.116666666666667  

如果随机插入大量数据, 因此采样的关系, 分析得到的相关性可能和实际的相关性有偏差

postgres=# insert into t select * from generate_series(1,100000) order by random();  
INSERT 0 100000  

如下 :

postgres=# select ctid,* from t limit 100;  
  ctid   |  id     
---------+-------  
 (0,1)   |     2  
 (0,2)   |     5  
 (0,3)   |     8  
 (0,4)   |     3  
 (0,5)   |     4  
 (0,6)   |     6  
 (0,7)   |     9  
 (0,8)   |     7  
 (0,9)   |     1  
 (0,10)  |  4607  
 (0,11)  | 39521  
 (0,12)  | 92869  
 (0,13)  | 80094  
 (0,14)  | 13214  
 (0,15)  | 15509  
 (0,16)  |  8380  
 (0,17)  | 22281  
 (0,18)  | 99252  
 (0,19)  | 60018  
 (0,20)  | 55716  
....  
  
postgres=# analyze t;  
ANALYZE  
postgres=# select correlation from pg_stats where attname ='id' and tablename='t';  
 correlation    
--------------  
 -0.000263469  
(1 row)  

和实际相关性偏差较大

postgres=# select corr(rn,id) from (select row_number() over(order by ctid) as rn, * from t) as t(rn,id);  
        corr           
---------------------  
 0.00110293570728894  
(1 row)  

修改id列的采样系数, 重新分析, 得到的相关性结果和实际的相关性基本一致.

postgres=# alter table t alter column id SET STATISTICS 10000;  
ALTER TABLE  
postgres=# analyze t;  
ANALYZE  
postgres=# select correlation from pg_stats where attname ='id' and tablename='t';  
 correlation   
-------------  
  0.00110296  
(1 row)  

参考

1. http://zh.wikipedia.org/zh-cn/%E7%9B%B8%E5%85%B3

2. http://baike.baidu.com/view/172091.htm

3. http://en.wikipedia.org/wiki/Correlation_and_dependence

4. http://www.postgresql.org/docs/9.4/static/functions-aggregate.html

digoal’s 大量PostgreSQL文章入口

Twitter Facebook Google+ LinkedIn

Digoal.zhou

PostgreSQL 统计信息之 - 逻辑与物理存储的线性相关性

背景

PostgreSQL 统计信息与线性相关性应用

相关性与优化器

参考

digoal’s 大量PostgreSQL文章入口

You May Also Enjoy

PostgreSQL(PPAS 兼容Oracle) 从零开始入门手册 - 珍藏版

PostgreSQL pipelinedb 流计算插件 - IoT应用 - 实时轨迹聚合

PostgreSQL plpgsql 存储过程、函数 - 状态、异常变量打印、异常捕获… - GET [STACKED] DIAGNOSTICS

PostgreSQL datediff 日期间隔（单位转换）兼容SQL用法