PostgreSQL 统计信息之 - 逻辑与物理存储的线性相关性
背景
PostgreSQL统计信息中, 有一个相关性的统计, 在pg_stats.correlation中可以查看到,
统计值范围从-1到1, 趋向于-1表示逆向相关, 趋向于1表示正向相关, 趋向于0表示不相关.
postgres=# \d pg_stats
View "pg_catalog.pg_stats"
Column | Type | Modifiers
------------------------+----------+-----------
schemaname | name |
tablename | name |
attname | name |
inherited | boolean |
null_frac | real |
avg_width | integer |
n_distinct | real |
most_common_vals | anyarray |
most_common_freqs | real[] |
histogram_bounds | anyarray |
correlation | real |
most_common_elems | anyarray |
most_common_elem_freqs | real[] |
elem_count_histogram | real[] |
correlation的含义是什么呢?
即列的物理顺序和列的逻辑顺序的相关性.
相关性越高, 走索引扫描的离散块扫描更少, 也就是说, 相关性越高, 走索引扫描的离散块扫描代价越低.
相关性在其他领域也有非常重要的应用, 例如广告投入和销售额的数据, 看百度提到的例子 :
软件公司在全国有许多代理商,为研究它的财务软件产品的广告投入与销售额的关系,统计人员随机选择10家代理商进行观察,搜集到年广告投入费和月平均销售额的数据,并编制成相关表,见表1:
表1 广告费与月平均销售额相关表 单位:万元
年广告费投入 | 月均销售额
12.5 15.3 23.2 26.4 33.5 34.4 39.4 45.2 55.4 60.9
21.2 23.9 32.9 34.1 42.5 43.2 49.0 52.8 59.4 63.5
参照表1,可计算相关系数如表2:
序号 | 广告投入(万元) x | 月均销售额(万元) y
1 2 3 4 5 6 7 8 9 10
12.5 15.3 23.2 26.4 33.5 34.4 39.4 45.2 55.4 60.9
21.2 23.9 32.9 34.1 42.5 43.2 49.0 52.8 59.4 63.5
156.25 234.09 538.24 696.96 1122.25 1183.36 1552.36 2043.04 3069.16 3708.81
449.44 571.21 1082.41 1162.81 1806.25 1866.24 2401.00 2787.84 3528.36 4032.25
265.00 365.67 763.28 900.24 1423.75 1486.08 1930.60 2386.56 3290.76 3867.15
合计 346.2 422.5 14304.52 19687.81 16679.09
=0.9942
相关系数为0.9942,说明广告投入费与月平均销售额之间有高度的线性正相关关系。
相关性越高, 说明广告投入和销售额的关系越明显.
相关性是如何计算的呢? 实际上是 “协方差(x,y)除以(平方根(方差(x)*方差(y)))” .
PostgreSQL 统计信息与线性相关性应用
在运维领域, 也可以做相对应的统计, 例如服务器的内存使用量, 负载, 进程数, 网络吞吐量, 用户请求量, 用户请求响应时间 等数据, 可以做相关性的统计, 观察他们之间的关系.
接下来进入正题, 看看PostgreSQL是如何计算列的逻辑和物理顺序相关性的
首选看一下pg_stats这个视图对应的correlation是怎么来的
postgres=# \d+ pg_stats
CASE
WHEN s.stakind1 = 3 THEN s.stanumbers1[1]
WHEN s.stakind2 = 3 THEN s.stanumbers2[1]
WHEN s.stakind3 = 3 THEN s.stanumbers3[1]
WHEN s.stakind4 = 3 THEN s.stanumbers4[1]
WHEN s.stakind5 = 3 THEN s.stanumbers5[1]
ELSE NULL::real
END AS correlation,
。。。
FROM pg_statistic s
JOIN pg_class c ON c.oid = s.starelid
JOIN pg_attribute a ON c.oid = a.attrelid AND a.attnum = s.staattnum
LEFT JOIN pg_namespace n ON n.oid = c.relnamespace
WHERE NOT a.attisdropped AND has_column_privilege(c.oid, a.attnum, 'select'::text);
其实是来自pg_statistic这个表, corr的统计是在analyze中完成的.
相关性计算的代码如下, 注意是采样统计 :
src/backend/commands/analyze.c
/*
* Now scan the values in order, find the most common ones, and also
* accumulate ordering-correlation statistics.
*
* To determine which are most common, we first have to count the
* number of duplicates of each value. The duplicates are adjacent in
* the sorted list, so a brute-force approach is to compare successive
* datum values until we find two that are not equal. However, that
* requires N-1 invocations of the datum comparison routine, which are
* completely redundant with work that was done during the sort. (The
* sort algorithm must at some point have compared each pair of items
* that are adjacent in the sorted order; otherwise it could not know
* that it's ordered the pair correctly.) We exploit this by having
* compare_scalars remember the highest tupno index that each
* ScalarItem has been found equal to. At the end of the sort, a
* ScalarItem's tupnoLink will still point to itself if and only if it
* is the last item of its group of duplicates (since the group will
* be ordered by tupno).
*/
corr_xysum = 0;
ndistinct = 0;
nmultiple = 0;
dups_cnt = 0;
for (i = 0; i < values_cnt; i++)
{
int tupno = values[i].tupno;
corr_xysum += ((double) i) * ((double) tupno);
dups_cnt++;
if (tupnoLink[tupno] == tupno)
{
/* Reached end of duplicates of this value */
ndistinct++;
if (dups_cnt > 1)
{
nmultiple++;
if (track_cnt < num_mcv ||
dups_cnt > track[track_cnt - 1].count)
{
/*
* Found a new item for the mcv list; find its
* position, bubbling down old items if needed. Loop
* invariant is that j points at an empty/ replaceable
* slot.
*/
int j;
if (track_cnt < num_mcv)
track_cnt++;
for (j = track_cnt - 1; j > 0; j--)
{
if (dups_cnt <= track[j - 1].count)
break;
track[j].count = track[j - 1].count;
track[j].first = track[j - 1].first;
}
track[j].count = dups_cnt;
track[j].first = i + 1 - dups_cnt;
}
}
dups_cnt = 0;
}
}
.........................
/* Generate a correlation entry if there are multiple values */
if (values_cnt > 1)
{
MemoryContext old_context;
float4 *corrs;
double corr_xsum,
corr_x2sum;
/* Must copy the target values into anl_context */
old_context = MemoryContextSwitchTo(stats->anl_context);
corrs = (float4 *) palloc(sizeof(float4));
MemoryContextSwitchTo(old_context);
/*----------
* Since we know the x and y value sets are both
* 0, 1, ..., values_cnt-1
* we have sum(x) = sum(y) =
* (values_cnt-1)*values_cnt / 2
* and sum(x^2) = sum(y^2) =
* (values_cnt-1)*values_cnt*(2*values_cnt-1) / 6.
*----------
*/
corr_xsum = ((double) (values_cnt - 1)) *
((double) values_cnt) / 2.0;
corr_x2sum = ((double) (values_cnt - 1)) *
((double) values_cnt) * (double) (2 * values_cnt - 1) / 6.0;
/* And the correlation coefficient reduces to */
corrs[0] = (values_cnt * corr_xysum - corr_xsum * corr_xsum) /
(values_cnt * corr_x2sum - corr_xsum * corr_xsum);
stats->stakind[slot_idx] = STATISTIC_KIND_CORRELATION;
stats->staop[slot_idx] = mystats->ltopr;
stats->stanumbers[slot_idx] = corrs;
stats->numnumbers[slot_idx] = 1;
slot_idx++;
}
PostgreSQL 提供了相关性统计的函数, corr供用户使用.
参考
http://www.postgresql.org/docs/9.4/static/functions-aggregate.html
corr代码如下 :
src/backend/utils/adt/float.c
Datum
float8_corr(PG_FUNCTION_ARGS)
{
ArrayType *transarray = PG_GETARG_ARRAYTYPE_P(0);
float8 *transvalues;
float8 N,
sumX,
sumX2,
sumY,
sumY2,
sumXY,
numeratorX,
numeratorY,
numeratorXY;
transvalues = check_float8_array(transarray, "float8_corr", 6);
N = transvalues[0];
sumX = transvalues[1];
sumX2 = transvalues[2];
sumY = transvalues[3];
sumY2 = transvalues[4];
sumXY = transvalues[5];
/* if N is 0 we should return NULL */
if (N < 1.0)
PG_RETURN_NULL();
numeratorX = N * sumX2 - sumX * sumX;
CHECKFLOATVAL(numeratorX, isinf(sumX2) || isinf(sumX), true);
numeratorY = N * sumY2 - sumY * sumY;
CHECKFLOATVAL(numeratorY, isinf(sumY2) || isinf(sumY), true);
numeratorXY = N * sumXY - sumX * sumY;
CHECKFLOATVAL(numeratorXY, isinf(sumXY) || isinf(sumX) ||
isinf(sumY), true);
if (numeratorX <= 0 || numeratorY <= 0)
PG_RETURN_NULL();
PG_RETURN_FLOAT8(numeratorXY / sqrt(numeratorX * numeratorY));
}
我们可以用corr来验证PostgreSQL的采样统计, 但是注意, 要验证的话, 数据量小一点比较好, 这样的话PG会全量采样, 和corr得到的结果一致, 如果数据量太大, 得到的结果可能有少量偏差.
postgres=# create table t(id int);
CREATE TABLE
postgres=# insert into t values (2),(5),(8),(3),(4),(6),(9),(7),(1);
INSERT 0 9
行号, ID值如下
postgres=# select ctid,* from t;
ctid | id
-------+----
(0,1) | 2
(0,2) | 5
(0,3) | 8
(0,4) | 3
(0,5) | 4
(0,6) | 6
(0,7) | 9
(0,8) | 7
(0,9) | 1
(9 rows)
使用窗口函数进行输出
postgres=# select * from (select row_number() over(order by ctid) as rn, * from t) as t(rn,id);
rn | id
----+----
1 | 2
2 | 5
3 | 8
4 | 3
5 | 4
6 | 6
7 | 9
8 | 7
9 | 1
(9 rows)
分析 :
postgres=# analyze t;
ANALYZE
查询统计信息的correlation
postgres=# select * from pg_stats where attname ='id' and tablename='t';
-[ RECORD 1 ]----------+--------------------
schemaname | public
tablename | t
attname | id
inherited | f
null_frac | 0
avg_width | 4
n_distinct | -1
most_common_vals |
most_common_freqs |
histogram_bounds | {1,2,3,4,5,6,7,8,9}
correlation | 0.116667
most_common_elems |
most_common_elem_freqs |
elem_count_histogram |
结果和corr函数计算得到的结果一致
postgres=# select corr(rn,id) from (select row_number() over(order by ctid) as rn, * from t) as t(rn,id);
-[ RECORD 1 ]-----------
corr | 0.116666666666667
如果随机插入大量数据, 因此采样的关系, 分析得到的相关性可能和实际的相关性有偏差
postgres=# insert into t select * from generate_series(1,100000) order by random();
INSERT 0 100000
如下 :
postgres=# select ctid,* from t limit 100;
ctid | id
---------+-------
(0,1) | 2
(0,2) | 5
(0,3) | 8
(0,4) | 3
(0,5) | 4
(0,6) | 6
(0,7) | 9
(0,8) | 7
(0,9) | 1
(0,10) | 4607
(0,11) | 39521
(0,12) | 92869
(0,13) | 80094
(0,14) | 13214
(0,15) | 15509
(0,16) | 8380
(0,17) | 22281
(0,18) | 99252
(0,19) | 60018
(0,20) | 55716
....
postgres=# analyze t;
ANALYZE
postgres=# select correlation from pg_stats where attname ='id' and tablename='t';
correlation
--------------
-0.000263469
(1 row)
和实际相关性偏差较大
postgres=# select corr(rn,id) from (select row_number() over(order by ctid) as rn, * from t) as t(rn,id);
corr
---------------------
0.00110293570728894
(1 row)
修改id列的采样系数, 重新分析, 得到的相关性结果和实际的相关性基本一致.
postgres=# alter table t alter column id SET STATISTICS 10000;
ALTER TABLE
postgres=# analyze t;
ANALYZE
postgres=# select correlation from pg_stats where attname ='id' and tablename='t';
correlation
-------------
0.00110296
(1 row)
相关性与优化器
当相关性很好时,说明物理存储的顺序与实际的值顺序很相似,那么在使用索引扫描时,扫描的堆表块也相对较少,同时不离散的扫描也很少。因此更加趋向于使用索引扫描。
在9.6版本中,引入了BRIN索引,当相关性很好时,BRIN的效率也越高,因为数据的交叉少了,精度自然就高了。
参考
1. http://zh.wikipedia.org/zh-cn/%E7%9B%B8%E5%85%B3
2. http://baike.baidu.com/view/172091.htm
3. http://en.wikipedia.org/wiki/Correlation_and_dependence
4. http://www.postgresql.org/docs/9.4/static/functions-aggregate.html