PostgreSQL 10.0 preview 功能增强 - 自由定义统计信息维度

1 minute read

背景

统计信息是数据库优化器用来计算SQL plan tree中每个NODE成本的依据,目前PostgreSQL的统计信息是开源数据库中最完备的。

包括继承表、以及全局的统计信息,每个字段的NULL值个数或百分比,平均字段宽度,唯一值个数或百分比,高频词,高频词的百分比,柱状图,线性相关性,数组字段的高频元素,高频元素的百分比,数组元素的柱状图。

          View "pg_catalog.pg_stats"  
         Column         |   Type   | Modifiers   
------------------------+----------+-----------  
 schemaname             | name     |   
 tablename              | name     |   
 attname                | name     |   
 inherited              | boolean  |   
 null_frac              | real     |   
 avg_width              | integer  |   
 n_distinct             | real     |   
 most_common_vals       | anyarray |   
 most_common_freqs      | real[]   |   
 histogram_bounds       | anyarray |   
 correlation            | real     |   
 most_common_elems      | anyarray |   
 most_common_elem_freqs | real[]   |   
 elem_count_histogram   | real[]   |   

实际上统计信息已经非常完备了,10.0提出了一个更大胆的统计信息,自定义统计信息。

比如我们将多个字段作为一个分组,生成分组的统计信息。有什么用呢?

例如select count(distinct a,b) from tbl, select count(*),a,b from tbl group by a,b;

以上SQL,如果有a,b分组的统计信息,有多少唯一值,线性相关性如何。优化器就可以评估group by需要多少内存,结果集大概多大。

是的,现在PostgreSQL 10.0允许你定义字段分组的统计信息了。是不是很开森呢?

你甚至可以用它来做多个字段组合的TOP N的评估呢。

Implement multivariate n-distinct coefficients  
  
Add support for explicitly declared statistic objects (CREATE  
STATISTICS), allowing collection of statistics on more complex  
combinations that individual table columns.  Companion commands DROP  
STATISTICS and ALTER STATISTICS ... OWNER TO / SET SCHEMA / RENAME are  
added too.  All this DDL has been designed so that more statistic types  
can be added later on, such as multivariate most-common-values and  
multivariate histograms between columns of a single table, leaving room  
for permitting columns on multiple tables, too, as well as expressions.  
  
This commit only adds support for collection of n-distinct coefficient  
on user-specified sets of columns in a single table.  This is useful to  
estimate number of distinct groups in GROUP BY and DISTINCT clauses;  
estimation errors there can cause over-allocation of memory in hashed  
aggregates, for instance, so it's a worthwhile problem to solve.  A new  
special pseudo-type pg_ndistinct is used.  
  
(num-distinct estimation was deemed sufficiently useful by itself that  
this is worthwhile even if no further statistic types are added  
immediately; so much so that another version of essentially the same  
functionality was submitted by Kyotaro Horiguchi:  
https://postgr.es/m/20150828.173334.114731693.horiguchi.kyotaro@lab.ntt.co.jp  
though this commit does not use that code.)  
  
Author: Tomas Vondra.  Some code rework by Álvaro.  
Reviewed-by: Dean Rasheed, David Rowley, Kyotaro Horiguchi, Jeff Janes,  
    Ideriha Takeshi  
Discussion: https://postgr.es/m/543AFA15.4080608@fuzzy.cz  
    https://postgr.es/m/20170320190220.ixlaueanxegqd5gr@alvherre.pgsql  

新增的系统表

pg_statistic_ext  

新增的语法

CREATE STATISTICS  

https://git.postgresql.org/gitweb/?p=postgresql.git;a=blob;f=doc/src/sgml/ref/create_statistics.sgml;h=60184a347bf72c4e4d075976f2959bd2b9c8e99b;hb=7b504eb282ca2f5104b5c00b4f05a3ef6bb1385b

https://git.postgresql.org/gitweb/?p=postgresql.git;a=blobdiff;f=doc/src/sgml/catalogs.sgml;h=ac39c639edcbbc05b3b18c31cc5ab6b99b93efeb;hp=c531c73aac9f661cad9e3c03e0f16197932edfd4;hb=7b504eb282ca2f5104b5c00b4f05a3ef6bb1385b;hpb=f120b614e070aed39586d1443193738a149a90d4

例子

你的数据表包含了日期、USERID、行为等字段,你可以建立(日期、USERID)分组统计,每天凌晨对统计信息打个快照,就可以大概的知道每天的活跃用户数,历史的活跃用户数了。是不是很赞呢?

这个patch的讨论,详见邮件组,本文末尾URL。

PostgreSQL社区的作风非常严谨,一个patch可能在邮件组中讨论几个月甚至几年,根据大家的意见反复的修正,patch合并到master已经非常成熟,所以PostgreSQL的稳定性也是远近闻名的。

参考

https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=7b504eb282ca2f5104b5c00b4f05a3ef6bb1385b

Flag Counter

digoal’s 大量PostgreSQL文章入口