PostgreSQL 9.5 new feature - table mview data sample

3 minute read

背景

PostgreSQL 9.5支持的新特性,数据采样。有了这个功能,如果能结合pg_dump使用的话,可以为比较大的生产数据库创建较小的测试环境。当然,如果是这样的话采样还需要考虑数据关联的问题。

用户可以自定义采样方法,或者使用系统自带的两种采样方法(基于所有数据块的选择性采样,以及基于全表的选择性采样)

基于数据块的采样扫描的数据块比较少,只需要扫描筛选出的数据块并返回最终的采样数据(粒度较大)。

基于全表的采样,需要扫描全表,并筛选最终的采样数据(粒度细,但是开销大)。

语法:

[ TABLESAMPLE sampling_method ( argument [, ...] ) [ REPEATABLE ( seed ) ] ]  

repeatable指定采样seed.

sampling_method 参数是采样函数的参数,目前指采样比例。

例子:

postgres=# create table test(id int);  
CREATE TABLE  
postgres=# insert into test select generate_series(1,100000000);  
INSERT 0 100000000  
Time: 128199.864 ms  

采样速度很快,2毫秒就完成了。

postgres=# select count(*) from test tablesample system (0.000001);  
 count   
-------  
   909  
(1 row)  
Time: 1.930 ms  

正常的话应该返回100条采样数据,但是返回了909,因为我们使用的是数据块采样方法SYSTEM,返回了1个数据块的所有记录。

postgres=# select 100000000*0.000001;  
  ?column?    
------------  
 100.000000  
(1 row)  
Time: 0.631 ms  
  
postgres=# select count(*) from (select ctid,* from test limit 100000) t where ctid::text ~ '\(1,';  
 count   
-------  
   909  
(1 row)  
Time: 196.232 ms  

这种采样方法离散度比较低,因为集中返回某些数据块的记录。

接下来使用记录采样方法BERNOULLI

postgres=# select relpages,reltuples from pg_class where relname='test';  
 relpages |  reltuples    
----------+-------------  
   110012 | 2.72691e+07  
(1 row)  
Time: 0.786 ms  
postgres=# select count(*) from test;  
   count     
-----------  
 100000000  
(1 row)  
Time: 12041.390 ms  

扫描全表的采样方法明显速度下降了

postgres=# select count(*) from test tablesample BERNOULLI (0.000001) ;  
 count   
-------  
     1  
(1 row)  
Time: 6245.059 ms  
postgres=# select count(*) from test tablesample BERNOULLI (0.00001) ;  
 count   
-------  
    13  
(1 row)  
Time: 6305.808 ms  

同样,采样结果集合也不对,算法上的问题。

postgres=# select 2.72691e+07*0.00001;  
 ?column?    
-----------  
 272.69100  
(1 row)  
postgres=# vacuum analyze test;  
VACUUM  
Time: 9880.728 ms  
postgres=# select relpages,reltuples from pg_class where relname='test';  
 relpages | reltuples   
----------+-----------  
   110012 |     1e+08  
(1 row)  
Time: 0.438 ms  
postgres=# select count(*) from test tablesample BERNOULLI (0.00001) ;  
 count   
-------  
     6  
(1 row)  
Time: 6243.548 ms  
postgres=# select count(*) from test tablesample BERNOULLI (0.00001) ;  
 count   
-------  
     9  
(1 row)  
Time: 6275.948 ms  
postgres=# select count(*) from test tablesample BERNOULLI (0.00001) ;  
 count   
-------  
     9  
(1 row)  
Time: 6243.882 ms  

这种方法的采样数据离散度很好,但是采样比较耗时。

PostgreSQL 另外还提供了两个扩展的采样方法模块:

1. http://www.postgresql.org/docs/devel/static/tsm-system-rows.html

这个采样方法和system一样,是基于数据块的,但是精确的返回指定的行数,相当于加了过滤器。

离散度较低。

postgres=# CREATE EXTENSION tsm_system_rows;  
CREATE EXTENSION  
postgres=# SELECT count(*) FROM test TABLESAMPLE SYSTEM_ROWS(100);  
 count   
-------  
   100  
(1 row)  
Time: 0.568 ms  
postgres=# SELECT distinct substring(ctid::text,'(.*),') FROM test TABLESAMPLE SYSTEM_ROWS(100);  
 substring   
-----------  
 (75816  
(1 row)  
Time: 1.265 ms  

2. http://www.postgresql.org/docs/devel/static/tsm-system-time.html

这个采样方法是基于时间的采样方法,从任意一个数据块开始扫描,当扫描时间达到指定的扫描时间时停止扫描.

所以返回的记录是某些数据块的连续数据,离散度较低。

postgres=# create extension tsm_system_time;  
CREATE EXTENSION  
  
postgres=# SELECT min(ctid),max(ctid),count(*),min(id),max(id),max(id)-min(id) FROM test TABLESAMPLE SYSTEM_TIME(1);  
   min    |    max     | count |   min   |   max   | ?column?   
----------+------------+-------+---------+---------+----------  
 (8093,1) | (8095,909) |  2727 | 6356538 | 6359264 |     2726  
(1 row)  
Time: 2.080 ms  
postgres=# SELECT min(ctid),max(ctid),count(*),min(id),max(id),max(id)-min(id) FROM test TABLESAMPLE SYSTEM_TIME(1);  
    min    |     max     | count |   min    |   max    | ?column?   
-----------+-------------+-------+----------+----------+----------  
 (35058,1) | (35083,909) | 23634 | 31867723 | 31891356 |    23633  
(1 row)  
Time: 12.863 ms  
postgres=# explain SELECT min(ctid),max(ctid),count(*),min(id),max(id),max(id)-min(id) FROM test TABLESAMPLE SYSTEM_TIME(1);  
                                     QUERY PLAN                                        
-------------------------------------------------------------------------------------  
 Aggregate  (cost=29697.06..29697.07 rows=1 width=10)  
   ->  Sample Scan (system_time) on test  (cost=0.00..12198.93 rows=999893 width=10)  
(2 rows)  
Time: 0.671 ms  

用户还可以自定义采样方法函数,参考:

http://www.postgresql.org/docs/devel/static/tablesample-method.html

参考

1. http://www.postgresql.org/docs/devel/static/tsm-system-rows.html

2. http://www.postgresql.org/docs/devel/static/tsm-system-time.html

3. http://www.postgresql.org/docs/devel/static/tablesample-method.html

4. http://www.postgresql.org/docs/devel/static/sql-select.html

TABLESAMPLE sampling_method ( argument [, ...] ) [ REPEATABLE ( seed ) ]  
Table sample clause after table_name indicates that a sampling_method should be used to retrieve subset of rows in the table. The sampling_method can be any sampling method installed in the database. There are currently two sampling methods available in the standard PostgreSQL distribution:  
  
SYSTEM  
  
BERNOULLI  
  
Both of these sampling methods currently accept only single argument which is the percent (floating point from 0 to 100) of the rows to be returned. The SYSTEM sampling method does block level sampling with each block having the same chance of being selected and returns all rows from each selected block. The BERNOULLI scans whole table and returns individual rows with equal probability. Additional sampling methods may be installed in the database via extensions.  
  
The optional parameter REPEATABLE uses the seed parameter, which can be a number or expression producing a number, as a random seed for sampling. Note that subsequent commands may return different results even if same REPEATABLE clause was specified. This happens because DML statements and maintenance operations such as VACUUM may affect physical distribution of data. The setseed() function will not affect the sampling result when the REPEATABLE parameter is used.  

Flag Counter

digoal’s 大量PostgreSQL文章入口