秒级任意维度分析1TB级大表 - 通过采样估值满足高效TOP N等统计分析需求
背景
估值计算是统计学的常用手段。因为数据量庞大,求精确数值需要耗费巨大的资源,而统计分析并不要求完全精确的数据,因此估值计算是一种折中的方法,广泛应用于统计分析场景。
PostgreSQL是一个功能强大的数据库,在估值统计方面,提供了很多方法。
1、PostgreSQL中,求估计的UV,增量UV等(即count distinct),可以通过HLL插件来实现。
《Greenplum 最佳实践 - 估值插件hll的使用(以及hll分式聚合函数优化)》
《PostgreSQL hll (HyperLogLog) extension for “State of The Art Cardinality Estimation Algorithm” - 3》
《PostgreSQL hll (HyperLogLog) extension for “State of The Art Cardinality Estimation Algorithm” - 2》
《PostgreSQL hll (HyperLogLog) extension for “State of The Art Cardinality Estimation Algorithm” - 1》
或者使用count-min sketch top-n插件
https://github.com/citusdata/cms_topn
2、求任意字段的TOP VALUE(包括数组字段的TOP 元素),以及COUNT,可以通过统计信息柱状图得到。
《PostgreSQL pg_stats used to estimate top N freps values and explain rows》
3、求全表记录数可以通过pg_class.reltuples得到。
4、求任意SQL的返回记录数(例如求分页),或者求COUNT(*)
的估值(将SQL转换为select 1 from …即可),可以通过explain的估值得到,例子如下。
《论count与offset使用不当的罪名 和 分页的优化》
5、求多个字段的唯一值个数,可以通过定义自定义统计信息得到非常准确的估值。
6、求带条件的查询的估值,比如某个省的TOP N电影明星,我们可以通过先采样,然后在采样中进行计算的方法得到。
本文将介绍采样估值的方法。
场景设计
之前写过一个场景,是泛内容网站的透视分析,数据量比较庞大,计算全量需要扫描的数据量较大。看看采样的方法是否满足需求?
《音视图(泛内容)网站透视分析 DB设计 - 阿里云(RDS、HybridDB) for PostgreSQL最佳实践》
1、表结构
create table tbl (
id int8, -- 序列
tag1 int[], -- 数组
c1 int, -- 1-100
c2 int, -- 1-10000
c3 timestamp -- 时间戳
);
2、生成随机值函数
取值范围$1-$2, 取$3个随机值的数组
create or replace function gen_rand_ints(int, int, int) returns int[] as $$
select array(select (random()*($2-$1))::int+$1 from generate_series(1,$3));
$$ language sql strict;
postgres=# select gen_rand_ints(10,25,5);
gen_rand_ints
------------------
{20,19,24,22,21}
(1 row)
3、写入测试数据
-- 写入热点数组,5000万条
insert into tbl select id, gen_rand_ints(1,1000,10), random()*100, random()*10000, clock_timestamp() from generate_series(1,50000000) t(id);
-- 写入非热点数组,1亿条
insert into tbl select id, gen_rand_ints(1,1000000,10), random()*100, random()*10000, clock_timestamp() from generate_series(1,100000000) t(id);
数据样式如下
postgres=# select * from tbl limit 10;
id | tag1 | c1 | c2 | c3
----------+-------------------------------------------+----+------+----------------------------
38931521 | {424,448,91,420,382,657,677,60,530,503} | 59 | 6120 | 2017-09-11 14:32:06.610512
38931522 | {66,87,468,207,79,780,307,714,520,149} | 44 | 7848 | 2017-09-11 14:32:06.610522
38931523 | {99,628,798,558,415,74,863,839,522,953} | 26 | 9032 | 2017-09-11 14:32:06.610531
38931524 | {610,935,962,140,438,551,752,503,636,220} | 71 | 7136 | 2017-09-11 14:32:06.61054
38931525 | {998,16,428,518,164,868,303,263,496,102} | 82 | 9102 | 2017-09-11 14:32:06.61055
38931526 | {175,683,749,696,637,8,599,247,942,561} | 39 | 3796 | 2017-09-11 14:32:06.610559
38931527 | {112,138,882,747,356,591,461,355,605,888} | 87 | 7684 | 2017-09-11 14:32:06.610568
38931528 | {756,175,31,252,276,850,162,450,533,910} | 15 | 1691 | 2017-09-11 14:32:06.610578
38931529 | {917,744,416,860,306,801,240,416,937,122} | 16 | 2927 | 2017-09-11 14:32:06.610587
38931530 | {712,623,647,317,511,519,86,267,693,116} | 52 | 9676 | 2017-09-11 14:32:06.610596
(10 rows)
求任意条件下的tag1的TOP N元素。
4、分析表,生成柱状图。
postgres=# analyze tbl;
ANALYZE
表大小 16 GB。
postgres=# \dt+ tbl
List of relations
Schema | Name | Type | Owner | Size | Description
--------+------+-------+----------+-------+-------------
public | tbl | table | postgres | 16 GB |
(1 row)
5、求某个条件下的精确TOP N元素,实际上有1000个热点ID,所以返回TOP 10的COUNT结果非常近似,后面在使用估值时,得到的TOP 10可能就没这么准了,但是一定是在1000个ID以内的。
-- 开启32个并行的查询时间
postgres=# select unnest(tag1) tag1, count(*) from tbl where c1 between 1 and 10 group by 1 order by 2 desc limit 10;
tag1 | count
------+-------
134 | 50935
768 | 50915
663 | 50876
567 | 50821
146 | 50821
332 | 50814
450 | 50807
884 | 50789
58 | 50781
605 | 50774
(10 rows)
Time: 23441.247 ms (00:23.441)
-- 不开并行的查询时间
postgres=# select unnest(tag1) tag1, count(*) from tbl where c1 between 1 and 10 group by 1 order by 2 desc limit 10;
tag1 | count
------+-------
134 | 50935
768 | 50915
663 | 50876
567 | 50821
146 | 50821
332 | 50814
450 | 50807
884 | 50789
58 | 50781
605 | 50774
(10 rows)
Time: 154935.686 ms (02:34.936)
6、求同样条件下的采样TOP N
采样算法参考文章末尾,PostgreSQL内置了2种采样方法,同时支持扩展采样方法,其中有两个内置的扩展采样方法,实际上内置总共有4种采样方法。
使用块级采样(目前采样不支持并行)。
postgres=# select unnest(tag1) tag1, (count(*))*20 -- 乘以100/采样系数
from
(select * from tbl TABLESAMPLE system (5)) t
where c1 between 1 and 10 group by 1 order by 2 desc limit 10;
tag1 | ?column?
------+----------
724 | 53380
798 | 52680
24 | 52640
371 | 52480
569 | 52400
531 | 52280
979 | 52160
429 | 52140
980 | 52080
350 | 51920
(10 rows)
-- 采样5%,约7秒。
Time: 6887.745 ms (00:06.888)
postgres=# select unnest(tag1) tag1, (count(*))*50 -- 乘以100/采样系数
from
(select * from tbl TABLESAMPLE system (2)) t
where c1 between 1 and 10 group by 1 order by 2 desc limit 10;
tag1 | ?column?
------+----------
324 | 55450
435 | 55150
720 | 55050
943 | 54950
475 | 54750
958 | 54600
13 | 54400
742 | 54300
739 | 54100
301 | 53950
(10 rows)
-- 采样2%, 约3秒。
Time: 2720.140 ms (00:02.720)
采样越多,精确度越高
采样的方法,得到的TOP N是很准确的,因为例子用了1000个随机值,并且每个随机值的概率是一样的,如果返回TOP 1000,那就准确无疑了。
大表例子
重新设计热点,总共写入40亿测试数据:
一级热点,1,5亿
二级热点,2-4,5亿
三级热点,5-10,5亿
四级热点,11-30,5亿
普通数据,1-100000,20亿
1、表结构设计
create table tbl1 (
id int8, -- 序列
c1 int8, -- 目标字段
c2 int8, -- 1-100
c3 int8, -- 1-100000
c4 timestamp -- 时间戳
);
2、写入测试数据
nohup psql -c "insert into tbl1 select id, 1, random()*100, random()*100000, clock_timestamp() from generate_series(1,500000000) t(id);" >/dev/null 2>&1 &
nohup psql -c "insert into tbl1 select id, random()*(4-2)+2, random()*100, random()*100000, clock_timestamp() from generate_series(1,500000000) t(id);" >/dev/null 2>&1 &
nohup psql -c "insert into tbl1 select id, random()*(10-5)+5, random()*100, random()*100000, clock_timestamp() from generate_series(1,500000000) t(id);" >/dev/null 2>&1 &
nohup psql -c "insert into tbl1 select id, random()*(30-11)+11, random()*100, random()*100000, clock_timestamp() from generate_series(1,500000000) t(id);" >/dev/null 2>&1 &
nohup psql -c "insert into tbl1 select id, random()*100000, random()*100, random()*100000, clock_timestamp() from generate_series(1,2000000000) t(id);" >/dev/null 2>&1 &
3、分析表
postgres=# analyze tbl1;
ANALYZE
Time: 502.421 ms
表大小,254 GB。
postgres=# \dt+ tbl1
List of relations
Schema | Name | Type | Owner | Size | Description
--------+------+-------+----------+--------+-------------
public | tbl1 | table | postgres | 254 GB |
(1 row)
4、精确TOP 30
-- 开启32个并行的查询时间
postgres=# select c1,count(*) from tbl1 where c2 between 1 and 10 group by 1 order by 2 desc limit 30;
c1 | count
----+----------
1 | 49991259
3 | 25006580
2 | 12502559
4 | 12498741
9 | 10004285
6 | 10002597
8 | 9999530
7 | 9999215
5 | 5003219
10 | 4998870
29 | 2636193
18 | 2635457
13 | 2635344
17 | 2634693
26 | 2633965
19 | 2633690
28 | 2633526
14 | 2633512
15 | 2633363
24 | 2633260
20 | 2633014
25 | 2632926
16 | 2632779
22 | 2632508
27 | 2632288
23 | 2632216
21 | 2631443
12 | 2631315
11 | 1318483
30 | 1318451
(30 rows)
Time: 20845.738 ms (00:20.846)
-- 不开启并行的查询时间
postgres=# select c1,count(*) from tbl1 where c2 between 1 and 10 group by 1 order by 2 desc limit 30;
c1 | count
----+----------
1 | 49991259
3 | 25006580
2 | 12502559
4 | 12498741
9 | 10004285
6 | 10002597
8 | 9999530
7 | 9999215
5 | 5003219
10 | 4998870
29 | 2636193
18 | 2635457
13 | 2635344
17 | 2634693
26 | 2633965
19 | 2633690
28 | 2633526
14 | 2633512
15 | 2633363
24 | 2633260
20 | 2633014
25 | 2632926
16 | 2632779
22 | 2632508
27 | 2632288
23 | 2632216
21 | 2631443
12 | 2631315
11 | 1318483
30 | 1318451
(30 rows)
Time: 471112.827 ms (07:51.113)
5、采样TOP 30
select c1,(count(*))*20 from -- 乘以100/采样系数
(select * from tbl1 TABLESAMPLE system (5)) t
where c2 between 1 and 10 group by 1 order by 2 desc limit 30;
c1 | ?column?
----+----------
1 | 50068840
3 | 25108820
2 | 12558680
4 | 12513080
7 | 10009300
9 | 10006260
6 | 10005400
8 | 9987220
5 | 5008280
10 | 5007980
17 | 2652940
16 | 2648640
25 | 2646800
28 | 2646600
15 | 2642480
20 | 2642220
14 | 2641620
26 | 2640500
23 | 2639420
29 | 2637740
22 | 2637320
13 | 2636900
19 | 2636100
18 | 2635120
24 | 2634440
12 | 2631480
27 | 2629880
21 | 2624940
11 | 1330140
30 | 1316480
(30 rows)
Time: 31884.725 ms (00:31.885)
-- 采样5%,约32秒。
select c1,(count(*))*50 from -- 乘以100/采样系数
(select * from tbl1 TABLESAMPLE system (2)) t
where c2 between 1 and 10 group by 1 order by 2 desc limit 30;
c1 | ?column?
----+----------
1 | 50173200
3 | 24993550
2 | 12487100
4 | 12474100
6 | 9998250
8 | 9980450
7 | 9973950
9 | 9960450
10 | 4999050
5 | 4995000
29 | 2642700
28 | 2640900
16 | 2640300
26 | 2630250
24 | 2627500
23 | 2623700
19 | 2622350
27 | 2622000
18 | 2621200
12 | 2619450
20 | 2616200
17 | 2616050
21 | 2615800
15 | 2613200
22 | 2612200
14 | 2607700
13 | 2605900
25 | 2604150
30 | 1312300
11 | 1311950
(30 rows)
Time: 12942.455 ms (00:12.942)
-- 采样2%,约13秒。
postgres=# select c1,(count(*))*1000 from -- 乘以100/采样系数
(select * from tbl1 TABLESAMPLE system (0.1)) t
where c2 between 1 and 10 group by 1 order by 2 desc limit 30;
c1 | ?column?
----+----------
1 | 48077000
3 | 25061000
2 | 12762000
4 | 12262000
8 | 9851000
6 | 9789000
7 | 9718000
9 | 9654000
5 | 4971000
10 | 4885000
18 | 2731000
28 | 2727000
29 | 2710000
23 | 2697000
15 | 2687000
27 | 2681000
22 | 2672000
17 | 2672000
25 | 2670000
19 | 2637000
20 | 2632000
12 | 2628000
14 | 2628000
21 | 2622000
26 | 2618000
13 | 2601000
24 | 2522000
16 | 2513000
11 | 1406000
30 | 1301000
(30 rows)
Time: 863.604 ms
-- 采样0.1%,约0.86秒。
OK,采样千分之一的时候(仅需约扫描254MB数据),只花了不到1秒,就算出了准确的TOP 30,而且准确度相当的高。
如果在Greenplum中支持这个功能,会很爽,一万亿的数据,秒级任意维度钻取透视不是梦。
小结
1、采样与精确查询耗时对比
1.1、求数组元素TOP N
查询 | 表大小 | 记录数 | 求TOP N耗时 |
---|---|---|---|
精确,32并行 | 16GB | 1.5亿 | 23秒 |
精确,非并行 | 16GB | 1.5亿 | 155秒 |
采样5% | 16GB | 1.5亿 | 7秒 |
采样2% | 16GB | 1.5亿 | 3秒 |
1.2、求scalar类型TOP N
查询 | 表大小 | 记录数 | 求TOP N耗时 |
---|---|---|---|
精确,32并行 | 254GB | 40亿 | 21秒 |
精确,非并行 | 254GB | 40亿 | 471秒 |
采样5% | 254GB | 40亿 | 32秒 |
采样2% | 254GB | 40亿 | 13秒 |
采样0.1% | 254GB | 40亿 | 0.86秒 |
2、采样计算达到了很高的精确度,同时耗费资源很少。虽然并行计算也非常快,但是需要消耗更多的CPU和IO资源,并行度就会大打折扣,除非有足够的资源给你折腾,否则能采用估值计算的时候,还是建议估值计算。
3、估值计算的效率评估:
由于目前估值计算不能采用多核并行,处理速度约每秒254MB,那么要达到1秒内的响应,对于254GB的表,采样设置为0.1%,对于1TB的表,可以将采样设置为0.025%)。那么TB级的表,也能实现任意维度秒级估算。
4、采样方法
4.1 https://www.postgresql.org/docs/9.6/static/sql-select.html
TABLESAMPLE sampling_method ( argument [, ...] ) [ REPEATABLE ( seed ) ]
A TABLESAMPLE clause after a table_name indicates that the specified sampling_method should be used to retrieve a subset of the rows in that table. This sampling precedes the application of any other filters such as WHERE clauses. The standard PostgreSQL distribution includes two sampling methods, BERNOULLI and SYSTEM, and other sampling methods can be installed in the database via extensions.
The BERNOULLI and SYSTEM sampling methods each accept a single argument which is the fraction of the table to sample, expressed as a percentage between 0 and 100. This argument can be any real-valued expression. (Other sampling methods might accept more or different arguments.) These two methods each return a randomly-chosen sample of the table that will contain approximately the specified percentage of the table’s rows. The BERNOULLI method scans the whole table and selects or ignores individual rows independently with the specified probability. The SYSTEM method does block-level sampling with each block having the specified chance of being selected; all rows in each selected block are returned. The SYSTEM method is significantly faster than the BERNOULLI method when small sampling percentages are specified, but it may return a less-random sample of the table as a result of clustering effects.
The optional REPEATABLE clause specifies a seed number or expression to use for generating random numbers within the sampling method. The seed value can be any non-null floating-point value. Two queries that specify the same seed and argument values will select the same sample of the table, if the table has not been changed meanwhile. But different seed values will usually produce different samples. If REPEATABLE is not given then a new random sample is selected for each query, based upon a system-generated seed. Note that some add-on sampling methods do not accept REPEATABLE, and will always produce new samples on each use.
4.2 https://www.postgresql.org/docs/9.6/static/tsm-system-rows.html
CREATE EXTENSION tsm_system_rows;
SELECT * FROM my_table TABLESAMPLE SYSTEM_ROWS(100);
4.3 https://www.postgresql.org/docs/9.6/static/tsm-system-time.html
CREATE EXTENSION tsm_system_time;
SELECT * FROM my_table TABLESAMPLE SYSTEM_TIME(1000);
5、PostgreSQL中还支持很多其他的估值方法,请参考本文开头部分的介绍。
6、如果采样支持并行,可以支持更大的表的更精确的估算,比如1%是一个比较好的估算采样比,对于1T的表就是10G,如果支持并行,预计在3秒内完成从采样到计算的整个过程。