PostgreSQL hll (HyperLogLog) extension for State of The Art Cardinality Estimation Algorithm - 2
背景
一、PostgreSQL hll的主要功能 :
1. 快速的检索hll中存储的唯一值.
例如计算网站某天的访问用户数. 以前的算法可能是这样的 :
select count(distinct userid) from access_log where date(crt_time)='2013-02-01'; -- 非常耗时.
hll解决了耗时的问题, 使用方法是将用户ID聚合存储到hll类型中. 如下(假设user_id的类型为int) :
create table access_date (acc_date date unique, userids hll);
insert into access_date select date(crt_time), hll_add_agg(hll_hash_integer(user_id)) from access_log group by 1;
select #userids from access_date where acc_date='2013-02-01'; -- 这条语句返回只要1毫秒左右. (10亿个唯一值返回也在1毫秒左右)
而hll仅仅需要1.2KB就可以存储1.6e+12的唯一值.
2. 因为hll中实际上存储了键值信息, 所以还可以做类似新增用户等的统计. 如下 :
digoal=> create table access_date(acc_date date unique, userids hll);
NOTICE: CREATE TABLE / UNIQUE will create implicit index "access_date_acc_date_key" for table "access_date"
CREATE TABLE
Time: 8.219 ms
digoal=> insert into access_date select current_date, hll_add_agg(hll_hash_integer(user_id)) from generate_series(1,10000) t(user_id);
INSERT 0 1
Time: 14.598 ms
digoal=> insert into access_date select current_date-1, hll_add_agg(hll_hash_integer(user_id)) from generate_series(5000,20000) t(user_id);
INSERT 0 1
Time: 16.397 ms
digoal=> insert into access_date select current_date-2, hll_add_agg(hll_hash_integer(user_id)) from generate_series(9000,40000) t(user_id);
INSERT 0 1
Time: 25.487 ms
digoal=> select *,total_users-coalesce(lag(total_users,1) over (order by rn),0) AS new_users from (
digoal(> SELECT acc_date, row_number() over date as rn,#hll_union_agg(userids) OVER date as total_users
digoal(> FROM access_date
digoal(> WINDOW date AS (ORDER BY acc_date ASC ROWS UNBOUNDED PRECEDING)
digoal(> ) t;
acc_date | rn | total_users | new_users
------------+----+------------------+------------------
2013-02-25 | 1 | 30324.8563878223 | 30324.8563878223
2013-02-26 | 2 | 33944.8370446358 | 3619.98065681347
2013-02-27 | 3 | 38696.2201822711 | 4751.38313763532
(3 rows)
Time: 2.327 ms
二、PostgreSQL hll的主要组成部分 :
1. 两个数据类型 :
digoal=> \dT+
List of data types
Schema | Name | Internal name | Size | Elements | Access privileges | Description
--------+-------------+---------------+------+----------+-------------------+-------------
public | hll | hll | var | | |
public | hll_hashval | hll_hashval | 8 | | |
hll 你可以想象成多个hll_hashval组成的集合.
hll_hashval是通过hash函数生成的hash值, 函数在后面介绍.
2. 几个操作函数 :
2.1 生成hash值的函数
Function Input Example
hll_hash_boolean boolean hll_hash_boolean(TRUE)
or
hll_hash_boolean(TRUE, 123/*hash seed*/)
hll_hash_smallint smallint hll_hash_smallint(4)
or
hll_hash_smallint(4, 123/*hash seed*/)
hll_hash_integer integer hll_hash_integer(21474836)
or
hll_hash_integer(21474836, 123/*hash seed*/)
hll_hash_bigint bigint hll_hash_bigint(223372036854775808)
or
hll_hash_bigint(223372036854775808, 123/*hash seed*/)
hll_hash_bytea bytea hll_hash_bytea(E'\\xDEADBEEF')
or
hll_hash_bytea(E'\\xDEADBEEF', 123/*hash seed*/)
hll_hash_text text hll_hash_text('foobar')
or
hll_hash_text('foobar', 123/*hash seed*/)
2.2 将多个hll_hashval聚合成hll的函数
digoal=> \df *.*hll*agg*
List of functions
Schema | Name | Result data type | Argument data types | Type
--------+---------------+------------------+------------------------------------------------+------
public | hll_add_agg | hll | hll_hashval | agg
public | hll_add_agg | hll | hll_hashval, integer | agg
public | hll_add_agg | hll | hll_hashval, integer, integer | agg
public | hll_add_agg | hll | hll_hashval, integer, integer, bigint | agg
public | hll_add_agg | hll | hll_hashval, integer, integer, bigint, integer | agg
由于聚合函数不支持默认值, 所以定义了5个, 如上. 分别用来设置几个调整精度和阈值.
hll.c
int32 log2m = PG_GETARG_INT32(2);
int32 regwidth = PG_GETARG_INT32(3);
int64 expthresh = PG_GETARG_INT64(4);
int32 sparseon = PG_GETARG_INT32(5);
使用方法举例 :
digoal=> select hll_add_agg(hll_hash_integer(t)) from generate_series(1,10) g(t);
hll_add_agg
------------------------------------------------------------------------------------------------------------------------------------
--------------------------------------
\x128c7f8895a3f5af28cafeb2c0b33a441f4218da0ce907e4355b6018b150d055d8e3d31cc3e945c98357f02c98c8925c9ed0524848de7f7bd2a13b5c5c9d3935c
f09bf72f24e286b62c7e47f2769b67e461dfb
(1 row)
2.3 将多个hll聚合的函数
public | hll_union_agg | hll | hll
去重复.
2.4 比较两个hll或hll_hashval值的函数
public | hll_ne | boolean | hll, hll
public | hll_eq | boolean | hll, hll
public | hll_hashval_ne | boolean | hll_hashval, hll_hashval
public | hll_hashval_eq | boolean | hll_hashval, hll_hashval
2.5 union两个hll的函数. 或者理解为在现有的hll中增加hll值.
public | hll_union | hll | hll, hll
2.6 在现有的hll中增加hll_hashval值
public | hll_add_rev | hll | hll_hashval, hll
public | hll_add | hll | hll, hll_hashval
例如 :
digoal=> select hll_add_rev(hll_hash_bigint(1),hll_add_agg(hll_hash_bigint(t))) from generate_series(3,10) g(t);
hll_add_rev
------------------------------------------------------------------------------------------------------------------------------------
----------------------
\x128c7fb419d210486ab6c1004403b7fb05c44a06c9ec78d52c26530fd4c5f69b6c771b122f34392c621e721b341d80dfead7e630e12993257f8fb23987d28c06f
0df795b3d5839b2488b0c
(1 row)
Time: 0.558 ms
digoal=> select hll_add(hll_add_agg(hll_hash_bigint(t)),hll_hash_bigint(1)) from generate_series(3,10) g(t);
hll_add
------------------------------------------------------------------------------------------------------------------------------------
----------------------
\x128c7fb419d210486ab6c1004403b7fb05c44a06c9ec78d52c26530fd4c5f69b6c771b122f34392c621e721b341d80dfead7e630e12993257f8fb23987d28c06f
0df795b3d5839b2488b0c
(1 row)
2.7 计算hll中的唯一值的函数
public | hll_cardinality | double precision | hll
例如 :
digoal=> insert into acc_agg select current_date,hll_add_agg(hll_hash_bigint(t)) from generate_series(1,100000000) g(t);
INSERT 0 1
Time: 75452.588 ms
digoal=> select #userids from acc_agg;
?column?
------------------
NaN
10017.3757531911
98388.5525610614
1004928.03336943
9817781.35207598
99813892.0484923
(6 rows)
2.8 设置或检查阈值和精度的函数
-- Returns the schema version of an hll.
--
CREATE FUNCTION hll_schema_version(hll)
RETURNS integer
AS 'MODULE_PATHNAME'
LANGUAGE C STRICT IMMUTABLE;
-- Returns the type of an hll.
--
CREATE FUNCTION hll_type(hll)
RETURNS integer
AS 'MODULE_PATHNAME'
LANGUAGE C STRICT IMMUTABLE;
-- Returns the log2m value of an hll.
--
CREATE FUNCTION hll_log2m(hll)
RETURNS integer
AS 'MODULE_PATHNAME'
LANGUAGE C STRICT IMMUTABLE;
-- Returns the register width of an hll.
--
CREATE FUNCTION hll_regwidth(hll)
RETURNS integer
AS 'MODULE_PATHNAME'
LANGUAGE C STRICT IMMUTABLE;
-- Returns the maximum explicit threshold of an hll.
--
CREATE FUNCTION hll_expthresh(hll, OUT specified bigint, OUT effective bigint)
AS 'MODULE_PATHNAME'
LANGUAGE C STRICT IMMUTABLE;
-- Returns the sparse enabled value of an hll.
--
CREATE FUNCTION hll_sparseon(hll)
RETURNS integer
AS 'MODULE_PATHNAME'
LANGUAGE C STRICT IMMUTABLE;
-- Set output version.
--
CREATE FUNCTION hll_set_output_version(integer)
RETURNS integer
AS 'MODULE_PATHNAME'
LANGUAGE C STRICT IMMUTABLE;
-- Set sparse to full compressed threshold to fixed value.
--
CREATE FUNCTION hll_set_max_sparse(integer)
RETURNS integer
AS 'MODULE_PATHNAME'
LANGUAGE C STRICT IMMUTABLE;
-- Change the default type modifier, empty and add aggregate defaults.
CREATE FUNCTION hll_set_defaults(IN i_log2m integer,
IN i_regwidth integer,
IN i_expthresh bigint,
IN i_sparseon integer,
OUT o_log2m integer,
OUT o_regwidth integer,
OUT o_expthresh bigint,
OUT o_sparseon integer)
AS 'MODULE_PATHNAME'
LANGUAGE C STRICT IMMUTABLE;
-- 输出的是老的精度和阈值 :
values[j] = palloc(32);
snprintf(values[j++], 32, "%d", old_log2m);
values[j] = palloc(32);
snprintf(values[j++], 32, "%d", old_regwidth);
values[j] = palloc(32);
snprintf(values[j++], 32, INT64_FORMAT, old_expthresh);
values[j] = palloc(32);
snprintf(values[j++], 32, "%d", old_sparseon);
2.9 DEBUG函数 :
public | hll_print | cstring | hll
例如 :
digoal=> select hll_print(hll_add(hll_add_agg(hll_hash_bigint(t)),hll_hash_bigint(1))) from generate_series(3,10) g(t);
hll_print
---------------------------------------------------------------------------
EXPLICIT, 9 elements, nregs=4096, nbits=5, expthresh=-1(320), sparseon=1:+
0: -5469109305088493887 +
1: 19144387141682250 +
2: 489182038263080531 +
3: 1140754268591781659 +
4: 1310323436750511730 +
5: 1960224177162737638 +
6: 3522142095546486706 +
7: 4145513480871534457 +
8: 6574508035858270988
(1 row)
因为未达到阈值, 所以使用explicit存储. 返回的唯一值将会是精确值.
3. 几个操作符 :
-- ----------------------------------------------------------------
-- Operators
-- ----------------------------------------------------------------
-- 比较两个hll类型值是否相等
CREATE OPERATOR = (
LEFTARG = hll, RIGHTARG = hll, PROCEDURE = hll_eq,
COMMUTATOR = '=', NEGATOR = '<>',
RESTRICT = eqsel, JOIN = eqjoinsel,
MERGES
);
CREATE OPERATOR <> (
LEFTARG = hll, RIGHTARG = hll, PROCEDURE = hll_ne,
COMMUTATOR = '<>', NEGATOR = '=',
RESTRICT = neqsel, JOIN = neqjoinsel
);
-- 合并两个hll类型值, 去重复.
CREATE OPERATOR || (
LEFTARG = hll, RIGHTARG = hll, PROCEDURE = hll_union
);
-- 合并hll_hashval和hll, 去重复.
CREATE OPERATOR || (
LEFTARG = hll, RIGHTARG = hll_hashval, PROCEDURE = hll_add
);
-- 同上
CREATE OPERATOR || (
LEFTARG = hll_hashval, RIGHTARG = hll, PROCEDURE = hll_add_rev
);
-- 计算hll类型值中包含的唯一hll_hashval值.
CREATE OPERATOR # (
RIGHTARG = hll, PROCEDURE = hll_cardinality
);
三、精度调整 :
在hll.c中 :
// Defaults if type modifier values are not specified.
//
#define DEFAULT_LOG2M 15
#define DEFAULT_REGWIDTH 5
#define DEFAULT_EXPTHRESH -1
#define DEFAULT_SPARSEON 1
static int32 g_default_log2m = DEFAULT_LOG2M;
static int32 g_default_regwidth = DEFAULT_REGWIDTH;
static int64 g_default_expthresh = DEFAULT_EXPTHRESH;
static int32 g_default_sparseon = DEFAULT_SPARSEON;
也可以在数据库中调整 :
-- 返回为老的值.
digoal=> select * from hll_set_defaults(15,5,-1,1);
o_log2m | o_regwidth | o_expthresh | o_sparseon
---------+------------+-------------+------------
12 | 5 | -1 | 1
(1 row)
Time: 0.330 ms
四、用法注意
1. 更新hll时小心锁, 最好是由程序调度聚合. 不要实时更新.
2. 浮点型转化成hll_hashval时, 不建议直接转化, 建议将浮点型数值固化后转化.
http://stackoverflow.com/questions/7403210/hashing-floating-point-values
参考
1. http://www.postgresql.org/docs/9.2/static/sql-select.html
2. http://www.postgresql.org/docs/9.2/static/tutorial-window.html
3. http://www.postgresql.org/docs/9.2/static/sql-expressions.html#SYNTAX-WINDOW-FUNCTIONS
4. http://www.postgresql.org/docs/9.2/static/queries-table-expressions.html#QUERIES-WINDOW
5. http://www.postgresql.org/docs/9.2/static/functions-window.html
6. http://blog.163.com/digoal@126/blog/static/16387704020131264480325/
7. REFERENCE.markdown
8. STORAGE.markdown