PostgreSQL hll (HyperLogLog) extension for State of The Art Cardinality Estimation Algorithm - 2

5 minute read

背景

一、PostgreSQL hll的主要功能 :

1. 快速的检索hll中存储的唯一值.

例如计算网站某天的访问用户数. 以前的算法可能是这样的 :

select count(distinct userid) from access_log where date(crt_time)='2013-02-01'; -- 非常耗时.  

hll解决了耗时的问题, 使用方法是将用户ID聚合存储到hll类型中. 如下(假设user_id的类型为int) :

create table access_date (acc_date date unique, userids hll);  
insert into access_date select date(crt_time), hll_add_agg(hll_hash_integer(user_id)) from access_log group by 1;  
select #userids from access_date where acc_date='2013-02-01'; -- 这条语句返回只要1毫秒左右. (10亿个唯一值返回也在1毫秒左右)  

而hll仅仅需要1.2KB就可以存储1.6e+12的唯一值.

2. 因为hll中实际上存储了键值信息, 所以还可以做类似新增用户等的统计. 如下 :

digoal=> create table access_date(acc_date date unique, userids hll);  
NOTICE:  CREATE TABLE / UNIQUE will create implicit index "access_date_acc_date_key" for table "access_date"  
CREATE TABLE  
Time: 8.219 ms  
digoal=> insert into access_date select current_date, hll_add_agg(hll_hash_integer(user_id)) from generate_series(1,10000) t(user_id);  
INSERT 0 1  
Time: 14.598 ms  
digoal=> insert into access_date select current_date-1, hll_add_agg(hll_hash_integer(user_id)) from generate_series(5000,20000) t(user_id);  
INSERT 0 1  
Time: 16.397 ms  
digoal=> insert into access_date select current_date-2, hll_add_agg(hll_hash_integer(user_id)) from generate_series(9000,40000) t(user_id);  
INSERT 0 1  
Time: 25.487 ms  
digoal=> select *,total_users-coalesce(lag(total_users,1) over (order by rn),0) AS new_users from (  
digoal(> SELECT acc_date, row_number() over date as rn,#hll_union_agg(userids) OVER date as total_users   
digoal(> FROM access_date  
digoal(> WINDOW date AS (ORDER BY acc_date ASC ROWS UNBOUNDED PRECEDING)  
digoal(> ) t;  
  acc_date  | rn |   total_users    |    new_users       
------------+----+------------------+------------------  
 2013-02-25 |  1 | 30324.8563878223 | 30324.8563878223  
 2013-02-26 |  2 | 33944.8370446358 | 3619.98065681347  
 2013-02-27 |  3 | 38696.2201822711 | 4751.38313763532  
(3 rows)  
Time: 2.327 ms  

二、PostgreSQL hll的主要组成部分 :

1. 两个数据类型 :

digoal=> \dT+  
                                    List of data types  
 Schema |    Name     | Internal name | Size | Elements | Access privileges | Description   
--------+-------------+---------------+------+----------+-------------------+-------------  
 public | hll         | hll           | var  |          |                   |   
 public | hll_hashval | hll_hashval   | 8    |          |                   |   

hll 你可以想象成多个hll_hashval组成的集合.

hll_hashval是通过hash函数生成的hash值, 函数在后面介绍.

2. 几个操作函数 :

2.1 生成hash值的函数

Function	Input	Example  
hll_hash_boolean	boolean	hll_hash_boolean(TRUE)  
or  
hll_hash_boolean(TRUE, 123/*hash seed*/)  
hll_hash_smallint	smallint	hll_hash_smallint(4)  
or  
hll_hash_smallint(4, 123/*hash seed*/)  
hll_hash_integer	integer	hll_hash_integer(21474836)  
or  
hll_hash_integer(21474836, 123/*hash seed*/)  
hll_hash_bigint	bigint	hll_hash_bigint(223372036854775808)  
or  
hll_hash_bigint(223372036854775808, 123/*hash seed*/)  
hll_hash_bytea	bytea	hll_hash_bytea(E'\\xDEADBEEF')  
or  
hll_hash_bytea(E'\\xDEADBEEF', 123/*hash seed*/)  
hll_hash_text	text	hll_hash_text('foobar')  
or  
hll_hash_text('foobar', 123/*hash seed*/)  

2.2 将多个hll_hashval聚合成hll的函数

digoal=> \df *.*hll*agg*  
                                         List of functions  
 Schema |     Name      | Result data type |              Argument data types               | Type   
--------+---------------+------------------+------------------------------------------------+------  
 public | hll_add_agg   | hll              | hll_hashval                                    | agg  
 public | hll_add_agg   | hll              | hll_hashval, integer                           | agg  
 public | hll_add_agg   | hll              | hll_hashval, integer, integer                  | agg  
 public | hll_add_agg   | hll              | hll_hashval, integer, integer, bigint          | agg  
 public | hll_add_agg   | hll              | hll_hashval, integer, integer, bigint, integer | agg  

由于聚合函数不支持默认值, 所以定义了5个, 如上. 分别用来设置几个调整精度和阈值.

hll.c  
            int32 log2m = PG_GETARG_INT32(2);  
            int32 regwidth = PG_GETARG_INT32(3);  
            int64 expthresh = PG_GETARG_INT64(4);  
            int32 sparseon = PG_GETARG_INT32(5);  

使用方法举例 :

digoal=> select hll_add_agg(hll_hash_integer(t)) from generate_series(1,10) g(t);  
                                                                               hll_add_agg                                            
                                        
------------------------------------------------------------------------------------------------------------------------------------  
--------------------------------------  
 \x128c7f8895a3f5af28cafeb2c0b33a441f4218da0ce907e4355b6018b150d055d8e3d31cc3e945c98357f02c98c8925c9ed0524848de7f7bd2a13b5c5c9d3935c  
f09bf72f24e286b62c7e47f2769b67e461dfb  
(1 row)  

2.3 将多个hll聚合的函数

public | hll_union_agg           | hll              | hll  

去重复.

2.4 比较两个hll或hll_hashval值的函数

public | hll_ne                  | boolean          | hll, hll  
public | hll_eq                  | boolean          | hll, hll  
public | hll_hashval_ne          | boolean          | hll_hashval, hll_hashval  
public | hll_hashval_eq          | boolean          | hll_hashval, hll_hashval  

2.5 union两个hll的函数. 或者理解为在现有的hll中增加hll值.

public | hll_union               | hll              | hll, hll  

2.6 在现有的hll中增加hll_hashval值

 public | hll_add_rev             | hll              | hll_hashval, hll  
 public | hll_add                 | hll              | hll, hll_hashval  

例如 :

digoal=> select hll_add_rev(hll_hash_bigint(1),hll_add_agg(hll_hash_bigint(t))) from generate_series(3,10) g(t);  
                                                                       hll_add_rev                                                    
                        
------------------------------------------------------------------------------------------------------------------------------------  
----------------------  
 \x128c7fb419d210486ab6c1004403b7fb05c44a06c9ec78d52c26530fd4c5f69b6c771b122f34392c621e721b341d80dfead7e630e12993257f8fb23987d28c06f  
0df795b3d5839b2488b0c  
(1 row)  
Time: 0.558 ms  
digoal=> select hll_add(hll_add_agg(hll_hash_bigint(t)),hll_hash_bigint(1)) from generate_series(3,10) g(t);  
                                                                         hll_add                                                      
                        
------------------------------------------------------------------------------------------------------------------------------------  
----------------------  
 \x128c7fb419d210486ab6c1004403b7fb05c44a06c9ec78d52c26530fd4c5f69b6c771b122f34392c621e721b341d80dfead7e630e12993257f8fb23987d28c06f  
0df795b3d5839b2488b0c  
(1 row)  

2.7 计算hll中的唯一值的函数

public | hll_cardinality         | double precision | hll  

例如 :

digoal=> insert into acc_agg select current_date,hll_add_agg(hll_hash_bigint(t)) from generate_series(1,100000000) g(t);  
INSERT 0 1  
Time: 75452.588 ms  
digoal=> select #userids from acc_agg;  
     ?column?       
------------------  
              NaN  
 10017.3757531911  
 98388.5525610614  
 1004928.03336943  
 9817781.35207598  
 99813892.0484923  
(6 rows)  

2.8 设置或检查阈值和精度的函数

-- Returns the schema version of an hll.  
--  
CREATE FUNCTION hll_schema_version(hll)  
     RETURNS integer  
     AS 'MODULE_PATHNAME'  
     LANGUAGE C STRICT IMMUTABLE;  
  
-- Returns the type of an hll.  
--  
CREATE FUNCTION hll_type(hll)  
     RETURNS integer  
     AS 'MODULE_PATHNAME'  
     LANGUAGE C STRICT IMMUTABLE;  
  
-- Returns the log2m value of an hll.  
--  
CREATE FUNCTION hll_log2m(hll)  
     RETURNS integer  
     AS 'MODULE_PATHNAME'  
     LANGUAGE C STRICT IMMUTABLE;  
  
-- Returns the register width of an hll.  
--  
CREATE FUNCTION hll_regwidth(hll)  
     RETURNS integer  
     AS 'MODULE_PATHNAME'  
     LANGUAGE C STRICT IMMUTABLE;  
  
-- Returns the maximum explicit threshold of an hll.  
--  
CREATE FUNCTION hll_expthresh(hll, OUT specified bigint, OUT effective bigint)  
     AS 'MODULE_PATHNAME'  
     LANGUAGE C STRICT IMMUTABLE;  
  
-- Returns the sparse enabled value of an hll.  
--  
CREATE FUNCTION hll_sparseon(hll)  
     RETURNS integer  
     AS 'MODULE_PATHNAME'  
     LANGUAGE C STRICT IMMUTABLE;  
  
-- Set output version.  
--  
CREATE FUNCTION hll_set_output_version(integer)  
     RETURNS integer  
     AS 'MODULE_PATHNAME'  
     LANGUAGE C STRICT IMMUTABLE;  
  
-- Set sparse to full compressed threshold to fixed value.  
--  
CREATE FUNCTION hll_set_max_sparse(integer)  
     RETURNS integer  
     AS 'MODULE_PATHNAME'  
     LANGUAGE C STRICT IMMUTABLE;  
  
-- Change the default type modifier, empty and add aggregate defaults.  
CREATE FUNCTION hll_set_defaults(IN i_log2m integer,  
                                 IN i_regwidth integer,  
                                 IN i_expthresh bigint,  
                                 IN i_sparseon integer,  
                                 OUT o_log2m integer,  
                                 OUT o_regwidth integer,  
                                 OUT o_expthresh bigint,  
                                 OUT o_sparseon integer)  
     AS 'MODULE_PATHNAME'  
     LANGUAGE C STRICT IMMUTABLE;  
-- 输出的是老的精度和阈值 :   
                values[j] = palloc(32);  
                snprintf(values[j++], 32, "%d", old_log2m);  
                values[j] = palloc(32);  
                snprintf(values[j++], 32, "%d", old_regwidth);  
                values[j] = palloc(32);  
                snprintf(values[j++], 32, INT64_FORMAT, old_expthresh);  
                values[j] = palloc(32);  
                snprintf(values[j++], 32, "%d", old_sparseon);  

2.9 DEBUG函数 :

public | hll_print               | cstring          | hll  

例如 :

digoal=> select hll_print(hll_add(hll_add_agg(hll_hash_bigint(t)),hll_hash_bigint(1))) from generate_series(3,10) g(t);  
                                 hll_print                                   
---------------------------------------------------------------------------  
 EXPLICIT, 9 elements, nregs=4096, nbits=5, expthresh=-1(320), sparseon=1:+  
 0: -5469109305088493887                                                  +  
 1:    19144387141682250                                                  +  
 2:   489182038263080531                                                  +  
 3:  1140754268591781659                                                  +  
 4:  1310323436750511730                                                  +  
 5:  1960224177162737638                                                  +  
 6:  3522142095546486706                                                  +  
 7:  4145513480871534457                                                  +  
 8:  6574508035858270988   
(1 row)  

因为未达到阈值, 所以使用explicit存储. 返回的唯一值将会是精确值.

3. 几个操作符 :

-- ----------------------------------------------------------------  
-- Operators  
-- ----------------------------------------------------------------  
-- 比较两个hll类型值是否相等  
  
CREATE OPERATOR = (  
        LEFTARG = hll, RIGHTARG = hll, PROCEDURE = hll_eq,  
        COMMUTATOR = '=', NEGATOR = '<>',  
        RESTRICT = eqsel, JOIN = eqjoinsel,  
        MERGES  
);  
  
CREATE OPERATOR <> (  
        LEFTARG = hll, RIGHTARG = hll, PROCEDURE = hll_ne,  
        COMMUTATOR = '<>', NEGATOR = '=',  
        RESTRICT = neqsel, JOIN = neqjoinsel  
);  
  
-- 合并两个hll类型值, 去重复.  
CREATE OPERATOR || (  
       LEFTARG = hll, RIGHTARG = hll, PROCEDURE = hll_union  
);  
  
-- 合并hll_hashval和hll, 去重复.  
CREATE OPERATOR || (  
       LEFTARG = hll, RIGHTARG = hll_hashval, PROCEDURE = hll_add  
);  
  
-- 同上  
CREATE OPERATOR || (  
       LEFTARG = hll_hashval, RIGHTARG = hll, PROCEDURE = hll_add_rev  
);  
  
-- 计算hll类型值中包含的唯一hll_hashval值.  
CREATE OPERATOR # (  
       RIGHTARG = hll, PROCEDURE = hll_cardinality  
);  

三、精度调整 :

在hll.c中 :

// Defaults if type modifier values are not specified.  
//  
#define DEFAULT_LOG2M           15        
#define DEFAULT_REGWIDTH        5  
#define DEFAULT_EXPTHRESH       -1  
#define DEFAULT_SPARSEON        1  
  
static int32 g_default_log2m = DEFAULT_LOG2M;  
static int32 g_default_regwidth = DEFAULT_REGWIDTH;  
static int64 g_default_expthresh = DEFAULT_EXPTHRESH;  
static int32 g_default_sparseon = DEFAULT_SPARSEON;  

也可以在数据库中调整 :

-- 返回为老的值.  
digoal=> select * from hll_set_defaults(15,5,-1,1);  
 o_log2m | o_regwidth | o_expthresh | o_sparseon   
---------+------------+-------------+------------  
      12 |          5 |          -1 |          1  
(1 row)  
Time: 0.330 ms  

四、用法注意

1. 更新hll时小心锁, 最好是由程序调度聚合. 不要实时更新.

2. 浮点型转化成hll_hashval时, 不建议直接转化, 建议将浮点型数值固化后转化.

http://stackoverflow.com/questions/7403210/hashing-floating-point-values

参考

1. http://www.postgresql.org/docs/9.2/static/sql-select.html

2. http://www.postgresql.org/docs/9.2/static/tutorial-window.html

3. http://www.postgresql.org/docs/9.2/static/sql-expressions.html#SYNTAX-WINDOW-FUNCTIONS

4. http://www.postgresql.org/docs/9.2/static/queries-table-expressions.html#QUERIES-WINDOW

5. http://www.postgresql.org/docs/9.2/static/functions-window.html

6. http://blog.163.com/digoal@126/blog/static/16387704020131264480325/

7. REFERENCE.markdown

8. STORAGE.markdown

Flag Counter

digoal’s 大量PostgreSQL文章入口