PostgreSQL sharding有序UUID最佳实践 - serial global uuid stored in 64bit int8

4 minute read

背景

Instagram 使用PostgreSQL数据库, 2012year中国PostgreSQL用户大会的时候他们来做过一次交流。

现在Instagram的月度活跃用户数已经超过9000万,每天照片上传量超4000万。

sharding, 一个非常关键的算法是如何产生所有节点全局唯一的ID。

Instagram 使用int8来存储这个唯一ID. 把64个bit位拆成3个部分, 如下 :

1. 最高的41个bit位, 存储从某时间点开始经过的毫秒数. (区别于UNIX epoch, 自’1970-01-01 00:00:00’ 以来的秒数)

2. 接下来的13个bit位, 存储shard ID.

3. 最后10个bit位, 存储序列值.

例如 :

1. 指定’2010-01-01 00:00:00’ 为这个起点, 这41个bit存储的是从这个起点开始历经的毫秒数.

41个bit位无符号的情况下可以存储2^41=2199023255552个数字, 也就是约69.7year的数据.

postgres=# select (2^41)/1000/60/60/24/365.0;  
     ?column?       
------------------  
 69.7305700010147  
(1 row)  

如果把起始值设置为’2012-01-01’的话, 69.7year后也就是 ‘2081-09-01’ 后这个算法将会有问题. 因为数值将大于41个bit位.

2. shard ID用了13个bit位, 所以可以存储8192个shard节点的信息.

postgres=# select 2^13;  
 ?column?   
----------  
     8192  
(1 row)  

如果每个shard节点用到1个主机, 使用这个算法的集群最大可以扩到8192个主机.

3. 序列值占用10个bit位, 可以存储1024个值.

postgres=# select 2^10;  
 ?column?   
----------  
     1024  
(1 row)  

因此可以这么来理解. 在1毫秒内, 每个shard节点, 允许产生1024个唯一值. 1秒产生102.4万个唯一值.

整个集群1秒允许产生102.4*8192 = 83.88608亿个唯一值.

postgres=# select 1024*1000;  
 ?column?   
----------  
  1024000  
(1 row)  

前段时间测试过2.0GHz 至强 8核的主机每秒约生成11万个序列值. 所以102.4万个唯一值这个宽度对于一台shard节点来说应该是没有问题的.

PostgreSQL 的序列性能测试 :   
测试机 :   
CentOS 5.7 x64  
PostgreSQL 9.2.1  
DELL R610  
CPU 2 * Intel(R) Xeon(R) CPU E5504  @ 2.00GHz  
  
1. 测试不开启cache的情况下取序列的速度 :   
创建序列 :   
ocz@db-172-16-3-150-> psql   
psql (9.2.1)  
Type "help" for help.  
digoal=> create sequence seq_test;  
CREATE SEQUENCE  
查看当前序列ID :   
digoal=> select * from seq_test ;  
-[ RECORD 1 ]-+--------------------  
sequence_name | seq_test  
last_value    | 1  
start_value   | 1  
increment_by  | 1  
max_value     | 9223372036854775807  
min_value     | 1  
cache_value   | 1  
log_cnt       | 0  
is_cycled     | f  
is_called     | f  
  
pgbench测试脚本 :   
ocz@db-172-16-3-150-> cat t.sql  
select nextval('seq_test');  
  
测试结果 :   
ocz@db-172-16-3-150-> pgbench -M prepared -n -r -f ./t.sql -c 16 -j 4 -T 30 -U digoal digoal  
transaction type: Custom query  
scaling factor: 1  
query mode: prepared  
number of clients: 16  
number of threads: 4  
duration: 30 s  
number of transactions actually processed: 3085448  
tps = 102832.533289 (including connections establishing)  
tps = 102891.321540 (excluding connections establishing)  
statement latencies in milliseconds:  
        0.153352        select nextval('seq_test');  
由此看出不启用cache的情况下每秒可取102891个序列值.  
  
2. 测试开启cache的情况下取序列的速度 :   
digoal=> alter sequence seq_test restart with 1;  
ALTER SEQUENCE  
digoal=> alter sequence seq_test cache 100;  
ALTER SEQUENCE  
ocz@db-172-16-3-150-> pgbench -M prepared -n -r -f ./t.sql -c 16 -j 4 -T 30 -U digoal digoal  
transaction type: Custom query  
scaling factor: 1  
query mode: prepared  
number of clients: 16  
number of threads: 4  
duration: 30 s  
number of transactions actually processed: 3359853  
tps = 111975.743127 (including connections establishing)  
tps = 112049.881187 (excluding connections establishing)  
statement latencies in milliseconds:  
        0.140799        select nextval('seq_test');  
ocz@db-172-16-3-150-> psql digoal digoal  
psql (9.2.1)  
Type "help" for help.  
digoal=> \x  
Expanded display is on.  
digoal=> select * from seq_test ;  
-[ RECORD 1 ]-+--------------------  
sequence_name | seq_test  
last_value    | 3360400  
start_value   | 1  
increment_by  | 1  
max_value     | 9223372036854775807  
min_value     | 1  
cache_value   | 100  
log_cnt       | 32  
is_cycled     | f  
is_called     | t  
获取速度为112049每秒. 略有提高. 但是如果非长连接的话, 将造成巨大的浪费. 如下 :   
digoal=> alter sequence seq_test restart with 1;  
ALTER SEQUENCE  
  
调整pgbench参数 :   
  -C           establish new connection for each transaction  
  
测试结果 :   
ocz@db-172-16-3-150-> pgbench -M simple -C -n -r -f ./t.sql -c 16 -j 4 -T 30 -U digoal digoal  
transaction type: Custom query  
scaling factor: 1  
query mode: simple  
number of clients: 16  
number of threads: 4  
duration: 30 s  
number of transactions actually processed: 25865  
tps = 861.986914 (including connections establishing)  
tps = 59712.243088 (excluding connections establishing)  
statement latencies in milliseconds:  
        13.960707       select nextval('seq_test');  
查看最后的序列值 :   
ocz@db-172-16-3-150-> psql digoal digoal  
psql (9.2.1)  
Type "help" for help.  
digoal=> \x  
Expanded display is on.  
digoal=> select * from seq_test ;  
-[ RECORD 1 ]-+--------------------  
sequence_name | seq_test  
last_value    | 2588100  
start_value   | 1  
increment_by  | 1  
max_value     | 9223372036854775807  
min_value     | 1  
cache_value   | 100  
log_cnt       | 32  
is_cycled     | f  
is_called     | t  
实际处理事务为25865个,  但是序列值已经增长到了2588100, 100倍的浪费.  
测试结果仅供参考.  

例子 :

假设起点为’2012-01-01’, 转换成unix epoch再转换成毫秒后: 1325376000000

postgres=# select EXTRACT(EPOCH FROM '2012-01-01 00:00:00'::timestamp) * 1000;  
   ?column?      
---------------  
 1325376000000  
(1 row)  

逻辑的shard可以使用schema来区分, 当然也可以使用database来区分. 本例使用schema来区分.

需要为每个shard创建生成全局唯一ID的函数 :

以 shard_id = 5 这个shard节点为例, 起始epoch = 1325376000000.

函数如下 :

CREATE OR REPLACE FUNCTION insta5.next_id(OUT result bigint) AS $$  
DECLARE  
    our_epoch bigint := 1325376000000;  
    seq_id bigint;  
    now_millis bigint;  
    shard_id int := 5;  
BEGIN  
    SELECT nextval('insta5.table_id_seq') % 1024 INTO seq_id;  
    SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) * 1000) INTO now_millis;  
    result := (now_millis - our_epoch) << 23;  
    result := result | (shard_id << 10);  
    result := result | (seq_id);  
END;  
$$ LANGUAGE PLPGSQL;  

shard 5 中对应的表结构, id为全局唯一主键, 默认值来自上面的函数产生的值.

CREATE TABLE insta5.our_table (  
    "id" bigint NOT NULL DEFAULT insta5.next_id(),  
    ...rest of table schema...  
)  

小结

1. 因为int8是带符号整型, 如果第一个BIT=1, 得出负数.

因此前34.87year这个函数产生的是正数, 后34.87year这个函数产生的是负数.

postgres=# select 2^40/1000/60/60/24/365;  
     ?column?       
------------------  
 34.8652850005074  
(1 row)  

例如 :

正值 :

postgres=# select date '2012-01-01'+ interval '34.8652850005074 year'  
postgres-# ;  
      ?column?         
---------------------  
 2046-11-01 00:00:00  
(1 row)  
  
postgres=# do language plpgsql $$  
DECLARE  
    our_epoch bigint := 1325376000000;  
    seq_id bigint;  
    now_millis bigint;  
    shard_id int := 5;  
    result bigint;  
BEGIN  
    SELECT 112345 % 1024 INTO seq_id;  
  
    SELECT FLOOR(EXTRACT(EPOCH FROM '2046-11-01 00:00:00'::timestamp) * 1000) INTO now_millis;  
    result := (now_millis - our_epoch) << 23;  
    result := result | (shard_id << 10);  
    result := result | (seq_id);  
    raise notice '%', result;  
END;  
$$;  
NOTICE:  9221321628057605849  
DO  

负值 :

postgres=# select date '2012-01-01'+ interval '34.9652850005074 year';  
      ?column?         
---------------------  
 2046-12-01 00:00:00  
(1 row)  
  
postgres=# do language plpgsql $$                                       
DECLARE  
    our_epoch bigint := 1325376000000;  
    seq_id bigint;  
    now_millis bigint;  
    shard_id int := 5;  
    result bigint;  
BEGIN  
    SELECT 112345 % 1024 INTO seq_id;  
  
    SELECT FLOOR(EXTRACT(EPOCH FROM '2046-12-01 00:00:00'::timestamp) * 1000) INTO now_millis;  
    result := (now_millis - our_epoch) << 23;  
    result := result | (shard_id << 10);  
    result := result | (seq_id);  
    raise notice '%', result;  
END;  
$$;  
NOTICE:  -9203679173715945767  
DO  

2. 这个算法的好处还有1个就是它产生的值是有顺序的, 不是无序的UUID. 因此存储顺序和索引的顺序一致性非常高.

对于使用索引查找是非常有效的.

并且这个算法对于shard也非常方便.

参考

1.

postgres=# \do |  
                                  List of operators  
   Schema   | Name | Left arg type | Right arg type | Result type |    Description      
------------+------+---------------+----------------+-------------+-------------------  
 pg_catalog | |    | bigint        | bigint         | bigint      | bitwise or  

Flag Counter

digoal’s 大量PostgreSQL文章入口