PostgreSQL sharding有序UUID最佳实践 - serial global uuid stored in 64bit int8
背景
Instagram 使用PostgreSQL数据库, 2012year中国PostgreSQL用户大会的时候他们来做过一次交流。
现在Instagram的月度活跃用户数已经超过9000万,每天照片上传量超4000万。
sharding, 一个非常关键的算法是如何产生所有节点全局唯一的ID。
Instagram 使用int8来存储这个唯一ID. 把64个bit位拆成3个部分, 如下 :
1. 最高的41个bit位, 存储从某时间点开始经过的毫秒数. (区别于UNIX epoch, 自’1970-01-01 00:00:00’ 以来的秒数)
2. 接下来的13个bit位, 存储shard ID.
3. 最后10个bit位, 存储序列值.
例如 :
1. 指定’2010-01-01 00:00:00’ 为这个起点, 这41个bit存储的是从这个起点开始历经的毫秒数.
41个bit位无符号的情况下可以存储2^41=2199023255552个数字, 也就是约69.7year的数据.
postgres=# select (2^41)/1000/60/60/24/365.0;
?column?
------------------
69.7305700010147
(1 row)
如果把起始值设置为’2012-01-01’的话, 69.7year后也就是 ‘2081-09-01’ 后这个算法将会有问题. 因为数值将大于41个bit位.
2. shard ID用了13个bit位, 所以可以存储8192个shard节点的信息.
postgres=# select 2^13;
?column?
----------
8192
(1 row)
如果每个shard节点用到1个主机, 使用这个算法的集群最大可以扩到8192个主机.
3. 序列值占用10个bit位, 可以存储1024个值.
postgres=# select 2^10;
?column?
----------
1024
(1 row)
因此可以这么来理解. 在1毫秒内, 每个shard节点, 允许产生1024个唯一值. 1秒产生102.4万个唯一值.
整个集群1秒允许产生102.4*8192 = 83.88608亿个唯一值.
postgres=# select 1024*1000;
?column?
----------
1024000
(1 row)
前段时间测试过2.0GHz 至强 8核的主机每秒约生成11万个序列值. 所以102.4万个唯一值这个宽度对于一台shard节点来说应该是没有问题的.
PostgreSQL 的序列性能测试 :
测试机 :
CentOS 5.7 x64
PostgreSQL 9.2.1
DELL R610
CPU 2 * Intel(R) Xeon(R) CPU E5504 @ 2.00GHz
1. 测试不开启cache的情况下取序列的速度 :
创建序列 :
ocz@db-172-16-3-150-> psql
psql (9.2.1)
Type "help" for help.
digoal=> create sequence seq_test;
CREATE SEQUENCE
查看当前序列ID :
digoal=> select * from seq_test ;
-[ RECORD 1 ]-+--------------------
sequence_name | seq_test
last_value | 1
start_value | 1
increment_by | 1
max_value | 9223372036854775807
min_value | 1
cache_value | 1
log_cnt | 0
is_cycled | f
is_called | f
pgbench测试脚本 :
ocz@db-172-16-3-150-> cat t.sql
select nextval('seq_test');
测试结果 :
ocz@db-172-16-3-150-> pgbench -M prepared -n -r -f ./t.sql -c 16 -j 4 -T 30 -U digoal digoal
transaction type: Custom query
scaling factor: 1
query mode: prepared
number of clients: 16
number of threads: 4
duration: 30 s
number of transactions actually processed: 3085448
tps = 102832.533289 (including connections establishing)
tps = 102891.321540 (excluding connections establishing)
statement latencies in milliseconds:
0.153352 select nextval('seq_test');
由此看出不启用cache的情况下每秒可取102891个序列值.
2. 测试开启cache的情况下取序列的速度 :
digoal=> alter sequence seq_test restart with 1;
ALTER SEQUENCE
digoal=> alter sequence seq_test cache 100;
ALTER SEQUENCE
ocz@db-172-16-3-150-> pgbench -M prepared -n -r -f ./t.sql -c 16 -j 4 -T 30 -U digoal digoal
transaction type: Custom query
scaling factor: 1
query mode: prepared
number of clients: 16
number of threads: 4
duration: 30 s
number of transactions actually processed: 3359853
tps = 111975.743127 (including connections establishing)
tps = 112049.881187 (excluding connections establishing)
statement latencies in milliseconds:
0.140799 select nextval('seq_test');
ocz@db-172-16-3-150-> psql digoal digoal
psql (9.2.1)
Type "help" for help.
digoal=> \x
Expanded display is on.
digoal=> select * from seq_test ;
-[ RECORD 1 ]-+--------------------
sequence_name | seq_test
last_value | 3360400
start_value | 1
increment_by | 1
max_value | 9223372036854775807
min_value | 1
cache_value | 100
log_cnt | 32
is_cycled | f
is_called | t
获取速度为112049每秒. 略有提高. 但是如果非长连接的话, 将造成巨大的浪费. 如下 :
digoal=> alter sequence seq_test restart with 1;
ALTER SEQUENCE
调整pgbench参数 :
-C establish new connection for each transaction
测试结果 :
ocz@db-172-16-3-150-> pgbench -M simple -C -n -r -f ./t.sql -c 16 -j 4 -T 30 -U digoal digoal
transaction type: Custom query
scaling factor: 1
query mode: simple
number of clients: 16
number of threads: 4
duration: 30 s
number of transactions actually processed: 25865
tps = 861.986914 (including connections establishing)
tps = 59712.243088 (excluding connections establishing)
statement latencies in milliseconds:
13.960707 select nextval('seq_test');
查看最后的序列值 :
ocz@db-172-16-3-150-> psql digoal digoal
psql (9.2.1)
Type "help" for help.
digoal=> \x
Expanded display is on.
digoal=> select * from seq_test ;
-[ RECORD 1 ]-+--------------------
sequence_name | seq_test
last_value | 2588100
start_value | 1
increment_by | 1
max_value | 9223372036854775807
min_value | 1
cache_value | 100
log_cnt | 32
is_cycled | f
is_called | t
实际处理事务为25865个, 但是序列值已经增长到了2588100, 100倍的浪费.
测试结果仅供参考.
例子 :
假设起点为’2012-01-01’, 转换成unix epoch再转换成毫秒后: 1325376000000
postgres=# select EXTRACT(EPOCH FROM '2012-01-01 00:00:00'::timestamp) * 1000;
?column?
---------------
1325376000000
(1 row)
逻辑的shard可以使用schema来区分, 当然也可以使用database来区分. 本例使用schema来区分.
需要为每个shard创建生成全局唯一ID的函数 :
以 shard_id = 5 这个shard节点为例, 起始epoch = 1325376000000.
函数如下 :
CREATE OR REPLACE FUNCTION insta5.next_id(OUT result bigint) AS $$
DECLARE
our_epoch bigint := 1325376000000;
seq_id bigint;
now_millis bigint;
shard_id int := 5;
BEGIN
SELECT nextval('insta5.table_id_seq') % 1024 INTO seq_id;
SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) * 1000) INTO now_millis;
result := (now_millis - our_epoch) << 23;
result := result | (shard_id << 10);
result := result | (seq_id);
END;
$$ LANGUAGE PLPGSQL;
shard 5 中对应的表结构, id为全局唯一主键, 默认值来自上面的函数产生的值.
CREATE TABLE insta5.our_table (
"id" bigint NOT NULL DEFAULT insta5.next_id(),
...rest of table schema...
)
小结
1. 因为int8是带符号整型, 如果第一个BIT=1, 得出负数.
因此前34.87year这个函数产生的是正数, 后34.87year这个函数产生的是负数.
postgres=# select 2^40/1000/60/60/24/365;
?column?
------------------
34.8652850005074
(1 row)
例如 :
正值 :
postgres=# select date '2012-01-01'+ interval '34.8652850005074 year'
postgres-# ;
?column?
---------------------
2046-11-01 00:00:00
(1 row)
postgres=# do language plpgsql $$
DECLARE
our_epoch bigint := 1325376000000;
seq_id bigint;
now_millis bigint;
shard_id int := 5;
result bigint;
BEGIN
SELECT 112345 % 1024 INTO seq_id;
SELECT FLOOR(EXTRACT(EPOCH FROM '2046-11-01 00:00:00'::timestamp) * 1000) INTO now_millis;
result := (now_millis - our_epoch) << 23;
result := result | (shard_id << 10);
result := result | (seq_id);
raise notice '%', result;
END;
$$;
NOTICE: 9221321628057605849
DO
负值 :
postgres=# select date '2012-01-01'+ interval '34.9652850005074 year';
?column?
---------------------
2046-12-01 00:00:00
(1 row)
postgres=# do language plpgsql $$
DECLARE
our_epoch bigint := 1325376000000;
seq_id bigint;
now_millis bigint;
shard_id int := 5;
result bigint;
BEGIN
SELECT 112345 % 1024 INTO seq_id;
SELECT FLOOR(EXTRACT(EPOCH FROM '2046-12-01 00:00:00'::timestamp) * 1000) INTO now_millis;
result := (now_millis - our_epoch) << 23;
result := result | (shard_id << 10);
result := result | (seq_id);
raise notice '%', result;
END;
$$;
NOTICE: -9203679173715945767
DO
2. 这个算法的好处还有1个就是它产生的值是有顺序的, 不是无序的UUID. 因此存储顺序和索引的顺序一致性非常高.
对于使用索引查找是非常有效的.
并且这个算法对于shard也非常方便.
参考
1.
postgres=# \do |
List of operators
Schema | Name | Left arg type | Right arg type | Result type | Description
------------+------+---------------+----------------+-------------+-------------------
pg_catalog | | | bigint | bigint | bigint | bitwise or