PostgreSQL 9.6 sharding based on FDW & pg_pathman

8 minute read

背景

可以阅读以下几篇文章先回顾一下FDW,基于FDW的shared以及高效的分区插件pg_pathman。

《PostgreSQL 9.6 单元化,sharding (based on postgres_fdw) - 内核层支持前传》

《PostgreSQL 9.6 sharding + 单元化 (based on postgres_fdw) 最佳实践 - 通用水平分库场景设计与实践》

《PostgreSQL 9.5+ 高效分区表实现 - pg_pathman》

由于pg_pathman支持foreign table,所以拿它来做sharding也是理所当然的事情,同时它支持的HASH partition不需要在QUERY中带入constraint对应的clause,也能走分区查询。

使用FDW来做分片,建议不要把它当成纯代理来使用,除非它已经达到了代理的处理吞吐能力。建议的做法是本地也有数据,同时分片走shard来访问,对应用透明。

应用程序可以连接任意一个节点,但是应该尽量的访问本地化的数据,而不是远程的数据。这种思想与ORACLE RAC的最佳实践类似,尽量避免交叉访问。

例子

依旧使用《PostgreSQL 9.6 sharding + 单元化 (based on postgres_fdw) 最佳实践 - 通用水平分库场景设计与实践》文章中的例子。

例子使用1个路由节点,4个数据节点。

1. 配置目标库的pg_hba.conf

使用md5认证

2. 创建用户

create role role0 nosuperuser login encrypted password 'pwd' ;
create role role1 nosuperuser login encrypted password 'pwd' ;
create role role2 nosuperuser login encrypted password 'pwd' ;
create role role3 nosuperuser login encrypted password 'pwd' ;
create role digoal nosuperuser login encrypted password 'pwd' ;

3. 创建数据库

create database db0 with template template0 ;
create database db1 with template template0 ;
create database db2 with template template0 ;
create database db3 with template template0 ;
create database mas1 with template template0 ;

4. 配置数据库权限

grant all on database db0 to role0 ;
grant all on database db1 to role1 ;
grant all on database db2 to role2 ;
grant all on database db3 to role3 ;
grant all on database mas1 to digoal ;

5. 创建schema, 建议与USER同名

\c db0 role0
create schema role0;
\c db1 role1
create schema role1;
\c db2 role2
create schema role2;
\c db3 role3
create schema role3;
\c mas1 digoal
create schema digoal;

6. 配置路由节点,需要用到postgres_fdw与pg_pathman插件

\c mas1 postgres
create extension postgres_fdw;
create extension pg_pathman;
grant usage on foreign data wrapper postgres_fdw to digoal;

7. ddl建表语句 , ddl-1.sql

create table userinfo(uid int8 primary key, info text, crt_time timestamp);

create table user_log(uid int8 not null, logintime timestamp, logouttime timestamp, client_addr inet, client_port int);

create table user_membership(uid1 int8 not null, uid2 int8, unique (uid1,uid2) );

create table user_membership_rev(uid2 int8 not null, uid1 int8, unique (uid2,uid1) );

8. 初始化数据节点,创建数据表,注意表名最好全局唯一,方便使用import foreign schema的方式导入。

\c db0 role0

ddl-1.sql

然后执行 :   
do language plpgsql $$
declare
  i int;
  mod int := 1;
  nod int := 0;
  sql text;
begin
  for i in (mod*nod)..(mod*(nod+1)-1) loop
      sql := 'create table userinfo_'||i||'(like userinfo including all)';
      execute sql;
      sql := 'create table user_log_'||i||'(like user_log including all)';
      execute sql;
      sql := 'create table user_membership_'||i||'(like user_membership including all)';
      execute sql;
      sql := 'create table user_membership_rev_'||i||'(like user_membership_rev including all)';
      execute sql;
  end loop;
end;
$$;
\c db1 role1

ddl-1.sql

然后执行 :   
do language plpgsql $$
declare
  i int;
  mod int := 1;
  nod int := 1;
  sql text;
begin
  for i in (mod*nod)..(mod*(nod+1)-1) loop
      sql := 'create table userinfo_'||i||'(like userinfo including all)';
      execute sql;
      sql := 'create table user_log_'||i||'(like user_log including all)';
      execute sql;
      sql := 'create table user_membership_'||i||'(like user_membership including all)';
      execute sql;
      sql := 'create table user_membership_rev_'||i||'(like user_membership_rev including all)';
      execute sql;
  end loop;
end;
$$;
\c db2 role2

ddl-1.sql

然后执行 :   
do language plpgsql $$
declare
  i int;
  mod int := 1;
  nod int := 2;
  sql text;
begin
  for i in (mod*nod)..(mod*(nod+1)-1) loop
      sql := 'create table userinfo_'||i||'(like userinfo including all)';
      execute sql;
      sql := 'create table user_log_'||i||'(like user_log including all)';
      execute sql;
      sql := 'create table user_membership_'||i||'(like user_membership including all)';
      execute sql;
      sql := 'create table user_membership_rev_'||i||'(like user_membership_rev including all)';
      execute sql;
  end loop;
end;
$$;
\c db3 role3

ddl-1.sql

然后执行 :   
do language plpgsql $$
declare
  i int;
  mod int := 1;
  nod int := 3;
  sql text;
begin
  for i in (mod*nod)..(mod*(nod+1)-1) loop
      sql := 'create table userinfo_'||i||'(like userinfo including all)';
      execute sql;
      sql := 'create table user_log_'||i||'(like user_log including all)';
      execute sql;
      sql := 'create table user_membership_'||i||'(like user_membership including all)';
      execute sql;
      sql := 'create table user_membership_rev_'||i||'(like user_membership_rev including all)';
      execute sql;
  end loop;
end;
$$;

9. 初始化master

\c mas1 digoal


-- 初始化foreign server
-- or user host (本例使用unix socket)
create server db0 foreign data wrapper postgres_fdw OPTIONS (host '/u02/digoal/pg_root5281', port '5281', dbname 'db0', use_remote_estimate 'false', fetch_size '5000000');
create server db1 foreign data wrapper postgres_fdw OPTIONS (host '/u02/digoal/pg_root5281', port '5281', dbname 'db1', use_remote_estimate 'false', fetch_size '5000000');
create server db2 foreign data wrapper postgres_fdw OPTIONS (host '/u02/digoal/pg_root5281', port '5281', dbname 'db2', use_remote_estimate 'false', fetch_size '5000000');
create server db3 foreign data wrapper postgres_fdw OPTIONS (host '/u02/digoal/pg_root5281', port '5281', dbname 'db3', use_remote_estimate 'false', fetch_size '5000000');
-- 设置use_remote_estimate 'false' 需收集外部表的统计信息, 但是可以节约explain的操作,对于高并发的小事务,建议使用FALSE


-- 创建user mapping
create user mapping FOR digoal SERVER db0 options (user 'role0', password 'pwd');
create user mapping FOR digoal SERVER db1 options (user 'role1', password 'pwd');
create user mapping FOR digoal SERVER db2 options (user 'role2', password 'pwd');
create user mapping FOR digoal SERVER db3 options (user 'role3', password 'pwd');


-- 导入FOREIGN TABLE,排除主表
import foreign schema role0 EXCEPT (userinfo,user_log,user_membership,mv_user_membership,user_membership_rev,mv_user_membership_rev) from server db0 into digoal;
import foreign schema role1 EXCEPT (userinfo,user_log,user_membership,mv_user_membership,user_membership_rev,mv_user_membership_rev) from server db1 into digoal;
import foreign schema role2 EXCEPT (userinfo,user_log,user_membership,mv_user_membership,user_membership_rev,mv_user_membership_rev) from server db2 into digoal;
import foreign schema role3 EXCEPT (userinfo,user_log,user_membership,mv_user_membership,user_membership_rev,mv_user_membership_rev) from server db3 into digoal;


-- 创建主表
create table userinfo(uid int8 primary key, info text, crt_time timestamp);
create table user_log(uid int8 not null, logintime timestamp, logouttime timestamp, client_addr inet, client_port int);
create table user_membership(uid1 int8 not null, uid2 int8, unique (uid1,uid2) );
create table user_membership_rev(uid2 int8 not null, uid1 int8, unique (uid2,uid1) );


-- 生成哈希分区
select create_hash_partitions('userinfo'::regclass, 'uid', 0, true) ;
select create_hash_partitions('user_log'::regclass, 'uid', 0, true) ;
select create_hash_partitions('user_membership'::regclass, 'uid1', 0, true) ;
select create_hash_partitions('user_membership_rev'::regclass, 'uid2', 0, true) ;


-- 设置foreign table 约束,目前pg_pathman不支持attach hash 分区表,所以我把步骤分解了,先构建约束
-- 注意约束名有固定的格式
do language plpgsql $$
declare
  i int;
  x int := 0;
  y int := 3;
  sql text;
begin
  for i in x..y loop
      sql := 'alter foreign table userinfo_'||i||' add constraint pathman_userinfo_'||i||'_1_check check(get_hash_part_idx(hashint8(uid), 4) = '||i||' )';
      execute sql;
      sql := 'alter foreign table user_log_'||i||' add constraint pathman_user_log_'||i||'_1_check check(get_hash_part_idx(hashint8(uid), 4) = '||i||' )';
      execute sql;
      sql := 'alter foreign table user_membership_'||i||' add constraint pathman_user_membership_'||i||'_1_check check(get_hash_part_idx(hashint8(uid1), 4) = '||i||' )';
      execute sql;
      sql := 'alter foreign table user_membership_rev_'||i||' add constraint pathman_user_membership_rev_'||i||'_1_check check(get_hash_part_idx(hashint8(uid2), 4) = '||i||' )';
      execute sql;
  end loop;
end;
$$;


-- 设置继承关系
do language plpgsql $$
declare
  i int;
  x int := 0;
  y int := 3;
  sql text;
begin
  for i in x..y loop
      sql := 'alter foreign table userinfo_'||i||' inherit userinfo';
      execute sql;
      sql := 'alter foreign table user_log_'||i||' inherit user_log';
      execute sql;
      sql := 'alter foreign table user_membership_'||i||' inherit user_membership';
      execute sql;
      sql := 'alter foreign table user_membership_rev_'||i||' inherit user_membership_rev';
      execute sql;
  end loop;
end;
$$;


-- 前面都准备好之后,可以设置pg_pathman的内存,刷新。
do language plpgsql $$
declare
  i int;
  x int := 0;
  y int := 3;
  tbls text[] := array['userinfo', 'user_log', 'user_membership', 'user_membership_rev'];
  tbl text;
  child text;
  sql text;
begin
  foreach tbl in array tbls loop 
    for i in x..y loop
      child := tbl||'_'||i;
      -- 将子表添加到pg_pathman hash memory中
      perform public.invoke_on_partition_created_callback(tbl::regclass, child::REGCLASS, 0);
    end loop;
    -- 刷新
    perform public.on_create_partitions(tbl);
  end loop;
end;
$$;

10. 插入压测

插入数据测试的函数

create or replace function f(int) returns void as $$
declare
begin 
  insert into userinfo values ($1, md5(random()::Text), now());
 exception when others then return;
end;
$$ language plpgsql strict;

压测

$ vi test1.sql
\set id random(1,50000000)
select f(:id);

$ pgbench -M simple -n -r -P 1 -f ./test1.sql -c 64 -j 64 -T 1200 -U digoal mas1
...
progress: 102.0 s, 42769.1 tps, lat 1.495 ms stddev 0.512
progress: 103.0 s, 42633.0 tps, lat 1.500 ms stddev 0.519
progress: 104.0 s, 42864.0 tps, lat 1.492 ms stddev 0.525
progress: 105.0 s, 42724.0 tps, lat 1.497 ms stddev 0.537
progress: 106.0 s, 42801.3 tps, lat 1.494 ms stddev 0.507
progress: 107.0 s, 42547.7 tps, lat 1.503 ms stddev 0.524
...
   PerfTop:   54084 irqs/sec  kernel:40.6%  exact:  0.0% [1000Hz cycles],  (all, 64 CPUs)
---------------------------------------------------------------------------------------------
             samples  pcnt function                      DSO
             _______ _____ _____________________________ ____________________________________

            18991.00  2.5% AllocSetAlloc                 /home/digoal/pgsql9.6/bin/postgres
            17321.00  2.3% _spin_lock                    [kernel.kallsyms]                   
            15946.00  2.1% GetSnapshotData               /home/digoal/pgsql9.6/bin/postgres
            15049.00  2.0% hash_search_with_hash_value   /home/digoal/pgsql9.6/bin/postgres
            12864.00  1.7% base_yyparse                  /home/digoal/pgsql9.6/bin/postgres
            10796.00  1.4% SearchCatCache                /home/digoal/pgsql9.6/bin/postgres
             9826.00  1.3% memcpy                        /lib64/libc-2.12.so                 
             8840.00  1.2% _int_malloc                   /lib64/libc-2.12.so                 
             8261.00  1.1% __strlen_sse42                /lib64/libc-2.12.so                 
             8245.00  1.1% schedule                      [kernel.kallsyms]                   
             7441.00  1.0% __GI_vfprintf                 /lib64/libc-2.12.so                 
             7348.00  1.0% AtEOXact_GUC                  /home/digoal/pgsql9.6/bin/postgres
             6679.00  0.9% __strcmp_sse42                /lib64/libc-2.12.so                 
             6531.00  0.9% mutex_spin_on_owner           [kernel.kallsyms]                   
             5899.00  0.8% reschedule_interrupt          [kernel.kallsyms]                   

直连测试

\c db0 role0

create or replace function f(int) returns void as $$
declare
begin 
  insert into userinfo values ($1, md5(random()::Text), now());
 exception when others then return;
end;
$$ language plpgsql strict;
$ pgbench -M prepared -n -r -P 1 -f ./test1.sql -c 64 -j 64 -T 1200 -U role0 db0

progress: 6.0 s, 149615.0 tps, lat 0.427 ms stddev 0.389
progress: 7.0 s, 149022.1 tps, lat 0.428 ms stddev 0.398
progress: 8.0 s, 148586.6 tps, lat 0.430 ms stddev 0.430

11. query测试

压测

$ vi test1.sql
\set id random(1,50000000)
select * from userinfo where uid=:id;

$ pgbench -M prepared -n -r -P 1 -f ./test.sql -c 64 -j 64 -T 1200 -U digoal mas1
...
progress: 158.0 s, 46046.8 tps, lat 1.388 ms stddev 0.960
progress: 159.0 s, 46201.2 tps, lat 1.384 ms stddev 0.945
progress: 160.0 s, 46176.9 tps, lat 1.385 ms stddev 0.957
progress: 161.0 s, 46143.0 tps, lat 1.386 ms stddev 0.943
progress: 162.0 s, 46316.0 tps, lat 1.381 ms stddev 0.955
progress: 163.0 s, 46067.1 tps, lat 1.388 ms stddev 0.954
...
   PerfTop:   46994 irqs/sec  kernel:40.7%  exact:  0.0% [1000Hz cycles],  (all, 64 CPUs)
-------------------------------------------------------------------------------------------------
             samples  pcnt function                      DSO
             _______ _____ _____________________________ ________________________________________

            17838.00  2.4% GetSnapshotData               /home/digoal/pgsql9.6/bin/postgres    
            17711.00  2.4% _spin_lock                    [kernel.kallsyms]                       
            17372.00  2.4% AllocSetAlloc                 /home/digoal/pgsql9.6/bin/postgres    
            15205.00  2.1% base_yyparse                  /home/digoal/pgsql9.6/bin/postgres    
            12210.00  1.7% SearchCatCache                /home/digoal/pgsql9.6/bin/postgres    
            11821.00  1.6% hash_search_with_hash_value   /home/digoal/pgsql9.6/bin/postgres    
            10729.00  1.5% __mutex_lock_slowpath         [kernel.kallsyms]                       
            10694.00  1.5% mutex_spin_on_owner           [kernel.kallsyms]                       
            10168.00  1.4% _int_malloc                   /lib64/libc-2.12.so                     
             8217.00  1.1% schedule                      [kernel.kallsyms]                       
             6977.00  1.0% __strlen_sse42                /lib64/libc-2.12.so                     
             6918.00  0.9% __strcmp_sse42                /lib64/libc-2.12.so                     
             6711.00  0.9% _spin_lock_irqsave            [kernel.kallsyms]                       
             6629.00  0.9% reschedule_interrupt          [kernel.kallsyms]                       
             6599.00  0.9% find_busiest_group            [kernel.kallsyms]                       
             5984.00  0.8% core_yylex                    /home/digoal/pgsql9.6/bin/postgres                                   

直连测试

$ vi test.sql
\set id random(1,50000000)
select * from userinfo_0 where uid=:id;

$ pgbench -M prepared -n -r -P 1 -f ./test.sql -c 64 -j 64 -T 1200 -U role0 db0
progress: 3.0 s, 1065981.5 tps, lat 0.059 ms stddev 0.008
progress: 4.0 s, 1084914.1 tps, lat 0.058 ms stddev 0.006
progress: 5.0 s, 1093083.4 tps, lat 0.057 ms stddev 0.006

小结

1. 基于fdw的shard,从功能上来讲还有哪些值得改进的点:

例如, 支持聚合下推(10.0已经支持)

https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=f5d6bce63ceb3c59a964814bb0df5a0648e750e5

https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=7012b132d07c2b4ea15b0b3cb1ea9f3278801d98

append的并行化,目前如果要扫描多个分区,是串行的,并不是并行扫描,非常影响效率。

2. 从性能方面来讲,还有哪些值得改进的点:

从profile的结果来看,目前的锁较重,有非常大的性能提升空间,需要花点时间看看代码。

当然如果当前的性能能满足你,或者你通过构建多个对等的master加上负载均衡,也能解决这个问题。

现在与直连单节点的性能差距还是较大的。

Flag Counter

digoal’s 大量PostgreSQL文章入口