小微贷款、天使投资(风控助手)业务数据库设计(图式搜索图谱分析) - 阿里云RDS PostgreSQL, HybridDB for PostgreSQL最佳实践

4 minute read

背景

贷款是银行的主营业务之一，但是并不是只有银行能提供贷款，实际上资金雄厚的公司都有能力提供贷款（比如保险行业、资源垄断型企业等）。

除了放贷，我们常说的天使投资、A轮B轮啥的，也是类似的场景，凭什么投你，背后如何决策也需要决策系统的支撑。

与贷款相反的是吸金类业务，比如我们现在发现越来越多的理财产品、股市、甚至是游戏充值，只要是让你存钱，把你的钱留住，而且你还不舍得吧钱取出来的业务，都是吸金类业务。

所以，一些本来就资金雄厚，或者是靠吸金类业务起来的公司，只要吸到足够的金之后，就有放贷款的能力。

放贷本身是一件体力活，来申请贷款的人或企业鱼龙混杂。什么样的人，什么样的企业能获得贷款，需要经过层层审查。审查需谨慎，否则就容易成为放贷公司的烂账。

本文涉及的项目和放贷、审查、风控、舆情系统有关。

一、需求分析

展示企业的图谱信息，例如股权结构、舆情事件、管理层、知识产权，经营状况等。方便审查员查看。

数据的来源，例如爬虫得到的信息、来自合作伙伴（例如招聘网站、税务）的信息等，组成了企业相关的信息网络。

类比社交业务

这类数据和互联网的社交类业务的数据非常类似，A企业投资了B企业（类似微博的LIKE），A企业发布了财报（类似发微博），A企业获得了某个专利。。。。

类比图式搜索业务

企业图谱系统与互联网社交类业务截然不同的地方：

企业数据相对来说是比较静态的数据，所以我们更关注的是查询的性能，按一个企业顺藤摸瓜找出相关企业的数据的性能。与之对应的是下面这个案例，图式搜索的应用。

案例详见：

《金融风控、公安刑侦、社会关系、人脉分析等需求分析与数据库实现 - PostgreSQL图数据库场景应用》

数据量预估

企业：百万。

每个企业相关的关系、舆情事件、财报等，平均预计1000条。

平均每个企业的直接相关企业有50家。

个人（法人、股东、监理、。。。）：百万。

每个人有若干家相关企业。

总数据量约10亿级。

二、数据库架构设计

1、相对静态的数据(例如人、企业基本信息、主要成员、股东。。。)

量级：百万。

有些一对多的数据，可以使用数组类型（例如企业主要成员ID、股东ID，存为数组）来存储，从而提升查询效率。

create table corp_info (...);       --  企业相对静态信息  
  
create table people_info (...);     --  个人相对静态信息  

建议将这类静态数据设计为多张表，使用PK关联起来，减少查询少量字段时的数据扫描量（因为PG是行存储格式，查询少量字段时，可以提高命中率减少IO放大）。

2、舆情、流水数据

量级：十亿。

create table corp_feed (corp_id pk, typ int, event);

3、关系数据

量级：千万 ~ 亿级别。

这个量级基本上不需要考虑存储冗余的正向关系，反向关系，（方便sharding(FDW, hybriddb)）。

create table corp_rel (...);

4、线上线下一体化

阿里云线上的数据，通过逻辑订阅、物理流式复制，可以将数据复制到线下数据库中。

其中逻辑订阅支持表级订阅（每张表的订阅速度约3万行/s，指表在独立订阅通道时）。逻辑订阅还可以通过规则或触发器支持订阅表中的部分数据，甚至实现格式转换等工作。逻辑订阅的备库支持可读可写。（逻辑订阅的大事务延迟比物理复制高）

物理流式复制，支持全库复制，物理复制的好处是延迟低，备库支持只读。

用户建立RDS备库的需求的起因

有些企业因为SOX审计的需求，必须要在自己的机房放一个备节点。

有些企业可能期望通过这种方法建立日常开发或测试环境。

有些架构的需求，例如云端存储所有数据，线下存储部分关心的加速数据。或者云端所有数据拆分成了多个库，分析场景需要将数据合并起来进行分析，那么就有多对一的复制需求，使用PostgreSQL逻辑订阅可以支持。

5、sharding

虽然本文涉及的场景和数据量并不需要sharding(根据后面的测试，实际上RDS PostgreSQL单库支持1000亿数据量，在稳定性、性能各方面应该是没有问题的。)，但是考虑到案例的通用性，还是提一下。

可以根据被查询的ID进行切分，如有正反向关系的，可以采用冗余存储的方式达到切片的目的。

PostgreSQL sharding的案例很多：

1、内核层面支持的sharding

《PostgreSQL 9.6 sharding based on FDW & pg_pathman》

2、通过plproxy代理支持的sharding

《PostgreSQL 最佳实践 - 水平分库(基于plproxy)》

3、通过citus插件支持的sharding和mpp

https://github.com/citusdata/citus

4、通过客户端代理支持的sharding

https://github.com/dangdangdotcom/sharding-jdbc

https://github.com/go-pg/sharding/

三、DEMO性能

建表

create table corp_info (  -- 企业信息  
  id int primary key,  -- 企业ID，主键   
  info text,           -- 企业信息  
  core_team int[]      -- 核心团队成员ID  
  -- 忽略其他字段  
);  
  
create table people_info (  -- 人信息  
  id int primary key,  -- 人ID，主键   
  info text,           -- 信息  
  rel_corp int[],      -- 和哪些公司相关  
  rel_type int[]       -- 分别是什么关系(法人、建立、董事长，。。。。)  
  -- 忽略其他字段  
);  
  
create table corp_rel1 (  -- 企业正向关系  
  corp_id1 int,  -- 企业ID  
  corp_id2 int,  -- 企业ID  
  reltypid int   -- 关系类型  
);  
create index idx_corp_rel1 on corp_rel1 (corp_id1);  
  
create table corp_rel2 (  -- 企业反向关系  
  corp_id1 int,  -- 企业ID  
  corp_id2 int,  -- 企业ID  
  reltypid int   -- 关系类型  
);  
create index idx_corp_rel2 on corp_rel2 (corp_id1);  
  
create table corp_event ( -- 企业舆情  
  corp_id int,  -- 企业ID  
  event text,   -- 事件内容  
  crt_time timestamp  -- 时间  
  -- 其他字段略  
);  
create index idx_corp_event_1 on corp_event(corp_id, crt_time desc);  

生成测试数据

生成10.42亿测试数据，数据组成和生成方法如下：

-- 100万 企业数据  
insert into corp_info select generate_series(1,1000000), 'test', array(select (random()*1000000)::int from generate_series(1,20));  
  
-- 100万 人数据  
insert into people_info select generate_series(1,1000000), 'test', array(select (random()*1000000)::int from generate_series(1,20)), array(select (random()*50)::int from generate_series(1,20));  
  
-- 2000万 企业正向关系  
insert into corp_rel1 select random()*1000000, random()*1000000, random()*100 from generate_series(1,20000000);  
  
-- 2000万 企业反向关系  
insert into corp_rel2 select random()*1000000, random()*1000000, random()*100 from generate_series(1,20000000);  
  
-- 10亿 企业舆情  
insert into corp_event select random()*1000000, 'test', now()+(id||' second')::interval from generate_series(1,1000000000) t(id);  

由于数据相对静止，所以我们可以对数据进行cluster话，提高查询效率。（不这么做，实际上也是毫秒级的响应时间，这么做之后可以降低到0.0X 毫秒）

压测

1、企业静态数据查询

vi test.sql  
  
\set id random(1,1000000)  
select * from corp_info where id=:id;  
select * from people_info where id = any (array(select core_team from corp_info where id=:id));  

pgbench -M prepared -n -r -P 1 -f ./test.sql -c 64 -j 64 -T 120  
  
transaction type: ./test.sql
scaling factor: 1
query mode: prepared
number of clients: 64
number of threads: 64
duration: 120 s
number of transactions actually processed: 15175162
latency average = 0.506 ms
latency stddev = 0.080 ms
tps = 126454.038435 (including connections establishing)
tps = 126468.912494 (excluding connections establishing)
script statistics:
 - statement latencies in milliseconds:
         0.001  \set id random(1,1000000)
         0.068  select * from corp_info where id=:id;
         0.444  select * from people_info where id = any (array(select core_team from corp_info where id=:id));

2、企业关系数据查询，查询正向和反向关系。

vi test1.sql  
  
\set id random(1,1000000)  
select * from corp_rel1 where corp_id1=:id;  
select * from corp_rel2 where corp_id1=:id;  

pgbench -M prepared -n -r -P 1 -f ./test1.sql -c 64 -j 64 -T 120  
  
transaction type: ./test1.sql
scaling factor: 1
query mode: prepared
number of clients: 64
number of threads: 64
duration: 120 s
number of transactions actually processed: 49723004
latency average = 0.154 ms
latency stddev = 0.032 ms
tps = 414351.413094 (including connections establishing)
tps = 414396.709915 (excluding connections establishing)
script statistics:
 - statement latencies in milliseconds:
         0.001  \set id random(1,1000000)
         0.077  select * from corp_rel1 where corp_id1=:id;
         0.077  select * from corp_rel2 where corp_id1=:id;

3、企业最近10条舆情数据查询

vi test2.sql  
  
\set id random(1,1000000)  
select * from corp_event where corp_id=:id order by crt_time desc limit 10;  

pgbench -M prepared -n -r -P 1 -f ./test2.sql -c 64 -j 64 -T 120  
  
transaction type: ./test2.sql
scaling factor: 1
query mode: prepared
number of clients: 64
number of threads: 64
duration: 120 s
number of transactions actually processed: 94303135
latency average = 0.081 ms
latency stddev = 0.020 ms
tps = 785845.099057 (including connections establishing)
tps = 785941.120081 (excluding connections establishing)
script statistics:
 - statement latencies in milliseconds:
         0.001  \set id random(1,1000000)
         0.081  select * from corp_event where corp_id=:id order by crt_time desc limit 10;

四、达到的效果

1、性能

1.1、企业静态数据查询。

TPS：12.6万

平均响应时间：0.5毫秒

1.2、企业关系数据查询，输出正向和反向关系。

TPS：41.4万

平均响应时间：0.15毫秒

1.3、企业最近10条舆情数据查询。

TPS：78.5万

平均响应时间：0.08毫秒

2、使用阿里云RDS PostgreSQL，用户不需要关心数据库的运维，容灾，备份恢复，扩容，缩容，HA等基本问题，可以更加专注于业务。

3、分析需求，使用PostgreSQL 10提供的多核并行、JIT、算子复用等特性，处理10亿级的数据分析完全不在话下。

《分析加速引擎黑科技 - LLVM、列存、多核并行、算子复用大联姻 - 一起来开启PostgreSQL的百宝箱》

《TPC-H测试 - PostgreSQL 10 vs Deepgreen(Greenplum)》

4、机器学习需求

阿里云RDS PostgreSQL提供了机器学习插件MADlib，支持Classification, Regression, Clustering, Topic Modeling, Association Rule Mining, Descriptive Statistics, Validation等众多挖掘模型。