PostgreSQL 内容随机推荐系统开发实践 - 文章随机推荐

3 minute read

背景

内容推荐是蛮普遍的需求，例如论坛、电商、新闻客户端等。

比较简单的需求：编辑精选一些内容ID，生成推荐列表。（例如每天生成一个这样的推荐列表。）然后随机的推荐给用户（同时过滤已读的内容）。

更高级的推荐需求：应该是根据不同口味产生的，例如对会员本身进行画像，归类。服务端针对不同口味生成不同的推荐列表。定向推荐。

本文介绍第一种需求的实践，使用PostgreSQL，中等规格的PG实例(28核)，可以轻松达到每秒50万篇内容吞吐的推荐。

DEMO

以论坛为例，有文章，有编辑精选的文章列表，有会员，有会员的阅读记录，用户打开页面时，根据精选列表随机推荐20篇（同时过滤已读内容）。

假设精选列表有2000篇文档，有1000万会员。

1、文章表

create table tbl_art (  
  artid int8,  -- 文章ID  
  content text,  -- 文章内容  
  crt_time timestamp  -- 文章创建时间  
  -- ...  -- 其他，标题，作者，。。。。  
);  

会员表（略）。

2、推荐文章ID列表

create table tbl_art_list (  
  list_time timestamp primary key,  -- 列表生成时间  
  artid int8[] not null,   -- 包含哪些文章（ID），使用数组存储  
  min_crt_time timestamp not null,  -- 这些文章中，时间最老的文章时间。取自tbl_art.crt_time 。 用于清理用户阅读日志。  
  arrlen int not null   -- artid 的长度（包含几个元素，即几篇精选文章）  
  -- classid int  如果会员有归类（标签），可以按归类创建精选列表）
);  

3、写入精选列表，每次推荐时，获取最后一个列表进行推荐。

如果有定向需求（根据会员标签进行推荐，改一下表tbl_art_list结构，加个CLASS字段，同时会员表，增加CLASS字段。会员打开页面时，通过CLASS匹配tbl_art_list里的最后一个列表）

如下，生成2000篇精选文章ID。一条记录。

insert into tbl_art_list values (  
  now(),   
  array(select (random()*1000000)::int8 from generate_series(1,2000)),   
  now(),   
  2000  
) ;  

4、已阅读记录

create table tbl_read_rec (  
  uid int8,  -- 会员ID  
  crt_time timestamp,  -- 阅读时间  
  artid int8,  -- 文章ID  
  primary key(uid,artid)  -- 主键（一篇文档，一个会员被阅读后，仅记录一次）  
);  
  
create index idx_crt_time_1 on tbl_read_rec (crt_time);  

5、随机获取推荐文章ID

用户打开推荐页面时，输入用户ID，GET多少篇精选文档（从精选列表中，随机GET）。

返回一个数组，即GET到的来自精选文章列表，并且过滤掉已读过的，随机文章ID。

create or replace function get_artid(  
  i_uid int8,   -- 用户ID  
  rows int  -- 随机获取多少篇ID  
) returns int8[] as $$  
declare   
  v_artid int8[];  -- 精选ID列表  
  len int;  -- 精选ID列表文章个数  
  res int8[] := (array[])::int8[];   -- 结果  
  tmp int8;  -- 中间变量，从精选ID列表中得到的随机文章ID  
  loopi int := 0;  -- 循环变量，已获取到多少篇符合条件的ID  
  loopx int := 0;  -- 循环变量，已循环多少次（上限，取决于精选ID列表文章个数，例如1.5倍len）  
begin   
  select artid,arrlen into v_artid,len   
    from tbl_art_list order by list_time desc limit 1;  -- 从编辑精选列表，获取最后一条。  
  loop   
    if loopi >= rows or loopx >= 1.5*len then    -- 是否已遍历所有精选文章ID （随机遍历）  
      return res;    
    end if;  
    tmp := v_artid[floor(random()*len)+1];   -- 从精选文章IDs 获取随机ID  
    perform 1 from tbl_read_rec where uid=i_uid and artid=tmp;   -- 判断是否已读  
    if not found then  
      res := array_append(res, tmp);  -- 未读，APPEND到返回结果  
      loopi := loopi +1 ;  -- 递增  
    end if;  
    loopx := loopx +1 ;  -- 递增  
  end loop;   
  return res;  
end;  
$$ language plpgsql strict;   

6、清理阅读记录

使用 limit，每次清理若干条，

使用skip locked，支持并行DELETE。

delete from tbl_read_rec where ctid = any (array(  
  select ctid from tbl_read_rec where crt_time < (select min_crt_time from tbl_art_list order by list_time desc limit 1) limit 10000 for update skip locked  
));  

7、测试

GET精选文章ID，（满足随机，过滤已读）。

性能OK。

postgres=# select get_artid(1,20);  
                                                                  get_artid                                                                    
---------------------------------------------------------------------------------------------------------------------------------------------  
 {919755,3386,100126,761631,447551,511825,168645,211819,862572,330666,942247,600470,843042,511825,295568,829303,382312,452915,499113,164219}  
(1 row)  
  
Time: 0.377 ms  
postgres=# select get_artid(1,20);  
                                                                 get_artid                                                                    
--------------------------------------------------------------------------------------------------------------------------------------------  
 {257929,796892,343984,363615,418506,326628,91731,958663,127918,794101,49124,410347,852461,922276,366815,926232,134506,153306,123694,67087}  
(1 row)  
  
Time: 0.347 ms  

假设获取到的随机ID，立即阅读，获取到的记录全部写入。（用于压测）

postgres=# insert into tbl_read_rec select 1, now(), unnest(get_artid(1,20)) on conflict do nothing;  
INSERT 0 20  
Time: 0.603 ms  
postgres=# insert into tbl_read_rec select 1, now(), unnest(get_artid(1,20)) on conflict do nothing;  
INSERT 0 20  
Time: 0.494 ms  
postgres=# insert into tbl_read_rec select 1, now(), unnest(get_artid(1,20)) on conflict do nothing;  
INSERT 0 20  
Time: 0.479 ms  
postgres=# insert into tbl_read_rec select 1, now(), unnest(get_artid(1,20)) on conflict do nothing;  
INSERT 0 20  
Time: 0.494 ms  

8、压测

vi test.sql  
  
\set uid random(1,10000000)  
insert into tbl_read_rec select :uid, now(), unnest(get_artid(:uid,20)) on conflict do nothing;  

pgbench -M prepared -n -r -P 1 -f ./test.sql -c 28 -j 28 -T 120

transaction type: ./test.sql  
scaling factor: 1  
query mode: prepared  
number of clients: 28  
number of threads: 28  
duration: 120 s  
number of transactions actually processed: 3074866  
latency average = 1.093 ms  
latency stddev = 0.577 ms  
tps = 25623.620112 (including connections establishing)  
tps = 25625.634577 (excluding connections establishing)  
statement latencies in milliseconds:  
         0.002  \set uid random(1,10000000)  
         1.091  insert into tbl_read_rec select :uid, now(), unnest(get_artid(:uid,20)) on conflict do nothing;  

120秒压测后，已读记录表达到3GB。

postgres=# \dt+  
                          List of relations  
 Schema |     Name     | Type  |  Owner   |    Size    | Description   
--------+--------------+-------+----------+------------+-------------  
 public | tbl_art      | table | postgres | 8192 bytes |   
 public | tbl_art_list | table | postgres | 64 kB      |   
 public | tbl_read_rec | table | postgres | 3047 MB    |   
(3 rows)  

已读记录6120万。

postgres=# select count(*) from tbl_read_rec ;  
  count     
----------  
 61206909  
(1 row)  

如下

postgres=# select uid,count(*) from tbl_read_rec group by 1 limit 10;  
 uid | count   
-----+-------  
|  2000  
|    20  
|    20  
|    20  
|    19  
|    20  
|    40  
|    19  
|    20  
|    19  
(10 rows)  

对于已经全部阅读的，则不再推荐，因为精选列表已全部已读。

postgres=# select get_artid(1,20);  
 get_artid   
-----------  
 {}  
(1 row)  

性能

tps : 25623

每秒推荐返回 : 512460 篇ID

推荐部分还有优化空间，例如用户对整个精选列表都已读时(已读越多，GET越慢)，来获取列表会比较慢（因为需要遍历整个列表ID，都拿不到20条有效记录，消耗较大，最后返回NULL）。(实测这种情况下，GET约20毫秒)

这种情况下，建议可以给用户打个标记，表示本次已推荐完所有内容（避开GET，那么性能就会直线上升，几十万TPS没问题），返回其他内容。

digoal’s 大量PostgreSQL文章入口

Twitter Facebook Google+ LinkedIn

Digoal.zhou

PostgreSQL 内容随机推荐系统开发实践 - 文章随机推荐

背景

DEMO

性能

digoal’s 大量PostgreSQL文章入口

You May Also Enjoy

PostgreSQL(PPAS 兼容Oracle) 从零开始入门手册 - 珍藏版

PostgreSQL pipelinedb 流计算插件 - IoT应用 - 实时轨迹聚合

PostgreSQL plpgsql 存储过程、函数 - 状态、异常变量打印、异常捕获… - GET [STACKED] DIAGNOSTICS

PostgreSQL datediff 日期间隔（单位转换）兼容SQL用法