推荐系统分析 - 推荐算法, RecDB推荐数据库介绍

6 minute read

背景

中华文化源远流长,从古至今有很多有趣的学问。比如看风水、看相,在西方文化中有类似的比如星座学说。

这些和推荐系统有什么关系呢?

个人感觉星座学说这些学问有一定的理论基础,更多的则是也是经验的总结。

推荐系统实际上和星座学说类似,有一定的算法基础,算法则可能是来自经验的总结。

在现实场景中,可以找到很多类似的例子,这些例子是一些数据,根据推荐算法,可以找到用户可能感兴趣的东西。

1. 比如豆瓣有几千万用户,用户在豆瓣上可以对影片打分和点评,就形成了一批这样的数据:

用户ID,影片ID,打分  
  
1, '终结者', 90  
902, '笑傲江湖', 78  
......  
  
我暂且称之为打分表, 或者rating数据  

2. 电商应用,也有类似的rating数据,比如这样的数据结构:

用户ID,购买的商品ID,购买数量  
  
或者  
  
用户ID,浏览的店铺ID,浏览的次数  
  
或者  
  
用户ID,浏览的商品ID,浏览次数  

3. 其他应用, 使用应用程序的feed,可以找到用户与用户,用户与物体的亲密度(rating)。

有了用户与用户,用户与物体的亲密度(rating)数据之后,就可以根据推荐算法,给用户推荐其他用户或物体。

推荐算法相关的文章可以参考

https://www.ibm.com/developerworks/cn/web/1103_zhaoct_recommstudy1/index.html

接下来容重介绍一款继承了多种推荐算法的数据库RecDB。

RecDB(推荐数据库)

RecDB是基于PostgreSQL打造的一款专业的推荐数据库,目前继承了以下推荐算法。

Currently, the available recommendation algorithms that could be passed to the USING clause are the following:

  • ItemCosCF Item-Item Collaborative Filtering using Cosine Similarity measure.

  • ItemPearCF Item-Item Collaborative Filtering using Pearson Correlation Similarity measure.

  • UserCosCF User-User Collaborative Filtering using Cosine Similarity measure.

  • UserPearCF User-User Collaborative Filtering using Cosine Similarity measure.

  • SVD Simon Funk Singular Value Decomposition.

Applications powered by RecDB can produce online and flexible personalized recommendations to end-users.

RecDB的用法非常简单,就如同使用PostgreSQL一样,对用户非常友好,同时扩展了一些推荐相关的SQL语法。

如果你还不会用PostgreSQL,可以来看看这个

《2011年功力的德哥教你2天撸通PostgreSQL - 入门、开发、原理、管理、调优》

RecDB安装

1. 编译安装recdb

git clone https://github.com/Sarwat/recdb-postgresql.git  
  
cd recdb-postgresql/PostgreSQL  
  
./configure --prefix=/home/digoal/recdb  
  
make world -j 32  
  
make install-world  
  
export PS1="$USER@`/bin/hostname -s`-> "  
export PGPORT=9999  
export PGDATA=/home/digoal/pgdata/pg_root9999  
export LANG=en_US.utf8  
export PGHOME=/home/digoal/recdb  
export LD_LIBRARY_PATH=$PGHOME/lib:/lib64:/usr/lib64:/usr/local/lib64:/lib:/usr/lib:/usr/local/lib:$LD_LIBRARY_PATH  
export LD_RUN_PATH=$LD_LIBRARY_PATH  
export DATE=`date +"%Y%m%d%H%M"`  
export PATH=$PGHOME/bin:$PATH:.  
export MANPATH=$PGHOME/share/man:$MANPATH  
export PGHOST=127.0.0.1  
export PGUSER=postgres  
export PGDATABASE=postgres  
alias rm='rm -i'  
alias ll='ls -lh'  
unalias vi  

2. 初始化数据库集群

initdb -D $PGDATA -E UTF8 --locale=C -U postgres  

3. 配置

cd $PGDATA  
  
vi postgresql.conf  
listen_addresses = '0.0.0.0'  
port = 9999   
max_connections = 100  
unix_socket_directory = '.'  
shared_buffers = 32GB  
maintenance_work_mem = 1GB  
vacuum_cost_delay = 0ms  
bgwriter_delay = 10ms  
bgwriter_lru_maxpages = 1000  
bgwriter_lru_multiplier = 5.0  
wal_level = minimal  
synchronous_commit = off  
wal_buffers = 128MB  
wal_writer_delay = 10ms  
checkpoint_segments = 2048    
checkpoint_timeout = 35min  
checkpoint_completion_target = 0.1  
random_page_cost = 1.2  
effective_cache_size = 128GB  
log_destination = 'csvlog'  
logging_collector = on  
log_directory = 'pg_log'  
log_filename = 'postgresql-%Y-%m-%d_%H%M%S.log'  
log_truncate_on_rotation = on  
log_checkpoints = on  
log_connections = on  
log_disconnections = on  
log_error_verbosity = verbose  
log_timezone = 'PRC'  
log_autovacuum_min_duration = 0  
autovacuum_max_workers = 8  
autovacuum_naptime = 10s  
autovacuum_vacuum_scale_factor = 0.05  
autovacuum_analyze_scale_factor = 0.1  
autovacuum_freeze_max_age = 1500000000  
autovacuum_vacuum_cost_delay = 0  
vacuum_freeze_table_age = 1300000000  
datestyle = 'iso, mdy'  
timezone = 'PRC'  
lc_messages = 'C'  
lc_monetary = 'C'  
lc_numeric = 'C'  
lc_time = 'C'  
default_text_search_config = 'pg_catalog.english'  
  
  
vi pg_hba.conf  
host all all 0.0.0.0/0 md5  

4. 启动数据库集群

pg_ctl start  

5. 测试数据库是否可用

psql  
psql (9.2.0)  
Type "help" for help.  
  
postgres=# \dt  
No relations found.  
postgres=# \l  
                             List of databases  
   Name    |  Owner   | Encoding | Collate | Ctype |   Access privileges     
-----------+----------+----------+---------+-------+-----------------------  
 postgres  | postgres | UTF8     | C       | C     |   
 template0 | postgres | UTF8     | C       | C     | =c/postgres          +  
           |          |          |         |       | postgres=CTc/postgres  
 template1 | postgres | UTF8     | C       | C     | =c/postgres          +  
           |          |          |         |       | postgres=CTc/postgres  
(3 rows)  

RecDB 使用例子

1. 创建测试表,三个字段,分别为用户ID,电影ID,评分。

postgres=# create table douban_rating(uid int, movie_id int, access_rating real);  
CREATE TABLE  

2. 插入100万评分数据,1000万用户ID,点评1000部电影

postgres=# insert into douban_rating select random()*10000000, random()*1000, random() from generate_series(1,1000000);  
INSERT 0 1000000  

3. 创建RECOMMENDER,需要指出推荐算法,字段。

推荐算法如下

  • ItemCosCF Item-Item Collaborative Filtering using Cosine Similarity measure.

  • ItemPearCF Item-Item Collaborative Filtering using Pearson Correlation Similarity measure.

  • UserCosCF User-User Collaborative Filtering using Cosine Similarity measure.

  • UserPearCF User-User Collaborative Filtering using Cosine Similarity measure.

  • SVD Simon Funk Singular Value Decomposition.

postgres=# set maintenance_work_mem ='32GB';  
SET  
postgres=# set work_mem ='32GB';  
SET  
postgres=# \timing  
Timing is on.  
  
postgres=# CREATE RECOMMENDER MovieRec ON douban_rating  
USERS FROM uid  
ITEMS FROM movie_id  
EVENTS FROM access_rating  
USING ItemCosCF;  
  
NOTICE:  CREATE TABLE will create implicit sequence "movierecindex_systemid_seq" for serial column "movierecindex.systemid"  
NOTICE:  CREATE TABLE / PRIMARY KEY will create implicit index "movierecindex_pkey" for table "movierecindex"  
NOTICE:  CREATE TABLE / PRIMARY KEY will create implicit index "movierecview1491804367109912_pkey" for table "movierecview1491804367109912"  
NOTICE:  ALTER TABLE / ADD PRIMARY KEY will create implicit index "movierecmodel1491804367109912_pkey" for table "movierecmodel1491804367109912"  
CREATE RECOMMENDER  
postgres=# \dt+  
                                  List of relations  
 Schema |             Name              | Type  |  Owner   |    Size    | Description   
--------+-------------------------------+-------+----------+------------+-------------  
 public | douban_rating                 | table | postgres | 42 MB      |   
 public | movierecindex                 | table | postgres | 16 kB      |   
 public | movierecmodel1491804035307409 | table | postgres | 2136 kB    |   
 public | movierecview1491804035307409  | table | postgres | 8192 bytes |   
 public | recdbproperties               | table | postgres | 8192 bytes |   
 public | recmodelscatalogue            | table | postgres | 16 kB      |   
(6 rows)  

查看数据

postgres=# select * from douban_rating limit 10;  
   uid   | movie_id | access_rating   
---------+----------+---------------  
  359604 |     4798 |      0.796408  
 7749197 |     1764 |      0.194022  
 9288272 |      947 |      0.696304  
 6032232 |     1738 |      0.745247  
 6451861 |     6278 |      0.416638  
 3290076 |      510 |      0.571497  
  334635 |     4904 |      0.552451  
 2313039 |     3906 |      0.963749  
 2535368 |     6377 |      0.297736  
 2939719 |     7603 |      0.624071  
(10 rows)  

查看某个用户的点评数据

postgres=# select * from douban_rating where uid=359604;  
  uid   | movie_id | access_rating   
--------+----------+---------------  
 359604 |     4798 |      0.796408  
(1 row)  

查看具有共性的用户,点评了哪些电影

postgres=# select distinct movie_id from douban_rating where uid in (select uid from douban_rating where movie_id=4798);  
 movie_id   
----------  
     2667  
     4798  
     1686  
     9008  
     3994  
     1156  
     1679  
     4103  
     1090  
(9 rows)  

4. 使用推荐算法,给uid=359604的用户,推荐其他电影,返回的是具有共性的电影,分值从高到低排序。

postgres=# SELECT * FROM douban_rating R  
RECOMMEND R.movie_id TO R.uid ON R.access_rating USING ItemCosCF  
WHERE R.uid = 359604  
ORDER BY R.access_rating desc  
LIMIT 10;  
  uid   | movie_id | access_rating   
--------+----------+---------------  
 359604 |     2667 |      0.796408  
 359604 |     9008 |      0.796408  
 359604 |     1090 |      0.796408  
 359604 |     1679 |      0.796408  
 359604 |     1686 |      0.796408  
 359604 |     4103 |      0.796408  
 359604 |     1156 |      0.796408  
 359604 |     3994 |      0.796408  
 359604 |        6 |             0  
 359604 |        5 |             0  
(10 rows)  

执行计划如下

postgres=# explain (analyze,verbose,timing,costs,buffers) SELECT * FROM douban_rating R  
RECOMMEND R.movie_id TO R.uid ON R.access_rating USING ItemCosCF  
WHERE R.uid = 359604  
ORDER BY R.access_rating desc  
LIMIT 10;  
                                                                    QUERY PLAN                                                                      
--------------------------------------------------------------------------------------------------------------------------------------------------  
 Limit  (cost=17906.01..17906.01 rows=1 width=12) (actual time=3337.091..3337.093 rows=10 loops=1)  
   Output: uid, movie_id, access_rating  
   Buffers: shared hit=61840, temp read=9993 written=9993  
   ->  Sort  (cost=17906.01..17906.01 rows=1 width=12) (actual time=3337.090..3337.092 rows=10 loops=1)  
         Output: uid, movie_id, access_rating  
         Sort Key: r.access_rating  
         Sort Method: top-N heapsort  Memory: 25kB  
         Buffers: shared hit=61840, temp read=9993 written=9993  
         ->  Result  (cost=0.00..17906.00 rows=1 width=12) (actual time=2948.739..3332.023 rows=10001 loops=1)  
               Output: uid, movie_id, access_rating  
               Buffers: shared hit=61840, temp read=9993 written=9993  
               ->  Recommend on public.douban_rating r  (cost=0.00..17906.00 rows=1 width=12) (actual time=2948.738..3329.177 rows=10001 loops=1)  
                     Output: uid, movie_id, access_rating  
                     Filter: (r.uid = 359604)  
                     Buffers: shared hit=61840, temp read=9993 written=9993  
 Total runtime: 3337.116 ms  
(16 rows)  

5. 由于共性用户较少,重新造一部分数据,让共性更多,1万用户点评1000部电影。

postgres=# drop RECOMMENDER MovieRec;  
DROP RECOMMENDER  
postgres=# truncate douban_rating ;  
TRUNCATE TABLE  
postgres=# insert into douban_rating select random()*10000, random()*1000, random() from generate_series(1,1000000);  
INSERT 0 1000000  
postgres=# CREATE RECOMMENDER MovieRec ON douban_rating  
USERS FROM uid  
ITEMS FROM movie_id  
EVENTS FROM access_rating  
USING ItemCosCF;  

查询原始数据如下

postgres=# select * from douban_rating limit 10;  
 uid  | movie_id | access_rating   
------+----------+---------------  
  128 |      848 |      0.796747  
 4953 |      401 |      0.832318  
 2766 |      874 |       0.61931  
 5572 |      430 |      0.550044  
    6 |      709 |      0.798314  
 1896 |      237 |      0.559974  
 4917 |      614 |      0.517259  
 6697 |      886 |      0.804338  
 2232 |      534 |      0.873135  
 4574 |      557 |       0.38828  
(10 rows)  
  
  
postgres=# select * from douban_rating where uid=128;  
 uid | movie_id | access_rating   
-----+----------+---------------  
 128 |      848 |      0.796747  
 128 |      755 |      0.139934  
 128 |       79 |      0.633511  
 128 |      979 |      0.145586  
 128 |      120 |      0.153884  
 128 |      839 |     0.0865545  
 ......  

7. 给用户128推荐电影,现在的结果有了一定的可参考性。

postgres=# SELECT * FROM MovieRec R                                                 
RECOMMEND R.movie_id TO R.uid ON R.access_rating USING ItemCosCF  
WHERE R.uid = 128  
ORDER BY R.access_rating desc  
LIMIT 10;  
 uid | movie_id | access_rating   
-----+----------+---------------  
 128 |      422 |      0.514567  
 128 |      127 |      0.514059  
 128 |      495 |      0.513637  
 128 |      974 |      0.513447  
 128 |      487 |      0.512524  
 128 |       64 |      0.512519  
 128 |      868 |      0.512367  
 128 |      132 |      0.512323  
 128 |        0 |       0.51225  
 128 |       54 |      0.512082  
(10 rows)  

8. recDB支持复杂的查询,例如JOIN.

SELECT * FROM ml_ratings R, Movies M  
RECOMMEND R.itemid TO R.userid ON R.ratingval USING ItemCosCF  
WHERE R.userid = 1 AND M.movieid = R.itemid AND M.genre LIKE '%Comedy%'  
ORDER BY R.ratingval  
LIMIT 10  

更多例子

https://github.com/DataSystemsLab/recdb-postgresql/tree/master/examples

小结

1. 未来RecDB可能会整合到PG内核,也可能作为一款PG的插件(可能性较大),Pipelinedb(流式计算数据库)也在做插件化的改造。

目前基于PG改造或扩展的产品非常的多,包括

1. 流计算数据库产品 pipelineDB

2. 推荐数据库产品 recDB

3. 时序数据库 timescaleDB

4. 分布式数据库插件 citus

5. 列存储插件 IMCS, cstore等

6. 面向OLAP的codegen数据库 pg_LLVM

7. 向量计算插件 vops

。。。

不同的场景,可以找到适合对应场景的插件。

2. recDB实现的推荐查询,与图数据也有一定的类似性,PostgreSQL在图式搜索方面的应用可以参考如下

《金融风控、公安刑侦、社会关系、人脉分析等需求分析与数据库实现 - PostgreSQL图数据库场景应用》

参考

https://github.com/DataSystemsLab/recdb-postgresql

https://www.ibm.com/developerworks/cn/web/1103_zhaoct_recommstudy1/index.html

<5分钟用PostgreSQL实现推荐系统>

Flag Counter

digoal’s 大量PostgreSQL文章入口