推荐系统分析 - 推荐算法, RecDB推荐数据库介绍
背景
中华文化源远流长,从古至今有很多有趣的学问。比如看风水、看相,在西方文化中有类似的比如星座学说。
这些和推荐系统有什么关系呢?
个人感觉星座学说这些学问有一定的理论基础,更多的则是也是经验的总结。
推荐系统实际上和星座学说类似,有一定的算法基础,算法则可能是来自经验的总结。
在现实场景中,可以找到很多类似的例子,这些例子是一些数据,根据推荐算法,可以找到用户可能感兴趣的东西。
1. 比如豆瓣有几千万用户,用户在豆瓣上可以对影片打分和点评,就形成了一批这样的数据:
用户ID,影片ID,打分
1, '终结者', 90
902, '笑傲江湖', 78
......
我暂且称之为打分表, 或者rating数据
2. 电商应用,也有类似的rating数据,比如这样的数据结构:
用户ID,购买的商品ID,购买数量
或者
用户ID,浏览的店铺ID,浏览的次数
或者
用户ID,浏览的商品ID,浏览次数
3. 其他应用, 使用应用程序的feed,可以找到用户与用户,用户与物体的亲密度(rating)。
有了用户与用户,用户与物体的亲密度(rating)数据之后,就可以根据推荐算法,给用户推荐其他用户或物体。
推荐算法相关的文章可以参考
https://www.ibm.com/developerworks/cn/web/1103_zhaoct_recommstudy1/index.html
接下来容重介绍一款继承了多种推荐算法的数据库RecDB。
RecDB(推荐数据库)
RecDB是基于PostgreSQL打造的一款专业的推荐数据库,目前继承了以下推荐算法。
Currently, the available recommendation algorithms that could be passed to the USING clause are the following:
-
ItemCosCF
Item-Item Collaborative Filtering using Cosine Similarity measure. -
ItemPearCF
Item-Item Collaborative Filtering using Pearson Correlation Similarity measure. -
UserCosCF
User-User Collaborative Filtering using Cosine Similarity measure. -
UserPearCF
User-User Collaborative Filtering using Cosine Similarity measure. -
SVD
Simon Funk Singular Value Decomposition.
Applications powered by RecDB can produce online and flexible personalized recommendations to end-users.
RecDB的用法非常简单,就如同使用PostgreSQL一样,对用户非常友好,同时扩展了一些推荐相关的SQL语法。
如果你还不会用PostgreSQL,可以来看看这个
《2011年功力的德哥教你2天撸通PostgreSQL - 入门、开发、原理、管理、调优》
RecDB安装
1. 编译安装recdb
git clone https://github.com/Sarwat/recdb-postgresql.git
cd recdb-postgresql/PostgreSQL
./configure --prefix=/home/digoal/recdb
make world -j 32
make install-world
export PS1="$USER@`/bin/hostname -s`-> "
export PGPORT=9999
export PGDATA=/home/digoal/pgdata/pg_root9999
export LANG=en_US.utf8
export PGHOME=/home/digoal/recdb
export LD_LIBRARY_PATH=$PGHOME/lib:/lib64:/usr/lib64:/usr/local/lib64:/lib:/usr/lib:/usr/local/lib:$LD_LIBRARY_PATH
export LD_RUN_PATH=$LD_LIBRARY_PATH
export DATE=`date +"%Y%m%d%H%M"`
export PATH=$PGHOME/bin:$PATH:.
export MANPATH=$PGHOME/share/man:$MANPATH
export PGHOST=127.0.0.1
export PGUSER=postgres
export PGDATABASE=postgres
alias rm='rm -i'
alias ll='ls -lh'
unalias vi
2. 初始化数据库集群
initdb -D $PGDATA -E UTF8 --locale=C -U postgres
3. 配置
cd $PGDATA
vi postgresql.conf
listen_addresses = '0.0.0.0'
port = 9999
max_connections = 100
unix_socket_directory = '.'
shared_buffers = 32GB
maintenance_work_mem = 1GB
vacuum_cost_delay = 0ms
bgwriter_delay = 10ms
bgwriter_lru_maxpages = 1000
bgwriter_lru_multiplier = 5.0
wal_level = minimal
synchronous_commit = off
wal_buffers = 128MB
wal_writer_delay = 10ms
checkpoint_segments = 2048
checkpoint_timeout = 35min
checkpoint_completion_target = 0.1
random_page_cost = 1.2
effective_cache_size = 128GB
log_destination = 'csvlog'
logging_collector = on
log_directory = 'pg_log'
log_filename = 'postgresql-%Y-%m-%d_%H%M%S.log'
log_truncate_on_rotation = on
log_checkpoints = on
log_connections = on
log_disconnections = on
log_error_verbosity = verbose
log_timezone = 'PRC'
log_autovacuum_min_duration = 0
autovacuum_max_workers = 8
autovacuum_naptime = 10s
autovacuum_vacuum_scale_factor = 0.05
autovacuum_analyze_scale_factor = 0.1
autovacuum_freeze_max_age = 1500000000
autovacuum_vacuum_cost_delay = 0
vacuum_freeze_table_age = 1300000000
datestyle = 'iso, mdy'
timezone = 'PRC'
lc_messages = 'C'
lc_monetary = 'C'
lc_numeric = 'C'
lc_time = 'C'
default_text_search_config = 'pg_catalog.english'
vi pg_hba.conf
host all all 0.0.0.0/0 md5
4. 启动数据库集群
pg_ctl start
5. 测试数据库是否可用
psql
psql (9.2.0)
Type "help" for help.
postgres=# \dt
No relations found.
postgres=# \l
List of databases
Name | Owner | Encoding | Collate | Ctype | Access privileges
-----------+----------+----------+---------+-------+-----------------------
postgres | postgres | UTF8 | C | C |
template0 | postgres | UTF8 | C | C | =c/postgres +
| | | | | postgres=CTc/postgres
template1 | postgres | UTF8 | C | C | =c/postgres +
| | | | | postgres=CTc/postgres
(3 rows)
RecDB 使用例子
1. 创建测试表,三个字段,分别为用户ID,电影ID,评分。
postgres=# create table douban_rating(uid int, movie_id int, access_rating real);
CREATE TABLE
2. 插入100万评分数据,1000万用户ID,点评1000部电影
postgres=# insert into douban_rating select random()*10000000, random()*1000, random() from generate_series(1,1000000);
INSERT 0 1000000
3. 创建RECOMMENDER,需要指出推荐算法,字段。
推荐算法如下
-
ItemCosCF
Item-Item Collaborative Filtering using Cosine Similarity measure. -
ItemPearCF
Item-Item Collaborative Filtering using Pearson Correlation Similarity measure. -
UserCosCF
User-User Collaborative Filtering using Cosine Similarity measure. -
UserPearCF
User-User Collaborative Filtering using Cosine Similarity measure. -
SVD
Simon Funk Singular Value Decomposition.
postgres=# set maintenance_work_mem ='32GB';
SET
postgres=# set work_mem ='32GB';
SET
postgres=# \timing
Timing is on.
postgres=# CREATE RECOMMENDER MovieRec ON douban_rating
USERS FROM uid
ITEMS FROM movie_id
EVENTS FROM access_rating
USING ItemCosCF;
NOTICE: CREATE TABLE will create implicit sequence "movierecindex_systemid_seq" for serial column "movierecindex.systemid"
NOTICE: CREATE TABLE / PRIMARY KEY will create implicit index "movierecindex_pkey" for table "movierecindex"
NOTICE: CREATE TABLE / PRIMARY KEY will create implicit index "movierecview1491804367109912_pkey" for table "movierecview1491804367109912"
NOTICE: ALTER TABLE / ADD PRIMARY KEY will create implicit index "movierecmodel1491804367109912_pkey" for table "movierecmodel1491804367109912"
CREATE RECOMMENDER
postgres=# \dt+
List of relations
Schema | Name | Type | Owner | Size | Description
--------+-------------------------------+-------+----------+------------+-------------
public | douban_rating | table | postgres | 42 MB |
public | movierecindex | table | postgres | 16 kB |
public | movierecmodel1491804035307409 | table | postgres | 2136 kB |
public | movierecview1491804035307409 | table | postgres | 8192 bytes |
public | recdbproperties | table | postgres | 8192 bytes |
public | recmodelscatalogue | table | postgres | 16 kB |
(6 rows)
查看数据
postgres=# select * from douban_rating limit 10;
uid | movie_id | access_rating
---------+----------+---------------
359604 | 4798 | 0.796408
7749197 | 1764 | 0.194022
9288272 | 947 | 0.696304
6032232 | 1738 | 0.745247
6451861 | 6278 | 0.416638
3290076 | 510 | 0.571497
334635 | 4904 | 0.552451
2313039 | 3906 | 0.963749
2535368 | 6377 | 0.297736
2939719 | 7603 | 0.624071
(10 rows)
查看某个用户的点评数据
postgres=# select * from douban_rating where uid=359604;
uid | movie_id | access_rating
--------+----------+---------------
359604 | 4798 | 0.796408
(1 row)
查看具有共性的用户,点评了哪些电影
postgres=# select distinct movie_id from douban_rating where uid in (select uid from douban_rating where movie_id=4798);
movie_id
----------
2667
4798
1686
9008
3994
1156
1679
4103
1090
(9 rows)
4. 使用推荐算法,给uid=359604的用户,推荐其他电影,返回的是具有共性的电影,分值从高到低排序。
postgres=# SELECT * FROM douban_rating R
RECOMMEND R.movie_id TO R.uid ON R.access_rating USING ItemCosCF
WHERE R.uid = 359604
ORDER BY R.access_rating desc
LIMIT 10;
uid | movie_id | access_rating
--------+----------+---------------
359604 | 2667 | 0.796408
359604 | 9008 | 0.796408
359604 | 1090 | 0.796408
359604 | 1679 | 0.796408
359604 | 1686 | 0.796408
359604 | 4103 | 0.796408
359604 | 1156 | 0.796408
359604 | 3994 | 0.796408
359604 | 6 | 0
359604 | 5 | 0
(10 rows)
执行计划如下
postgres=# explain (analyze,verbose,timing,costs,buffers) SELECT * FROM douban_rating R
RECOMMEND R.movie_id TO R.uid ON R.access_rating USING ItemCosCF
WHERE R.uid = 359604
ORDER BY R.access_rating desc
LIMIT 10;
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=17906.01..17906.01 rows=1 width=12) (actual time=3337.091..3337.093 rows=10 loops=1)
Output: uid, movie_id, access_rating
Buffers: shared hit=61840, temp read=9993 written=9993
-> Sort (cost=17906.01..17906.01 rows=1 width=12) (actual time=3337.090..3337.092 rows=10 loops=1)
Output: uid, movie_id, access_rating
Sort Key: r.access_rating
Sort Method: top-N heapsort Memory: 25kB
Buffers: shared hit=61840, temp read=9993 written=9993
-> Result (cost=0.00..17906.00 rows=1 width=12) (actual time=2948.739..3332.023 rows=10001 loops=1)
Output: uid, movie_id, access_rating
Buffers: shared hit=61840, temp read=9993 written=9993
-> Recommend on public.douban_rating r (cost=0.00..17906.00 rows=1 width=12) (actual time=2948.738..3329.177 rows=10001 loops=1)
Output: uid, movie_id, access_rating
Filter: (r.uid = 359604)
Buffers: shared hit=61840, temp read=9993 written=9993
Total runtime: 3337.116 ms
(16 rows)
5. 由于共性用户较少,重新造一部分数据,让共性更多,1万用户点评1000部电影。
postgres=# drop RECOMMENDER MovieRec;
DROP RECOMMENDER
postgres=# truncate douban_rating ;
TRUNCATE TABLE
postgres=# insert into douban_rating select random()*10000, random()*1000, random() from generate_series(1,1000000);
INSERT 0 1000000
postgres=# CREATE RECOMMENDER MovieRec ON douban_rating
USERS FROM uid
ITEMS FROM movie_id
EVENTS FROM access_rating
USING ItemCosCF;
查询原始数据如下
postgres=# select * from douban_rating limit 10;
uid | movie_id | access_rating
------+----------+---------------
128 | 848 | 0.796747
4953 | 401 | 0.832318
2766 | 874 | 0.61931
5572 | 430 | 0.550044
6 | 709 | 0.798314
1896 | 237 | 0.559974
4917 | 614 | 0.517259
6697 | 886 | 0.804338
2232 | 534 | 0.873135
4574 | 557 | 0.38828
(10 rows)
postgres=# select * from douban_rating where uid=128;
uid | movie_id | access_rating
-----+----------+---------------
128 | 848 | 0.796747
128 | 755 | 0.139934
128 | 79 | 0.633511
128 | 979 | 0.145586
128 | 120 | 0.153884
128 | 839 | 0.0865545
......
7. 给用户128推荐电影,现在的结果有了一定的可参考性。
postgres=# SELECT * FROM MovieRec R
RECOMMEND R.movie_id TO R.uid ON R.access_rating USING ItemCosCF
WHERE R.uid = 128
ORDER BY R.access_rating desc
LIMIT 10;
uid | movie_id | access_rating
-----+----------+---------------
128 | 422 | 0.514567
128 | 127 | 0.514059
128 | 495 | 0.513637
128 | 974 | 0.513447
128 | 487 | 0.512524
128 | 64 | 0.512519
128 | 868 | 0.512367
128 | 132 | 0.512323
128 | 0 | 0.51225
128 | 54 | 0.512082
(10 rows)
8. recDB支持复杂的查询,例如JOIN.
SELECT * FROM ml_ratings R, Movies M
RECOMMEND R.itemid TO R.userid ON R.ratingval USING ItemCosCF
WHERE R.userid = 1 AND M.movieid = R.itemid AND M.genre LIKE '%Comedy%'
ORDER BY R.ratingval
LIMIT 10
更多例子
https://github.com/DataSystemsLab/recdb-postgresql/tree/master/examples
小结
1. 未来RecDB可能会整合到PG内核,也可能作为一款PG的插件(可能性较大),Pipelinedb(流式计算数据库)也在做插件化的改造。
目前基于PG改造或扩展的产品非常的多,包括
1. 流计算数据库产品 pipelineDB
2. 推荐数据库产品 recDB
3. 时序数据库 timescaleDB
4. 分布式数据库插件 citus
5. 列存储插件 IMCS, cstore等
6. 面向OLAP的codegen数据库 pg_LLVM
7. 向量计算插件 vops
。。。
不同的场景,可以找到适合对应场景的插件。
2. recDB实现的推荐查询,与图数据也有一定的类似性,PostgreSQL在图式搜索方面的应用可以参考如下
《金融风控、公安刑侦、社会关系、人脉分析等需求分析与数据库实现 - PostgreSQL图数据库场景应用》
参考
https://github.com/DataSystemsLab/recdb-postgresql
https://www.ibm.com/developerworks/cn/web/1103_zhaoct_recommstudy1/index.html