performance tuning about multi-rows query aggregated to single-row query
背景
聚合操作是统计分析的常见需求,聚合实际上就是把多行变成单行。
比如count, sum, avg, min, max, 线性相关性等等
接下来的几篇文章介绍一下如何编写聚合函数。
本文介绍的是多行转单行后的优化手段和性能提升点。
正文
线上有一个这样的表 :
digoal=# \d recommendation_mpt
Column | Type | Modifiers
-------------+---------+-----------
user_id | text |
app_id | numeric |
rating | numeric |
记录了用户ID以及每个APP_ID对应的推荐比率.
这部分数据是由数据仓库每天生成并插入到生产环境的, 所以生产中以读为主, 没有写操作. 本文也是针对读进行的优化.
程序在向用户推荐APP_ID时,根据rating进行排序, rating越大, 推荐越排在前面。
SQL如下 :
数据库排序 :
select user_id,app_id from recommendation_mpt where user_id=$1 order by rating desc;
或者程序取到数据后进行排序 :
select user_id,app_id from recommendation_mpt where user_id=$1;
假如每个用户保留50条app_id的信息, 1亿用户的话需要50亿条记录, 来存储这部分信息.
不管是数据库排还是程序排序. 以上SQL都要取出50条记录, 走user_id索引的话也可能需要大量的离散IO。
比较合理的优化手段是每个用户记录1条信息, 用数组或text来存储app_id对应的rating信息.
如 :
digoal=# \d recommendation_mpt_new
Column | Type | Modifiers
-------------+---------+-----------
user_id | text |
recomm | text |
这里涉及到聚合, 最好是排序后再聚合, 那么可以省去程序取出数据后再排序的过程.
例如 :
select user_id, array_agg( app_id || '_' || rating order by rating desc ) from recommendation_mpt group by user_id;
如果数组已经排好了顺序的话, 并且程序不需要用到rating时, 那么这个表只需要user_id和app_id字段.
select user_id, array_agg( app_id order by rating desc ) from recommendation_mpt group by user_id;
注 :
聚合函数支持order by是从9.0.1开始的 :
Release 9.0.1
Release Date: 2010-10-04
* Allow aggregate functions to use ORDER BY (Andrew Gierth)
For example, this is now supported: array_agg(a ORDER BY b). This
is useful with aggregates for which the order of input values is
significant, and eliminates the need to use a nonstandard subquery
to determine the ordering.
性能测试 :
场景一、
100W用户, 每个用户41条记录(41和1000000不能整除, 比较好生成测试数据.) .
create table recommendation_mpt (user_id int8, app_id numeric, rating numeric);
insert into recommendation_mpt select generate_series(1,1000000), generate_series(1,41), random();
INSERT 0 41000000
create index idx_recommendation_mpt_1 on recommendation_mpt (user_id);
digoal=> explain select app_id from recommendation_mpt where user_id = 1 order by rating desc;
QUERY PLAN
------------------------------------------------------------------------------------------------------------
Sort (cost=55.27..55.37 rows=41 width=16)
Sort Key: rating
-> Index Scan using idx_recommendation_mpt_1 on recommendation_mpt (cost=0.00..54.17 rows=41 width=16)
Index Cond: (user_id = 1)
(4 rows)
测试脚本:
vi digoal.sql
\setrandom user_id 1 1000000
select app_id from recommendation_mpt where user_id = :user_id order by rating desc;
测试结果 :
ocz@db-172-16-3-150-> pgbench -M prepared -n -r -c 16 -j 4 -f ./digoal.sql -T 60 -h $PGDATA -U digoal digoal
transaction type: Custom query
scaling factor: 1
query mode: prepared
number of clients: 16
number of threads: 4
duration: 60 s
number of transactions actually processed: 507317
tps = 8447.554947 (including connections establishing)
tps = 8450.407448 (excluding connections establishing)
statement latencies in milliseconds:
0.002683 \setrandom user_id 1 1000000
1.887224 select app_id from recommendation_mpt where user_id = :user_id order by rating desc;
场景一的优化 : 建立索引避免rating排序的开销.
create index idx_recommendation_mpt_2 on recommendation_mpt (user_id, rating, app_id);
digoal=> set enable_sort=off;
SET
digoal=> explain select app_id from recommendation_mpt where user_id = 1 order by rating desc;
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------
Index Only Scan Backward using idx_recommendation_mpt_2 on recommendation_mpt (cost=0.00..64.33 rows=42 width=16)
Index Cond: (user_id = 1)
(2 rows)
vi $PGDATA/postgresql.conf
enable_sort = off
ocz@db-172-16-3-150-> pg_ctl reload
server signaled
在此测试后的结果 :
ocz@db-172-16-3-150-> pgbench -M prepared -n -r -c 16 -j 4 -f ./digoal.sql -T 60 -h $PGDATA -U digoal digoal
transaction type: Custom query
scaling factor: 1
query mode: prepared
number of clients: 16
number of threads: 4
duration: 60 s
number of transactions actually processed: 515758
tps = 8587.821833 (including connections establishing)
tps = 8590.411963 (excluding connections establishing)
statement latencies in milliseconds:
0.002724 \setrandom user_id 1 1000000
1.856069 select app_id from recommendation_mpt where user_id = :user_id order by rating desc;
场景二、
100W用户, 每个用户1条聚合并排序好的记录.
create table recommendation_mpt_new ( user_id int8, app_id numeric[] );
insert into recommendation_mpt_new select user_id, array_agg(app_id order by rating desc) from recommendation_mpt group by user_id;
INSERT 0 1000000
digoal=> select * from recommendation_mpt_new limit 5;
user_id | app_id
---------+---------------------------------------------------------------------------------------------------------------------
1 | {16,19,9,8,35,20,31,40,28,7,5,15,36,10,13,11,34,24,32,38,26,39,6,18,30,29,3,12,17,37,25,22,27,1,2,41,4,21,33,23,14}
2 | {24,33,17,3,20,29,41,38,27,39,4,19,2,28,9,15,11,1,32,10,13,30,31,37,18,21,5,7,12,14,16,6,40,22,36,23,25,8,35,26,34}
3 | {18,22,14,20,26,35,34,8,1,16,11,41,15,30,27,21,19,7,3,5,9,13,17,4,40,10,23,32,6,24,28,31,12,36,38,29,25,2,37,39,33}
4 | {9,27,37,2,41,38,1,31,30,3,32,28,26,7,40,10,4,34,6,21,12,14,24,5,20,13,33,23,35,11,16,18,22,17,19,8,39,29,36,15,25}
5 | {35,9,39,27,11,23,18,24,13,5,10,36,22,1,34,16,7,28,20,38,4,26,21,8,17,2,32,15,41,33,31,29,30,40,14,25,6,37,19,12,3}
(5 rows)
digoal=> select * from recommendation_mpt where user_id=1 order by rating desc;
user_id | app_id | rating
---------+--------+--------------------
1 | 16 | 0.996264576911926
1 | 19 | 0.988723468966782
1 | 9 | 0.988221516367048
1 | 8 | 0.962889738380909
1 | 35 | 0.957569351885468
1 | 20 | 0.94144273782149
1 | 31 | 0.929523797240108
1 | 40 | 0.889733758755028
1 | 28 | 0.851767126005143
1 | 7 | 0.848528668750077
1 | 5 | 0.840631154365838
1 | 15 | 0.822692712768912
1 | 36 | 0.819620195310563
1 | 10 | 0.706364229787141
1 | 13 | 0.698610578197986
1 | 11 | 0.695095229428262
1 | 34 | 0.673047889955342
1 | 24 | 0.660528331529349
1 | 32 | 0.647978096734732
1 | 38 | 0.636633691377938
1 | 26 | 0.592413422185928
1 | 39 | 0.501070868223906
1 | 6 | 0.491314855404198
1 | 18 | 0.471619851421565
1 | 30 | 0.466300643049181
1 | 29 | 0.457061214838177
1 | 3 | 0.325578296091408
1 | 12 | 0.310081131756306
1 | 17 | 0.305294726509601
1 | 37 | 0.26613838179037
1 | 25 | 0.251236057840288
1 | 22 | 0.237996113952249
1 | 27 | 0.22669791476801
1 | 1 | 0.217828437685966
1 | 2 | 0.200171065982431
1 | 41 | 0.197947917506099
1 | 4 | 0.189090229570866
1 | 21 | 0.182843111455441
1 | 33 | 0.131565005518496
1 | 23 | 0.106283554807305
1 | 14 | 0.0274284891784191
(41 rows)
create index idx_recommendation_mpt_new_1 on recommendation_mpt_new (user_id);
digoal=> explain select app_id from recommendation_mpt_new where user_id = 1;
QUERY PLAN
-------------------------------------------------------------------------------------------------------------
Index Scan using idx_recommendation_mpt_new_1 on recommendation_mpt_new (cost=0.00..2.54 rows=1 width=352)
Index Cond: (user_id = 1)
(2 rows)
测试脚本:
\setrandom user_id 1 1000000
select app_id from recommendation_mpt_new where user_id = :user_id;
测试结果 :
ocz@db-172-16-3-150-> pgbench -M prepared -n -r -c 16 -j 4 -f ./digoal.sql -T 60 -h $PGDATA -U digoal digoal
transaction type: Custom query
scaling factor: 1
query mode: prepared
number of clients: 16
number of threads: 4
duration: 60 s
number of transactions actually processed: 4700467
tps = 78296.696341 (including connections establishing)
tps = 78327.675964 (excluding connections establishing)
statement latencies in milliseconds:
0.001730 \setrandom user_id 1 1000000
0.200860 select app_id from recommendation_mpt_new where user_id = :user_id;
小结
1. 将多行聚合成1行后, 查询性能从8590 qps提升到78327 qps .
2. 另一篇介绍PostgreSQL aggregate的文章 :
http://blog.163.com/digoal@126/blog/static/16387704020121118112533410/