performance tuning about multi-rows query aggregated to single-row query

4 minute read

背景

聚合操作是统计分析的常见需求,聚合实际上就是把多行变成单行。

比如count, sum, avg, min, max, 线性相关性等等

接下来的几篇文章介绍一下如何编写聚合函数。

本文介绍的是多行转单行后的优化手段和性能提升点。

正文

线上有一个这样的表 :

digoal=# \d recommendation_mpt  
   Column    |  Type   | Modifiers   
-------------+---------+-----------  
 user_id     | text    |   
 app_id      | numeric |   
 rating      | numeric |   

记录了用户ID以及每个APP_ID对应的推荐比率.

这部分数据是由数据仓库每天生成并插入到生产环境的, 所以生产中以读为主, 没有写操作. 本文也是针对读进行的优化.

程序在向用户推荐APP_ID时,根据rating进行排序, rating越大, 推荐越排在前面。

SQL如下 :

数据库排序 :

select user_id,app_id from recommendation_mpt where user_id=$1 order by rating desc;  

或者程序取到数据后进行排序 :

select user_id,app_id from recommendation_mpt where user_id=$1;  

假如每个用户保留50条app_id的信息, 1亿用户的话需要50亿条记录, 来存储这部分信息.

不管是数据库排还是程序排序. 以上SQL都要取出50条记录, 走user_id索引的话也可能需要大量的离散IO。

比较合理的优化手段是每个用户记录1条信息, 用数组或text来存储app_id对应的rating信息.

如 :

digoal=# \d recommendation_mpt_new  
   Column    |  Type   | Modifiers   
-------------+---------+-----------  
 user_id     | text    |   
 recomm      | text    |   

这里涉及到聚合, 最好是排序后再聚合, 那么可以省去程序取出数据后再排序的过程.

例如 :

select user_id, array_agg( app_id || '_' || rating order by rating desc ) from recommendation_mpt group by user_id;  

如果数组已经排好了顺序的话, 并且程序不需要用到rating时, 那么这个表只需要user_id和app_id字段.

select user_id, array_agg( app_id order by rating desc ) from recommendation_mpt group by user_id;  

注 :

聚合函数支持order by是从9.0.1开始的 :

Release 9.0.1  
Release Date: 2010-10-04  
     * Allow aggregate functions to use ORDER BY (Andrew Gierth)  
       For example, this is now supported: array_agg(a ORDER BY b). This  
       is useful with aggregates for which the order of input values is  
       significant, and eliminates the need to use a nonstandard subquery  
       to determine the ordering.  

性能测试 :

场景一、

100W用户, 每个用户41条记录(41和1000000不能整除, 比较好生成测试数据.) .

create table recommendation_mpt (user_id int8, app_id numeric, rating numeric);  
insert into recommendation_mpt select generate_series(1,1000000), generate_series(1,41), random();  
INSERT 0 41000000  
create index idx_recommendation_mpt_1 on recommendation_mpt (user_id);  
digoal=> explain select app_id from recommendation_mpt where user_id = 1 order by rating desc;  
                                                 QUERY PLAN                                                   
------------------------------------------------------------------------------------------------------------  
 Sort  (cost=55.27..55.37 rows=41 width=16)  
   Sort Key: rating  
   ->  Index Scan using idx_recommendation_mpt_1 on recommendation_mpt  (cost=0.00..54.17 rows=41 width=16)  
         Index Cond: (user_id = 1)  
(4 rows)  

测试脚本:

vi digoal.sql  
\setrandom user_id 1 1000000  
select app_id from recommendation_mpt where user_id = :user_id order by rating desc;  

测试结果 :

ocz@db-172-16-3-150-> pgbench -M prepared -n -r -c 16 -j 4 -f ./digoal.sql -T 60 -h $PGDATA -U digoal digoal  
transaction type: Custom query  
scaling factor: 1  
query mode: prepared  
number of clients: 16  
number of threads: 4  
duration: 60 s  
number of transactions actually processed: 507317  
tps = 8447.554947 (including connections establishing)  
tps = 8450.407448 (excluding connections establishing)  
statement latencies in milliseconds:  
        0.002683        \setrandom user_id 1 1000000  
        1.887224        select app_id from recommendation_mpt where user_id = :user_id order by rating desc;  

场景一的优化 : 建立索引避免rating排序的开销.

create index idx_recommendation_mpt_2 on  recommendation_mpt (user_id, rating, app_id);  
digoal=> set enable_sort=off;  
SET  
digoal=> explain select app_id from recommendation_mpt where user_id = 1 order by rating desc;  
                                                     QUERY PLAN                                                       
--------------------------------------------------------------------------------------------------------------------  
 Index Only Scan Backward using idx_recommendation_mpt_2 on recommendation_mpt  (cost=0.00..64.33 rows=42 width=16)  
   Index Cond: (user_id = 1)  
(2 rows)  
  
vi $PGDATA/postgresql.conf  
enable_sort = off  
  
ocz@db-172-16-3-150-> pg_ctl reload  
server signaled  

在此测试后的结果 :

ocz@db-172-16-3-150-> pgbench -M prepared -n -r -c 16 -j 4 -f ./digoal.sql -T 60 -h $PGDATA -U digoal digoal  
transaction type: Custom query  
scaling factor: 1  
query mode: prepared  
number of clients: 16  
number of threads: 4  
duration: 60 s  
number of transactions actually processed: 515758  
tps = 8587.821833 (including connections establishing)  
tps = 8590.411963 (excluding connections establishing)  
statement latencies in milliseconds:  
        0.002724        \setrandom user_id 1 1000000  
        1.856069        select app_id from recommendation_mpt where user_id = :user_id order by rating desc;  

场景二、

100W用户, 每个用户1条聚合并排序好的记录.

create table recommendation_mpt_new ( user_id int8, app_id numeric[] );  
insert into recommendation_mpt_new select user_id, array_agg(app_id order by rating desc)  from  recommendation_mpt group by user_id;  
INSERT 0 1000000  
digoal=> select * from recommendation_mpt_new limit 5;  
 user_id |                                                       app_id                                                          
---------+---------------------------------------------------------------------------------------------------------------------  
       1 | {16,19,9,8,35,20,31,40,28,7,5,15,36,10,13,11,34,24,32,38,26,39,6,18,30,29,3,12,17,37,25,22,27,1,2,41,4,21,33,23,14}  
       2 | {24,33,17,3,20,29,41,38,27,39,4,19,2,28,9,15,11,1,32,10,13,30,31,37,18,21,5,7,12,14,16,6,40,22,36,23,25,8,35,26,34}  
       3 | {18,22,14,20,26,35,34,8,1,16,11,41,15,30,27,21,19,7,3,5,9,13,17,4,40,10,23,32,6,24,28,31,12,36,38,29,25,2,37,39,33}  
       4 | {9,27,37,2,41,38,1,31,30,3,32,28,26,7,40,10,4,34,6,21,12,14,24,5,20,13,33,23,35,11,16,18,22,17,19,8,39,29,36,15,25}  
       5 | {35,9,39,27,11,23,18,24,13,5,10,36,22,1,34,16,7,28,20,38,4,26,21,8,17,2,32,15,41,33,31,29,30,40,14,25,6,37,19,12,3}  
(5 rows)  
digoal=> select * from recommendation_mpt where user_id=1 order by rating desc;  
 user_id | app_id |       rating         
---------+--------+--------------------  
       1 |     16 |  0.996264576911926  
       1 |     19 |  0.988723468966782  
       1 |      9 |  0.988221516367048  
       1 |      8 |  0.962889738380909  
       1 |     35 |  0.957569351885468  
       1 |     20 |   0.94144273782149  
       1 |     31 |  0.929523797240108  
       1 |     40 |  0.889733758755028  
       1 |     28 |  0.851767126005143  
       1 |      7 |  0.848528668750077  
       1 |      5 |  0.840631154365838  
       1 |     15 |  0.822692712768912  
       1 |     36 |  0.819620195310563  
       1 |     10 |  0.706364229787141  
       1 |     13 |  0.698610578197986  
       1 |     11 |  0.695095229428262  
       1 |     34 |  0.673047889955342  
       1 |     24 |  0.660528331529349  
       1 |     32 |  0.647978096734732  
       1 |     38 |  0.636633691377938  
       1 |     26 |  0.592413422185928  
       1 |     39 |  0.501070868223906  
       1 |      6 |  0.491314855404198  
       1 |     18 |  0.471619851421565  
       1 |     30 |  0.466300643049181  
       1 |     29 |  0.457061214838177  
       1 |      3 |  0.325578296091408  
       1 |     12 |  0.310081131756306  
       1 |     17 |  0.305294726509601  
       1 |     37 |   0.26613838179037  
       1 |     25 |  0.251236057840288  
       1 |     22 |  0.237996113952249  
       1 |     27 |   0.22669791476801  
       1 |      1 |  0.217828437685966  
       1 |      2 |  0.200171065982431  
       1 |     41 |  0.197947917506099  
       1 |      4 |  0.189090229570866  
       1 |     21 |  0.182843111455441  
       1 |     33 |  0.131565005518496  
       1 |     23 |  0.106283554807305  
       1 |     14 | 0.0274284891784191  
(41 rows)  
create index idx_recommendation_mpt_new_1 on recommendation_mpt_new (user_id);  
digoal=> explain select app_id from recommendation_mpt_new where user_id = 1;  
                                                 QUERY PLAN                                                    
-------------------------------------------------------------------------------------------------------------  
 Index Scan using idx_recommendation_mpt_new_1 on recommendation_mpt_new  (cost=0.00..2.54 rows=1 width=352)  
   Index Cond: (user_id = 1)  
(2 rows)  

测试脚本:

\setrandom user_id 1 1000000  
select app_id from recommendation_mpt_new where user_id = :user_id;  

测试结果 :

ocz@db-172-16-3-150-> pgbench -M prepared -n -r -c 16 -j 4 -f ./digoal.sql -T 60 -h $PGDATA -U digoal digoal  
transaction type: Custom query  
scaling factor: 1  
query mode: prepared  
number of clients: 16  
number of threads: 4  
duration: 60 s  
number of transactions actually processed: 4700467  
tps = 78296.696341 (including connections establishing)  
tps = 78327.675964 (excluding connections establishing)  
statement latencies in milliseconds:  
        0.001730        \setrandom user_id 1 1000000  
        0.200860        select app_id from recommendation_mpt_new where user_id = :user_id;  

小结

1. 将多行聚合成1行后, 查询性能从8590 qps提升到78327 qps .

2. 另一篇介绍PostgreSQL aggregate的文章 :

http://blog.163.com/digoal@126/blog/static/16387704020121118112533410/

Flag Counter

digoal’s 大量PostgreSQL文章入口