冗余数据JOIN导致的慢SQL优化一例

2 minute read

背景

CASE

一个这样的查询，每个表都只有几千条数据，但是查询非常慢，几十秒不出结果。

select    
distinct abc.pro_col1, abc.col3    
from    
t0 p    
INNER JOIN t1 abc   
  on p.id=abc.par_col2  
inner join t2 s   
  on  s.col3=abc.col3    
inner join t3 po   
  on  po.id=s.col4   
where p.state=2 and po.state=3   
order by abc.pro_col1, abc.col3;   

优化方法

从语义来看，这条SQL是在经过几个JOIN后取其中一个表的两个字段的唯一值。

但是每一次关联，都可能产生冗余的值，所以导致了结果集越来越庞大。

修改建议，每一次JOIN都输出唯一值，减少冗余。

select   
distinct pro_col1, col3 from  
(  
select   
distinct t1.pro_col1, t1.col3, s.col4 from   
(  
select   
distinct abc.pro_col1, abc.col3 from   
t1 abc INNER JOIN t0 p      
on (p.id = abc.par_col2 and p.state=2)  
) t1  
inner join t2 s   
on (s.col3 = t1.col3)  
) t2  
inner join t3 po     
on (po.id = t2.col4 and po.state=3)  
order by t2.pro_col1, t2.col3  ;  

修改后几十毫秒可以输出结果。

重现

postgres=# create table rt1(id int, info text);  
CREATE TABLE  
postgres=# create table rt2(id int, info text);  
CREATE TABLE  
postgres=# create table rt3(id int, info text);  
CREATE TABLE  
postgres=# create table rt4(id int, info text);  
CREATE TABLE  
  
postgres=# insert into rt1 select generate_series(1,1000),'test';  
INSERT 0 1000  
postgres=# insert into rt2 select 1,'test' from generate_series(1,1000);  
INSERT 0 1000  
postgres=# insert into rt3 select 1,'test' from generate_series(1,1000);  
INSERT 0 1000  
postgres=# insert into rt4 select 1,'test' from generate_series(1,1000);  
INSERT 0 1000  

以下查询，每次JOIN都产生大量的冗余数据，越到后面的JOIN，冗余越多，导致的查询非常漫长。

postgres=# explain select distinct rt1.id from rt1 join rt2 on rt1.id=rt2.id join rt3 on rt2.id=rt3.id join rt4 on rt3.id=rt4.id;  
                                           QUERY PLAN                                              
-------------------------------------------------------------------------------------------------  
 HashAggregate  (cost=145.25..155.25 rows=1000 width=4)  
   Group Key: rt1.id  
   ->  Hash Join  (cost=113.00..142.75 rows=1000 width=4)  
         Hash Cond: (rt4.id = rt1.id)  
         ->  Seq Scan on rt4  (cost=0.00..16.00 rows=1000 width=4)  
         ->  Hash  (cost=100.50..100.50 rows=1000 width=12)  
               ->  Hash Join  (cost=70.75..100.50 rows=1000 width=12)  
                     Hash Cond: (rt3.id = rt1.id)  
                     ->  Seq Scan on rt3  (cost=0.00..16.00 rows=1000 width=4)  
                     ->  Hash  (cost=58.25..58.25 rows=1000 width=8)  
                           ->  Hash Join  (cost=28.50..58.25 rows=1000 width=8)  
                                 Hash Cond: (rt2.id = rt1.id)  
                                 ->  Seq Scan on rt2  (cost=0.00..16.00 rows=1000 width=4)  
                                 ->  Hash  (cost=16.00..16.00 rows=1000 width=4)  
                                       ->  Seq Scan on rt1  (cost=0.00..16.00 rows=1000 width=4)  
(15 rows)  

修改如下，可以很快的得到结果

postgres=# select distinct t2.id from   
(  
select distinct t1.id from   
(select distinct rt1.id from rt1 join rt2 on rt1.id=rt2.id) t1  
join   
rt3 on t1.id=rt3.id  
) t2  
join rt4 on t2.id=rt4.id  
;  
 id   
----  
  1  
(1 row)  
Time: 2.052 ms  
  
postgres=# explain select distinct t2.id from   
postgres-# (  
postgres(# select distinct t1.id from   
postgres(# (select distinct rt1.id from rt1 join rt2 on rt1.id=rt2.id) t1  
postgres(# join   
postgres(# rt3 on t1.id=rt3.id  
postgres(# ) t2  
postgres-# join rt4 on t2.id=rt4.id  
postgres-# ;  
                                                 QUERY PLAN                                                    
-------------------------------------------------------------------------------------------------------------  
 HashAggregate  (cost=190.25..200.25 rows=1000 width=4)  
   Group Key: rt1.id  
   ->  Hash Join  (cost=158.00..187.75 rows=1000 width=4)  
         Hash Cond: (rt4.id = rt1.id)  
         ->  Seq Scan on rt4  (cost=0.00..16.00 rows=1000 width=4)  
         ->  Hash  (cost=145.50..145.50 rows=1000 width=4)  
               ->  HashAggregate  (cost=125.50..135.50 rows=1000 width=4)  
                     Group Key: rt1.id  
                     ->  Hash Join  (cost=93.25..123.00 rows=1000 width=4)  
                           Hash Cond: (rt3.id = rt1.id)  
                           ->  Seq Scan on rt3  (cost=0.00..16.00 rows=1000 width=4)  
                           ->  Hash  (cost=80.75..80.75 rows=1000 width=4)  
                                 ->  HashAggregate  (cost=60.75..70.75 rows=1000 width=4)  
                                       Group Key: rt1.id  
                                       ->  Hash Join  (cost=28.50..58.25 rows=1000 width=4)  
                                             Hash Cond: (rt2.id = rt1.id)  
                                             ->  Seq Scan on rt2  (cost=0.00..16.00 rows=1000 width=4)  
                                             ->  Hash  (cost=16.00..16.00 rows=1000 width=4)  
                                                   ->  Seq Scan on rt1  (cost=0.00..16.00 rows=1000 width=4)  
(19 rows)  
  
Time: 0.750 ms  

小结

这种SQL，如果要改内核的话，可以对统计信息进行分析（每个字段都有n_distinct），并对其进行query rewrite，得到同样的结果。

postgres=# \d pg_stats  
          View "pg_catalog.pg_stats"  
         Column         |   Type   | Modifiers   
------------------------+----------+-----------  
 schemaname             | name     |   
 tablename              | name     |   
 attname                | name     |   
 inherited              | boolean  |   
 null_frac              | real     |   
 avg_width              | integer  |   
 n_distinct             | real     |   
 most_common_vals       | anyarray |   
 most_common_freqs      | real[]   |   
 histogram_bounds       | anyarray |   
 correlation            | real     |   
 most_common_elems      | anyarray |   
 most_common_elem_freqs | real[]   |   
 elem_count_histogram   | real[]   |   

祝大家玩得开心，欢迎随时来 阿里云促膝长谈业务需求，恭候光临。

阿里云的小伙伴们加油，努力 做好内核与服务，打造最贴地气的云数据库 。

digoal’s 大量PostgreSQL文章入口

Twitter Facebook Google+ LinkedIn

Digoal.zhou

冗余数据JOIN导致的慢SQL优化一例

背景

CASE

优化方法

重现

小结

digoal’s 大量PostgreSQL文章入口

You May Also Enjoy

PostgreSQL(PPAS 兼容Oracle) 从零开始入门手册 - 珍藏版

PostgreSQL pipelinedb 流计算插件 - IoT应用 - 实时轨迹聚合

PostgreSQL plpgsql 存储过程、函数 - 状态、异常变量打印、异常捕获… - GET [STACKED] DIAGNOSTICS

PostgreSQL datediff 日期间隔（单位转换）兼容SQL用法