冗余数据JOIN导致的慢SQL优化一例
背景
CASE
一个这样的查询,每个表都只有几千条数据,但是查询非常慢,几十秒不出结果。
select
distinct abc.pro_col1, abc.col3
from
t0 p
INNER JOIN t1 abc
on p.id=abc.par_col2
inner join t2 s
on s.col3=abc.col3
inner join t3 po
on po.id=s.col4
where p.state=2 and po.state=3
order by abc.pro_col1, abc.col3;
优化方法
从语义来看,这条SQL是在经过几个JOIN后取其中一个表的两个字段的唯一值。
但是每一次关联,都可能产生冗余的值,所以导致了结果集越来越庞大。
修改建议,每一次JOIN都输出唯一值,减少冗余。
select
distinct pro_col1, col3 from
(
select
distinct t1.pro_col1, t1.col3, s.col4 from
(
select
distinct abc.pro_col1, abc.col3 from
t1 abc INNER JOIN t0 p
on (p.id = abc.par_col2 and p.state=2)
) t1
inner join t2 s
on (s.col3 = t1.col3)
) t2
inner join t3 po
on (po.id = t2.col4 and po.state=3)
order by t2.pro_col1, t2.col3 ;
修改后几十毫秒可以输出结果。
重现
postgres=# create table rt1(id int, info text);
CREATE TABLE
postgres=# create table rt2(id int, info text);
CREATE TABLE
postgres=# create table rt3(id int, info text);
CREATE TABLE
postgres=# create table rt4(id int, info text);
CREATE TABLE
postgres=# insert into rt1 select generate_series(1,1000),'test';
INSERT 0 1000
postgres=# insert into rt2 select 1,'test' from generate_series(1,1000);
INSERT 0 1000
postgres=# insert into rt3 select 1,'test' from generate_series(1,1000);
INSERT 0 1000
postgres=# insert into rt4 select 1,'test' from generate_series(1,1000);
INSERT 0 1000
以下查询,每次JOIN都产生大量的冗余数据,越到后面的JOIN,冗余越多,导致的查询非常漫长。
postgres=# explain select distinct rt1.id from rt1 join rt2 on rt1.id=rt2.id join rt3 on rt2.id=rt3.id join rt4 on rt3.id=rt4.id;
QUERY PLAN
-------------------------------------------------------------------------------------------------
HashAggregate (cost=145.25..155.25 rows=1000 width=4)
Group Key: rt1.id
-> Hash Join (cost=113.00..142.75 rows=1000 width=4)
Hash Cond: (rt4.id = rt1.id)
-> Seq Scan on rt4 (cost=0.00..16.00 rows=1000 width=4)
-> Hash (cost=100.50..100.50 rows=1000 width=12)
-> Hash Join (cost=70.75..100.50 rows=1000 width=12)
Hash Cond: (rt3.id = rt1.id)
-> Seq Scan on rt3 (cost=0.00..16.00 rows=1000 width=4)
-> Hash (cost=58.25..58.25 rows=1000 width=8)
-> Hash Join (cost=28.50..58.25 rows=1000 width=8)
Hash Cond: (rt2.id = rt1.id)
-> Seq Scan on rt2 (cost=0.00..16.00 rows=1000 width=4)
-> Hash (cost=16.00..16.00 rows=1000 width=4)
-> Seq Scan on rt1 (cost=0.00..16.00 rows=1000 width=4)
(15 rows)
修改如下,可以很快的得到结果
postgres=# select distinct t2.id from
(
select distinct t1.id from
(select distinct rt1.id from rt1 join rt2 on rt1.id=rt2.id) t1
join
rt3 on t1.id=rt3.id
) t2
join rt4 on t2.id=rt4.id
;
id
----
1
(1 row)
Time: 2.052 ms
postgres=# explain select distinct t2.id from
postgres-# (
postgres(# select distinct t1.id from
postgres(# (select distinct rt1.id from rt1 join rt2 on rt1.id=rt2.id) t1
postgres(# join
postgres(# rt3 on t1.id=rt3.id
postgres(# ) t2
postgres-# join rt4 on t2.id=rt4.id
postgres-# ;
QUERY PLAN
-------------------------------------------------------------------------------------------------------------
HashAggregate (cost=190.25..200.25 rows=1000 width=4)
Group Key: rt1.id
-> Hash Join (cost=158.00..187.75 rows=1000 width=4)
Hash Cond: (rt4.id = rt1.id)
-> Seq Scan on rt4 (cost=0.00..16.00 rows=1000 width=4)
-> Hash (cost=145.50..145.50 rows=1000 width=4)
-> HashAggregate (cost=125.50..135.50 rows=1000 width=4)
Group Key: rt1.id
-> Hash Join (cost=93.25..123.00 rows=1000 width=4)
Hash Cond: (rt3.id = rt1.id)
-> Seq Scan on rt3 (cost=0.00..16.00 rows=1000 width=4)
-> Hash (cost=80.75..80.75 rows=1000 width=4)
-> HashAggregate (cost=60.75..70.75 rows=1000 width=4)
Group Key: rt1.id
-> Hash Join (cost=28.50..58.25 rows=1000 width=4)
Hash Cond: (rt2.id = rt1.id)
-> Seq Scan on rt2 (cost=0.00..16.00 rows=1000 width=4)
-> Hash (cost=16.00..16.00 rows=1000 width=4)
-> Seq Scan on rt1 (cost=0.00..16.00 rows=1000 width=4)
(19 rows)
Time: 0.750 ms
小结
这种SQL,如果要改内核的话,可以对统计信息进行分析(每个字段都有n_distinct),并对其进行query rewrite,得到同样的结果。
postgres=# \d pg_stats
View "pg_catalog.pg_stats"
Column | Type | Modifiers
------------------------+----------+-----------
schemaname | name |
tablename | name |
attname | name |
inherited | boolean |
null_frac | real |
avg_width | integer |
n_distinct | real |
most_common_vals | anyarray |
most_common_freqs | real[] |
histogram_bounds | anyarray |
correlation | real |
most_common_elems | anyarray |
most_common_elem_freqs | real[] |
elem_count_histogram | real[] |
祝大家玩得开心,欢迎随时来 阿里云促膝长谈业务需求 ,恭候光临。
阿里云的小伙伴们加油,努力 做好内核与服务,打造最贴地气的云数据库 。