PostgreSQL 10.0 preview 性能增强 - hash,nestloop join优化(聪明的优化器是这样的)
背景
两张表JOIN时,如果内表的JOIN字段确定是唯一的,那么在嵌套循环时,如果外表有重复值,循环过程中,对于内表来说,一个VALUE只需要扫描一次。
hash join同样适用。
例子
postgres=# create table intbl(id int);
CREATE TABLE
postgres=# create unique index idx_intbl on intbl(id);
CREATE INDEX
postgres=# insert into intbl select generate_series(1,1000000);
INSERT 0 1000000
postgres=# create table out(id int);
CREATE TABLE
postgres=# insert into out select 1 from generate_series(1,1000);
-- 对于外表的1000个1, 内表scan一次命中后,同一个值不需要再次scan内表
INSERT 0 1000
postgres=# set enable_hashjoin =off;
SET
postgres=# set enable_mergejoin =off;
SET
postgres=# set enable_material =off;
SET
9.6
postgres=# explain (analyze,verbose,timing,costs,buffers) select * from out,intbl where out.id=intbl.id;
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------
Nested Loop (cost=0.42..2736.00 rows=1000 width=8) (actual time=0.033..1.965 rows=1000 loops=1)
Output: "out".id, intbl.id
Buffers: shared hit=4005
-> Seq Scan on public."out" (cost=0.00..15.00 rows=1000 width=4) (actual time=0.013..0.101 rows=1000 loops=1)
Output: "out".id
Buffers: shared hit=5
-> Index Only Scan using idx_intbl on public.intbl (cost=0.42..2.71 rows=1 width=4) (actual time=0.001..0.002 rows=1 loops=1000)
Output: intbl.id
Index Cond: (intbl.id = "out".id)
Heap Fetches: 1000
Buffers: shared hit=4000
Planning time: 0.109 ms
Execution time: 2.048 ms
(13 rows)
10.0
postgres=# explain (analyze,verbose,timing,costs,buffers) select * from out,intbl where out.id=intbl.id;
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------
Nested Loop (cost=0.42..2202.50 rows=1000 width=8) (actual time=0.035..1.803 rows=1000 loops=1)
Output: "out".id, intbl.id
Inner Unique: true
Buffers: shared hit=4005
-> Seq Scan on public."out" (cost=0.00..15.00 rows=1000 width=4) (actual time=0.013..0.106 rows=1000 loops=1)
Output: "out".id
Buffers: shared hit=5
-> Index Only Scan using idx_intbl on public.intbl (cost=0.42..2.19 rows=1 width=4) (actual time=0.001..0.001 rows=1 loops=1000)
Output: intbl.id
Index Cond: (intbl.id = "out".id)
Heap Fetches: 1000
Buffers: shared hit=4000
Planning time: 0.122 ms
Execution time: 1.887 ms
(14 rows)
patch如下
Optimize joins when the inner relation can be proven unique.
author Tom Lane <tgl@sss.pgh.pa.us>
Sat, 8 Apr 2017 10:20:03 +0800 (22:20 -0400)
committer Tom Lane <tgl@sss.pgh.pa.us>
Sat, 8 Apr 2017 10:20:13 +0800 (22:20 -0400)
commit 9c7f5229ad68d7e0e4dd149e3f80257893e404d4
tree 0a167d403952550f43941b01b24ed5e7526c5351 tree | snapshot
parent f13a9121f9822eafe05cc3178bf046155a248173 commit | diff
Optimize joins when the inner relation can be proven unique.
If there can certainly be no more than one matching inner row for a given
outer row, then the executor can move on to the next outer row as soon as
it's found one match; there's no need to continue scanning the inner
relation for this outer row. This saves useless scanning in nestloop
and hash joins. In merge joins, it offers the opportunity to skip
mark/restore processing, because we know we have not advanced past the
first possible match for the next outer row.
Of course, the devil is in the details: the proof of uniqueness must
depend only on joinquals (not otherquals), and if we want to skip
mergejoin mark/restore then it must depend only on merge clauses.
To avoid adding more planning overhead than absolutely necessary,
the present patch errs in the conservative direction: there are cases
where inner_unique or skip_mark_restore processing could be used, but
it will not do so because it's not sure that the uniqueness proof
depended only on "safe" clauses. This could be improved later.
David Rowley, reviewed and rather heavily editorialized on by me
Discussion: https://postgr.es/m/CAApHDvqF6Sw-TK98bW48TdtFJ+3a7D2mFyZ7++=D-RyPsL76gw@mail.gmail.com
这个patch的讨论,详见邮件组,本文末尾URL。
PostgreSQL社区的作风非常严谨,一个patch可能在邮件组中讨论几个月甚至几年,根据大家的意见反复的修正,patch合并到master已经非常成熟,所以PostgreSQL的稳定性也是远近闻名的。
参考
https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=9c7f5229ad68d7e0e4dd149e3f80257893e404d4