PostgreSQL merge join 评估成本时可能会查询索引 - 硬解析务必引起注意 - 批量删除数据后, 未释放empty索引页导致mergejoin执行计划变慢 case
背景
PostgreSQL支持三种JOIN的方法,nestloop, merge, hash。
这三种JOIN方法的差别和原理可以参考
https://www.postgresql.org/docs/devel/static/planner-optimizer.html
《PostgreSQL nestloop/hash/merge join讲解》
nested loop join:
The right relation is scanned once for every row found in the left relation.
This strategy is easy to implement but can be very time consuming.
(However, if the right relation can be scanned with an index scan, this can be a good strategy.
It is possible to use values from the current row of the left relation as keys for the index scan of the right.)
merge join:
Each relation is sorted on the join attributes before the join starts.
Then the two relations are scanned in parallel, and matching rows are combined to form join rows.
This kind of join is more attractive because each relation has to be scanned only once.
The required sorting might be achieved either by an explicit sort step,
or by scanning the relation in the proper order using an index on the join key.
hash join:
the right relation is first scanned and loaded into a hash table, using its join attributes as hash keys.
Next the left relation is scanned and the appropriate values of every row found are used as hash keys to locate the matching rows in the table.
对于merge join,在估算成本时,如果JOIN列有索引,那么会扫描索引,获取该列的最大值和最小值。
(注意,数据库的统计信息中并没有最大值和最小值)
View "pg_catalog.pg_stats"
Column | Type | Modifiers
------------------------+----------+-----------
schemaname | name |
tablename | name |
attname | name |
inherited | boolean |
null_frac | real |
avg_width | integer |
n_distinct | real |
most_common_vals | anyarray |
most_common_freqs | real[] |
histogram_bounds | anyarray |
correlation | real |
most_common_elems | anyarray |
most_common_elem_freqs | real[] |
elem_count_histogram | real[] |
那么问题来了,如果索引出现了剧烈倾斜(或者没有及时释放空页),那么在评估merge join的执行计划时,可能导致执行计划时间过长。
下面看一个例子。
merge join 评估成本
创建两张测试表,关闭表级autovacuum,以免影响结果
postgres=# create unlogged table tbl1(id int, info text) with (autovacuum_enabled=off);
CREATE TABLE
postgres=# create unlogged table tbl2(id int, info text) with (autovacuum_enabled=off);
CREATE TABLE
往两张表分别插入1000万记录
postgres=# insert into tbl1 select generate_series(1,10000000);
INSERT 0 10000000
postgres=# insert into tbl2 select generate_series(1,10000000);
INSERT 0 10000000
检查mergejoin已打开
postgres=# show enable_mergejoin ;
enable_mergejoin
------------------
on
(1 row)
打开时间记录
postgres=# \timing
Timing is on.
查看执行计划,目前生成执行计划的耗时很正常
postgres=# explain select * from tbl1,tbl2 where tbl1.id=tbl2.id;
QUERY PLAN
----------------------------------------------------------------------------------
Merge Join (cost=1838455.64..2370285748.91 rows=157893676470 width=72)
Merge Cond: (tbl1.id = tbl2.id)
-> Sort (cost=919227.82..933276.56 rows=5619496 width=36)
Sort Key: tbl1.id
-> Seq Scan on tbl1 (cost=0.00..100442.96 rows=5619496 width=36)
-> Materialize (cost=919227.82..947325.30 rows=5619496 width=36)
-> Sort (cost=919227.82..933276.56 rows=5619496 width=36)
Sort Key: tbl2.id
-> Seq Scan on tbl2 (cost=0.00..100442.96 rows=5619496 width=36)
(9 rows)
Time: 1.134 ms
收集统计信息,生成表对应的vm, fsm文件。
postgres=# vacuum analyze tbl1;
VACUUM
Time: 834.366 ms
postgres=# vacuum analyze tbl2;
VACUUM
Time: 835.022 ms
再次生成执行计划,强制使用merge join,执行计划的时间依旧正常
当没有索引时,评估merge join的成本不需要获取最大值和最小值
postgres=# explain select * from tbl1,tbl2 where tbl1.id=tbl2.id;
QUERY PLAN
----------------------------------------------------------------------------
Hash Join (cost=347372.66..975995.14 rows=9999985 width=72)
Hash Cond: (tbl1.id = tbl2.id)
-> Seq Scan on tbl1 (cost=0.00..144247.85 rows=9999985 width=36)
-> Hash (cost=144247.85..144247.85 rows=9999985 width=36)
-> Seq Scan on tbl2 (cost=0.00..144247.85 rows=9999985 width=36)
(5 rows)
Time: 0.633 ms
postgres=# set enable_hashjoin=off;
SET
Time: 0.246 ms
postgres=# explain select * from tbl1,tbl2 where tbl1.id=tbl2.id;
QUERY PLAN
----------------------------------------------------------------------------------
Merge Join (cost=3285716.66..3510716.32 rows=9999985 width=72)
Merge Cond: (tbl1.id = tbl2.id)
-> Sort (cost=1642858.33..1667858.29 rows=9999985 width=36)
Sort Key: tbl1.id
-> Seq Scan on tbl1 (cost=0.00..144247.85 rows=9999985 width=36)
-> Materialize (cost=1642858.33..1692858.26 rows=9999985 width=36)
-> Sort (cost=1642858.33..1667858.29 rows=9999985 width=36)
Sort Key: tbl2.id
-> Seq Scan on tbl2 (cost=0.00..144247.85 rows=9999985 width=36)
(9 rows)
Time: 0.469 ms
postgres=# set enable_material =off;
SET
Time: 0.205 ms
postgres=# explain select * from tbl1,tbl2 where tbl1.id=tbl2.id;
QUERY PLAN
----------------------------------------------------------------------------
Merge Join (cost=3285716.66..3485716.36 rows=9999985 width=72)
Merge Cond: (tbl1.id = tbl2.id)
-> Sort (cost=1642858.33..1667858.29 rows=9999985 width=36)
Sort Key: tbl1.id
-> Seq Scan on tbl1 (cost=0.00..144247.85 rows=9999985 width=36)
-> Sort (cost=1642858.33..1667858.29 rows=9999985 width=36)
Sort Key: tbl2.id
-> Seq Scan on tbl2 (cost=0.00..144247.85 rows=9999985 width=36)
(8 rows)
Time: 0.436 ms
创建tbl1的JOIN字段ID的索引
postgres=# create index idx_tbl1_id on tbl1(id);
CREATE INDEX
Time: 2813.772 ms
当前索引大小214 MB
postgres=# \di+ idx_tbl1_id
List of relations
Schema | Name | Type | Owner | Table | Size | Description
--------+-------------+-------+----------+-------+--------+-------------
public | idx_tbl1_id | index | postgres | tbl1 | 214 MB |
(1 row)
删除tbl1表的前9999999条记录
postgres=# delete from tbl1 where id<10000000;
DELETE 9999999
Time: 3123.303 ms
重新生成执行计划,发现现在执行计划耗时变长了很多很多
postgres=# explain select * from tbl1,tbl2 where tbl1.id=tbl2.id;
QUERY PLAN
-------------------------------------------------------------------------------------------
Merge Join (cost=1642863.77..2047750.44 rows=9999985 width=72)
Merge Cond: (tbl1.id = tbl2.id)
-> Index Scan using idx_tbl1_id on tbl1 (cost=0.43..229897.34 rows=10000000 width=36)
-> Sort (cost=1642858.33..1667858.29 rows=9999985 width=36)
Sort Key: tbl2.id
-> Seq Scan on tbl2 (cost=0.00..144247.85 rows=9999985 width=36)
(6 rows)
Time: 1317.079 ms
再一次生成执行计划,耗时还是不正常,但是略有好转,可能因为索引页的数据已经在内存中了。
postgres=# explain select * from tbl1,tbl2 where tbl1.id=tbl2.id;
QUERY PLAN
-------------------------------------------------------------------------------------------
Merge Join (cost=1642863.77..2047750.44 rows=9999985 width=72)
Merge Cond: (tbl1.id = tbl2.id)
-> Index Scan using idx_tbl1_id on tbl1 (cost=0.43..229897.34 rows=10000000 width=36)
-> Sort (cost=1642858.33..1667858.29 rows=9999985 width=36)
Sort Key: tbl2.id
-> Seq Scan on tbl2 (cost=0.00..144247.85 rows=9999985 width=36)
(6 rows)
Time: 81.410 ms
执行计划的时间与通过索引查询JOIN列的最大最小值的时间基本一致
postgres=# select min(id),max(id) from tbl1;
min | max
----------+----------
10000000 | 10000000
(1 row)
Time: 81.591 ms
没有评估到merge join的时候,执行计划是正常的
postgres=# explain select min(id),max(id) from tbl1;
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------
Result (cost=0.91..0.92 rows=1 width=8)
InitPlan 1 (returns $0)
-> Limit (cost=0.43..0.46 rows=1 width=4)
-> Index Only Scan using idx_tbl1_id on tbl1 (cost=0.43..210649.04 rows=10000000 width=4)
Index Cond: (id IS NOT NULL)
InitPlan 2 (returns $1)
-> Limit (cost=0.43..0.46 rows=1 width=4)
-> Index Only Scan Backward using idx_tbl1_id on tbl1 tbl1_1 (cost=0.43..210649.04 rows=10000000 width=4)
Index Cond: (id IS NOT NULL)
(9 rows)
Time: 0.679 ms
将优化器的enable_mergejoin关闭,执行计划的耗时恢复正常,所以问题的根源是merge join执行计划本身的问题,后面会有更细致的分析
postgres=# set enable_mergejoin =off;
SET
postgres=# explain select * from tbl1,tbl2 where tbl1.id=tbl2.id;
QUERY PLAN
-------------------------------------------------------------------------------------
Nested Loop (cost=0.43..13754566787.32 rows=10000000 width=72)
-> Seq Scan on tbl2 (cost=0.00..144247.85 rows=9999985 width=36)
-> Index Scan using idx_tbl1_id on tbl1 (cost=0.43..875.44 rows=50000 width=36)
Index Cond: (id = tbl2.id)
(4 rows)
Time: 0.602 ms
目前索引大小依旧是214 MB
postgres=# \di+ idx_tbl1_id
List of relations
Schema | Name | Type | Owner | Table | Size | Description
--------+-------------+-------+----------+-------+--------+-------------
public | idx_tbl1_id | index | postgres | tbl1 | 214 MB |
(1 row)
使用pageinspect插件,检查一下当前索引
postgres=# create extension pageinspect ;
CREATE EXTENSION
首先,从metapage,查到索引的root page id
postgres=# select * from bt_metap('idx_tbl1_id');
magic | version | root | level | fastroot | fastlevel
--------+---------+------+-------+----------+-----------
340322 | 2 | 290 | 2 | 290 | 2
(1 row)
查询root page有多少条目,可以看到虽然数据都删了,但是索引还没有清理,这些条目依旧存在索引页中。
这也是为什么使用这个索引查找min, max会很慢的原因,因为它不知道这些数据已经被删除了,必须通过索引条目访问到HEAP PAGE对应的tuple后,才知道。
postgres=# select * from bt_page_stats('idx_tbl1_id',290);
blkno | type | live_items | dead_items | avg_item_size | page_size | free_size | btpo_prev | btpo_next | btpo | btpo_flags
-------+------+------------+------------+---------------+-----------+-----------+-----------+-----------+------+------------
290 | r | 97 | 0 | 15 | 8192 | 6216 | 0 | 0 | 2 | 2
(1 row)
查找root page索引条目的明细
postgres=# select * from bt_page_items('idx_tbl1_id',290);
itemoffset | ctid | itemlen | nulls | vars | data
------------+-----------+---------+-------+------+-------------------------
1 | (3,1) | 8 | f | f |
2 | (289,1) | 16 | f | f | 09 96 01 00 00 00 00 00
3 | (575,1) | 16 | f | f | 11 2c 03 00 00 00 00 00
4 | (860,1) | 16 | f | f | 19 c2 04 00 00 00 00 00
5 | (1145,1) | 16 | f | f | 21 58 06 00 00 00 00 00
......
96 | (27080,1) | 16 | f | f | f9 ac 96 00 00 00 00 00
97 | (27365,1) | 16 | f | f | 01 43 98 00 00 00 00 00
(97 rows)
接下来使用vacuum tbl1 回收垃圾页,这个动作同样会回收tbl1的索引垃圾页,对于全部dead的索引也,会置为empty page。
postgres=# vacuum tbl1;
VACUUM
Time: 1797.681 ms
现在,使用索引又很快了
postgres=# select min(id),max(id) from tbl1;
min | max
----------+----------
10000000 | 10000000
(1 row)
Time: 0.542 ms
postgres=# explain select min(id),max(id) from tbl1;
QUERY PLAN
-----------------------------------------------------------------------------------
Aggregate (cost=1.44..1.45 rows=1 width=8)
-> Index Only Scan using idx_tbl1_id on tbl1 (cost=0.12..1.44 rows=1 width=4)
(2 rows)
Time: 0.467 ms
那么现在merge join执行计划的耗时恢复正常了吗?
恢复了
postgres=# explain select * from tbl1,tbl2 where tbl1.id=tbl2.id;
QUERY PLAN
----------------------------------------------------------------------
Nested Loop (cost=0.00..313495.67 rows=1 width=72)
Join Filter: (tbl1.id = tbl2.id)
-> Seq Scan on tbl1 (cost=0.00..44248.01 rows=1 width=36)
-> Seq Scan on tbl2 (cost=0.00..144247.85 rows=9999985 width=36)
(4 rows)
Time: 0.488 ms
postgres=# set enable_nestloop=off;
SET
Time: 0.210 ms
postgres=# explain select * from tbl1,tbl2 where tbl1.id=tbl2.id;
QUERY PLAN
-----------------------------------------------------------------------------------
Merge Join (cost=1642863.46..1737103.01 rows=1 width=72)
Merge Cond: (tbl1.id = tbl2.id)
-> Index Scan using idx_tbl1_id on tbl1 (cost=0.12..44249.74 rows=1 width=36)
-> Sort (cost=1642858.33..1667858.29 rows=9999985 width=36)
Sort Key: tbl2.id
-> Seq Scan on tbl2 (cost=0.00..144247.85 rows=9999985 width=36)
(6 rows)
Time: 0.505 ms
虽然现在索引大小没有变化,但是实际上没有引用的index page都置为empty page了
postgres=# \di+ idx_tbl1_id
List of relations
Schema | Name | Type | Owner | Table | Size | Description
--------+-------------+-------+----------+-------+--------+-------------
public | idx_tbl1_id | index | postgres | tbl1 | 214 MB |
(1 row)
具体详见btree的readme
src/backend/access/nbtree/README
Page Deletion
-------------
We consider deleting an entire page from the btree only when it's become
completely empty of items. (Merging partly-full pages would allow better
space reuse, but it seems impractical to move existing data items left or
right to make this happen --- a scan moving in the opposite direction
might miss the items if so.) Also, we *never* delete the rightmost page
on a tree level (this restriction simplifies the traversal algorithms, as
explained below). Page deletion always begins from an empty leaf page. An
internal page can only be deleted as part of a branch leading to a leaf
page, where each internal page has only one child and that child is also to
be deleted.
观察vacuum后索引页的变化
首先获取metapage的信息,得到root page id,注意索引的层次并没有变化,依旧是2层,也就是说有第一层是branch节点,第二层是leaf节点。
postgres=# select * from bt_metap('idx_tbl1_id');
magic | version | root | level | fastroot | fastlevel
--------+---------+------+-------+----------+-----------
340322 | 2 | 290 | 2 | 27421 | 0
(1 row)
读取root page的信息,显然现在root page只有一个条目,即一级branch的某个page
postgres=# select * from bt_page_stats('idx_tbl1_id',290);
blkno | type | live_items | dead_items | avg_item_size | page_size | free_size | btpo_prev | btpo_next | btpo | btpo_flags
-------+------+------------+------------+---------------+-----------+-----------+-----------+-----------+------+------------
290 | r | 1 | 0 | 8 | 8192 | 8136 | 0 | 0 | 2 | 2
(1 row)
postgres=# select * from bt_page_items('idx_tbl1_id',290);
itemoffset | ctid | itemlen | nulls | vars | data
------------+-----------+---------+-------+------+------
1 | (27365,1) | 8 | f | f |
(1 row)
查看第一级,branch的信息,找到第二级,leaf节点。
postgres=# select * from bt_page_stats('idx_tbl1_id',27365);
blkno | type | live_items | dead_items | avg_item_size | page_size | free_size | btpo_prev | btpo_next | btpo | btpo_flags
-------+------+------------+------------+---------------+-----------+-----------+-----------+-----------+------+------------
27365 | i | 1 | 0 | 8 | 8192 | 8136 | 0 | 0 | 1 | 0
(1 row)
postgres=# select * from bt_page_items('idx_tbl1_id',27365);
itemoffset | ctid | itemlen | nulls | vars | data
------------+-----------+---------+-------+------+------
1 | (27421,1) | 8 | f | f |
(1 row)
查看第二级,leaf节点的信息
postgres=# select * from bt_page_stats('idx_tbl1_id', 27421);
blkno | type | live_items | dead_items | avg_item_size | page_size | free_size | btpo_prev | btpo_next | btpo | btpo_flags
-------+------+------------+------------+---------------+-----------+-----------+-----------+-----------+------+------------
27421 | l | 1 | 0 | 16 | 8192 | 8128 | 0 | 0 | 0 | 1
(1 row)
postgres=# select * from bt_page_items('idx_tbl1_id',27421);
itemoffset | ctid | itemlen | nulls | vars | data
------------+-------------+---------+-------+------+-------------------------
1 | (44247,178) | 16 | f | f | 80 96 98 00 00 00 00 00
(1 row)
leaf节点,对应的是heap table的行号,所以通过行号,可以直接访问数据
postgres=# select * from tbl1 where ctid='(44247,178)';
id | info
----------+------
10000000 |
(1 row)
从以上分析可以得到一个结论
在数据库中执行多表JOIN时,如果没有设置enable_mergejoin=off,那么数据库可能会选择merge join,或者说数据库需要评估merge join的成本。
当JOIN列有索引存在,为了算出更精确的COST值,评估merge join的成本会用到该列的min, max值(通过扫描JOIN列的索引得到)。
不管任何原因,扫描索引得到min,max 比较慢的话,执行计划的时间都会被拉长。
一个实际的CASE
某个业务,每天会从几千万数据中清除几十万,然后就发现某些JOIN的SQL执行计划时间变得好长(虽然最后选择的是nest loop join,但是评估过程依旧需要评估merge join的成本)。
如何发现的?
1. 使用perf
连接会话, pg_backend_pid()得到PID
收集该会话统计信息
perf record -avg -p $PID
在该会话执行explain QUERY;
分析该会话的代码时间占比
perf report --tui
2. 使用gdb, 或者打印进程的 pstack
某个场景得到的bt
while true do; pstack $PID sleep 0.01; done
0x00000000004a8238 in _bt_checkkeys ()
#1 0x00000000004a6126 in _bt_readpage ()
#2 0x00000000004a67a9 in _bt_steppage ()
#3 0x00000000004a68a8 in _bt_next ()
#4 0x00000000004a53c8 in btgettuple ()
#5 0x00000000007da563 in FunctionCall2Coll ()
#6 0x000000000049e53e in index_getnext_tid ()
#7 0x000000000049e5fa in index_getnext ()
#8 0x0000000000782820 in get_actual_variable_range ()
#9 0x0000000000786092 in ineq_histogram_selectivity ()
#10 0x0000000000786a87 in scalarineqsel ()
#11 0x0000000000787062 in mergejoinscansel ()
#12 0x000000000061896e in initial_cost_mergejoin ()
#13 0x0000000000624113 in try_mergejoin_path ()
#14 0x0000000000624c2f in add_paths_to_joinrel ()
#15 0x0000000000626678 in make_join_rel ()
#16 0x0000000000626bd8 in join_search_one_level ()
#17 0x00000000006147e3 in standard_join_search ()
#18 0x00000000006337c1 in query_planner ()
#19 0x000000000063521c in grouping_planner ()
#20 0x0000000000637a80 in standard_planner ()
#21 0x00000000006cd396 in pg_plan_query ()
#22 0x000000000056c623 in ExplainOneQuery ()
#23 0x000000000056c9c5 in ExplainQuery ()
#24 0x00000000006d1fae in standard_ProcessUtility ()
#25 0x00007f9c19f8d261 in pgss_ProcessUtility ()
#26 0x00000000006cf1c7 in PortalRunUtility ()
#27 0x00000000006d003d in FillPortalStore ()
#28 0x00000000006d0340 in PortalRun ()
#29 0x00000000006cd7bb in exec_simple_query ()
#30 0x00000000006ce9e5 in PostgresMain ()
#31 0x00000000006682e1 in PostmasterMain ()
#32 0x00000000005f179c in main ()
如何避免问题
当系统关闭了autovacuum后,如果批量删除或更新数据,可能会导致索引出现大量引用dead tuple的页面,从而评估与这些列有关的JOIN可能时间会变长(指merge join)
1. 当使用了绑定变量时,可能能解决以上问题,但是也可能无法解决以上问题,因为PostgreSQL绑定变量有一个避免执行计划倾斜的算法,会记录custom plan的次数和平均成本,根据plan cache和传入的参数,调用choose custom plan,评估generic plan的成本,和custem plan平均成本进行比较,以此判断是否需要custom plan.
如果需要custom plan,那么会重新评估各种执行计划的成本。生成一次custom plan。
原理详见本文末尾的几篇参考文档。
2. autovacuum设置的阈值太大(autovacuum_vacuum_scale_factor=0.2),默认是20%,也就是说只有数据发送了20%变化后,才会自动清理。
如何避免呢?
1. 不要关闭表的autovacuum。
2. 对于大表,可以设置表级autovacuum 阈值,比如1%,或者更小一点。
create table 或者 alter table都可以,语法详见PostgreSQL手册。
3. 开启系统级autovacuum, 并设置合理的autovacuum_vacuum_scale_factor,不要太大。
4. 在大量删除数据或者更新数据后,人为的对这些表执行vacuum analyze table;, 避免以上问题。
小结
当JOIN列有索引存在,并且优化器允许merge join时,评估merge join的成本时需要用到该列的min,max值,min,max值通过索引获得。
当JOIN列都没有索引存在时,评估merge join的成本,不需要min,max值。因此评估merge join的执行计划很快。
从索引获取min,max值,直接影响了产生执行计划的耗时。
当数据被批量删除后,如果没有触发vacuum垃圾回收,评估merge join的成本就可能比较耗时,也就是本文提到的CASE。
执行vacuum后,index的垃圾也会被清理,优化器评估merge join成本时用到的min,max值可以很快获得。
参考
《为什么用 PostgreSQL 绑定变量 没有 Oracle pin S 等待问题》
《PostgreSQL plan cache 源码浅析 - 如何确保不会计划倾斜》
[《执行计划选择算法 与 绑定变量 - PostgreSQL prepared statement: SPI_prepare, prepare | execute COMMAND, PL/pgsql STYLE: custom & generic plan cache》](../201212/20121224_01.md) |