PostgreSQL 嵌套循环成本估算方法 - nestloop loop cost & cost_material run_cost

3 minute read

背景

Nested Loop循环计算总成本时, 一般来说inner 节点循环多少次, 就要乘以多少次的inner 节点的total_cost.

例如这个例子 :

EXPLAIN SELECT *  
FROM tenk1 t1, tenk2 t2  
WHERE t1.unique1 < 10 AND t1.unique2 = t2.unique2;  
  
                                      QUERY PLAN  
--------------------------------------------------------------------------------------  
 Nested Loop  (cost=4.65..118.62 rows=10 width=488)  
   ->  Bitmap Heap Scan on tenk1 t1  (cost=4.36..39.47 rows=10 width=244)  
         Recheck Cond: (unique1 < 10)  
         ->  Bitmap Index Scan on tenk1_unique1  (cost=0.00..4.36 rows=10 width=0)  
               Index Cond: (unique1 < 10)  
   ->  Index Scan using tenk2_unique2 on tenk2 t2  (cost=0.29..7.91 rows=1 width=244)  
         Index Cond: (unique2 = t1.unique2)  
Nested Loop节点的118.62 = 39.47+10*7.91 + 一部分输出成本.  

但是下面这个例子, 你看看如果这样算的话, 输出成本就不合理了.

In this example the join's output row count is the same as the product of the two scans' row counts,   
  
but that's not true in all cases because there can be additional WHERE clauses that mention both tables and so can only be applied at the join point,   
  
not to either input scan. Here's an example:  
  
EXPLAIN SELECT *  
FROM tenk1 t1, tenk2 t2  
WHERE t1.unique1 < 10 AND t2.unique2 < 10 AND t1.hundred < t2.hundred;  
  
                                         QUERY PLAN  
---------------------------------------------------------------------------------------------  
 Nested Loop  (cost=4.65..49.46 rows=33 width=488)  
   Join Filter: (t1.hundred < t2.hundred)  
   ->  Bitmap Heap Scan on tenk1 t1  (cost=4.36..39.47 rows=10 width=244)  
         Recheck Cond: (unique1 < 10)  
         ->  Bitmap Index Scan on tenk1_unique1  (cost=0.00..4.36 rows=10 width=0)  
               Index Cond: (unique1 < 10)  
   ->  Materialize  (cost=0.29..8.51 rows=10 width=244)  
         ->  Index Scan using tenk2_unique2 on tenk2 t2  (cost=0.29..8.46 rows=10 width=244)  
               Index Cond: (unique2 < 10)  
  
The condition t1.hundred < t2.hundred can't be tested in the tenk2_unique2 index, so it's applied at the join node.   
This reduces the estimated output row count of the join node, but does not change either input scan.  
  
Notice that here the planner has chosen to "materialize" the inner relation of the join, by putting a Materialize plan node atop it.   
This means that the t2 indexscan will be done just once, even though the nested-loop join node needs to read that data ten times, once for each row from the outer relation.   
The Materialize node saves the data in memory as it's read, and then returns the data from memory on each subsequent pass.  

这个materialized节点表明, 索引扫描一次后, 如果work_mem可以放下这个物化节点的数据, 那么这部分数据都在内存中. 如果不能放下, 那么会用到一部分磁盘. 这里循环用到的是物化节点的run_cost. 代码如下 :

src/backend/optimizer/path/costsize.c

/*  
 * cost_material  
 *        Determines and returns the cost of materializing a relation, including  
 *        the cost of reading the input data.  
 *  
 * If the total volume of data to materialize exceeds work_mem, we will need  
 * to write it to disk, so the cost is much higher in that case.  
 *  
 * Note that here we are estimating the costs for the first scan of the  
 * relation, so the materialization is all overhead --- any savings will  
 * occur only on rescan, which is estimated in cost_rescan.  
 */  
void  
cost_material(Path *path,  
                          Cost input_startup_cost, Cost input_total_cost,  
                          double tuples, int width)  
{  
        Cost            startup_cost = input_startup_cost;  
        Cost            run_cost = input_total_cost - input_startup_cost;  
        double          nbytes = relation_byte_size(tuples, width);  
        long            work_mem_bytes = work_mem * 1024L;  
  
        path->rows = tuples;  
  
        /*  
         * Whether spilling or not, charge 2x cpu_operator_cost per tuple to  
         * reflect bookkeeping overhead.  (This rate must be more than what  
         * cost_rescan charges for materialize, ie, cpu_operator_cost per tuple;  
         * if it is exactly the same then there will be a cost tie between  
         * nestloop with A outer, materialized B inner and nestloop with B outer,  
         * materialized A inner.  The extra cost ensures we'll prefer  
         * materializing the smaller rel.)      Note that this is normally a good deal  
         * less than cpu_tuple_cost; which is OK because a Material plan node  
         * doesn't do qual-checking or projection, so it's got less overhead than  
         * most plan nodes.  
         */  
        run_cost += 2 * cpu_operator_cost * tuples;  
        /*  
         * If we will spill to disk, charge at the rate of seq_page_cost per page.  
         * This cost is assumed to be evenly spread through the plan run phase,  
         * which isn't exactly accurate but our cost model doesn't allow for  
         * nonuniform costs within the run phase.  
         */  
        if (nbytes > work_mem_bytes)  
        {  
                double          npages = ceil(nbytes / BLCKSZ);  
  
                run_cost += seq_page_cost * npages;  
        }  
  
        path->startup_cost = startup_cost;  
        path->total_cost = startup_cost + run_cost;  
}  

那么循环十次的话, 每次循环实际上只用到了2 * cpu_operator_cost * tuples;这部分开销.

2*0.0025*10=0.05  

所以Nested Loop节点的总开销 49.46 = 39.47+(0.05+8.46=8.51)+ 一部分输出开销。

Flag Counter

digoal’s 大量PostgreSQL文章入口