PostgreSQL 10.0 preview 性能增强 - mergesort(Gather merge)

6 minute read

背景

在数据库中,经常会有多个节点append,然后sort的情况。

例如一张表有10个分区,查询所有分区,并按某列排序输出,常规的做法是所有的记录append,然后sort。

PostgreSQL 10.0 将支持append node的并行计算,也就是说所有的分区表可以并行的sort,然后返回,此时就可以使用merge sort来提高排序的速度。

另外,像单表的并行计算,如果需要排序输出的话,每个worker process可以并行的排序,然后在merge sort输出结果。

在许多分布式数据库中,merge sort也是必备的,否则排序都收到上层节点做是非常耗费CPU和内存的。

merge sort原理,首先要保证下层的所有节点返回的数据是有序的(例如每个NODE都按ID的顺序返回),以轮询所有下层node的方式组装并返回有序结果。

merge sort和merge join的原理也类似。

merge sort的性能提升示例

Query 4:  With GM 7901.480 -> Without GM 9064.776  
Query 5:  With GM 53452.126 -> Without GM 55059.511  
Query 9:  With GM 52613.132 -> Without GM 98206.793  
Query 15: With GM 68051.058 -> Without GM 68918.378  
Query 17: With GM 129236.075 -> Without GM 160451.094  
Query 20: With GM 259144.232 -> Without GM 306256.322  
Query 21: With GM 153483.497 -> Without GM 168169.916  

讨论

Hi hackers,  
  
Attached is the patch to implement Gather Merge. The Gather Merge node would  
assume that the results from each worker are ordered with respect to each  
other,  
and then do a final merge pass over those. This is so that we get the  
top-level  
query ordering we want. The final plan for such a query would look something  
like this:  
  
Gather Merge  
-> Sort  
  -> Parallel Seq Scan on foo  
      Filter: something  
  
With this we now have a new parallel node which will always return the  
sorted  
results, so that any query having the pathkey can build the gather merge  
path.  
Currently if a query has a pathkey and we want to make it parallel-aware,  
the  
plan would be something like this:  
  
Sort  
-> Gather  
 -> Parallel Seq Scan on foo  
     Filter: something  
  
With GatherMerge now it is also possible to have plan like:  
  
Finalize GroupAggregate  
-> Gather Merge  
 -> Partial GroupAggregate  
  -> Sort  
  
With gather merge, sort can be pushed under the Gather Merge. It's valuable  
as it has very good performance benefits. Consider the following test  
results:  
  
1) ./db/bin/pgbench postgres -i -F 100 -s 20  
2) update pgbench_accounts set filler = 'foo' where aid%10 = 0;  
3) vacuum analyze pgbench_accounts;  
4) set max_parallel_workers_per_gather = 4;  
  
Without patch:  
  
postgres=# explain analyze select aid, sum(abalance) from pgbench_accounts  
where filler like '%foo%' group by aid;  
                                                               QUERY  
PLAN  
----------------------------------------------------------------------------------------------------------------------------------------  
 GroupAggregate  (cost=81696.51..85242.09 rows=202605 width=12) (actual  
time=1037.212..1162.086 rows=200000 loops=1)  
   Group Key: aid  
   ->  Sort  (cost=81696.51..82203.02 rows=202605 width=8) (actual  
time=1037.203..1072.446 rows=200000 loops=1)  
         Sort Key: aid  
         Sort Method: external sort  Disk: 3520kB  
         ->  Seq Scan on pgbench_accounts  (cost=0.00..61066.59 rows=202605  
width=8) (actual time=801.398..868.390 rows=200000 loops=1)  
               Filter: (filler ~~ '%foo%'::text)  
               Rows Removed by Filter: 1800000  
 Planning time: 0.133 ms  
 Execution time: 1171.656 ms  
(10 rows)  
  
With Patch:  
  
postgres=# explain analyze select aid, sum(abalance) from pgbench_accounts  
where filler like '%foo%' group by aid;  
  
QUERY  
PLAN  
  
-----------------------------------------------------------------------------------------------------------------------------------------------------------  
 Finalize GroupAggregate  (cost=47274.13..56644.58 rows=202605 width=12)  
(actual time=315.457..561.825 rows=200000 loops=1)  
   Group Key: aid  
   ->  Gather Merge  (cost=47274.13..54365.27 rows=50651 width=0) (actual  
time=315.451..451.886 rows=200000 loops=1)  
         Workers Planned: 4  
         Workers Launched: 4  
         ->  Partial GroupAggregate  (cost=46274.09..47160.49 rows=50651  
width=12) (actual time=306.830..333.908 rows=40000 loops=5)  
               Group Key: aid  
               ->  Sort  (cost=46274.09..46400.72 rows=50651 width=8)  
(actual time=306.822..310.800 rows=40000 loops=5)  
                     Sort Key: aid  
                     Sort Method: quicksort  Memory: 2543kB  
                     ->  Parallel Seq Scan on pgbench_accounts  
(cost=0.00..42316.15 rows=50651 width=8) (actual time=237.552..255.968  
rows=40000 loops=5)  
                           Filter: (filler ~~ '%foo%'::text)  
                           Rows Removed by Filter: 360000  
 Planning time: 0.200 ms  
 Execution time: 572.221 ms  
(15 rows)  
  
I ran the TPCH benchmark queries with the patch and found that 7 out of 22  
queries end up picking the Gather Merge path.  
  
Below benchmark numbers taken under following configuration:  
  
- Scale factor = 10  
- max_worker_processes = DEFAULT (8)  
- max_parallel_workers_per_gather = 4  
- Cold cache environment is ensured. With every query execution - server is  
  stopped and also OS caches were dropped.  
- The reported values of execution time (in ms) is median of 3 executions.  
- power2 machine with 512GB of RAM  
- PFA performance machine cpu into (benchmark_machine_info.txt)  
  
Query 4:  With GM 7901.480 -> Without GM 9064.776  
Query 5:  With GM 53452.126 -> Without GM 55059.511  
Query 9:  With GM 52613.132 -> Without GM 98206.793  
Query 15: With GM 68051.058 -> Without GM 68918.378  
Query 17: With GM 129236.075 -> Without GM 160451.094  
Query 20: With GM 259144.232 -> Without GM 306256.322  
Query 21: With GM 153483.497 -> Without GM 168169.916  
  
Here from the results we can see that query 9, 17 and 20 are the one which  
show good performance benefit with the Gather Merge.  
  
PFA tpch_result.tar.gz for the explain analysis output for TPCH queries  
(with  
and without patch)  
  
I ran the TPCH benchmark queries with different number of workers and found  
that  
Query 18 also started picking Gather merge with worker > 6. PFA attach  
TPCH_GatherMerge.pdf for the detail benchmark results.  
  
Implementation details:  
  
New Gather Merge node:  
  
The patch introduces a new node type for Gather Merge. The Gather Merge  
implementation is mostly similar to what Gather does. The major difference  
is  
that the Gather node has two TupleTableSlots; one for leader and one for the  
tuple read from the worker, and Gather Merge has a TupleTableSlot per  
worker,  
plus one for the leader. As for Gather Merge, we need to fill every slot,  
then  
build a heap of the tuples and return the lowest one.  
  
The patch generates the gather merge path from:  
  
a) create_ordered_paths() for partial_pathlist. If the pathkey contain the  
sort_pathkey, then directly create the gather merge. If not then create sort  
and then create the gather merge path.  
  
Example:  
  
explain analyze  
select * from pgbench_accounts where filler like '%foo%' order by aid;  
  
b) create_distinct_paths(): when sort-based implementations of DISTINCT is  
possible.  
  
Example:  
  
explain analyze  
select distinct * from pgbench_accounts where filler like '%foo%' order by  
aid;  
  
c) create_grouping_paths() : While generating a complete GroupAgg Path, loop  
over the partial path list and if partial path contains the group_pathkeys  
generate the gather merge path.  
  
Example:  
  
explain analyze  
select * from pgbench_accounts where filler like '%foo%' group by aid;  
  
In all the above mentioned cases, with the patches it's giving almost a 2x  
performance gain. PFA pgbench_query.out, for the explain analyze output for  
the  
queries.  
  
Gather Merge reads the tuple from each queue and then picks the lowest one,  
so  
every time it has to read the tuple from the queue into wait mode. During  
testing I found that some of the query spent some time reading tuple  
from the queue. So in the patch I introduced the tuple array; once we get  
the tuple into wait mode, it tries to read more tuples in nowait mode and  
store it into the tuple array. Once we get one tuple through the queue,  
there  
are chances to have more ready tuple into queue, so just read it and, if  
any,  
store it to the tuple array. With this I found good performance benefits  
with  
some of the TPC-H complex queries.  
  
Costing:  
  
GatherMerge merges several pre-sorted input streams using a heap.  
Considering  
that costing for Gather Merge is the combination of cost_gather +  
cost_merge_append.  
  
Open Issue:  
  
- Commit af33039317ddc4a0e38a02e2255c2bf453115fd2 fixed the leak into  
tqueue.c by  
calling gather_readnext() into per-tuple context. Now for gather merge that  
is  
not possible, as we storing the tuple into Tuple array and we want tuple to  
be  
free only its get pass through the merge sort algorithm. One idea is, we can  
also call gm_readnext_tuple() under per-tuple context (which will fix the  
leak  
into tqueue.c) and then store the copy of tuple into tuple array.  
- Need to see a way to add the simple test as part of regression.  
  
Thanks to my colleague Robert Haas for his help in design and some  
preliminary review for the patch.  
  
Please let me know your thought, and thanks for reading.  
  
Regards,  
Rushabh Lathia  
www.EnterpriseDB.com  

这个patch的讨论,详见邮件组,本文末尾URL。

PostgreSQL社区的作风非常严谨,一个patch可能在邮件组中讨论几个月甚至几年,根据大家的意见反复的修正,patch合并到master已经非常成熟,所以PostgreSQL的稳定性也是远近闻名的。

参考

https://commitfest.postgresql.org/13/810/

https://www.postgresql.org/message-id/flat/CAGPqQf09oPX-cQRpBKS0Gq49Z+m6KBxgxd_p9gX8CKk_d75HoQ@mail.gmail.com#CAGPqQf09oPX-cQRpBKS0Gq49Z+m6KBxgxd_p9gX8CKk_d75HoQ@mail.gmail.com

Flag Counter

digoal’s 大量PostgreSQL文章入口