PostgreSQL 10.0 preview sharding增强 - 支持Append节点并行

3 minute read

背景

Append节点通常出现在多个表union , union all或者查询包含多个分区的主表时。

需要对每个子表,或者每个subquery进行查询,然后在将结果合并。

PostgreSQL 10.0增加了一个并行模式即Append节点的并行,多个Append节点,可以并行的执行。、

从而大大的提升效率。

postgres_fdw的异步化调用,也需要Append并行的支持。

Currently an Append plan node does not execute its subplans in  
parallel. There is no distribution of workers across its subplans. The  
second subplan starts running only after the first subplan finishes,  
although the individual subplans may be running parallel scans.  
  
Secondly, we create a partial Append path for an appendrel, but we do  
that only if all of its member subpaths are partial paths. If one or  
more of the subplans is a non-parallel path, there will be only a  
non-parallel Append. So whatever node is sitting on top of Append is  
not going to do a parallel plan; for example, a select count(*) won't  
divide it into partial aggregates if the underlying Append is not  
partial.  
  
The attached patch removes both of the above restrictions.  There has  
already been a mail thread [1] that discusses an approach suggested by  
Robert Haas for implementing this feature. This patch uses this same  
approach.  
  
Attached is pgbench_create_partition.sql (derived from the one  
included in the above thread) that distributes pgbench_accounts table  
data into 3 partitions pgbench_account_[1-3]. The below queries use  
this schema.  
  
Consider a query such as :  
select count(*) from pgbench_accounts;  
  
Now suppose, these two partitions do not allow parallel scan :  
alter table pgbench_accounts_1 set (parallel_workers=0);  
alter table pgbench_accounts_2 set (parallel_workers=0);  
  
On HEAD, due to some of the partitions having non-parallel scans, the  
whole Append would be a sequential scan :  
  
 Aggregate  
   ->  Append  
         ->  Index Only Scan using pgbench_accounts_pkey on pgbench_accounts  
         ->  Seq Scan on pgbench_accounts_1  
         ->  Seq Scan on pgbench_accounts_2  
         ->  Seq Scan on pgbench_accounts_3  
  
Whereas, with the patch, the Append looks like this :  
  
 Finalize Aggregate  
   ->  Gather  
         Workers Planned: 6  
         ->  Partial Aggregate  
               ->  Parallel Append  
                     ->  Parallel Seq Scan on pgbench_accounts  
                     ->  Seq Scan on pgbench_accounts_1  
                     ->  Seq Scan on pgbench_accounts_2  
                     ->  Parallel Seq Scan on pgbench_accounts_3  
  
Above, Parallel Append is generated, and it executes all these  
subplans in parallel, with 1 worker executing each of the sequential  
scans, and multiple workers executing each of the parallel subplans.  
  
  
======= Implementation details ========  
  
------- Adding parallel-awareness -------  
  
In a given worker, this Append plan node will be executing just like  
the usual partial Append node. It will run a subplan until completion.  
The subplan may or may not be a partial parallel-aware plan like  
parallelScan. After the subplan is done, Append will choose the next  
subplan. It is here where it will be different than the current  
partial Append plan: it is parallel-aware. The Append nodes in the  
workers will be aware that there are other Append nodes running in  
parallel. The partial Append will have to coordinate with other Append  
nodes while choosing the next subplan.  
  
------- Distribution of workers --------  
  
The coordination info is stored in a shared array, each element of  
which describes the per-subplan info. This info contains the number of  
workers currently executing the subplan, and the maximum number of  
workers that should be executing it at the same time. For non-partial  
sublans, max workers would always be 1. For choosing the next subplan,  
the Append executor will sequentially iterate over the array to find a  
subplan having the least number of workers currently being executed,  
AND which is not already being executed by the maximum number of  
workers assigned for the subplan. Once it gets one, it increments  
current_workers, and releases the Spinlock, so that other workers can  
choose their next subplan if they are waiting.  
  
This way, workers would be fairly distributed across subplans.  
  
The shared array needs to be initialized and made available to  
workers. For this, we can do exactly what sequential scan does for  
being parallel-aware : Using function ExecAppendInitializeDSM()  
similar to ExecSeqScanInitializeDSM() in the backend to allocate the  
array. Similarly, for workers, have ExecAppendInitializeWorker() to  
retrieve the shared array.  

这个patch的讨论,详见邮件组,本文末尾URL。

PostgreSQL社区的作风非常严谨,一个patch可能在邮件组中讨论几个月甚至几年,根据大家的意见反复的修正,patch合并到master已经非常成熟,所以PostgreSQL的稳定性也是远近闻名的。

参考

https://commitfest.postgresql.org/13/987/

https://www.postgresql.org/message-id/flat/CAJ3gD9dy0K_E8r727heqXoBmWZ83HwLFwdcaSSmBQ1+S+vRuUQ@mail.gmail.com#CAJ3gD9dy0K_E8r727heqXoBmWZ83HwLFwdcaSSmBQ1+S+vRuUQ@mail.gmail.com

Flag Counter

digoal’s 大量PostgreSQL文章入口