Append节点通常出现在多个表union , union all或者查询包含多个分区的主表时。
Currently an Append plan node does not execute its subplans in parallel. There is no distribution of workers across its subplans. The second subplan starts running only after the first subplan finishes, although the individual subplans may be running parallel scans. Secondly, we create a partial Append path for an appendrel, but we do that only if all of its member subpaths are partial paths. If one or more of the subplans is a non-parallel path, there will be only a non-parallel Append. So whatever node is sitting on top of Append is not going to do a parallel plan; for example, a select count(*) won't divide it into partial aggregates if the underlying Append is not partial. The attached patch removes both of the above restrictions. There has already been a mail thread  that discusses an approach suggested by Robert Haas for implementing this feature. This patch uses this same approach. Attached is pgbench_create_partition.sql (derived from the one included in the above thread) that distributes pgbench_accounts table data into 3 partitions pgbench_account_[1-3]. The below queries use this schema. Consider a query such as : select count(*) from pgbench_accounts; Now suppose, these two partitions do not allow parallel scan : alter table pgbench_accounts_1 set (parallel_workers=0); alter table pgbench_accounts_2 set (parallel_workers=0); On HEAD, due to some of the partitions having non-parallel scans, the whole Append would be a sequential scan : Aggregate -> Append -> Index Only Scan using pgbench_accounts_pkey on pgbench_accounts -> Seq Scan on pgbench_accounts_1 -> Seq Scan on pgbench_accounts_2 -> Seq Scan on pgbench_accounts_3 Whereas, with the patch, the Append looks like this : Finalize Aggregate -> Gather Workers Planned: 6 -> Partial Aggregate -> Parallel Append -> Parallel Seq Scan on pgbench_accounts -> Seq Scan on pgbench_accounts_1 -> Seq Scan on pgbench_accounts_2 -> Parallel Seq Scan on pgbench_accounts_3 Above, Parallel Append is generated, and it executes all these subplans in parallel, with 1 worker executing each of the sequential scans, and multiple workers executing each of the parallel subplans. ======= Implementation details ======== ------- Adding parallel-awareness ------- In a given worker, this Append plan node will be executing just like the usual partial Append node. It will run a subplan until completion. The subplan may or may not be a partial parallel-aware plan like parallelScan. After the subplan is done, Append will choose the next subplan. It is here where it will be different than the current partial Append plan: it is parallel-aware. The Append nodes in the workers will be aware that there are other Append nodes running in parallel. The partial Append will have to coordinate with other Append nodes while choosing the next subplan. ------- Distribution of workers -------- The coordination info is stored in a shared array, each element of which describes the per-subplan info. This info contains the number of workers currently executing the subplan, and the maximum number of workers that should be executing it at the same time. For non-partial sublans, max workers would always be 1. For choosing the next subplan, the Append executor will sequentially iterate over the array to find a subplan having the least number of workers currently being executed, AND which is not already being executed by the maximum number of workers assigned for the subplan. Once it gets one, it increments current_workers, and releases the Spinlock, so that other workers can choose their next subplan if they are waiting. This way, workers would be fairly distributed across subplans. The shared array needs to be initialized and made available to workers. For this, we can do exactly what sequential scan does for being parallel-aware : Using function ExecAppendInitializeDSM() similar to ExecSeqScanInitializeDSM() in the backend to allocate the array. Similarly, for workers, have ExecAppendInitializeWorker() to retrieve the shared array.