PostgreSQL 10.0 preview 多核并行增强 - 并行hash join支持shared hashdata, 节约哈希表内存提高效率

15 minute read

背景

PostgreSQL 9.6支持哈希JOIN并行，但是每个worker进程都需要复制一份哈希表，所以会造成内存的浪费，小表无妨，但是大表的浪费是非常大的。

因此10.0做了一个改进，使用共享的哈希表。

Hi hackers,  
  
In PostgreSQL 9.6, hash joins can be parallelised under certain  
conditions, but a copy of the hash table is built in every  
participating backend.  That means that memory and CPU time are  
wasted.  In many cases, that's OK: if the hash table contents are  
small and cheap to compute, then we don't really care, we're just  
happy that the probing can be done in parallel.  But in cases where  
the hash table is large and/or expensive to build, we could do much  
better.  I am working on that problem.  
  
To recap the situation in 9.6, a hash join can appear below a Gather  
node and it looks much the same as a non-parallel hash join except  
that it has a partial outer plan:  
  
      ->  Hash Join  
            ->  <partial outer plan>  
            ->  Hash  
                  ->  <non-partial parallel-safe inner plan>  
  
A partial plan is one that has some kind of 'scatter' operation as its  
ultimate source of tuples.  Currently the only kind of scatter  
operation is a Parallel Seq Scan (but see also the Parallel Index Scan  
and Parallel Bitmap Scan proposals).  The scatter operation enables  
parallelism in all the executor nodes above it, as far as the  
enclosing 'gather' operation which must appear somewhere above it.  
Currently the only kind of gather operation is a Gather node (but see  
also the Gather Merge proposal which adds a new one).  
  
The inner plan is built from a non-partial parallel-safe path and will  
be run in every worker.  
  
Note that a Hash Join node in 9.6 isn't parallel-aware itself: it's  
not doing anything special at execution time to support parallelism.  
The planner has determined that correct partial results will be  
produced by this plan, but the executor nodes are blissfully unaware  
of parallelism.  
  
PROPOSED NEW PLAN VARIANTS  
  
Shortly I will post a patch which introduces two new hash join plan  
variants that are parallel-aware:  
  
1.  Parallel Hash Join with Shared Hash  
  
      ->  Parallel Hash Join  
            ->  <partial outer plan>  
            ->  Shared Hash  
                  ->  <non-partial parallel-safe inner plan>  
  
In this case, there is only one copy of the hash table and only one  
participant loads it.  The other participants wait patiently for one  
chosen backend to finish building the hash table, and then they all  
wake up and probe.  
  
Call the number of participants P, being the number of workers + 1  
(for the leader).  Compared to a non-shared hash plan, we avoid  
wasting CPU and IO resources running P copies of the inner plan in  
parallel (something that is not well captured in our costing model for  
parallel query today), and we can allow ourselves to use a hash table  
P times larger while sticking to the same overall space target of  
work_mem * P.  
  
2.  Parallel Hash Join with Parallel Shared Hash  
  
      ->  Parallel Hash Join  
            ->  <partial outer plan>  
            ->  Parallel Shared Hash  
                  ->  <partial inner plan>  
  
In this case, the inner plan is run in parallel by all participants.  
We have the advantages of a shared hash table as described above, and  
now we can also divide the work of running the inner plan and hashing  
the resulting tuples by P participants.  Note that Parallel Shared  
Hash is acting as a special kind of gather operation that is the  
counterpart to the scatter operation contained in the inner plan.  
  
PERFORMANCE  
  
So far I have been unable to measure any performance degradation  
compared with unpatched master for hash joins with non-shared hash.  
That's good because it means that I didn't slow existing plans down  
when I introduced a bunch of conditional branches to existing hash  
join code.  
  
Laptop testing shows greater than 2x speedups on several of the TPC-H  
queries with single batches, and no slowdowns.  I will post test  
numbers on big rig hardware in the coming weeks when I have the  
batching code in more complete and stable shape.  
  
IMPLEMENTATION  
  
I have taken the approach of extending the existing hash join  
algorithm, rather than introducing separate hash join executor nodes  
or a fundamentally different algorithm.  Here's a short description of  
what the patch does:  
  
1.  SHARED HASH TABLE  
  
To share data between participants, the patch uses two other patches I  
have proposed:  DSA areas[1], which provide a higher level interface  
to DSM segments to make programming with processes a little more like  
programming with threads, and in particular a per-parallel-query DSA  
area[2] that is made available for any executor node that needs some  
shared work space.  
  
The patch uses atomic operations to push tuples into the hash table  
buckets while building, rehashing and loading, and then the hash table  
is immutable during probing (except for match flags used to implement  
outer joins).  The existing memory chunk design is retained for dense  
allocation of tuples, which provides a convenient way to rehash the  
table when its size changes.  
  
2.  WORK COORDINATION  
  
To coordinate parallel work, this patch uses two other patches:  
barriers[3], to implement a 'barrier' or 'phaser' synchronisation  
primitive, and those in turn use the condition variables proposed by  
Robert Haas.  
  
Barriers provide a way for participants to break work up into phases  
that they unanimously agree to enter together, which is a basic  
requirement for parallelising hash joins.  It is not safe to insert  
into the hash table until exactly one participant has created it; it  
is not safe to probe the hash table until all participants have  
finished inserting into it; it is not safe to scan it for unmatched  
tuples until all participants have finished probing it; it is not safe  
to discard it and start loading the next batch until ... you get the  
idea.  You could also construct appropriate synchronisation using  
various other interlocking primitives or flow control systems, but  
fundamentally these wait points would exist at some level, and I think  
this way is quite clean and simple.  YMMV.  
  
If we had exactly W workers and the leader didn't participate, then we  
could use a simple simple pthread- or MPI-style barrier without an  
explicit notion of 'phase'.  We would simply take the existing hash  
join code, add the shared hash table, add barrier waits at various  
points and make sure that all participants always hit all of those  
points in the same order, and it should All Just Work.   But we have a  
variable party size and a dual-role leader process, and I want to  
highlight the specific problems that causes here because they increase  
the patch size significantly:  
  
Problem 1:  We don't know how many workers will actually start.  We  
know how many were planned, but at execution time we may have  
exhausted limits and actually get a smaller number.  So we can't use  
"static" barriers like the classic barriers in POSIX or MPI where the  
group size is known up front.  We need "dynamic" barriers with attach  
and detach operations.  As soon as you have varying party size you  
need some kind of explicit model of the current phase, so that a new  
participant can know what to do when it joins.  For that reason, this  
patch uses a phase number to track progress through the parallel hash  
join.  See MultiExecHash and ExecHashJoin which have switch statements  
allowing a newly joined participant to synchronise their own state  
machine and program counter with the phase.  
  
Problem 2:  One participant is not like the others: Gather may or may  
not decide to run its subplan directly if the worker processes aren't  
producing any tuples (and the proposed Gather Merge is the same).  The  
problem is that it also needs to consume tuples from the fixed-size  
queues of the regular workers.  A deadlock could arise if the leader's  
plan blocks waiting for other participants while another participant  
has filled its output queue and is waiting for the leader to consume.  
One way to avoid such deadlocks is to follow the rule that the leader  
should never wait for other participants if there is any possibility  
that they have emitted tuples.  The simplest way to do that would be  
to have shared hash plans refuse to run in the leader by returning  
NULL to signal the end of this partial tuple stream, but then we'd  
lose a CPU compared to non-shared hash plans.  The latest point the  
leader can exit while respecting that rule is at the end of probing  
the first batch.  That is the approach taken by the patch currently.  
See ExecHashCheckForEarlyExit for logic and discussion.  It would be  
better to be able to use the leader in later batches too, but as far  
as I can see that'd require changes that are out of scope for this  
patch.  One idea would be an executor protocol change allowing plans  
running in the leader to detach and yield, saying 'I have no further  
tuples right now, but I'm not finished; try again later', and then  
reattach when you call it back.  Clearly that sails close to  
asynchronous execution territory.  
  
Problem 3:  If the leader drops out after the first batch to solve  
problem 2, then it may leave behind batch files which must be  
processed by other participants.  I had originally planned to defer  
work on batch file sharing until a later iteration, thinking that it  
would be a nice performance improvement to redistribute work from  
uneven batch files, but it turns out to be necessary for correct  
results because of participants exiting early.  I am working on a very  
simple batch sharing system to start with...  Participants still  
generate their own batch files, and then new operations BufFileExport  
and BufFileImport are used to grant read-only access to the BufFile to  
other participants.  Each participant reads its own batch files  
entirely and then tries to read from every other participant's batch  
files until they are all exhausted, using a shared read head.  The  
per-tuple locking granularity, extra seeking and needless buffering in  
every backend on batch file reads aren't great, and I'm still figuring  
out temporary file cleanup/ownership semantics.  There may be an  
opportunity to make use of 'unified' BufFile concepts from Peter  
Geoghegan's work, or create some new reusable shared tuple spilling  
infrastructure.  
  
3.  COSTING  
  
For now, I have introduced a GUC called cpu_shared_tuple_cost which  
provides a straw-man model of the overhead of exchanging tuples via a  
shared hash table, and the extra process coordination required.  If  
it's zero then a non-shared hash plan (ie multiple copies) has the  
same cost as a shared hash plan, even though the non-shared hash plan  
wastefully runs P copies of the plan.  If cost represents runtime and  
and we assume perfectly spherical cows running without interference  
from each other, that makes some kind of sense, but it doesn't account  
for the wasted resources and contention caused by running the same  
plan in parallel.  I don't know what to do about that yet.  If  
cpu_shared_tuple_cost is a positive number, as it probably should be  
(more on that later), then shared hash tables look more expensive than  
non-shared ones, which is technically true (CPU cache sharing etc) but  
unhelpful because what you lose there you tend to gain by not running  
all those plans in parallel.  In other words cpu_shared_tuple_cost  
doesn't really model the cost situation at all well, but it's a useful  
GUC for development purposes for now as positive and negative numbers  
can be used to turn the feature on and off for testing...  As for  
work_mem, it seems to me that 9.6 already established that work_mem is  
a per participant limit, and it would be only fair to let a shared  
plan use a total of work_mem * P too.  I am still working on work_mem  
accounting and reporting.  Accounting for the parallelism in parallel  
shared hash plans is easy though: their estimated tuple count is  
already divided by P in the underlying partial path, and that is a  
fairly accurate characterisation of what's going to happen at  
execution time:  it's often going to go a lot faster, and those plans  
are the real goal of this work.  
  
STATUS  
  
Obviously this is a work in progress.  I am actively working on the following:  
  
* rescan  
* batch number increases  
* skew buckets  
* costing model and policy/accounting for work_mem  
* shared batch file reading  
* preloading next batch  
* debugging and testing  
* tidying and refactoring  
  
The basic approach is visible and simple cases are working though, so  
I am submitting this WIP work for a round of review in the current  
commitfest and hoping to get some feedback and ideas.  I will post the  
patch in a follow-up email shortly...  Thanks for reading!  
  
[1] https://www.postgresql.org/message-id/flat/CAEepm=1z5WLuNoJ80PaCvz6EtG9dN0j-KuHcHtU6QEfcPP5-qA(at)mail(dot)gmail(dot)com#CAEepm=1z5WLuNoJ80PaCvz6EtG9dN0j-KuHcHtU6QEfcPP5-qA@mail.gmail.com  
[2] https://www.postgresql.org/message-id/flat/CAEepm%3D0HmRefi1%2BxDJ99Gj5APHr8Qr05KZtAxrMj8b%2Bay3o6sA%40mail.gmail.com  
[3] https://www.postgresql.org/message-id/flat/CAEepm%3D2_y7oi01OjA_wLvYcWMc9_d%3DLaoxrY3eiROCZkB_qakA%40mail.gmail.com  
  
--   
Thomas Munro  
http://www.enterprisedb.com  

例子

原始情况，并行构建哈希表实际上外表是每个WORKER工作进程都是全量读取的，并没有拆分。

postgres=# create table tbla(id int8 primary key, c1 int8);  
CREATE TABLE  
postgres=# create table tblb(id int8 primary key, c1 int8);  
CREATE TABLE  
postgres=# insert into tbla select generate_series(1,5000000), random()*10000;  
INSERT 0 5000000  
postgres=# insert into tblb select generate_series(1,5000000), random()*100000;  
INSERT 0 5000000  
postgres=# set min_parallel_table_scan_size =0;  
SET  
postgres=# set min_parallel_index_scan_size =0;  
SET  
postgres=# set parallel_setup_cost =0;  
SET  
postgres=# set parallel_tuple_cost =0;  
SET  
postgres=# set max_parallel_workers=128;  
SET  
postgres=# set force_parallel_mode =on;  
SET  
postgres=# set max_parallel_workers_per_gather =4;  
SET  
postgres=# alter table tbla set (parallel_workers =32);  
ALTER TABLE  
postgres=# alter table tblb set (parallel_workers =32);  
ALTER TABLE  
  
  
postgres=# explain (analyze,verbose,timing,costs,buffers) select tbla.c1,count(*),sum(tblb.c1),avg(tblb.c1) from tbla join tblb using (id) group by tbla.c1;  
                                                                       QUERY PLAN                                                                          
---------------------------------------------------------------------------------------------------------------------------------------------------------  
 Finalize HashAggregate  (cost=208083.65..208232.93 rows=9952 width=80) (actual time=2865.351..2871.004 rows=10001 loops=1)  
   Output: tbla.c1, count(*), sum(tblb.c1), avg(tblb.c1)  
   Group Key: tbla.c1  
   Buffers: shared hit=162876  
   ->  Gather  (cost=207337.25..207486.53 rows=39808 width=80) (actual time=2766.289..2831.069 rows=50005 loops=1)  
         Output: tbla.c1, (PARTIAL count(*)), (PARTIAL sum(tblb.c1)), (PARTIAL avg(tblb.c1))  
         Workers Planned: 4  
         Workers Launched: 4  
         Buffers: shared hit=162876  
         ->  Partial HashAggregate  (cost=207337.25..207486.53 rows=9952 width=80) (actual time=2763.021..2769.581 rows=10001 loops=5)  
               Output: tbla.c1, PARTIAL count(*), PARTIAL sum(tblb.c1), PARTIAL avg(tblb.c1)  
               Group Key: tbla.c1  
               Buffers: shared hit=162180  
               Worker 0: actual time=2761.708..2768.271 rows=10001 loops=1  
                 Buffers: shared hit=32220  
               Worker 1: actual time=2761.978..2768.377 rows=10001 loops=1  
                 Buffers: shared hit=32512  
               Worker 2: actual time=2762.589..2769.099 rows=10001 loops=1  
                 Buffers: shared hit=32185  
               Worker 3: actual time=2762.840..2769.427 rows=10001 loops=1  
                 Buffers: shared hit=32185  
               ->  Hash Join  (cost=139528.00..194837.25 rows=1250000 width=16) (actual time=2012.875..2540.816 rows=1000000 loops=5)  
                     Output: tbla.c1, tblb.c1  
                     Inner Unique: true  
                     Hash Cond: (tbla.id = tblb.id)  
                     Buffers: shared hit=162180  
                     Worker 0: actual time=2038.831..2550.300 rows=959965 loops=1  
                       Buffers: shared hit=32220  
                     Worker 1: actual time=2000.501..2540.438 rows=1013985 loops=1  
                       Buffers: shared hit=32512  
                     Worker 2: actual time=2039.682..2547.897 rows=953490 loops=1  
                       Buffers: shared hit=32185  
                     Worker 3: actual time=2039.998..2548.110 rows=953310 loops=1  
                       Buffers: shared hit=32185  
                     ->  Parallel Seq Scan on public.tbla  (cost=0.00..39528.00 rows=1250000 width=16) (actual time=0.012..107.563 rows=1000000 loops=5)  
                           Output: tbla.id, tbla.c1  
                           Buffers: shared hit=27028  
                           Worker 0: actual time=0.014..107.047 rows=959965 loops=1  
                             Buffers: shared hit=5189  
                           Worker 1: actual time=0.011..113.478 rows=1013985 loops=1  
                             Buffers: shared hit=5481  
                           Worker 2: actual time=0.014..106.765 rows=953490 loops=1  
                             Buffers: shared hit=5154  
                           Worker 3: actual time=0.016..106.519 rows=953310 loops=1  
                             Buffers: shared hit=5154  
                     -- -- 注意，下面的是并行，但是每个worker进程全量读取数据。  
                     ->  Hash  (cost=77028.00..77028.00 rows=5000000 width=16) (actual time=1987.927..1987.927 rows=5000000 loops=5)  
                           Output: tblb.c1, tblb.id  
                           Buckets: 8388608  Batches: 1  Memory Usage: 299911kB  
                           Buffers: shared hit=135140  
                           Worker 0: actual time=2014.261..2014.261 rows=5000000 loops=1  
                             Buffers: shared hit=27028  
                           Worker 1: actual time=1975.721..1975.721 rows=5000000 loops=1  
                             Buffers: shared hit=27028  
                           Worker 2: actual time=2014.507..2014.507 rows=5000000 loops=1  
                             Buffers: shared hit=27028  
                           Worker 3: actual time=2015.002..2015.002 rows=5000000 loops=1  
                             Buffers: shared hit=27028  
			       
                           ->  Seq Scan on public.tblb  (cost=0.00..77028.00 rows=5000000 width=16) (actual time=0.018..702.727 rows=5000000 loops=5)  
                                 Output: tblb.c1, tblb.id  
                                 Buffers: shared hit=135140  
                                 Worker 0: actual time=0.020..722.961 rows=5000000 loops=1  
                                   Buffers: shared hit=27028  
                                 Worker 1: actual time=0.021..705.710 rows=5000000 loops=1  
                                   Buffers: shared hit=27028  
                                 Worker 2: actual time=0.020..714.658 rows=5000000 loops=1  
                                   Buffers: shared hit=27028  
                                 Worker 3: actual time=0.025..717.066 rows=5000000 loops=1  
                                   Buffers: shared hit=27028  
 Planning time: 0.576 ms  
 Execution time: 2874.424 ms  
(70 rows)  
  
  
  
  
postgres=# explain  select tbla.c1,count(*),sum(tblb.c1),avg(tblb.c1) from tbla join tblb using (id) group by tbla.c1;  
                                           QUERY PLAN                                             
------------------------------------------------------------------------------------------------  
 Finalize HashAggregate  (cost=208083.65..208232.93 rows=9952 width=80)  
   Group Key: tbla.c1  
   ->  Gather  (cost=207337.25..207486.53 rows=39808 width=80)  
         Workers Planned: 4  
         ->  Partial HashAggregate  (cost=207337.25..207486.53 rows=9952 width=80)  
               Group Key: tbla.c1  
               ->  Hash Join  (cost=139528.00..194837.25 rows=1250000 width=16)  
                     Hash Cond: (tbla.id = tblb.id)  
                     ->  Parallel Seq Scan on tbla  (cost=0.00..39528.00 rows=1250000 width=16)  
                     ->  Hash  (cost=77028.00..77028.00 rows=5000000 width=16)  
                           ->  Seq Scan on tblb  (cost=0.00..77028.00 rows=5000000 width=16)  
(11 rows)  

patch 如下

https://commitfest.postgresql.org/14/871/

$tar -zxvf parallel-shared-hash-v15.patchset.tgz  
  
$cd postgresql-10beta2/  
  
$patch -p1 < ../parallel-shared-hash-v15.patchset/0001-hj-refactor-memory-accounting-v15.patch   
...  
$patch -p1 < ../parallel-shared-hash-v15.patchset/0010-hj-parallel-v15.patch   
  
  
export USE_NAMED_POSIX_SEMAPHORES=1  
LIBS=-lpthread  CFLAGS="-O3" ./configure --prefix=/home/digoal/pgsql10  
LIBS=-lpthread  CFLAGS="-O3" make world -j 64  
LIBS=-lpthread  CFLAGS="-O3" make install-world  

更新后，我们看到hash table的构建并行，并且每个WORKER只读取部分数据

postgres=# set max_parallel_workers_per_gather =7;  
SET  
postgres=# explain (analyze,verbose,timing,costs,buffers) select a.c1,count(*),sum(b.c1) from a join b on (a.id=b.id) group by a.c1;  
                                                                              QUERY PLAN                                                                                 
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------  
 Finalize GroupAggregate  (cost=1788232.17..1789031.13 rows=9987 width=20) (actual time=22222.246..22249.421 rows=10001 loops=1)  
   Output: a.c1, count(*), sum(b.c1)  
   Group Key: a.c1  
   Buffers: shared hit=886251  
   ->  Sort  (cost=1788232.17..1788406.94 rows=69909 width=20) (actual time=22222.237..22233.510 rows=80008 loops=1)  
         Output: a.c1, (PARTIAL count(*)), (PARTIAL sum(b.c1))  
         Sort Key: a.c1  
         Sort Method: quicksort  Memory: 9323kB  
         Buffers: shared hit=886251  
         ->  Gather  (cost=1782507.01..1782606.88 rows=69909 width=20) (actual time=21733.961..22201.706 rows=80008 loops=1)  
               Output: a.c1, (PARTIAL count(*)), (PARTIAL sum(b.c1))  
               Workers Planned: 7  
               Workers Launched: 7  
               Buffers: shared hit=886251  
               ->  Partial HashAggregate  (cost=1782507.01..1782606.88 rows=9987 width=20) (actual time=21729.524..21731.261 rows=10001 loops=8)  
                     Output: a.c1, PARTIAL count(*), PARTIAL sum(b.c1)  
                     Group Key: a.c1  
                     Buffers: shared hit=884977  
                     Worker 0: actual time=21727.744..21729.451 rows=10001 loops=1  
                       Buffers: shared hit=112708  
                     Worker 1: actual time=21728.234..21729.838 rows=10001 loops=1  
                       Buffers: shared hit=110884  
                     Worker 2: actual time=21728.696..21730.324 rows=10001 loops=1  
                       Buffers: shared hit=109946  
                     Worker 3: actual time=21729.089..21730.726 rows=10001 loops=1  
                       Buffers: shared hit=109402  
                     Worker 4: actual time=21729.271..21730.903 rows=10001 loops=1  
                       Buffers: shared hit=109108  
                     Worker 5: actual time=21729.585..21731.221 rows=10001 loops=1  
                       Buffers: shared hit=108370  
                     Worker 6: actual time=21729.944..21731.582 rows=10001 loops=1  
                       Buffers: shared hit=108794  
                     ->  Parallel Hash Join  (cost=783906.57..1675364.15 rows=14285714 width=8) (actual time=6740.541..19015.452 rows=12500000 loops=8)  
                           Output: a.c1, b.c1  
                           Hash Cond: (a.id = b.id)  
                           Buffers: shared hit=884977  
                           Worker 0: actual time=6738.838..18970.510 rows=12570572 loops=1  
                             Buffers: shared hit=112708  
                           Worker 1: actual time=6739.232..19045.940 rows=12618258 loops=1  
                             Buffers: shared hit=110884  
                           Worker 2: actual time=6739.621..19014.532 rows=12586364 loops=1  
                             Buffers: shared hit=109946  
                           Worker 3: actual time=6740.099..19035.616 rows=12430904 loops=1  
                             Buffers: shared hit=109402  
                           Worker 4: actual time=6740.401..19009.144 rows=12385704 loops=1  
                             Buffers: shared hit=109108  
                           Worker 5: actual time=6740.568..19034.750 rows=12357454 loops=1  
                             Buffers: shared hit=108370  
                           Worker 6: actual time=6740.920..19012.554 rows=12375534 loops=1  
                             Buffers: shared hit=108794  
                           ->  Parallel Seq Scan on postgres.a  (cost=0.00..585335.14 rows=14285714 width=8) (actual time=0.016..1324.436 rows=12500000 loops=8)  
                                 Output: a.id, a.c1  
                                 Buffers: shared hit=442478  
                                 Worker 0: actual time=0.017..1348.098 rows=12570572 loops=1  
                                   Buffers: shared hit=55622  
                                 Worker 1: actual time=0.015..1356.242 rows=12618258 loops=1  
                                   Buffers: shared hit=55833  
                                 Worker 2: actual time=0.016..1350.642 rows=12586364 loops=1  
                                   Buffers: shared hit=55692  
                                 Worker 3: actual time=0.017..1339.639 rows=12430904 loops=1  
                                   Buffers: shared hit=55004  
                                 Worker 4: actual time=0.016..1336.109 rows=12385704 loops=1  
                                   Buffers: shared hit=54804  
                                 Worker 5: actual time=0.018..1345.363 rows=12357454 loops=1  
                                   Buffers: shared hit=54679  
                                 Worker 6: actual time=0.017..1338.061 rows=12375534 loops=1  
                                   Buffers: shared hit=54759  
                           -- 我们看到hash table的构建并行，并且每个WORKER只读取部分数据  
			   ->  Parallel Hash  (cost=585335.14..585335.14 rows=14285714 width=8) (actual time=6696.394..6696.394 rows=12500000 loops=8)  
                                 Output: b.c1, b.id  
                                 Buckets: 67108864  Batches: 1  Memory Usage: 4436416kB  
                                 Buffers: shared hit=442478  
                                 Worker 0: actual time=6738.735..6738.735 rows=12900758 loops=1  
                                   Buffers: shared hit=57083  
                                 Worker 1: actual time=6739.147..6739.147 rows=12440848 loops=1  
                                   Buffers: shared hit=55048  
                                 Worker 2: actual time=6739.515..6739.515 rows=12260726 loops=1  
                                   Buffers: shared hit=54251  
                                 Worker 3: actual time=6739.999..6739.999 rows=12293242 loops=1  
                                   Buffers: shared hit=54395  
                                 Worker 4: actual time=6740.301..6740.301 rows=12272026 loops=1  
                                   Buffers: shared hit=54301  
                                 Worker 5: actual time=6740.474..6740.474 rows=12133488 loops=1  
                                   Buffers: shared hit=53688  
                                 Worker 6: actual time=6740.814..6740.814 rows=12211232 loops=1  
                                   Buffers: shared hit=54032  
                                 ->  Parallel Seq Scan on postgres.b  (cost=0.00..585335.14 rows=14285714 width=8) (actual time=0.018..1844.592 rows=12500000 loops=8)  
                                       Output: b.c1, b.id  
                                       Buffers: shared hit=442478  
                                       Worker 0: actual time=0.016..1898.661 rows=12900758 loops=1  
                                         Buffers: shared hit=57083  
                                       Worker 1: actual time=0.024..1863.159 rows=12440848 loops=1  
                                         Buffers: shared hit=55048  
                                       Worker 2: actual time=0.015..1852.866 rows=12260726 loops=1  
                                         Buffers: shared hit=54251  
                                       Worker 3: actual time=0.021..1845.705 rows=12293242 loops=1  
                                         Buffers: shared hit=54395  
                                       Worker 4: actual time=0.016..1844.956 rows=12272026 loops=1  
                                         Buffers: shared hit=54301  
                                       Worker 5: actual time=0.020..1838.015 rows=12133488 loops=1  
                                         Buffers: shared hit=53688  
                                       Worker 6: actual time=0.017..1839.597 rows=12211232 loops=1  
                                         Buffers: shared hit=54032  
 Planning time: 0.268 ms  
 Execution time: 23271.260 ms  
(104 rows)  

这个patch的讨论，详见邮件组，本文末尾URL。

PostgreSQL社区的作风非常严谨，一个patch可能在邮件组中讨论几个月甚至几年，根据大家的意见反复的修正，patch合并到master已经非常成熟，所以PostgreSQL的稳定性也是远近闻名的。

参考

https://commitfest.postgresql.org/13/871/

https://www.postgresql.org/message-id/flat/CAEepm=2W=cOkiZxcg6qiFQP-dHUe09aqTrEMM7yJDrHMhDv_RA@mail.gmail.com#CAEepm=2W=cOkiZxcg6qiFQP-dHUe09aqTrEMM7yJDrHMhDv_RA@mail.gmail.com

digoal’s 大量PostgreSQL文章入口

Twitter Facebook Google+ LinkedIn

Digoal.zhou

PostgreSQL 10.0 preview 多核并行增强 - 并行hash join支持shared hashdata, 节约哈希表内存提高效率

背景

参考

digoal’s 大量PostgreSQL文章入口

You May Also Enjoy

PostgreSQL(PPAS 兼容Oracle) 从零开始入门手册 - 珍藏版

PostgreSQL pipelinedb 流计算插件 - IoT应用 - 实时轨迹聚合

PostgreSQL plpgsql 存储过程、函数 - 状态、异常变量打印、异常捕获… - GET [STACKED] DIAGNOSTICS

PostgreSQL datediff 日期间隔（单位转换）兼容SQL用法