使用 PGStrom 2 (GPU JOIN, BulkScan, GpuPreAgg, …)

8 minute read

背景

BUG fixed, 测试了一下。

JOIN性能确实好很多,主要体现在。

1. BuldkScan

300毫秒左右,而seqscan需要1500毫秒。

2. GpuJoin

约4.5秒。而hash join约8秒。

postgres=# explain (analyze,verbose,costs,buffers,timing) select count(*) from t1 natural join t2 natural join t3;  
                                                                          QUERY PLAN                                                                             
---------------------------------------------------------------------------------------------------------------------------------------------------------------  
 Aggregate  (cost=592268.52..592268.53 rows=1 width=0) (actual time=12571.551..12571.551 rows=1 loops=1)  
   Output: pgstrom.count((pgstrom.nrows()))  
   Buffers: shared hit=132771  
   ->  Custom Scan (GpuPreAgg)  (cost=425951.98..570768.69 rows=17 width=4) (actual time=12548.700..12571.530 rows=28 loops=1)  
         Output: pgstrom.nrows()  
         Bulkload: On (density: 100.00%)  
         Reduction: NoGroup  
         Features: format: tuple-slot, bulkload: unsupported  
         Buffers: shared hit=132771  
         ->  Custom Scan (GpuJoin)  (cost=422951.98..567268.52 rows=10000000 width=0) (actual time=12337.214..12531.078 rows=10000000 loops=1)  
               Pseudo Scan: (t1.c1)::integer, (t1.c2)::integer, (t3.c1)::integer, (t3.c2)::integer  
               Bulkload: On (density: 100.00%)  
               Depth 1: GpuHashJoin, HashKeys: (t1.c1, t1.c2), JoinQual: ((t1.c1 = t3.c1) AND (t1.c2 = t3.c2))  
                        Nrows (in:10000000 out:10000000, 100.00% planned 100.00%), KDS-Hash (size: 495.91MB planned 739.10MB, nbatches: 1 planned 1)  
               Features: format: heap-tuple, bulkload: supported  
               Buffers: shared hit=132771  
               ->  Custom Scan (GpuJoin)  (cost=211479.50..355978.48 rows=10000000 width=16) (actual time=6805.095..7022.021 rows=10000000 loops=1)  
                     Output: t1.c1, t1.c2, t2.c1, t2.c2  
                     Pseudo Scan: t1.c1::integer, t1.c2::integer, t2.c1::integer, t2.c2::integer  
                     Bulkload: On (density: 100.00%)  
                     Depth 1: GpuHashJoin, HashKeys: (t1.c1, t1.c2), JoinQual: ((t1.c1 = t2.c1) AND (t1.c2 = t2.c2))  
                              Nrows (in:10000000 out:10000000, 100.00% planned 100.00%), KDS-Hash (size: 495.91MB planned 739.10MB, nbatches: 1 planned 1)  
                     Features: format: tuple-slot, bulkload: supported  
                     Buffers: shared hit=88523  
                     ->  Custom Scan (BulkScan) on public.t1  (cost=0.00..144248.00 rows=10000000 width=8) (actual time=12.278..335.416 rows=10000000 loops=1)  
                           Output: t1.c1, t1.c2  
                           Features: format: tuple-slot, bulkload: supported  
                           Buffers: shared hit=44275  
                     ->  Seq Scan on public.t2  (cost=0.00..144248.00 rows=10000000 width=8) (actual time=0.009..1510.590 rows=10000000 loops=1)  
                           Output: t2.c1, t2.c2  
                           Buffers: shared hit=44248  
               ->  Seq Scan on public.t3  (cost=0.00..144248.00 rows=10000000 width=8) (actual time=0.017..1539.207 rows=10000000 loops=1)  
                     Output: t3.c1, t3.c2  
                     Buffers: shared hit=44248  
 Planning time: 0.856 ms  
 Execution time: 14226.754 ms  
(36 rows)  
  
postgres=# set pg_strom.enabled=off;  
SET  
postgres=# explain (analyze,verbose,costs,buffers,timing) select count(*) from t1 natural join t2 natural join t3;  
                                                                   QUERY PLAN                                                                      
-------------------------------------------------------------------------------------------------------------------------------------------------  
 Aggregate  (cost=1889780.00..1889780.01 rows=1 width=0) (actual time=23863.974..23863.974 rows=1 loops=1)  
   Output: count(*)  
   Buffers: shared hit=132744, temp read=184849 written=184073  
   ->  Hash Join  (cost=666622.00..1864780.00 rows=10000000 width=0) (actual time=6751.553..23007.967 rows=10000000 loops=1)  
         Hash Cond: ((t1.c1 = t3.c1) AND (t1.c2 = t3.c2))  
         Buffers: shared hit=132744, temp read=184849 written=184073  
         ->  Hash Join  (cost=333311.00..994748.00 rows=10000000 width=16) (actual time=3369.316..12137.123 rows=10000000 loops=1)  
               Output: t1.c1, t1.c2, t2.c1, t2.c2  
               Hash Cond: ((t1.c1 = t2.c1) AND (t1.c2 = t2.c2))  
               Buffers: shared hit=88496, temp read=86300 written=85912  
               ->  Seq Scan on public.t1  (cost=0.00..144248.00 rows=10000000 width=8) (actual time=0.008..1079.109 rows=10000000 loops=1)  
                     Output: t1.c1, t1.c2  
                     Buffers: shared hit=44248  
               ->  Hash  (cost=144248.00..144248.00 rows=10000000 width=8) (actual time=3368.576..3368.576 rows=10000000 loops=1)  
                     Output: t2.c1, t2.c2  
                     Buckets: 131072 (originally 131072)  Batches: 256 (originally 128)  Memory Usage: 4073kB  
                     Buffers: shared hit=44248, temp written=33847  
                     ->  Seq Scan on public.t2  (cost=0.00..144248.00 rows=10000000 width=8) (actual time=0.007..1096.536 rows=10000000 loops=1)  
                           Output: t2.c1, t2.c2  
                           Buffers: shared hit=44248  
         ->  Hash  (cost=144248.00..144248.00 rows=10000000 width=8) (actual time=3382.059..3382.059 rows=10000000 loops=1)  
               Output: t3.c1, t3.c2  
               Buckets: 131072 (originally 131072)  Batches: 256 (originally 128)  Memory Usage: 4073kB  
               Buffers: shared hit=44248, temp written=33847  
               ->  Seq Scan on public.t3  (cost=0.00..144248.00 rows=10000000 width=8) (actual time=0.010..1102.204 rows=10000000 loops=1)  
                     Output: t3.c1, t3.c2  
                     Buffers: shared hit=44248  
 Planning time: 0.594 ms  
 Execution time: 23864.050 ms  
(29 rows)  

8个表的JOIN,不使用 GPU。

postgres=# explain (analyze,verbose,costs,buffers,timing) select count(*) from t1 natural join t2 natural join t3 natural join t4 natural join t5 natural join t6 natural join t7 natural join t8 ;  
                                                                                QUERY PLAN                                                                                  
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------  
 Aggregate  (cost=5773641.04..5773641.05 rows=1 width=0) (actual time=78621.151..78621.151 rows=1 loops=1)  
   Output: count(*)  
   Buffers: shared hit=435962, temp read=518383 written=516199  
   ->  Hash Join  (cost=2720512.24..5748641.21 rows=9999931 width=0) (actual time=33751.548..77761.816 rows=10000000 loops=1)  
         Hash Cond: ((t1.c1 = t4.c1) AND (t1.c2 = t4.c2))  
         Buffers: shared hit=435962, temp read=518383 written=516199  
         ->  Hash Join  (cost=1526935.54..3930456.05 rows=9999954 width=48) (actual time=18264.080..53774.071 rows=10000000 loops=1)  
               Output: t1.c1, t1.c2, t2.c1, t2.c2, t3.c1, t3.c2, t5.c1, t5.c2, t7.c1, t7.c2, t8.c1, t8.c2  
               Hash Cond: ((t1.c1 = t5.c1) AND (t1.c2 = t5.c2))  
               Buffers: shared hit=347463, temp read=304863 written=303577  
               ->  Hash Join  (cost=333358.83..2151331.98 rows=9999977 width=32) (actual time=3509.315..30357.234 rows=10000000 loops=1)  
                     Output: t1.c1, t1.c2, t2.c1, t2.c2, t3.c1, t3.c2, t8.c1, t8.c2  
                     Hash Cond: ((t1.c1 = t8.c1) AND (t1.c2 = t8.c2))  
                     Buffers: shared hit=258967, temp read=110794 written=110406  
                     ->  Merge Join  (cost=48.41..1261770.78 rows=10000000 width=24) (actual time=0.077..17169.673 rows=10000000 loops=1)  
                           Output: t1.c1, t1.c2, t2.c1, t2.c2, t3.c1, t3.c2  
                           Merge Cond: (t1.c1 = t3.c1)  
                           Join Filter: (t1.c2 = t3.c2)  
                           Buffers: shared hit=214719  
                           ->  Merge Join  (cost=23.83..782865.84 rows=10000000 width=16) (actual time=0.059..9851.866 rows=10000000 loops=1)  
                                 Output: t1.c1, t1.c2, t2.c1, t2.c2  
                                 Merge Cond: (t1.c1 = t2.c1)  
                                 Join Filter: (t1.c2 = t2.c2)  
                                 Buffers: shared hit=143146  
                                 ->  Index Scan using idx1 on public.t1  (cost=0.43..303939.43 rows=10000000 width=8) (actual time=0.034..2221.148 rows=10000000 loops=1)  
                                       Output: t1.c1, t1.c2  
                                       Buffers: shared hit=71573  
                                 ->  Index Scan using idx2 on public.t2  (cost=0.43..303939.43 rows=10000000 width=8) (actual time=0.013..2620.760 rows=10000000 loops=1)  
                                       Output: t2.c1, t2.c2  
                                       Buffers: shared hit=71573  
                           ->  Index Scan using idx3 on public.t3  (cost=0.43..303939.43 rows=10000000 width=8) (actual time=0.013..2675.423 rows=10000000 loops=1)  
                                 Output: t3.c1, t3.c2  
                                 Buffers: shared hit=71573  
                     ->  Hash  (cost=144247.77..144247.77 rows=9999977 width=8) (actual time=3508.083..3508.083 rows=10000000 loops=1)  
                           Output: t8.c1, t8.c2  
                           Buckets: 131072 (originally 131072)  Batches: 256 (originally 128)  Memory Usage: 4073kB  
                           Buffers: shared hit=44248, temp written=33847  
                           ->  Seq Scan on public.t8  (cost=0.00..144247.77 rows=9999977 width=8) (actual time=0.034..1189.104 rows=10000000 loops=1)  
                                 Output: t8.c1, t8.c2  
                                 Buffers: shared hit=44248  
               ->  Hash  (cost=994748.05..994748.05 rows=9999977 width=16) (actual time=14751.206..14751.206 rows=10000000 loops=1)  
                     Output: t5.c1, t5.c2, t7.c1, t7.c2  
                     Buckets: 131072  Batches: 256  Memory Usage: 2855kB  
                     Buffers: shared hit=88496, temp read=86300 written=129560  
                     ->  Hash Join  (cost=333312.20..994748.05 rows=9999977 width=16) (actual time=3485.575..12066.617 rows=10000000 loops=1)  
                           Output: t5.c1, t5.c2, t7.c1, t7.c2  
                           Hash Cond: ((t7.c1 = t5.c1) AND (t7.c2 = t5.c2))  
                           Buffers: shared hit=88496, temp read=86300 written=85912  
                           ->  Seq Scan on public.t7  (cost=0.00..144247.77 rows=9999977 width=8) (actual time=0.017..1176.896 rows=10000000 loops=1)  
                                 Output: t7.c1, t7.c2  
                                 Buffers: shared hit=44248  
                           ->  Hash  (cost=144248.48..144248.48 rows=10000048 width=8) (actual time=3484.279..3484.279 rows=10000000 loops=1)  
                                 Output: t5.c1, t5.c2  
                                 Buckets: 131072 (originally 131072)  Batches: 256 (originally 128)  Memory Usage: 4073kB  
                                 Buffers: shared hit=44248, temp written=33847  
                                 ->  Seq Scan on public.t5  (cost=0.00..144248.48 rows=10000048 width=8) (actual time=0.012..1191.389 rows=10000000 loops=1)  
                                       Output: t5.c1, t5.c2  
                                       Buffers: shared hit=44248  
         ->  Hash  (cost=994748.05..994748.05 rows=9999977 width=16) (actual time=15486.754..15486.754 rows=10000000 loops=1)  
               Output: t4.c1, t4.c2, t6.c1, t6.c2  
               Buckets: 131072  Batches: 256  Memory Usage: 2855kB  
               Buffers: shared hit=88496, temp read=86300 written=129560  
               ->  Hash Join  (cost=333312.20..994748.05 rows=9999977 width=16) (actual time=3494.887..12813.050 rows=10000000 loops=1)  
                     Output: t4.c1, t4.c2, t6.c1, t6.c2  
                     Hash Cond: ((t6.c1 = t4.c1) AND (t6.c2 = t4.c2))  
                     Buffers: shared hit=88496, temp read=86300 written=85912  
                     ->  Seq Scan on public.t6  (cost=0.00..144247.77 rows=9999977 width=8) (actual time=0.025..1214.920 rows=10000000 loops=1)  
                           Output: t6.c1, t6.c2  
                           Buffers: shared hit=44248  
                     ->  Hash  (cost=144248.48..144248.48 rows=10000048 width=8) (actual time=3493.640..3493.640 rows=10000000 loops=1)  
                           Output: t4.c1, t4.c2  
                           Buckets: 131072 (originally 131072)  Batches: 256 (originally 128)  Memory Usage: 4073kB  
                           Buffers: shared hit=44248, temp written=33847  
                           ->  Seq Scan on public.t4  (cost=0.00..144248.48 rows=10000048 width=8) (actual time=0.016..1191.688 rows=10000000 loops=1)  
                                 Output: t4.c1, t4.c2  
                                 Buffers: shared hit=44248  
 Planning time: 46.170 ms  
 Execution time: 78621.977 ms  
(78 rows)  

使用GPU。

postgres=# set pg_strom.enabled=on;  
SET  
postgres=# explain (analyze,verbose,costs,buffers,timing) select count(*) from t1 natural join t2 natural join t3 natural join t4 natural join t5 natural join t6 natural join t7 natural join t8 ;  
                                                                                           QUERY PLAN                                                                                              
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------  
 Aggregate  (cost=1663562.75..1663562.76 rows=1 width=0) (actual time=48858.541..48858.541 rows=1 loops=1)  
   Output: pgstrom.count((pgstrom.nrows()))  
   Buffers: shared hit=354038  
   ->  Custom Scan (GpuPreAgg)  (cost=1453270.24..1642063.09 rows=17 width=4) (actual time=48794.894..48858.504 rows=28 loops=1)  
         Output: pgstrom.nrows()  
         Bulkload: On (density: 100.00%)  
         Reduction: NoGroup  
         Features: format: tuple-slot, bulkload: unsupported  
         Buffers: shared hit=354038  
         ->  Custom Scan (GpuJoin)  (cost=1450270.24..1638562.92 rows=9999931 width=0) (actual time=48438.139..48777.446 rows=10000000 loops=1)  
               Pseudo Scan: (t1.c1)::integer, (t1.c2)::integer, (t4.c1)::integer, (t4.c2)::integer  
               Bulkload: On (density: 100.00%)  
               Depth 1: GpuHashJoin, HashKeys: (t1.c1, t1.c2), JoinQual: ((t1.c1 = t4.c1) AND (t1.c2 = t4.c2))  
                        Nrows (in:10000000 out:10000000, 100.00% planned 100.00%), KDS-Hash (size: 801.09MB planned 560.27MB, nbatches: 1 planned 2)  
               Features: format: heap-tuple, bulkload: supported  
               Buffers: shared hit=354038  
               ->  Custom Scan (GpuJoin)  (cost=422951.98..567268.52 rows=10000000 width=24) (actual time=9964.714..10167.462 rows=10000000 loops=1)  
                     Output: t1.c1, t1.c2, t2.c1, t2.c2, t3.c1, t3.c2  
                     Pseudo Scan: t1.c1::integer, t1.c2::integer, t2.c1::integer, t2.c2::integer, t3.c1::integer, t3.c2::integer  
                     Bulkload: On (density: 100.00%)  
                     Depth 1: GpuHashJoin, HashKeys: (t1.c1, t1.c2), JoinQual: ((t1.c1 = t3.c1) AND (t1.c2 = t3.c2))  
                              Nrows (in:10000000 out:10000000, 100.00% planned 100.00%), KDS-Hash (size: 495.91MB planned 739.10MB, nbatches: 1 planned 1)  
                     Features: format: tuple-slot, bulkload: supported  
                     Buffers: shared hit=132771  
                     ->  Custom Scan (GpuJoin)  (cost=211479.50..355978.48 rows=10000000 width=16) (actual time=5018.376..5238.156 rows=10000000 loops=1)  
                           Output: t1.c1, t1.c2, t2.c1, t2.c2  
                           Pseudo Scan: t1.c1::integer, t1.c2::integer, t2.c1::integer, t2.c2::integer  
                           Bulkload: On (density: 100.00%)  
                           Depth 1: GpuHashJoin, HashKeys: (t1.c1, t1.c2), JoinQual: ((t1.c1 = t2.c1) AND (t1.c2 = t2.c2))  
                                    Nrows (in:10000000 out:10000000, 100.00% planned 100.00%), KDS-Hash (size: 495.91MB planned 739.10MB, nbatches: 1 planned 1)  
                           Features: format: tuple-slot, bulkload: supported  
                           Buffers: shared hit=88523  
                           ->  Custom Scan (BulkScan) on public.t1  (cost=0.00..144248.00 rows=10000000 width=8) (actual time=12.040..322.669 rows=10000000 loops=1)  
                                 Output: t1.c1, t1.c2  
                                 Features: format: tuple-slot, bulkload: supported  
                                 Buffers: shared hit=44275  
                           ->  Seq Scan on public.t2  (cost=0.00..144248.00 rows=10000000 width=8) (actual time=0.008..1528.913 rows=10000000 loops=1)  
                                 Output: t2.c1, t2.c2  
                                 Buffers: shared hit=44248  
                     ->  Seq Scan on public.t3  (cost=0.00..144248.00 rows=10000000 width=8) (actual time=0.022..1529.554 rows=10000000 loops=1)  
                           Output: t3.c1, t3.c2  
                           Buffers: shared hit=44248  
               ->  Custom Scan (GpuJoin)  (cost=845250.50..990105.67 rows=9999835 width=40) (actual time=25935.700..27197.613 rows=10000000 loops=1)  
                     Output: t4.c1, t4.c2, t5.c1, t5.c2, t6.c1, t6.c2, t7.c1, t7.c2, t8.c1, t8.c2  
                     Pseudo Scan: t4.c1::integer, t4.c2::integer, t5.c1::integer, t5.c2::integer, t6.c1::integer, t6.c2::integer, t7.c1::integer, t7.c2::integer, t8.c1::integer, t8.c2::integer  
                     Bulkload: On (density: 100.00%)  
                     Depth 1: GpuHashJoin, HashKeys: (t4.c1, t4.c2), JoinQual: ((t4.c1 = t5.c1) AND (t4.c2 = t5.c2))  
                              Nrows (in:10000000 out:10000000, 100.00% planned 100.00%), KDS-Hash (size: 495.91MB planned 739.10MB, nbatches: 1 planned 1)  
                     Features: format: tuple-slot, bulkload: supported  
                     Buffers: shared hit=221267  
                     ->  Custom Scan (GpuJoin)  (cost=634200.09..778635.44 rows=9999835 width=32) (actual time=20268.808..20269.446 rows=10000000 loops=1)  
                           Output: t4.c1, t4.c2, t6.c1, t6.c2, t7.c1, t7.c2, t8.c1, t8.c2  
                           Pseudo Scan: t4.c1::integer, t4.c2::integer, t6.c1::integer, t6.c2::integer, t7.c1::integer, t7.c2::integer, t8.c1::integer, t8.c2::integer  
                           Bulkload: On (density: 100.00%)  
                           Depth 1: GpuHashJoin, HashKeys: (t4.c1, t4.c2), JoinQual: ((t4.c1 = t8.c1) AND (t4.c2 = t8.c2))  
                                    Nrows (in:10000000 out:10000000, 100.00% planned 100.00%), KDS-Hash (size: 495.91MB planned 739.10MB, nbatches: 1 planned 1)  
                           Features: format: tuple-slot, bulkload: supported  
                           Buffers: shared hit=177019  
                           ->  Custom Scan (GpuJoin)  (cost=422951.23..567266.84 rows=9999906 width=24) (actual time=14136.992..14560.294 rows=10000000 loops=1)  
                                 Output: t4.c1, t4.c2, t6.c1, t6.c2, t7.c1, t7.c2  
                                 Pseudo Scan: t4.c1::integer, t4.c2::integer, t6.c1::integer, t6.c2::integer, t7.c1::integer, t7.c2::integer  
                                 Bulkload: On (density: 100.00%)  
                                 Depth 1: GpuHashJoin, HashKeys: (t4.c1, t4.c2), JoinQual: ((t4.c1 = t7.c1) AND (t4.c2 = t7.c2))  
                                          Nrows (in:10000000 out:10000000, 100.00% planned 100.00%), KDS-Hash (size: 495.91MB planned 739.10MB, nbatches: 1 planned 1)  
                                 Features: format: tuple-slot, bulkload: supported  
                                 Buffers: shared hit=132771  
                                 ->  Custom Scan (GpuJoin)  (cost=211479.13..355977.88 rows=9999977 width=16) (actual time=7517.167..7812.485 rows=10000000 loops=1)  
                                       Output: t4.c1, t4.c2, t6.c1, t6.c2  
                                       Pseudo Scan: t4.c1::integer, t4.c2::integer, t6.c1::integer, t6.c2::integer  
                                       Bulkload: On (density: 100.00%)  
                                       Depth 1: GpuHashJoin, HashKeys: (t4.c1, t4.c2), JoinQual: ((t4.c1 = t6.c1) AND (t4.c2 = t6.c2))  
                                                Nrows (in:10000000 out:10000000, 100.00% planned 100.00%), KDS-Hash (size: 495.91MB planned 739.10MB, nbatches: 1 planned 1)  
                                       Features: format: tuple-slot, bulkload: supported  
                                       Buffers: shared hit=88523  
                                       ->  Custom Scan (BulkScan) on public.t4  (cost=0.00..144248.48 rows=10000048 width=8) (actual time=12.063..322.445 rows=10000000 loops=1)  
                                             Output: t4.c1, t4.c2  
                                             Features: format: tuple-slot, bulkload: supported  
                                             Buffers: shared hit=44275  
                                       ->  Seq Scan on public.t6  (cost=0.00..144247.77 rows=9999977 width=8) (actual time=0.009..1559.817 rows=10000000 loops=1)  
                                             Output: t6.c1, t6.c2  
                                             Buffers: shared hit=44248  
                                 ->  Seq Scan on public.t7  (cost=0.00..144247.77 rows=9999977 width=8) (actual time=0.022..1527.810 rows=10000000 loops=1)  
                                       Output: t7.c1, t7.c2  
                                       Buffers: shared hit=44248  
                           ->  Seq Scan on public.t8  (cost=0.00..144247.77 rows=9999977 width=8) (actual time=0.023..1521.776 rows=10000000 loops=1)  
                                 Output: t8.c1, t8.c2  
                                 Buffers: shared hit=44248  
                     ->  Seq Scan on public.t5  (cost=0.00..144248.48 rows=10000048 width=8) (actual time=0.015..1499.270 rows=10000000 loops=1)  
                           Output: t5.c1, t5.c2  
                           Buffers: shared hit=44248  
 Planning time: 82.220 ms  
 Execution time: 52799.548 ms  
(92 rows)  

参考

1. 《试用 PGStrom》

Flag Counter

digoal’s 大量PostgreSQL文章入口