使用 PGStrom 2 (GPU JOIN, BulkScan, GpuPreAgg, …)
背景
BUG fixed, 测试了一下。
JOIN性能确实好很多,主要体现在。
1. BuldkScan
300毫秒左右,而seqscan需要1500毫秒。
2. GpuJoin
约4.5秒。而hash join约8秒。
postgres=# explain (analyze,verbose,costs,buffers,timing) select count(*) from t1 natural join t2 natural join t3;
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=592268.52..592268.53 rows=1 width=0) (actual time=12571.551..12571.551 rows=1 loops=1)
Output: pgstrom.count((pgstrom.nrows()))
Buffers: shared hit=132771
-> Custom Scan (GpuPreAgg) (cost=425951.98..570768.69 rows=17 width=4) (actual time=12548.700..12571.530 rows=28 loops=1)
Output: pgstrom.nrows()
Bulkload: On (density: 100.00%)
Reduction: NoGroup
Features: format: tuple-slot, bulkload: unsupported
Buffers: shared hit=132771
-> Custom Scan (GpuJoin) (cost=422951.98..567268.52 rows=10000000 width=0) (actual time=12337.214..12531.078 rows=10000000 loops=1)
Pseudo Scan: (t1.c1)::integer, (t1.c2)::integer, (t3.c1)::integer, (t3.c2)::integer
Bulkload: On (density: 100.00%)
Depth 1: GpuHashJoin, HashKeys: (t1.c1, t1.c2), JoinQual: ((t1.c1 = t3.c1) AND (t1.c2 = t3.c2))
Nrows (in:10000000 out:10000000, 100.00% planned 100.00%), KDS-Hash (size: 495.91MB planned 739.10MB, nbatches: 1 planned 1)
Features: format: heap-tuple, bulkload: supported
Buffers: shared hit=132771
-> Custom Scan (GpuJoin) (cost=211479.50..355978.48 rows=10000000 width=16) (actual time=6805.095..7022.021 rows=10000000 loops=1)
Output: t1.c1, t1.c2, t2.c1, t2.c2
Pseudo Scan: t1.c1::integer, t1.c2::integer, t2.c1::integer, t2.c2::integer
Bulkload: On (density: 100.00%)
Depth 1: GpuHashJoin, HashKeys: (t1.c1, t1.c2), JoinQual: ((t1.c1 = t2.c1) AND (t1.c2 = t2.c2))
Nrows (in:10000000 out:10000000, 100.00% planned 100.00%), KDS-Hash (size: 495.91MB planned 739.10MB, nbatches: 1 planned 1)
Features: format: tuple-slot, bulkload: supported
Buffers: shared hit=88523
-> Custom Scan (BulkScan) on public.t1 (cost=0.00..144248.00 rows=10000000 width=8) (actual time=12.278..335.416 rows=10000000 loops=1)
Output: t1.c1, t1.c2
Features: format: tuple-slot, bulkload: supported
Buffers: shared hit=44275
-> Seq Scan on public.t2 (cost=0.00..144248.00 rows=10000000 width=8) (actual time=0.009..1510.590 rows=10000000 loops=1)
Output: t2.c1, t2.c2
Buffers: shared hit=44248
-> Seq Scan on public.t3 (cost=0.00..144248.00 rows=10000000 width=8) (actual time=0.017..1539.207 rows=10000000 loops=1)
Output: t3.c1, t3.c2
Buffers: shared hit=44248
Planning time: 0.856 ms
Execution time: 14226.754 ms
(36 rows)
postgres=# set pg_strom.enabled=off;
SET
postgres=# explain (analyze,verbose,costs,buffers,timing) select count(*) from t1 natural join t2 natural join t3;
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=1889780.00..1889780.01 rows=1 width=0) (actual time=23863.974..23863.974 rows=1 loops=1)
Output: count(*)
Buffers: shared hit=132744, temp read=184849 written=184073
-> Hash Join (cost=666622.00..1864780.00 rows=10000000 width=0) (actual time=6751.553..23007.967 rows=10000000 loops=1)
Hash Cond: ((t1.c1 = t3.c1) AND (t1.c2 = t3.c2))
Buffers: shared hit=132744, temp read=184849 written=184073
-> Hash Join (cost=333311.00..994748.00 rows=10000000 width=16) (actual time=3369.316..12137.123 rows=10000000 loops=1)
Output: t1.c1, t1.c2, t2.c1, t2.c2
Hash Cond: ((t1.c1 = t2.c1) AND (t1.c2 = t2.c2))
Buffers: shared hit=88496, temp read=86300 written=85912
-> Seq Scan on public.t1 (cost=0.00..144248.00 rows=10000000 width=8) (actual time=0.008..1079.109 rows=10000000 loops=1)
Output: t1.c1, t1.c2
Buffers: shared hit=44248
-> Hash (cost=144248.00..144248.00 rows=10000000 width=8) (actual time=3368.576..3368.576 rows=10000000 loops=1)
Output: t2.c1, t2.c2
Buckets: 131072 (originally 131072) Batches: 256 (originally 128) Memory Usage: 4073kB
Buffers: shared hit=44248, temp written=33847
-> Seq Scan on public.t2 (cost=0.00..144248.00 rows=10000000 width=8) (actual time=0.007..1096.536 rows=10000000 loops=1)
Output: t2.c1, t2.c2
Buffers: shared hit=44248
-> Hash (cost=144248.00..144248.00 rows=10000000 width=8) (actual time=3382.059..3382.059 rows=10000000 loops=1)
Output: t3.c1, t3.c2
Buckets: 131072 (originally 131072) Batches: 256 (originally 128) Memory Usage: 4073kB
Buffers: shared hit=44248, temp written=33847
-> Seq Scan on public.t3 (cost=0.00..144248.00 rows=10000000 width=8) (actual time=0.010..1102.204 rows=10000000 loops=1)
Output: t3.c1, t3.c2
Buffers: shared hit=44248
Planning time: 0.594 ms
Execution time: 23864.050 ms
(29 rows)
8个表的JOIN,不使用 GPU。
postgres=# explain (analyze,verbose,costs,buffers,timing) select count(*) from t1 natural join t2 natural join t3 natural join t4 natural join t5 natural join t6 natural join t7 natural join t8 ;
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=5773641.04..5773641.05 rows=1 width=0) (actual time=78621.151..78621.151 rows=1 loops=1)
Output: count(*)
Buffers: shared hit=435962, temp read=518383 written=516199
-> Hash Join (cost=2720512.24..5748641.21 rows=9999931 width=0) (actual time=33751.548..77761.816 rows=10000000 loops=1)
Hash Cond: ((t1.c1 = t4.c1) AND (t1.c2 = t4.c2))
Buffers: shared hit=435962, temp read=518383 written=516199
-> Hash Join (cost=1526935.54..3930456.05 rows=9999954 width=48) (actual time=18264.080..53774.071 rows=10000000 loops=1)
Output: t1.c1, t1.c2, t2.c1, t2.c2, t3.c1, t3.c2, t5.c1, t5.c2, t7.c1, t7.c2, t8.c1, t8.c2
Hash Cond: ((t1.c1 = t5.c1) AND (t1.c2 = t5.c2))
Buffers: shared hit=347463, temp read=304863 written=303577
-> Hash Join (cost=333358.83..2151331.98 rows=9999977 width=32) (actual time=3509.315..30357.234 rows=10000000 loops=1)
Output: t1.c1, t1.c2, t2.c1, t2.c2, t3.c1, t3.c2, t8.c1, t8.c2
Hash Cond: ((t1.c1 = t8.c1) AND (t1.c2 = t8.c2))
Buffers: shared hit=258967, temp read=110794 written=110406
-> Merge Join (cost=48.41..1261770.78 rows=10000000 width=24) (actual time=0.077..17169.673 rows=10000000 loops=1)
Output: t1.c1, t1.c2, t2.c1, t2.c2, t3.c1, t3.c2
Merge Cond: (t1.c1 = t3.c1)
Join Filter: (t1.c2 = t3.c2)
Buffers: shared hit=214719
-> Merge Join (cost=23.83..782865.84 rows=10000000 width=16) (actual time=0.059..9851.866 rows=10000000 loops=1)
Output: t1.c1, t1.c2, t2.c1, t2.c2
Merge Cond: (t1.c1 = t2.c1)
Join Filter: (t1.c2 = t2.c2)
Buffers: shared hit=143146
-> Index Scan using idx1 on public.t1 (cost=0.43..303939.43 rows=10000000 width=8) (actual time=0.034..2221.148 rows=10000000 loops=1)
Output: t1.c1, t1.c2
Buffers: shared hit=71573
-> Index Scan using idx2 on public.t2 (cost=0.43..303939.43 rows=10000000 width=8) (actual time=0.013..2620.760 rows=10000000 loops=1)
Output: t2.c1, t2.c2
Buffers: shared hit=71573
-> Index Scan using idx3 on public.t3 (cost=0.43..303939.43 rows=10000000 width=8) (actual time=0.013..2675.423 rows=10000000 loops=1)
Output: t3.c1, t3.c2
Buffers: shared hit=71573
-> Hash (cost=144247.77..144247.77 rows=9999977 width=8) (actual time=3508.083..3508.083 rows=10000000 loops=1)
Output: t8.c1, t8.c2
Buckets: 131072 (originally 131072) Batches: 256 (originally 128) Memory Usage: 4073kB
Buffers: shared hit=44248, temp written=33847
-> Seq Scan on public.t8 (cost=0.00..144247.77 rows=9999977 width=8) (actual time=0.034..1189.104 rows=10000000 loops=1)
Output: t8.c1, t8.c2
Buffers: shared hit=44248
-> Hash (cost=994748.05..994748.05 rows=9999977 width=16) (actual time=14751.206..14751.206 rows=10000000 loops=1)
Output: t5.c1, t5.c2, t7.c1, t7.c2
Buckets: 131072 Batches: 256 Memory Usage: 2855kB
Buffers: shared hit=88496, temp read=86300 written=129560
-> Hash Join (cost=333312.20..994748.05 rows=9999977 width=16) (actual time=3485.575..12066.617 rows=10000000 loops=1)
Output: t5.c1, t5.c2, t7.c1, t7.c2
Hash Cond: ((t7.c1 = t5.c1) AND (t7.c2 = t5.c2))
Buffers: shared hit=88496, temp read=86300 written=85912
-> Seq Scan on public.t7 (cost=0.00..144247.77 rows=9999977 width=8) (actual time=0.017..1176.896 rows=10000000 loops=1)
Output: t7.c1, t7.c2
Buffers: shared hit=44248
-> Hash (cost=144248.48..144248.48 rows=10000048 width=8) (actual time=3484.279..3484.279 rows=10000000 loops=1)
Output: t5.c1, t5.c2
Buckets: 131072 (originally 131072) Batches: 256 (originally 128) Memory Usage: 4073kB
Buffers: shared hit=44248, temp written=33847
-> Seq Scan on public.t5 (cost=0.00..144248.48 rows=10000048 width=8) (actual time=0.012..1191.389 rows=10000000 loops=1)
Output: t5.c1, t5.c2
Buffers: shared hit=44248
-> Hash (cost=994748.05..994748.05 rows=9999977 width=16) (actual time=15486.754..15486.754 rows=10000000 loops=1)
Output: t4.c1, t4.c2, t6.c1, t6.c2
Buckets: 131072 Batches: 256 Memory Usage: 2855kB
Buffers: shared hit=88496, temp read=86300 written=129560
-> Hash Join (cost=333312.20..994748.05 rows=9999977 width=16) (actual time=3494.887..12813.050 rows=10000000 loops=1)
Output: t4.c1, t4.c2, t6.c1, t6.c2
Hash Cond: ((t6.c1 = t4.c1) AND (t6.c2 = t4.c2))
Buffers: shared hit=88496, temp read=86300 written=85912
-> Seq Scan on public.t6 (cost=0.00..144247.77 rows=9999977 width=8) (actual time=0.025..1214.920 rows=10000000 loops=1)
Output: t6.c1, t6.c2
Buffers: shared hit=44248
-> Hash (cost=144248.48..144248.48 rows=10000048 width=8) (actual time=3493.640..3493.640 rows=10000000 loops=1)
Output: t4.c1, t4.c2
Buckets: 131072 (originally 131072) Batches: 256 (originally 128) Memory Usage: 4073kB
Buffers: shared hit=44248, temp written=33847
-> Seq Scan on public.t4 (cost=0.00..144248.48 rows=10000048 width=8) (actual time=0.016..1191.688 rows=10000000 loops=1)
Output: t4.c1, t4.c2
Buffers: shared hit=44248
Planning time: 46.170 ms
Execution time: 78621.977 ms
(78 rows)
使用GPU。
postgres=# set pg_strom.enabled=on;
SET
postgres=# explain (analyze,verbose,costs,buffers,timing) select count(*) from t1 natural join t2 natural join t3 natural join t4 natural join t5 natural join t6 natural join t7 natural join t8 ;
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=1663562.75..1663562.76 rows=1 width=0) (actual time=48858.541..48858.541 rows=1 loops=1)
Output: pgstrom.count((pgstrom.nrows()))
Buffers: shared hit=354038
-> Custom Scan (GpuPreAgg) (cost=1453270.24..1642063.09 rows=17 width=4) (actual time=48794.894..48858.504 rows=28 loops=1)
Output: pgstrom.nrows()
Bulkload: On (density: 100.00%)
Reduction: NoGroup
Features: format: tuple-slot, bulkload: unsupported
Buffers: shared hit=354038
-> Custom Scan (GpuJoin) (cost=1450270.24..1638562.92 rows=9999931 width=0) (actual time=48438.139..48777.446 rows=10000000 loops=1)
Pseudo Scan: (t1.c1)::integer, (t1.c2)::integer, (t4.c1)::integer, (t4.c2)::integer
Bulkload: On (density: 100.00%)
Depth 1: GpuHashJoin, HashKeys: (t1.c1, t1.c2), JoinQual: ((t1.c1 = t4.c1) AND (t1.c2 = t4.c2))
Nrows (in:10000000 out:10000000, 100.00% planned 100.00%), KDS-Hash (size: 801.09MB planned 560.27MB, nbatches: 1 planned 2)
Features: format: heap-tuple, bulkload: supported
Buffers: shared hit=354038
-> Custom Scan (GpuJoin) (cost=422951.98..567268.52 rows=10000000 width=24) (actual time=9964.714..10167.462 rows=10000000 loops=1)
Output: t1.c1, t1.c2, t2.c1, t2.c2, t3.c1, t3.c2
Pseudo Scan: t1.c1::integer, t1.c2::integer, t2.c1::integer, t2.c2::integer, t3.c1::integer, t3.c2::integer
Bulkload: On (density: 100.00%)
Depth 1: GpuHashJoin, HashKeys: (t1.c1, t1.c2), JoinQual: ((t1.c1 = t3.c1) AND (t1.c2 = t3.c2))
Nrows (in:10000000 out:10000000, 100.00% planned 100.00%), KDS-Hash (size: 495.91MB planned 739.10MB, nbatches: 1 planned 1)
Features: format: tuple-slot, bulkload: supported
Buffers: shared hit=132771
-> Custom Scan (GpuJoin) (cost=211479.50..355978.48 rows=10000000 width=16) (actual time=5018.376..5238.156 rows=10000000 loops=1)
Output: t1.c1, t1.c2, t2.c1, t2.c2
Pseudo Scan: t1.c1::integer, t1.c2::integer, t2.c1::integer, t2.c2::integer
Bulkload: On (density: 100.00%)
Depth 1: GpuHashJoin, HashKeys: (t1.c1, t1.c2), JoinQual: ((t1.c1 = t2.c1) AND (t1.c2 = t2.c2))
Nrows (in:10000000 out:10000000, 100.00% planned 100.00%), KDS-Hash (size: 495.91MB planned 739.10MB, nbatches: 1 planned 1)
Features: format: tuple-slot, bulkload: supported
Buffers: shared hit=88523
-> Custom Scan (BulkScan) on public.t1 (cost=0.00..144248.00 rows=10000000 width=8) (actual time=12.040..322.669 rows=10000000 loops=1)
Output: t1.c1, t1.c2
Features: format: tuple-slot, bulkload: supported
Buffers: shared hit=44275
-> Seq Scan on public.t2 (cost=0.00..144248.00 rows=10000000 width=8) (actual time=0.008..1528.913 rows=10000000 loops=1)
Output: t2.c1, t2.c2
Buffers: shared hit=44248
-> Seq Scan on public.t3 (cost=0.00..144248.00 rows=10000000 width=8) (actual time=0.022..1529.554 rows=10000000 loops=1)
Output: t3.c1, t3.c2
Buffers: shared hit=44248
-> Custom Scan (GpuJoin) (cost=845250.50..990105.67 rows=9999835 width=40) (actual time=25935.700..27197.613 rows=10000000 loops=1)
Output: t4.c1, t4.c2, t5.c1, t5.c2, t6.c1, t6.c2, t7.c1, t7.c2, t8.c1, t8.c2
Pseudo Scan: t4.c1::integer, t4.c2::integer, t5.c1::integer, t5.c2::integer, t6.c1::integer, t6.c2::integer, t7.c1::integer, t7.c2::integer, t8.c1::integer, t8.c2::integer
Bulkload: On (density: 100.00%)
Depth 1: GpuHashJoin, HashKeys: (t4.c1, t4.c2), JoinQual: ((t4.c1 = t5.c1) AND (t4.c2 = t5.c2))
Nrows (in:10000000 out:10000000, 100.00% planned 100.00%), KDS-Hash (size: 495.91MB planned 739.10MB, nbatches: 1 planned 1)
Features: format: tuple-slot, bulkload: supported
Buffers: shared hit=221267
-> Custom Scan (GpuJoin) (cost=634200.09..778635.44 rows=9999835 width=32) (actual time=20268.808..20269.446 rows=10000000 loops=1)
Output: t4.c1, t4.c2, t6.c1, t6.c2, t7.c1, t7.c2, t8.c1, t8.c2
Pseudo Scan: t4.c1::integer, t4.c2::integer, t6.c1::integer, t6.c2::integer, t7.c1::integer, t7.c2::integer, t8.c1::integer, t8.c2::integer
Bulkload: On (density: 100.00%)
Depth 1: GpuHashJoin, HashKeys: (t4.c1, t4.c2), JoinQual: ((t4.c1 = t8.c1) AND (t4.c2 = t8.c2))
Nrows (in:10000000 out:10000000, 100.00% planned 100.00%), KDS-Hash (size: 495.91MB planned 739.10MB, nbatches: 1 planned 1)
Features: format: tuple-slot, bulkload: supported
Buffers: shared hit=177019
-> Custom Scan (GpuJoin) (cost=422951.23..567266.84 rows=9999906 width=24) (actual time=14136.992..14560.294 rows=10000000 loops=1)
Output: t4.c1, t4.c2, t6.c1, t6.c2, t7.c1, t7.c2
Pseudo Scan: t4.c1::integer, t4.c2::integer, t6.c1::integer, t6.c2::integer, t7.c1::integer, t7.c2::integer
Bulkload: On (density: 100.00%)
Depth 1: GpuHashJoin, HashKeys: (t4.c1, t4.c2), JoinQual: ((t4.c1 = t7.c1) AND (t4.c2 = t7.c2))
Nrows (in:10000000 out:10000000, 100.00% planned 100.00%), KDS-Hash (size: 495.91MB planned 739.10MB, nbatches: 1 planned 1)
Features: format: tuple-slot, bulkload: supported
Buffers: shared hit=132771
-> Custom Scan (GpuJoin) (cost=211479.13..355977.88 rows=9999977 width=16) (actual time=7517.167..7812.485 rows=10000000 loops=1)
Output: t4.c1, t4.c2, t6.c1, t6.c2
Pseudo Scan: t4.c1::integer, t4.c2::integer, t6.c1::integer, t6.c2::integer
Bulkload: On (density: 100.00%)
Depth 1: GpuHashJoin, HashKeys: (t4.c1, t4.c2), JoinQual: ((t4.c1 = t6.c1) AND (t4.c2 = t6.c2))
Nrows (in:10000000 out:10000000, 100.00% planned 100.00%), KDS-Hash (size: 495.91MB planned 739.10MB, nbatches: 1 planned 1)
Features: format: tuple-slot, bulkload: supported
Buffers: shared hit=88523
-> Custom Scan (BulkScan) on public.t4 (cost=0.00..144248.48 rows=10000048 width=8) (actual time=12.063..322.445 rows=10000000 loops=1)
Output: t4.c1, t4.c2
Features: format: tuple-slot, bulkload: supported
Buffers: shared hit=44275
-> Seq Scan on public.t6 (cost=0.00..144247.77 rows=9999977 width=8) (actual time=0.009..1559.817 rows=10000000 loops=1)
Output: t6.c1, t6.c2
Buffers: shared hit=44248
-> Seq Scan on public.t7 (cost=0.00..144247.77 rows=9999977 width=8) (actual time=0.022..1527.810 rows=10000000 loops=1)
Output: t7.c1, t7.c2
Buffers: shared hit=44248
-> Seq Scan on public.t8 (cost=0.00..144247.77 rows=9999977 width=8) (actual time=0.023..1521.776 rows=10000000 loops=1)
Output: t8.c1, t8.c2
Buffers: shared hit=44248
-> Seq Scan on public.t5 (cost=0.00..144248.48 rows=10000048 width=8) (actual time=0.015..1499.270 rows=10000000 loops=1)
Output: t5.c1, t5.c2
Buffers: shared hit=44248
Planning time: 82.220 ms
Execution time: 52799.548 ms
(92 rows)
参考
1. 《试用 PGStrom》