PostgreSQL 多个数组聚合为一维数组加速(array_agg)
背景
多个数组聚合为一维数组,求PC。业务背景见:
《PostgreSQL APP海量FEED LOG实时质量统计CASE(含percentile_disc)》
由于PostgreSQL内置的聚合函数array_agg支持的数组聚合实际上是将多个数组聚合为多维数组。并不是一维数组。
例如:
postgres=# select array_agg(arr) from (values(array[1,2,3]), (array[4,5,6])) t(arr);
array_agg
-------------------
{{1,2,3},{4,5,6}}
(1 row)
而实际上我们要的是一维数组的结果
{1,2,3,4,5,6}
此时需要自定义一个聚合函数
create aggregate arragg (anyarray) (sfunc = array_cat, stype=anyarray, PARALLEL=safe);
效果如下
postgres=# select arragg(arr) from (values(array[1,2,3]), (array[4,5,6])) t(arr);
arragg
---------------
{1,2,3,4,5,6}
(1 row)
但是这个新加的聚合用到了array_cat,大量的memcpy导致性能并不好。
array_agg性能对比arragg
聚合100万个元素.
1、array_agg,耗时0.14秒
postgres=# explain (analyze,verbose,timing,costs,buffers) select array_agg(array[1,2,3,4,5,6,7,8,9,10]) from generate_series(1,100000);
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=12.50..12.51 rows=1 width=32) (actual time=113.134..113.134 rows=1 loops=1)
Output: array_agg('{1,2,3,4,5,6,7,8,9,10}'::integer[])
-> Function Scan on pg_catalog.generate_series (cost=0.00..10.00 rows=1000 width=0) (actual time=53.585..66.200 rows=100000 loops=1)
Output: generate_series
Function Call: generate_series(1, 100000)
Planning time: 0.064 ms
Execution time: 143.075 ms
(7 rows)
2、arragg(use array_cat),耗时108.15秒
postgres=# explain (analyze,verbose,timing,costs,buffers) select arragg(array[1,2,3,4,5,6,7,8,9,10]) from generate_series(1,100000);
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=12.50..12.51 rows=1 width=32) (actual time=108081.186..108081.186 rows=1 loops=1)
Output: arragg('{1,2,3,4,5,6,7,8,9,10}'::integer[])
-> Function Scan on pg_catalog.generate_series (cost=0.00..10.00 rows=1000 width=0) (actual time=11.121..81.467 rows=100000 loops=1)
Output: generate_series
Function Call: generate_series(1, 100000)
Planning time: 0.148 ms
Execution time: 108154.846 ms
(7 rows)
3、unnest聚合,耗时0.59秒
postgres=# explain (analyze,verbose,timing,costs,buffers) select array(select unnest(array[1,2,3,4,5,6,7,8,9,10]) from generate_series(1,100000));
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------
Result (cost=517.50..517.51 rows=1 width=32) (actual time=520.327..520.327 rows=1 loops=1)
Output: $0
InitPlan 1 (returns $0)
-> ProjectSet (cost=0.00..517.50 rows=100000 width=4) (actual time=11.979..223.223 rows=1000000 loops=1)
Output: unnest('{1,2,3,4,5,6,7,8,9,10}'::integer[])
-> Function Scan on pg_catalog.generate_series (cost=0.00..10.00 rows=1000 width=0) (actual time=11.972..27.014 rows=100000 loops=1)
Output: generate_series
Function Call: generate_series(1, 100000)
Planning time: 0.082 ms
Execution time: 590.976 ms
(10 rows)
4、使用string_agg,再转换,耗时0.18秒。
postgres=# explain (analyze,verbose,timing,costs,buffers) select ('{'||string_agg(rtrim(ltrim((array[1,2,3,4,5,6,7,8,9,10])::text,'{'),'}'),',')||'}')::int[] from generate_series(1,100000);
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=22.50..22.53 rows=1 width=32) (actual time=179.099..179.099 rows=1 loops=1)
Output: ((('{'::text || string_agg(rtrim(ltrim(('{1,2,3,4,5,6,7,8,9,10}'::integer[])::text, '{'::text), '}'::text), ','::text)) || '}'::text))::integer[]
-> Function Scan on pg_catalog.generate_series (cost=0.00..10.00 rows=1000 width=0) (actual time=7.217..16.844 rows=100000 loops=1)
Output: generate_series
Function Call: generate_series(1, 100000)
Planning time: 0.062 ms
Execution time: 179.981 ms
(7 rows)
Time: 180.547 ms
小结
array_cat构建的聚合耗时较多,性能优化提升空间明显。
array_agg代码参考:
src/backend/utils/adt/arrayfuncs.c
即使使用unnest再聚合,性能也比array_cat好很多。
实在不行,还不如使用string_agg,再转换,性能也比array_cat快很多。