Thread-parallel reduce(vcat, ...) performance issues when providing init

Base’s reduce provides a manual specialization for reduce(vcat, _) without init. Actually executing reduce(vcat, _) would be extremely slow since it allocates a new array for each input element which is what is happening in Folds.reduce and reduce with init. A bit better approach here is to use append!!

julia> using BenchmarkTools, Folds, BangBang

julia> @btime reduce(vcat, $vec_of_vecs);
  265.167 μs (2 allocations: 3.81 MiB)

julia> Threads.nthreads()
4

julia> @btime Folds.reduce(vcat, $vec_of_vecs);
  3.074 ms (224 allocations: 64.39 MiB)

julia> @btime Folds.reduce(append!!, $vec_of_vecs);
  767.082 μs (50 allocations: 9.77 MiB)

julia> @btime Folds.reduce(vcat, $vec_of_vecs, SequentialEx());
  29.569 ms (200 allocations: 216.73 MiB)

julia> @btime Folds.reduce(append!!, $vec_of_vecs, SequentialEx());
  471.255 μs (9 allocations: 4.88 MiB)

I suspect that Folds.reduce(append!!, _, SequentialEx()) is slower than reduce(vcat, _) because the latter pre-compute the output array size. A similar optimization is possible in Folds.

I suspect that Folds.reduce(append!!, _s) is slower than Folds.reduce(append!!, _, SequentialEx()) because it invokes GC a lot.

Proper parallel implementation requires a two-pass approach: compute the output indices using parallel cumsum and then copy the elements in parallel. It’s not implemented in Folds yet.

1 Like