# Thread-parallel reduce(vcat, ...) performance issues when providing init

Hey all—

I’ve got a problem where I have a `vec_of_vecs::Vector{Vector{T}}` where the length is long and the length of each element is long. I’d love to use multithreading to do `reduce(vcat, vec_of_vecs)` more efficiently. From some benchmarking, it looks like `Folds.jl` provides a very convenient way to do this, and the parallel speedup of `ThreadedEx()` compared to `SequentialEx()` is seriously impressive and almost perfectly linear with the number of threads I provide (thanks @tkf!).

…But, the actual speed compared to `reduce(vcat, vec_of_vecs)` is two orders of magnitude slower, and I think the problem is having to provide the `init` kwarg. Because `reduce(vcat, vec_of_vecs, init=T[])`, or `init=first(...)`, is similarly slow! Can anybody help me work out a sensible initialization to fix the speed disparity? Here is an example you can run:

``````# run with julia -O3 -t 3 on a 6-core CPU.

using Folds, BenchmarkTools, Serialization

const N = 5000
const M = 100
const vec_of_vecs = [rand(Int64, N) for _ in 1:M]
const (v1, vrest) = (vec_of_vecs[1], vec_of_vecs[2:end])

print("Serial time (no init, base):") # 127 μs, 2 alloc
@btime reduce(vcat, \$vec_of_vecs)

print("Serial time (init, base):") # 13.3 ms, 198 alloc
@btime reduce(vcat, \$vrest, init=\$v1)

print("Serial time (Folds):") # 13.48 ms, 198 alloc
@btime Folds.reduce(vcat, \$vrest, SequentialEx(), init=\$v1)

print("Thread-parallel time (Folds):") # 3.39 ms, 223 alloc
``````

The results for me were similar if I instead provided `init=T[]`. I would appreciate any thoughts people have—clearly `Folds.jl` is capable of really speeding things up, but just not in the way I’m trying to use it currently.

`Base`'s `reduce` provides a manual specialization for `reduce(vcat, _)` without `init`. Actually executing `reduce(vcat, _)` would be extremely slow since it allocates a new array for each input element which is what is happening in `Folds.reduce` and `reduce` with `init`. A bit better approach here is to use `append!!`

``````julia> using BenchmarkTools, Folds, BangBang

julia> @btime reduce(vcat, \$vec_of_vecs);
265.167 μs (2 allocations: 3.81 MiB)

4

julia> @btime Folds.reduce(vcat, \$vec_of_vecs);
3.074 ms (224 allocations: 64.39 MiB)

julia> @btime Folds.reduce(append!!, \$vec_of_vecs);
767.082 μs (50 allocations: 9.77 MiB)

julia> @btime Folds.reduce(vcat, \$vec_of_vecs, SequentialEx());
29.569 ms (200 allocations: 216.73 MiB)

julia> @btime Folds.reduce(append!!, \$vec_of_vecs, SequentialEx());
471.255 μs (9 allocations: 4.88 MiB)
``````

I suspect that `Folds.reduce(append!!, _, SequentialEx())` is slower than `reduce(vcat, _)` because the latter pre-compute the output array size. A similar optimization is possible in Folds.

I suspect that `Folds.reduce(append!!, _s)` is slower than `Folds.reduce(append!!, _, SequentialEx())` because it invokes GC a lot.

Proper parallel implementation requires a two-pass approach: compute the output indices using parallel `cumsum` and then copy the elements in parallel. It’s not implemented in Folds yet.

1 Like

That all makes a ton of sense. I should have looked at more than just the docstrings for `reduce`, I apologize. I also should have just guessed that it would be doing something particularly clever with no `init`. I’ll definitely play with using `append!!`, but based on your thoughts here maybe I should just implement a slight manual methods and use a good old `Folds.foreach(...)` or something to more manually update a pre-allocated output.

Thanks again!

1 Like