Hey all—
I’ve got a problem where I have a vec_of_vecs::Vector{Vector{T}}
where the length is long and the length of each element is long. I’d love to use multithreading to do reduce(vcat, vec_of_vecs)
more efficiently. From some benchmarking, it looks like Folds.jl
provides a very convenient way to do this, and the parallel speedup of ThreadedEx()
compared to SequentialEx()
is seriously impressive and almost perfectly linear with the number of threads I provide (thanks @tkf!).
…But, the actual speed compared to reduce(vcat, vec_of_vecs)
is two orders of magnitude slower, and I think the problem is having to provide the init
kwarg. Because reduce(vcat, vec_of_vecs, init=T[])
, or init=first(...)
, is similarly slow! Can anybody help me work out a sensible initialization to fix the speed disparity? Here is an example you can run:
# run with julia -O3 -t 3 on a 6-core CPU.
using Folds, BenchmarkTools, Serialization
const N = 5000
const M = 100
const vec_of_vecs = [rand(Int64, N) for _ in 1:M]
const (v1, vrest) = (vec_of_vecs[1], vec_of_vecs[2:end])
print("Serial time (no init, base):") # 127 μs, 2 alloc
@btime reduce(vcat, $vec_of_vecs)
print("Serial time (init, base):") # 13.3 ms, 198 alloc
@btime reduce(vcat, $vrest, init=$v1)
print("Serial time (Folds):") # 13.48 ms, 198 alloc
@btime Folds.reduce(vcat, $vrest, SequentialEx(), init=$v1)
print("Thread-parallel time (Folds):") # 3.39 ms, 223 alloc
@btime Folds.reduce(vcat, $vrest, ThreadedEx(), init=$v1)
The results for me were similar if I instead provided init=T[]
. I would appreciate any thoughts people have—clearly Folds.jl
is capable of really speeding things up, but just not in the way I’m trying to use it currently.