I have two parallel arrays a and b of type Vector{Complex128}, and each of them is very large (say, each contains 2^30 elements, thus each requires around 16 GB of RAM). I would like to multiply these arrays elementwise on a machine with many cores available. The obvious way is to use the “dot” syntax
a .*= b
but this only uses a single core. At the moment, there is no parallel version of broadcast, so I am trying to figure out what the next best alternative may be. As far as I can tell, none of the BLAS operations fit this pattern (though BLAS.gbmv! comes close, except I would need to pass a as two different arguments, and I am pretty sure BLAS assumes non-overlapping memory). I also tried using Threads.@threads but this did not speed things up (I can elaborate more on this if necessary).
What is the most efficient way to implement such a calculation? I’m getting close to implementing this part of the calculation as a kernel written in C using pthreads.
function f1!(a, b)
a .*= b
a
end
function f2!(a, b)
Threads.@threads for i in 1:4
s = div(length(a), 4)
r1, r2 = s*(i-1)+1, s*i
@inbounds a[r1:r2] .*= @view b[r1:r2]
end
a
end
for N in [8, 25]
@show N
a = rand(Complex128, 2 ^ N)
b = rand(Complex128, 2 ^ N)
@time f1!(a, b)
@time f2!(a, b)
end
which I run with JULIA_NUM_THREADS=4.
EDIT: Oddly enough, even f1! allocates, which I do not understand:
I would also recommend using BenchmarkTools and then doing @btime f1!($a, $b), so it will be more sophisticated about collecting timing statistics.
That being said, I’m pretty surprised that you are trying to multi-thread such a simple operation. Multi-threading will be much more effective if you are doing more computation on each data item, so that the overhead of memory access (which doesn’t parallelize as well due to contention) is less. If you have a real application (rather than just doing synthetic benchmarks), I would strongly recommend trying to multi-thread at a coarser level. (i.e. don’t try to multi-thread individual vector operations: fuse multiple operations together into a single @threads loop).