Elementwise multiplication of arrays across many cores

I have two parallel arrays a and b of type Vector{Complex128}, and each of them is very large (say, each contains 2^30 elements, thus each requires around 16 GB of RAM). I would like to multiply these arrays elementwise on a machine with many cores available. The obvious way is to use the “dot” syntax

a .*= b

but this only uses a single core. At the moment, there is no parallel version of broadcast, so I am trying to figure out what the next best alternative may be. As far as I can tell, none of the BLAS operations fit this pattern (though BLAS.gbmv! comes close, except I would need to pass a as two different arguments, and I am pretty sure BLAS assumes non-overlapping memory). I also tried using Threads.@threads but this did not speed things up (I can elaborate more on this if necessary).

What is the most efficient way to implement such a calculation? I’m getting close to implementing this part of the calculation as a kernel written in C using pthreads.


By the way, here is my code testing @threads:

function f1!(a, b)
    a .*= b

function f2!(a, b)
    Threads.@threads for i in 1:4
        s = div(length(a), 4)
        r1, r2 = s*(i-1)+1, s*i
        @inbounds a[r1:r2] .*= @view b[r1:r2]

for N in [8, 25]
    @show N
    a = rand(Complex128, 2 ^ N)
    b = rand(Complex128, 2 ^ N)
    @time f1!(a, b)
    @time f2!(a, b)

which I run with JULIA_NUM_THREADS=4.

EDIT: Oddly enough, even f1! allocates, which I do not understand:

N = 8
  0.000001 seconds (2 allocations: 4.156 KB)
  0.061202 seconds (47.53 k allocations: 2.154 MB)
N = 25
  0.757689 seconds (3 allocations: 512.000 MB, 6.71% gc time)
  2.006967 seconds (31 allocations: 1.375 GB, 19.78% gc time)

This is the wrong way to use threads, do

Threads.@threads for i in 1:length(a)
    @inbounds a[i] *= b[i]



.* is only in-place in Julia 0.6

Thanks @yuyichao and @stevengj, that solves both of my problems.

I would also recommend using BenchmarkTools and then doing @btime f1!($a, $b), so it will be more sophisticated about collecting timing statistics.

That being said, I’m pretty surprised that you are trying to multi-thread such a simple operation. Multi-threading will be much more effective if you are doing more computation on each data item, so that the overhead of memory access (which doesn’t parallelize as well due to contention) is less. If you have a real application (rather than just doing synthetic benchmarks), I would strongly recommend trying to multi-thread at a coarser level. (i.e. don’t try to multi-thread individual vector operations: fuse multiple operations together into a single @threads loop).