Elementwise multiplication of arrays across many cores

garrison · April 14, 2017, 6:29pm

I have two parallel arrays a and b of type Vector{Complex128}, and each of them is very large (say, each contains 2^30 elements, thus each requires around 16 GB of RAM). I would like to multiply these arrays elementwise on a machine with many cores available. The obvious way is to use the “dot” syntax

a .*= b

but this only uses a single core. At the moment, there is no parallel version of broadcast, so I am trying to figure out what the next best alternative may be. As far as I can tell, none of the BLAS operations fit this pattern (though BLAS.gbmv! comes close, except I would need to pass a as two different arguments, and I am pretty sure BLAS assumes non-overlapping memory). I also tried using Threads.@threads but this did not speed things up (I can elaborate more on this if necessary).

What is the most efficient way to implement such a calculation? I’m getting close to implementing this part of the calculation as a kernel written in C using pthreads.

Thanks!

garrison · April 14, 2017, 6:31pm

By the way, here is my code testing @threads:

function f1!(a, b)
    a .*= b
    a
end

function f2!(a, b)
    Threads.@threads for i in 1:4
        s = div(length(a), 4)
        r1, r2 = s*(i-1)+1, s*i
        @inbounds a[r1:r2] .*= @view b[r1:r2]
    end
    a
end

for N in [8, 25]
    @show N
    a = rand(Complex128, 2 ^ N)
    b = rand(Complex128, 2 ^ N)
    @time f1!(a, b)
    @time f2!(a, b)
end

which I run with JULIA_NUM_THREADS=4.

EDIT: Oddly enough, even f1! allocates, which I do not understand:

N = 8
  0.000001 seconds (2 allocations: 4.156 KB)
  0.061202 seconds (47.53 k allocations: 2.154 MB)
N = 25
  0.757689 seconds (3 allocations: 512.000 MB, 6.71% gc time)
  2.006967 seconds (31 allocations: 1.375 GB, 19.78% gc time)

yuyichao · April 14, 2017, 6:40pm

This is the wrong way to use threads, do

Threads.@threads for i in 1:length(a)
    @inbounds a[i] *= b[i]
end

Directly.

stevengj · April 14, 2017, 6:40pm

.* is only in-place in Julia 0.6

garrison · April 14, 2017, 6:46pm

Thanks @yuyichao and @stevengj, that solves both of my problems.

stevengj · April 14, 2017, 6:51pm

I would also recommend using BenchmarkTools and then doing @btime f1!($a, $b), so it will be more sophisticated about collecting timing statistics.

That being said, I’m pretty surprised that you are trying to multi-thread such a simple operation. Multi-threading will be much more effective if you are doing more computation on each data item, so that the overhead of memory access (which doesn’t parallelize as well due to contention) is less. If you have a real application (rather than just doing synthetic benchmarks), I would strongly recommend trying to multi-thread at a coarser level. (i.e. don’t try to multi-thread individual vector operations: fuse multiple operations together into a single @threads loop).

Topic		Replies	Views
Memory usage in nested loops New to Julia multithreading , memory-allocation	2	770	December 10, 2019
How to utilize all cores when operating on large matrices General Usage	4	558	May 2, 2018
Parallel computation of multiplication of large matrices Performance	5	2436	April 8, 2019
Multiply many-matrices by many-vectors Performance matlab , parallel , multithreading , tensors	33	6737	December 14, 2018
Speeding up array multiplication General Usage multithreading , high-performance	7	420	March 9, 2025

Elementwise multiplication of arrays across many cores

Related topics