Parallel calls to build-in functions


Hello, I am not new to julia but until now never was forced to do any parallel computations with it.

Now, I have a simple case where I want to use Thread or Processes to speed up the computations but I don’t really see how to start.

I have the following function

function reweight(A::AbstractArray, w::AbstractArray, ω::Real)
	P = exp.(-ω .* (w))
	return sum(A .* P) ./ sum(P)

and the arrays A,w are very large (order of 10^8). Now I want to parallel call to sum and exp functions but I do not see any easy possibility for this without the need of manually define a for-loop for sum computations.

Could you give me some ideas on how this code could be parallelized in julia? Thanks.


In this case, using a for loop is interesting in itself, since it allows

  1. fusing the two loops, and
  2. avoiding the materialization of intermediate vector P.

For example:

function reweight_for(A::AbstractArray, w::AbstractArray, ω::Real)
    s1 = zero(eltype(A))
    s2 = zero(eltype(A))
    @inbounds for i in eachindex(A)
        p = exp(-ω * w[i])
        s1 += A[i] * p
        s2 += p
    return s1 / s2
julia> N = 100_000_000;
julia> A = rand(N);
julia> w = rand(N);
julia> ω = rand();
julia> @test reweight(A, w, ω) ≈ reweight_for(A, w, ω)
Test Passed

julia> @btime reweight($A, $w, $ω);
  949.582 ms (4 allocations: 1.49 GiB)

julia> @btime reweight_for($A, $w, $ω);
  782.361 ms (0 allocations: 0 bytes)

From there, it is more easily parallelized, although I don’t think your case features enough arithmetic intensity for it to be useful:

function reweight_threads(A::AbstractArray, w::AbstractArray, ω::Real)
    s1 = zeros(eltype(A), Threads.nthreads())
    s2 = zeros(eltype(A), Threads.nthreads())
    @inbounds Threads.@threads for i in eachindex(A)
        id = Threads.threadid()

        p = exp(-ω * w[i])
        s1[id] += A[i] * p
        s2[id] += p
    return sum(s1) / sum(s2)
julia> @test reweight(A, w, ω) ≈ reweight_threads(A, w, ω)
Test Passed

julia> Threads.nthreads()

julia> @btime reweight_threads($A, $w, $ω);
  1.700 s (3 allocations: 288 bytes)

There might be some gain to get from vectorization, though:

function reweight_simd(A::AbstractArray, w::AbstractArray, ω::Real)
    s1 = zero(eltype(A))
    s2 = zero(eltype(A))
    @inbounds @simd for i in eachindex(A)
        p = exp(-ω * w[i])
        s1 += A[i] * p
        s2 += p
    return s1 / s2
julia> @test reweight(A, w, ω) ≈ reweight_simd(A, w, ω)
Test Passed

julia> @btime reweight_simd($A, $w, $ω)
  672.069 ms (0 allocations: 0 bytes)

Hey, thanks for this long and extensive version. The explicit for loop was something which I figured out on myself, too, but after the post. The threads approach looks really cool and I was thinking about the same idea but wanted to save everything into one variable and not per thread, which do the trick here. I am quite sure, that it will be faster at the end of the day, because I am on a large cluster and also realized that the length of the array will be more of order 10^9 - 10^{10}. I could use up to 24 threads or even \approx 100 - 600 CPUs.

However, with this in mind, how would the same thing work with processes and @distributed instead of threads? Would I need a SharedArray or something like this? Or is it just fine to use a normal one when only for reading? Would this ‘only read’ array still need to be copied for each worker separately? This would not work in my case due to memory limits.


I think this sort of computation is more suitable for either threads or GPU. Launching memory independent worker processes (then collecting the results after) might actually be too much of an overhead. In general, the rule of thumb is that if your function is computationally heavy then addprocs() is the way to go. If the computation is fast then use threads.


So, you mean, in this case I should take as much threads as possible and stay away from procs?