Parallel calls to build-in functions

stakaz · February 28, 2019, 1:23pm

Hello, I am not new to julia but until now never was forced to do any parallel computations with it.

Now, I have a simple case where I want to use Thread or Processes to speed up the computations but I don’t really see how to start.

I have the following function

function reweight(A::AbstractArray, w::AbstractArray, ω::Real)
	P = exp.(-ω .* (w))
	return sum(A .* P) ./ sum(P)
end

and the arrays A,w are very large (order of 10^8). Now I want to parallel call to sum and exp functions but I do not see any easy possibility for this without the need of manually define a for-loop for sum computations.

Could you give me some ideas on how this code could be parallelized in julia? Thanks.

ffevotte · March 1, 2019, 10:37pm

In this case, using a for loop is interesting in itself, since it allows

fusing the two loops, and
avoiding the materialization of intermediate vector P.

For example:

function reweight_for(A::AbstractArray, w::AbstractArray, ω::Real)
    s1 = zero(eltype(A))
    s2 = zero(eltype(A))
    @inbounds for i in eachindex(A)
        p = exp(-ω * w[i])
        s1 += A[i] * p
        s2 += p
    end
    return s1 / s2
end

julia> N = 100_000_000;
julia> A = rand(N);
julia> w = rand(N);
julia> ω = rand();
julia> @test reweight(A, w, ω) ≈ reweight_for(A, w, ω)
Test Passed

julia> @btime reweight($A, $w, $ω);
  949.582 ms (4 allocations: 1.49 GiB)

julia> @btime reweight_for($A, $w, $ω);
  782.361 ms (0 allocations: 0 bytes)

From there, it is more easily parallelized, although I don’t think your case features enough arithmetic intensity for it to be useful:

function reweight_threads(A::AbstractArray, w::AbstractArray, ω::Real)
    s1 = zeros(eltype(A), Threads.nthreads())
    s2 = zeros(eltype(A), Threads.nthreads())
    @inbounds Threads.@threads for i in eachindex(A)
        id = Threads.threadid()

        p = exp(-ω * w[i])
        s1[id] += A[i] * p
        s2[id] += p
    end
    return sum(s1) / sum(s2)
end

julia> @test reweight(A, w, ω) ≈ reweight_threads(A, w, ω)
Test Passed

julia> Threads.nthreads()
4

julia> @btime reweight_threads($A, $w, $ω);
  1.700 s (3 allocations: 288 bytes)

There might be some gain to get from vectorization, though:

function reweight_simd(A::AbstractArray, w::AbstractArray, ω::Real)
    s1 = zero(eltype(A))
    s2 = zero(eltype(A))
    @inbounds @simd for i in eachindex(A)
        p = exp(-ω * w[i])
        s1 += A[i] * p
        s2 += p
    end
    return s1 / s2
end

julia> @test reweight(A, w, ω) ≈ reweight_simd(A, w, ω)
Test Passed

julia> @btime reweight_simd($A, $w, $ω)
  672.069 ms (0 allocations: 0 bytes)

stakaz · March 2, 2019, 1:06pm

Hey, thanks for this long and extensive version. The explicit for loop was something which I figured out on myself, too, but after the post. The threads approach looks really cool and I was thinking about the same idea but wanted to save everything into one variable and not per thread, which do the trick here. I am quite sure, that it will be faster at the end of the day, because I am on a large cluster and also realized that the length of the array will be more of order 10^9 - 10^{10}. I could use up to 24 threads or even \approx 100 - 600 CPUs.

However, with this in mind, how would the same thing work with processes and @distributed instead of threads? Would I need a SharedArray or something like this? Or is it just fine to use a normal one when only for reading? Would this ‘only read’ array still need to be copied for each worker separately? This would not work in my case due to memory limits.

affans · March 2, 2019, 2:37pm

I think this sort of computation is more suitable for either threads or GPU. Launching memory independent worker processes (then collecting the results after) might actually be too much of an overhead. In general, the rule of thumb is that if your function is computationally heavy then addprocs() is the way to go. If the computation is fast then use threads.

stakaz · March 2, 2019, 5:23pm

So, you mean, in this case I should take as much threads as possible and stay away from procs?

Topic		Replies	Views
Questions about getting started with parallel computing Julia at Scale	18	5780	June 22, 2019
Help with parallel computing for a simple loop with a large function New to Julia	4	1650	October 25, 2019
Simple Parallel Examples for Embarrassingly Simple Problems Julia at Scale	29	7348	April 23, 2021
Running simple loop in parallel New to Julia parallel	21	5928	November 30, 2022
Performance issues with parallel Julia code Julia at Scale performance , parallel , distributed , scientific-computing	2	1132	October 29, 2021

Parallel calls to build-in functions

Related topics