# Parallel calls to build-in functions

Hello, I am not new to julia but until now never was forced to do any parallel computations with it.

Now, I have a simple case where I want to use Thread or Processes to speed up the computations but I don’t really see how to start.

I have the following function

``````function reweight(A::AbstractArray, w::AbstractArray, ω::Real)
P = exp.(-ω .* (w))
return sum(A .* P) ./ sum(P)
end
``````

and the arrays `A,w` are very large (order of 10^8). Now I want to parallel call to `sum` and `exp` functions but I do not see any easy possibility for this without the need of manually define a `for`-loop for sum computations.

Could you give me some ideas on how this code could be parallelized in julia? Thanks.

In this case, using a `for` loop is interesting in itself, since it allows

1. fusing the two loops, and
2. avoiding the materialization of intermediate vector `P`.

For example:

``````function reweight_for(A::AbstractArray, w::AbstractArray, ω::Real)
s1 = zero(eltype(A))
s2 = zero(eltype(A))
@inbounds for i in eachindex(A)
p = exp(-ω * w[i])
s1 += A[i] * p
s2 += p
end
return s1 / s2
end
``````
``````julia> N = 100_000_000;
julia> A = rand(N);
julia> w = rand(N);
julia> ω = rand();
julia> @test reweight(A, w, ω) ≈ reweight_for(A, w, ω)
Test Passed

julia> @btime reweight(\$A, \$w, \$ω);
949.582 ms (4 allocations: 1.49 GiB)

julia> @btime reweight_for(\$A, \$w, \$ω);
782.361 ms (0 allocations: 0 bytes)
``````

From there, it is more easily parallelized, although I don’t think your case features enough arithmetic intensity for it to be useful:

``````function reweight_threads(A::AbstractArray, w::AbstractArray, ω::Real)

p = exp(-ω * w[i])
s1[id] += A[i] * p
s2[id] += p
end
return sum(s1) / sum(s2)
end
``````
``````julia> @test reweight(A, w, ω) ≈ reweight_threads(A, w, ω)
Test Passed

4

1.700 s (3 allocations: 288 bytes)
``````

There might be some gain to get from vectorization, though:

``````function reweight_simd(A::AbstractArray, w::AbstractArray, ω::Real)
s1 = zero(eltype(A))
s2 = zero(eltype(A))
@inbounds @simd for i in eachindex(A)
p = exp(-ω * w[i])
s1 += A[i] * p
s2 += p
end
return s1 / s2
end
``````
``````julia> @test reweight(A, w, ω) ≈ reweight_simd(A, w, ω)
Test Passed

julia> @btime reweight_simd(\$A, \$w, \$ω)
672.069 ms (0 allocations: 0 bytes)
``````

Hey, thanks for this long and extensive version. The explicit `for` loop was something which I figured out on myself, too, but after the post. The threads approach looks really cool and I was thinking about the same idea but wanted to save everything into one variable and not per thread, which do the trick here. I am quite sure, that it will be faster at the end of the day, because I am on a large cluster and also realized that the length of the array will be more of order 10^9 - 10^{10}. I could use up to 24 threads or even \approx 100 - 600 CPUs.

However, with this in mind, how would the same thing work with processes and `@distributed` instead of threads? Would I need a `SharedArray` or something like this? Or is it just fine to use a normal one when only for reading? Would this ‘only read’ array still need to be copied for each worker separately? This would not work in my case due to memory limits.

I think this sort of computation is more suitable for either threads or GPU. Launching memory independent worker processes (then collecting the results after) might actually be too much of an overhead. In general, the rule of thumb is that if your function is computationally heavy then `addprocs()` is the way to go. If the computation is fast then use threads.

So, you mean, in this case I should take as much threads as possible and stay away from procs?