I want to make the following function parallel.
function build_vector_explicit(du, elements)
for k β elements
du[k] += sin(k)
end
return nothing
end
This function takes in a vector du
(not necessarily all zeros) and a list of indices elements
to update, e.g.
k = [1, 3, 4, 16, 12, 50, 32, 23, 59, 61, 63, 97]
du = zeros(100)
build_vector(du, k)
How can I use FLoops.jl to make this parallel and efficient (for larger arrays, at least)? I found this question Using FLoops.jl to update array counters - #2 by tkf which seems to cover it, and I write the code as
using FLoops
function build_vector_floop(du, elements)
@floop for k β elements
ind_to_val = k => sin(k)
@reduce() do (u = du; ind_to_val)
if ind_to_val isa Pair
u[first(ind_to_val)] += last(ind_to_val)
end
end
end
return nothing
end
This seems to be slower, though:
using Random, BenchmarkTools, StatsBase
Random.seed!(123)
_k = sample(1:10_000_000, 500_000; replace = false)
@benchmark build_vector($du, $k) setup=(du=zeros(10_000_000); k=_k) evals=1
@benchmark build_vector_floop($du, $k) setup=(du=zeros(10_000_000); k=_k) evals=1
julia> @benchmark build_vector($du, $k) setup=(du=zeros(10_000_000); k=_k) evals=1
BenchmarkTools.Trial: 299 samples with 1 evaluation.
Range (min β¦ max): 2.000 ΞΌs β¦ 8.700 ΞΌs β GC (min β¦ max): 0.00% β¦ 0.00%
Time (median): 3.300 ΞΌs β GC (median): 0.00%
Time (mean Β± Ο): 3.540 ΞΌs Β± 977.458 ns β GC (mean Β± Ο): 0.00% Β± 0.00%
ββββββ
β
ββ β
βββββ
βββββββββββββββββββββ
β
βββββ
βββββββββββββββββββββββββββ β
2 ΞΌs Histogram: frequency by time 7.1 ΞΌs <
Memory estimate: 0 bytes, allocs estimate: 0.
julia> @benchmark build_vector_floop($du, $k) setup=(du=zeros(10_000_000); k=_k) evals=1
BenchmarkTools.Trial: 294 samples with 1 evaluation.
Range (min β¦ max): 55.100 ΞΌs β¦ 529.000 ΞΌs β GC (min β¦ max): 0.00% β¦ 0.00%
Time (median): 85.000 ΞΌs β GC (median): 0.00%
Time (mean Β± Ο): 90.738 ΞΌs Β± 36.432 ΞΌs β GC (mean Β± Ο): 0.00% Β± 0.00%
ββββββββ
ββββββ β
β
βββββββββββββββββββββββββ
ββββββββββββββββββββββββββββββββββ β
55.1 ΞΌs Histogram: frequency by time 193 ΞΌs <
Memory estimate: 7.95 KiB, allocs estimate: 107.
What is the correct way to get this working in parallel?