FLoops for updating a vector by index

I want to make the following function parallel.

function build_vector_explicit(du, elements)
    for k ∈ elements
        du[k] += sin(k)
    end
    return nothing
end

This function takes in a vector du (not necessarily all zeros) and a list of indices elements to update, e.g.

k = [1, 3, 4, 16, 12, 50, 32, 23, 59, 61, 63, 97]
du = zeros(100)
build_vector(du, k)

How can I use FLoops.jl to make this parallel and efficient (for larger arrays, at least)? I found this question Using FLoops.jl to update array counters - #2 by tkf which seems to cover it, and I write the code as

using FLoops
function build_vector_floop(du, elements)
    @floop for k ∈ elements
        ind_to_val = k => sin(k)
        @reduce() do (u = du; ind_to_val)
            if ind_to_val isa Pair
                u[first(ind_to_val)] += last(ind_to_val)
            end
        end
    end
    return nothing
end

This seems to be slower, though:

using Random, BenchmarkTools, StatsBase
Random.seed!(123)
_k = sample(1:10_000_000, 500_000; replace = false)
@benchmark build_vector($du, $k) setup=(du=zeros(10_000_000); k=_k) evals=1
@benchmark build_vector_floop($du, $k) setup=(du=zeros(10_000_000); k=_k) evals=1
julia> @benchmark build_vector($du, $k) setup=(du=zeros(10_000_000); k=_k) evals=1
BenchmarkTools.Trial: 299 samples with 1 evaluation.
 Range (min … max):  2.000 ΞΌs …   8.700 ΞΌs  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     3.300 ΞΌs               β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   3.540 ΞΌs Β± 977.458 ns  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

          β–‚β–ˆβ–ƒβ–‚β–‚β–… β–…β–„β–ƒ      β–‚
  β–ƒβ–β–ƒβ–ƒβ–…β–‡β–ˆβ–β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–β–ˆβ–ˆβ–ˆβ–†β–†β–†β–ˆβ–β–†β–ˆβ–…β–…β–„β–„β–β–„β–…β–β–ƒβ–ƒβ–ƒβ–β–„β–ƒβ–ƒβ–ƒβ–ƒβ–ƒβ–ƒβ–β–β–β–„β–β–β–β–β–β–ƒβ–β–β–ƒβ–β–ƒ β–ƒ
  2 ΞΌs            Histogram: frequency by time         7.1 ΞΌs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark build_vector_floop($du, $k) setup=(du=zeros(10_000_000); k=_k) evals=1
BenchmarkTools.Trial: 294 samples with 1 evaluation.
 Range (min … max):  55.100 ΞΌs … 529.000 ΞΌs  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     85.000 ΞΌs               β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   90.738 ΞΌs Β±  36.432 ΞΌs  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

       β–‚β–„β–ˆβ–„β–„β–„β–„β–…β–ˆβ–‚β–†β–†β–‚β–‚ ▁
  β–…β–„β–„β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–„β–β–ˆβ–ƒβ–…β–ƒβ–ƒβ–ƒβ–ƒβ–ƒβ–β–ƒβ–β–ƒβ–β–β–β–ƒβ–ƒβ–β–β–ƒβ–β–β–ƒβ–β–ƒβ–β–β–β–β–β–ƒβ–β–ƒβ–ƒβ–ƒβ–β–ƒ β–ƒ
  55.1 ΞΌs         Histogram: frequency by time          193 ΞΌs <

 Memory estimate: 7.95 KiB, allocs estimate: 107.

What is the correct way to get this working in parallel?

UPDATE Oh, I just noticed that the elements, i.e. the indices, can appear multiple times which could lead to race conditions. Sorry, it’s very early in the morning here :smiley:

Since you don’t want to perform a reduction, I wouldn’t use @reduce at all. Did you try the following straightforward variant?

function build_vector_floop(du, elements)
   @floop for k ∈ elements
       du[k] += sin(k)
   end
   return nothing
end

For me, this gives

julia> @benchmark build_vector_floop(du, k) setup=(du=zeros(10_000_000); k=_k) evals=1
BenchmarkTools.Trial: 164 samples with 1 evaluation.
 Range (min … max):  13.154 ms …  15.656 ms  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     13.374 ms               β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   13.497 ms Β± 352.965 ΞΌs  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

     β–‚β–…β–…β–ˆ ▁
  β–ƒβ–‡β–‡β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‡β–‡β–…β–β–†β–…β–…β–β–„β–„β–ƒβ–ƒβ–ƒβ–ƒβ–„β–„β–β–β–β–ƒβ–ƒβ–ƒβ–ƒβ–β–ƒβ–ƒβ–ƒβ–ƒβ–β–β–β–β–β–ƒβ–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–ƒ β–ƒ
  13.2 ms         Histogram: frequency by time           15 ms <

 Memory estimate: 3.67 KiB, allocs estimate: 51.

julia> @benchmark build_vector_explicit(du, k) setup=(du=zeros(10_000_000); k=_k) evals=1
BenchmarkTools.Trial: 75 samples with 1 evaluation.
 Range (min … max):  50.402 ms …  54.720 ms  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     51.398 ms               β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   51.471 ms Β± 766.897 ΞΌs  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

       β–„  β–ˆ            β–‚
  β–„β–„β–„β–β–†β–ˆβ–„β–ˆβ–ˆβ–„β–ˆβ–„β–„β–„β–†β–„β–„β–†β–„β–†β–†β–ˆβ–ˆβ–ˆβ–„β–ˆβ–β–†β–β–„β–β–„β–†β–ˆβ–β–„β–β–„β–†β–†β–„β–β–β–„β–β–β–†β–β–β–β–β–β–β–„β–β–β–β–β–β–„ ▁
  50.4 ms         Histogram: frequency by time         53.4 ms <

 Memory estimate: 0 bytes, allocs estimate: 0.

So, about 3.8x faster on my machine (with 6 threads).

2 Likes

@carstenbauer : Thanks for that. I should have tried your simpler loop first :slight_smile: I’m pretty confused on what a reduction is supposed to be, actually, so I didn’t realise that’s not what I’m doing.

Regarding your edit: The provided elements won’t contain any duplicates. So is there no need to worry about race conditions, then? Seems to be the case:

du = zeros(10_000_000)
build_vector(du, _k)
true_val = deepcopy(du)
du = zeros(10_000_000)
build_vector_floop(du, _k)
floop_val = deepcopy(du)
@test floop_val == true_val
Test Passed

Glad that my post wasn’t entirely useless then :slight_smile:

A reduction essentially is an operation that β€œreduces” a collections of things (e.g. a vector of numbers) to just a thing (e.g. a number). Example: summation.

If the indices don’t contain any duplicates than there won’t be a race condition since loop iterations will be entirely independent. (There still could be minor performance issues like false sharing but that’s a different story.)

1 Like

BTW, you should also try to use LoopVectorization, which appears to be faster on my system (presumably due to better SIMD utilization):

julia> function build_vector_tturbo(du, elements) # multithreaded
           @tturbo for i in eachindex(elements)
               k = elements[i]
               du[k] += sin(k)
           end
           return nothing
       end
build_vector_tturbo (generic function with 1 method)

julia> function build_vector_turbo(du, elements) # single-threaded
           @turbo for i in eachindex(elements)
               k = elements[i]
               du[k] += sin(k)
           end
           return nothing
       end
build_vector_turbo (generic function with 1 method)

julia> @btime build_vector_turbo(du, k) setup=(du=zeros(10_000_000); k=_k) evals=1;
  10.523 ms (0 allocations: 0 bytes)

julia> @btime build_vector_tturbo(du, k) setup=(du=zeros(10_000_000); k=_k) evals=1;
  3.386 ms (0 allocations: 0 bytes)

(Didn’t check correctness but should be fine.)

1 Like

Thanks for these suggestions and the explanation of reduction, very helpful. I initially tried LoopVectorization but it doesn’t seem to work with my actual application that has some nested loops, unfortunately. Also have a lot more mutation to sort out. Maybe as I learn some more I’ll be able to modify my code to use it.

Are you sure about this? Make sure that you’re not just lucky:

julia> k = sample(1:10, 100);

julia> k == unique(k) # no duplicates?
false
1 Like

Yeah, my call to sample had no replacements:

_k = sample(1:10_000_000, 500_000; replace = false)
1 Like