FLoops for updating a vector by index

DanielVandH · September 13, 2022, 5:03am

I want to make the following function parallel.

function build_vector_explicit(du, elements)
    for k ∈ elements
        du[k] += sin(k)
    end
    return nothing
end

This function takes in a vector du (not necessarily all zeros) and a list of indices elements to update, e.g.

k = [1, 3, 4, 16, 12, 50, 32, 23, 59, 61, 63, 97]
du = zeros(100)
build_vector(du, k)

How can I use FLoops.jl to make this parallel and efficient (for larger arrays, at least)? I found this question Using FLoops.jl to update array counters - #2 by tkf which seems to cover it, and I write the code as

using FLoops
function build_vector_floop(du, elements)
    @floop for k ∈ elements
        ind_to_val = k => sin(k)
        @reduce() do (u = du; ind_to_val)
            if ind_to_val isa Pair
                u[first(ind_to_val)] += last(ind_to_val)
            end
        end
    end
    return nothing
end

This seems to be slower, though:

using Random, BenchmarkTools, StatsBase
Random.seed!(123)
_k = sample(1:10_000_000, 500_000; replace = false)
@benchmark build_vector($du, $k) setup=(du=zeros(10_000_000); k=_k) evals=1
@benchmark build_vector_floop($du, $k) setup=(du=zeros(10_000_000); k=_k) evals=1

julia> @benchmark build_vector($du, $k) setup=(du=zeros(10_000_000); k=_k) evals=1
BenchmarkTools.Trial: 299 samples with 1 evaluation.
 Range (min … max):  2.000 μs …   8.700 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     3.300 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   3.540 μs ± 977.458 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

          ▂█▃▂▂▅ ▅▄▃      ▂
  ▃▁▃▃▅▇█▁██████▁███▆▆▆█▁▆█▅▅▄▄▁▄▅▁▃▃▃▁▄▃▃▃▃▃▃▁▁▁▄▁▁▁▁▁▃▁▁▃▁▃ ▃
  2 μs            Histogram: frequency by time         7.1 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark build_vector_floop($du, $k) setup=(du=zeros(10_000_000); k=_k) evals=1
BenchmarkTools.Trial: 294 samples with 1 evaluation.
 Range (min … max):  55.100 μs … 529.000 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     85.000 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   90.738 μs ±  36.432 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

       ▂▄█▄▄▄▄▅█▂▆▆▂▂ ▁
  ▅▄▄██████████████████▄▁█▃▅▃▃▃▃▃▁▃▁▃▁▁▁▃▃▁▁▃▁▁▃▁▃▁▁▁▁▁▃▁▃▃▃▁▃ ▃
  55.1 μs         Histogram: frequency by time          193 μs <

 Memory estimate: 7.95 KiB, allocs estimate: 107.

What is the correct way to get this working in parallel?

carstenbauer · September 13, 2022, 5:34am

UPDATE Oh, I just noticed that the elements, i.e. the indices, can appear multiple times which could lead to race conditions. Sorry, it’s very early in the morning here

Since you don’t want to perform a reduction, I wouldn’t use @reduce at all. Did you try the following straightforward variant?

function build_vector_floop(du, elements)
   @floop for k ∈ elements
       du[k] += sin(k)
   end
   return nothing
end

For me, this gives

julia> @benchmark build_vector_floop(du, k) setup=(du=zeros(10_000_000); k=_k) evals=1
BenchmarkTools.Trial: 164 samples with 1 evaluation.
 Range (min … max):  13.154 ms …  15.656 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     13.374 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   13.497 ms ± 352.965 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

     ▂▅▅█ ▁
  ▃▇▇██████▇▇▅▁▆▅▅▁▄▄▃▃▃▃▄▄▁▁▁▃▃▃▃▁▃▃▃▃▁▁▁▁▁▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃ ▃
  13.2 ms         Histogram: frequency by time           15 ms <

 Memory estimate: 3.67 KiB, allocs estimate: 51.

julia> @benchmark build_vector_explicit(du, k) setup=(du=zeros(10_000_000); k=_k) evals=1
BenchmarkTools.Trial: 75 samples with 1 evaluation.
 Range (min … max):  50.402 ms …  54.720 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     51.398 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   51.471 ms ± 766.897 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

       ▄  █            ▂
  ▄▄▄▁▆█▄██▄█▄▄▄▆▄▄▆▄▆▆███▄█▁▆▁▄▁▄▆█▁▄▁▄▆▆▄▁▁▄▁▁▆▁▁▁▁▁▁▄▁▁▁▁▁▄ ▁
  50.4 ms         Histogram: frequency by time         53.4 ms <

 Memory estimate: 0 bytes, allocs estimate: 0.

So, about 3.8x faster on my machine (with 6 threads).

DanielVandH · September 13, 2022, 5:46am

@carstenbauer : Thanks for that. I should have tried your simpler loop first I’m pretty confused on what a reduction is supposed to be, actually, so I didn’t realise that’s not what I’m doing.

Regarding your edit: The provided elements won’t contain any duplicates. So is there no need to worry about race conditions, then? Seems to be the case:

du = zeros(10_000_000)
build_vector(du, _k)
true_val = deepcopy(du)
du = zeros(10_000_000)
build_vector_floop(du, _k)
floop_val = deepcopy(du)
@test floop_val == true_val
Test Passed

carstenbauer · September 13, 2022, 5:56am

Glad that my post wasn’t entirely useless then

A reduction essentially is an operation that “reduces” a collections of things (e.g. a vector of numbers) to just a thing (e.g. a number). Example: summation.

If the indices don’t contain any duplicates than there won’t be a race condition since loop iterations will be entirely independent. (There still could be minor performance issues like false sharing but that’s a different story.)

carstenbauer · September 13, 2022, 5:57am

BTW, you should also try to use LoopVectorization, which appears to be faster on my system (presumably due to better SIMD utilization):

julia> function build_vector_tturbo(du, elements) # multithreaded
           @tturbo for i in eachindex(elements)
               k = elements[i]
               du[k] += sin(k)
           end
           return nothing
       end
build_vector_tturbo (generic function with 1 method)

julia> function build_vector_turbo(du, elements) # single-threaded
           @turbo for i in eachindex(elements)
               k = elements[i]
               du[k] += sin(k)
           end
           return nothing
       end
build_vector_turbo (generic function with 1 method)

julia> @btime build_vector_turbo(du, k) setup=(du=zeros(10_000_000); k=_k) evals=1;
  10.523 ms (0 allocations: 0 bytes)

julia> @btime build_vector_tturbo(du, k) setup=(du=zeros(10_000_000); k=_k) evals=1;
  3.386 ms (0 allocations: 0 bytes)

(Didn’t check correctness but should be fine.)

DanielVandH · September 13, 2022, 6:01am

Thanks for these suggestions and the explanation of reduction, very helpful. I initially tried LoopVectorization but it doesn’t seem to work with my actual application that has some nested loops, unfortunately. Also have a lot more mutation to sort out. Maybe as I learn some more I’ll be able to modify my code to use it.

carstenbauer · September 13, 2022, 6:02am

Are you sure about this? Make sure that you’re not just lucky:

julia> k = sample(1:10, 100);

julia> k == unique(k) # no duplicates?
false

DanielVandH · September 13, 2022, 6:03am

Yeah, my call to sample had no replacements:

_k = sample(1:10_000_000, 500_000; replace = false)

Topic		Replies	Views
Problem converting serial code to parallel code with FLoops Performance question , parallel , floops	9	516	October 3, 2022
ANN: Parallel `for` loops in FLoops.jl with composable and extensible fold-based API Package Announcements parallel , multithreading , distributed , loops	17	3425	August 25, 2020
Using FLoops.jl to update array counters General Usage parallel	2	740	April 8, 2021
How to use LoopVectorization.jl? General Usage parallel , loopvectorization	7	1317	September 15, 2023
Multithreading using FLoops for updating discrete distribution Finance and Economics multithreading , floops	11	711	November 16, 2021

FLoops for updating a vector by index

Related topics