Problem converting serial code to parallel code with FLoops

DanielVandH · September 22, 2022, 1:00am

I have the following code which loops over some indices and adds values to two separate parts of an array.


using FLoops, Random
function update_u!(du, u, k, q, j)
    v = [k, (k + 1) % length(u) + 1, (k + 7) % length(u) + 1]
    q .= [u[v[1]], u[v[3]]]
    du[v[j]] += q' * q
    du[v[(j%3)+1]] -= sin.(q)' * sin.(q)
    nothing
end
function do_serial(du::AbstractVector{T}, u, elements) where {T}
    q = zeros(2, length(elements), 3)
    fill!(du, zero(T))
    for k ∈ elements
        for j in 1:3
            @views update_u!(du, u, k, q[:, k, j], j)
        end
    end
    nothing
end
function do_parallel(du::AbstractVector{T}, u, elements) where {T}
    q = zeros(2, length(elements), 3)
    @floop for k ∈ elements
        for j in 1:3
            @views update_u!(du, u, k, q[:, k, j], j)
        end
    end
    nothing
end
n = 10
u = rand(n)
du_serial = zero(u)
du_parallel = zero(u)
k = randperm(n)
do_serial(du_serial, u, k)
do_parallel(du_parallel, u, k)
[du_serial du_parallel abs.(du_serial .- du_parallel)]

10×3 Matrix{Float64}:
 0.316201   0.316201  0.0
 0.126995   0.126995  2.77556e-17
 0.472551   0.472551  0.0
 ⋮
 0.0436279  0.230509  0.186881
 0.574818   0.574818  0.0
 0.0795132  0.370962  0.291448

I should want these final results to be identical (i.e. have the same results when computing serially and with @floops. How can I modify the code so that this works with FLoops.jl? The code above does sometimes work, but for the larger examples that I do need (n \in (10000, 50000)), the errors accumulate too significantly to handle.

I can’t seem to identify where the issue is actually happening, since I already extended the array q to have a column for each k and for each j to avoid data races there. Is it possible to build reductions on individual vector elements?

jmair · September 22, 2022, 9:05pm

du still has a race condition. Since the index mutated can be k, k+1 or k+7 (ignoring the circular part). So k may be on j=2 while the k+1 loop is on j=1 which results in them both updating index k+1 in du at the same time.

jmair · September 22, 2022, 9:08pm

Also, take a look at StaticArrays.jl, which would let you use the array notation for v and q without allocations or requiring a temp array for storage.

DanielVandH · September 22, 2022, 9:24pm

I see, thank you. How would I rewrite it to fix that? I thought that might not be a worry with @floop since I’m just subtracting values pre-computed, not relying on other du values.

Also, re StaticArrays.jl: Do you mean to use something like

function update_u!(du, u, k, q, j)
    v = @SVector[k, (k + 1) % length(u) + 1, (k + 7) % length(u) + 1]
    q = @SVector[u[v[1]], u[v[3]]]
    du[v[j]] += q' * q
    du[v[(j%3)+1]] -= sin.(q)' * sin.(q)
    nothing
end

at each step? I assumed that would be slower and allocating since I’m creating the vector at each stage, though maybe the compiler is smarter than what I’m assuming.

e: Wow, you’re right:

julia> nonstatic = @benchmark $do_serial($du_serial, $u, $k) # original
BenchmarkTools.Trial: 10000 samples with 7 evaluations.
 Range (min … max):  4.700 μs … 615.286 μs  ┊ GC (min … max): 0.00% … 97.72%
 Time  (median):     4.900 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   6.193 μs ±  16.163 μs  ┊ GC (mean ± σ):  8.08% ±  3.09%

  ██▆▅▃▂▁▁       ▄▂            ▃▃▄▂▂▁                         ▂
  ████████████▇█▇███▆▇▇▆▆▆▆▅▅▅███████▇▇█▆▆▆▅▇▆▆▆▅▆▆▆▅▆▆▇▇▆▆▆▆ █
  4.7 μs       Histogram: log(frequency) by time      11.6 μs <

 Memory estimate: 9.91 KiB, allocs estimate: 121.

julia> static = @benchmark $do_serial_sv($du_serial, $u, $k) # with new update_u!
BenchmarkTools.Trial: 10000 samples with 141 evaluations.
 Range (min … max):  692.199 ns …   3.310 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     717.021 ns               ┊ GC (median):    0.00%
 Time  (mean ± σ):   775.339 ns ± 184.877 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▁██▃▁                                       ▃▂                ▁
  █████▆▆▆▇▆▆▆▆▅▅▃▅▅▃▃▄▄▅▅▅▄▅▄▄▄▄▄▃▄▅▃▅▅▄▅▅▇▇▆████▆▅▆▆▅▄▄▂▅▄▄▃▂ █
  692 ns        Histogram: log(frequency) by time       1.43 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

Thanks for that suggestion!

jmair · September 22, 2022, 10:44pm

When we say allocations, we mean heap allocations. Static arrays live on the stack and so don’t count towards allocations, they just make the code a lot nicer to write/read.

I don’t use FLoops myself, but I imagine the macro doesn’t see the race condition, and doesn’t compensate for it.

What you really want is to use some form of reduction, but the dependencies make it non-trivial, but I’ll have a think and get back to you. How large do you plan on making n? I ask as going parallel will likely only be faster for much larger n.

DanielVandH · September 22, 2022, 11:06pm

Ah, makes sense - thank you.

I initially tried using FLoops.@reduce, although it wasn’t working for individual array elements (or I just couldn’t figure it out, I guess). n will be typically be around 10,000 - 25,000. The parts of my actual application that I have successfully parallelised give about 20x faster results for these values.

jmair · September 23, 2022, 7:31am

I’ve been able to run the code now, and it seems like the results for the do_parallel is mostly correct, up to machine epsilon (~2e-16 for Float64). You can check the comparison with:

isapprox.(du_serial, du_parallel)

And you can see this is mostly true, but when n gets larger it sometimes stops being true. Since the floating point operations are not guaranteed to happen in the same order as in serial, they may return different results as floating point math is (EDIT after @Oscar_Smith kind correction) non-associative (i.e. a+b+c != c+a+b etc usually). Also the race condition too. I haven’t been able to get the parallel code working well, but what I do have is a more optimised serial one:

function do_serial(du::AbstractVector{T}, u, elements) where {T}
    fill!(du, zero(T))
    for k ∈ elements
        v = @SVector [k, (k + 1) % length(u) + 1, (k + 7) % length(u) + 1]
        q = @SVector [u[v[1]], u[v[3]]]
        mag_q = q' * q
        mag_sin_q = sin.(q)' * sin.(q)
        d_mag = mag_q - mag_sin_q
        # Unrolled the j=1:3 loop
        du[v[1]] += d_mag
        du[v[2]] += d_mag
        du[v[3]] += d_mag
    end
    nothing
end

Which is just under 3x faster for n=20000:

julia> @benchmark do_serial(du_serial, u, k)
BenchmarkTools.Trial: 9494 samples with 1 evaluation.
 Range (min … max):  519.300 μs … 775.700 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     523.600 μs               ┊ GC (median):    0.00% Time  (mean ± σ):   524.676 μs ±  11.364 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

   █▂ ▄█▁
  ▄██▇███▆▄▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▂▂▂▂▁▁▂▂▁▂▁▁▁▂▂▂▂▂▂▂▂▂▂▂▁▂▁▁▁▁▂▂▂ ▃
  519 μs           Histogram: frequency by time          574 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

do_serial_old (generic function with 1 method)

julia> @benchmark do_serial_old(du_serial, u, k)
BenchmarkTools.Trial: 3533 samples with 1 evaluation.
 Range (min … max):  1.403 ms …  2.173 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     1.407 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   1.414 ms ± 30.639 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▄█▇▆▇▆▄▂▁                                ▁                 ▁
  ████████████▇▆▆▇▅▇▅▇▆▆▆▇▆▇█▆▆▅▇▆▅▅▅▆▆██▇███▇▆▅▃▅▃▃▆█▇██▇▇▆ █
  1.4 ms       Histogram: log(frequency) by time     1.47 ms <

 Memory estimate: 0 bytes, allocs estimate: 0.

DanielVandH · September 23, 2022, 2:48pm

Thanks for that. Unfortunately for my application these differences between du_serial and du_parallel seem to be a lot more severe in my actual application, with norm(du_serial - du_parallel) being around 150 rather than machine epsilon typically. That serial code improvement is nice though, cheers. Hopefully the solution can just be found somewhere in FLoops, I’ll keep looking.

Oscar_Smith · September 23, 2022, 2:58pm

Just a minor note. Floating point math is commutative, but not associative.

DanielVandH · October 3, 2022, 9:52pm

Still haven’t been able to get anywhere. I think the only way must be to somehow define reductions for individual vector entries, although I can’t find any examples of that as of yet.

Topic		Replies	Views
FLoops for updating a vector by index Performance parallel , floops	7	479	September 13, 2022
Julia multithreading is running slower than serial, can someone please explain why...? Performance multithreading , floops	15	1390	March 3, 2023
Julia multithreading is running slower than serial, can someone please explain why…? Performance multithreading , floops	14	704	March 24, 2023
How to speed up this function? Performance	2	458	August 8, 2021
Using FLoops.jl to update array counters General Usage parallel	2	740	April 8, 2021

Problem converting serial code to parallel code with FLoops

Related topics