Problem converting serial code to parallel code with FLoops

I have the following code which loops over some indices and adds values to two separate parts of an array.


using FLoops, Random
function update_u!(du, u, k, q, j)
v = [k, (k + 1) % length(u) + 1, (k + 7) % length(u) + 1]
q .= [u[v[1]], u[v[3]]]
du[v[j]] += q' * q
du[v[(j%3)+1]] -= sin.(q)' * sin.(q)
nothing
end
function do_serial(du::AbstractVector{T}, u, elements) where {T}
q = zeros(2, length(elements), 3)
fill!(du, zero(T))
for k ∈ elements
for j in 1:3
@views update_u!(du, u, k, q[:, k, j], j)
end
end
nothing
end
function do_parallel(du::AbstractVector{T}, u, elements) where {T}
q = zeros(2, length(elements), 3)
@floop for k ∈ elements
for j in 1:3
@views update_u!(du, u, k, q[:, k, j], j)
end
end
nothing
end
n = 10
u = rand(n)
du_serial = zero(u)
du_parallel = zero(u)
k = randperm(n)
do_serial(du_serial, u, k)
do_parallel(du_parallel, u, k)
[du_serial du_parallel abs.(du_serial .- du_parallel)]

10×3 Matrix{Float64}:
0.316201   0.316201  0.0
0.126995   0.126995  2.77556e-17
0.472551   0.472551  0.0
⋮
0.0436279  0.230509  0.186881
0.574818   0.574818  0.0
0.0795132  0.370962  0.291448


I should want these final results to be identical (i.e. have the same results when computing serially and with @floops. How can I modify the code so that this works with FLoops.jl? The code above does sometimes work, but for the larger examples that I do need (n \in (10000, 50000)), the errors accumulate too significantly to handle.

I can’t seem to identify where the issue is actually happening, since I already extended the array q to have a column for each k and for each j to avoid data races there. Is it possible to build reductions on individual vector elements?

du still has a race condition. Since the index mutated can be k, k+1 or k+7 (ignoring the circular part). So k may be on j=2 while the k+1 loop is on j=1 which results in them both updating index k+1 in du at the same time.

1 Like

Also, take a look at StaticArrays.jl, which would let you use the array notation for v and q without allocations or requiring a temp array for storage.

2 Likes

I see, thank you. How would I rewrite it to fix that? I thought that might not be a worry with @floop since I’m just subtracting values pre-computed, not relying on other du values.

Also, re StaticArrays.jl: Do you mean to use something like

function update_u!(du, u, k, q, j)
v = @SVector[k, (k + 1) % length(u) + 1, (k + 7) % length(u) + 1]
q = @SVector[u[v[1]], u[v[3]]]
du[v[j]] += q' * q
du[v[(j%3)+1]] -= sin.(q)' * sin.(q)
nothing
end


at each step? I assumed that would be slower and allocating since I’m creating the vector at each stage, though maybe the compiler is smarter than what I’m assuming.

e: Wow, you’re right:

julia> nonstatic = @benchmark $do_serial($du_serial, $u,$k) # original
BenchmarkTools.Trial: 10000 samples with 7 evaluations.
Range (min … max):  4.700 μs … 615.286 μs  ┊ GC (min … max): 0.00% … 97.72%
Time  (median):     4.900 μs               ┊ GC (median):    0.00%
Time  (mean ± σ):   6.193 μs ±  16.163 μs  ┊ GC (mean ± σ):  8.08% ±  3.09%

██▆▅▃▂▁▁       ▄▂            ▃▃▄▂▂▁                         ▂
████████████▇█▇███▆▇▇▆▆▆▆▅▅▅███████▇▇█▆▆▆▅▇▆▆▆▅▆▆▆▅▆▆▇▇▆▆▆▆ █
4.7 μs       Histogram: log(frequency) by time      11.6 μs <

Memory estimate: 9.91 KiB, allocs estimate: 121.

julia> static = @benchmark $do_serial_sv($du_serial, $u,$k) # with new update_u!
BenchmarkTools.Trial: 10000 samples with 141 evaluations.
Range (min … max):  692.199 ns …   3.310 μs  ┊ GC (min … max): 0.00% … 0.00%
Time  (median):     717.021 ns               ┊ GC (median):    0.00%
Time  (mean ± σ):   775.339 ns ± 184.877 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

▁██▃▁                                       ▃▂                ▁
█████▆▆▆▇▆▆▆▆▅▅▃▅▅▃▃▄▄▅▅▅▄▅▄▄▄▄▄▃▄▅▃▅▅▄▅▅▇▇▆████▆▅▆▆▅▄▄▂▅▄▄▃▂ █
692 ns        Histogram: log(frequency) by time       1.43 μs <

Memory estimate: 0 bytes, allocs estimate: 0.


Thanks for that suggestion!

1 Like

When we say allocations, we mean heap allocations. Static arrays live on the stack and so don’t count towards allocations, they just make the code a lot nicer to write/read.

I don’t use FLoops myself, but I imagine the macro doesn’t see the race condition, and doesn’t compensate for it.

What you really want is to use some form of reduction, but the dependencies make it non-trivial, but I’ll have a think and get back to you. How large do you plan on making n? I ask as going parallel will likely only be faster for much larger n.

1 Like

Ah, makes sense - thank you.

I initially tried using FLoops.@reduce, although it wasn’t working for individual array elements (or I just couldn’t figure it out, I guess). n will be typically be around 10,000 - 25,000. The parts of my actual application that I have successfully parallelised give about 20x faster results for these values.

1 Like

I’ve been able to run the code now, and it seems like the results for the do_parallel is mostly correct, up to machine epsilon (~2e-16 for Float64). You can check the comparison with:

isapprox.(du_serial, du_parallel)


And you can see this is mostly true, but when n gets larger it sometimes stops being true. Since the floating point operations are not guaranteed to happen in the same order as in serial, they may return different results as floating point math is (EDIT after @Oscar_Smith kind correction) non-associative (i.e. a+b+c != c+a+b etc usually). Also the race condition too. I haven’t been able to get the parallel code working well, but what I do have is a more optimised serial one:

function do_serial(du::AbstractVector{T}, u, elements) where {T}
fill!(du, zero(T))
for k ∈ elements
v = @SVector [k, (k + 1) % length(u) + 1, (k + 7) % length(u) + 1]
q = @SVector [u[v[1]], u[v[3]]]
mag_q = q' * q
mag_sin_q = sin.(q)' * sin.(q)
d_mag = mag_q - mag_sin_q
# Unrolled the j=1:3 loop
du[v[1]] += d_mag
du[v[2]] += d_mag
du[v[3]] += d_mag
end
nothing
end


Which is just under 3x faster for n=20000:

julia> @benchmark do_serial(du_serial, u, k)
BenchmarkTools.Trial: 9494 samples with 1 evaluation.
Range (min … max):  519.300 μs … 775.700 μs  ┊ GC (min … max): 0.00% … 0.00%
Time  (median):     523.600 μs               ┊ GC (median):    0.00% Time  (mean ± σ):   524.676 μs ±  11.364 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

█▂ ▄█▁
▄██▇███▆▄▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▂▂▂▂▁▁▂▂▁▂▁▁▁▂▂▂▂▂▂▂▂▂▂▂▁▂▁▁▁▁▂▂▂ ▃
519 μs           Histogram: frequency by time          574 μs <

Memory estimate: 0 bytes, allocs estimate: 0.

do_serial_old (generic function with 1 method)

julia> @benchmark do_serial_old(du_serial, u, k)
BenchmarkTools.Trial: 3533 samples with 1 evaluation.
Range (min … max):  1.403 ms …  2.173 ms  ┊ GC (min … max): 0.00% … 0.00%
Time  (median):     1.407 ms              ┊ GC (median):    0.00%
Time  (mean ± σ):   1.414 ms ± 30.639 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

▄█▇▆▇▆▄▂▁                                ▁                 ▁
████████████▇▆▆▇▅▇▅▇▆▆▆▇▆▇█▆▆▅▇▆▅▅▅▆▆██▇███▇▆▅▃▅▃▃▆█▇██▇▇▆ █
1.4 ms       Histogram: log(frequency) by time     1.47 ms <

Memory estimate: 0 bytes, allocs estimate: 0.

1 Like

Thanks for that. Unfortunately for my application these differences between du_serial and du_parallel seem to be a lot more severe in my actual application, with norm(du_serial - du_parallel) being around 150 rather than machine epsilon typically. That serial code improvement is nice though, cheers. Hopefully the solution can just be found somewhere in FLoops, I’ll keep looking.

1 Like

Just a minor note. Floating point math is commutative, but not associative.

1 Like

Still haven’t been able to get anywhere. I think the only way must be to somehow define reductions for individual vector entries, although I can’t find any examples of that as of yet.