Problem converting serial code to parallel code with FLoops

I have the following code which loops over some indices and adds values to two separate parts of an array.


using FLoops, Random
function update_u!(du, u, k, q, j)
    v = [k, (k + 1) % length(u) + 1, (k + 7) % length(u) + 1]
    q .= [u[v[1]], u[v[3]]]
    du[v[j]] += q' * q
    du[v[(j%3)+1]] -= sin.(q)' * sin.(q)
    nothing
end
function do_serial(du::AbstractVector{T}, u, elements) where {T}
    q = zeros(2, length(elements), 3)
    fill!(du, zero(T))
    for k โˆˆ elements
        for j in 1:3
            @views update_u!(du, u, k, q[:, k, j], j)
        end
    end
    nothing
end
function do_parallel(du::AbstractVector{T}, u, elements) where {T}
    q = zeros(2, length(elements), 3)
    @floop for k โˆˆ elements
        for j in 1:3
            @views update_u!(du, u, k, q[:, k, j], j)
        end
    end
    nothing
end
n = 10
u = rand(n)
du_serial = zero(u)
du_parallel = zero(u)
k = randperm(n)
do_serial(du_serial, u, k)
do_parallel(du_parallel, u, k)
[du_serial du_parallel abs.(du_serial .- du_parallel)]
10ร—3 Matrix{Float64}:
 0.316201   0.316201  0.0
 0.126995   0.126995  2.77556e-17
 0.472551   0.472551  0.0
 โ‹ฎ
 0.0436279  0.230509  0.186881
 0.574818   0.574818  0.0
 0.0795132  0.370962  0.291448

I should want these final results to be identical (i.e. have the same results when computing serially and with @floops. How can I modify the code so that this works with FLoops.jl? The code above does sometimes work, but for the larger examples that I do need (n \in (10000, 50000)), the errors accumulate too significantly to handle.

I canโ€™t seem to identify where the issue is actually happening, since I already extended the array q to have a column for each k and for each j to avoid data races there. Is it possible to build reductions on individual vector elements?

du still has a race condition. Since the index mutated can be k, k+1 or k+7 (ignoring the circular part). So k may be on j=2 while the k+1 loop is on j=1 which results in them both updating index k+1 in du at the same time.

1 Like

Also, take a look at StaticArrays.jl, which would let you use the array notation for v and q without allocations or requiring a temp array for storage.

2 Likes

I see, thank you. How would I rewrite it to fix that? I thought that might not be a worry with @floop since Iโ€™m just subtracting values pre-computed, not relying on other du values.

Also, re StaticArrays.jl: Do you mean to use something like

function update_u!(du, u, k, q, j)
    v = @SVector[k, (k + 1) % length(u) + 1, (k + 7) % length(u) + 1]
    q = @SVector[u[v[1]], u[v[3]]]
    du[v[j]] += q' * q
    du[v[(j%3)+1]] -= sin.(q)' * sin.(q)
    nothing
end

at each step? I assumed that would be slower and allocating since Iโ€™m creating the vector at each stage, though maybe the compiler is smarter than what Iโ€™m assuming.

e: Wow, youโ€™re right:

julia> nonstatic = @benchmark $do_serial($du_serial, $u, $k) # original
BenchmarkTools.Trial: 10000 samples with 7 evaluations.
 Range (min โ€ฆ max):  4.700 ฮผs โ€ฆ 615.286 ฮผs  โ”Š GC (min โ€ฆ max): 0.00% โ€ฆ 97.72%
 Time  (median):     4.900 ฮผs               โ”Š GC (median):    0.00%
 Time  (mean ยฑ ฯƒ):   6.193 ฮผs ยฑ  16.163 ฮผs  โ”Š GC (mean ยฑ ฯƒ):  8.08% ยฑ  3.09%

  โ–ˆโ–ˆโ–†โ–…โ–ƒโ–‚โ–โ–       โ–„โ–‚            โ–ƒโ–ƒโ–„โ–‚โ–‚โ–                         โ–‚
  โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‡โ–ˆโ–‡โ–ˆโ–ˆโ–ˆโ–†โ–‡โ–‡โ–†โ–†โ–†โ–†โ–…โ–…โ–…โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‡โ–‡โ–ˆโ–†โ–†โ–†โ–…โ–‡โ–†โ–†โ–†โ–…โ–†โ–†โ–†โ–…โ–†โ–†โ–‡โ–‡โ–†โ–†โ–†โ–† โ–ˆ
  4.7 ฮผs       Histogram: log(frequency) by time      11.6 ฮผs <

 Memory estimate: 9.91 KiB, allocs estimate: 121.

julia> static = @benchmark $do_serial_sv($du_serial, $u, $k) # with new update_u!
BenchmarkTools.Trial: 10000 samples with 141 evaluations.
 Range (min โ€ฆ max):  692.199 ns โ€ฆ   3.310 ฮผs  โ”Š GC (min โ€ฆ max): 0.00% โ€ฆ 0.00%
 Time  (median):     717.021 ns               โ”Š GC (median):    0.00%
 Time  (mean ยฑ ฯƒ):   775.339 ns ยฑ 184.877 ns  โ”Š GC (mean ยฑ ฯƒ):  0.00% ยฑ 0.00%

  โ–โ–ˆโ–ˆโ–ƒโ–                                       โ–ƒโ–‚                โ–
  โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–†โ–†โ–†โ–‡โ–†โ–†โ–†โ–†โ–…โ–…โ–ƒโ–…โ–…โ–ƒโ–ƒโ–„โ–„โ–…โ–…โ–…โ–„โ–…โ–„โ–„โ–„โ–„โ–„โ–ƒโ–„โ–…โ–ƒโ–…โ–…โ–„โ–…โ–…โ–‡โ–‡โ–†โ–ˆโ–ˆโ–ˆโ–ˆโ–†โ–…โ–†โ–†โ–…โ–„โ–„โ–‚โ–…โ–„โ–„โ–ƒโ–‚ โ–ˆ
  692 ns        Histogram: log(frequency) by time       1.43 ฮผs <

 Memory estimate: 0 bytes, allocs estimate: 0.

Thanks for that suggestion!

1 Like

When we say allocations, we mean heap allocations. Static arrays live on the stack and so donโ€™t count towards allocations, they just make the code a lot nicer to write/read.

I donโ€™t use FLoops myself, but I imagine the macro doesnโ€™t see the race condition, and doesnโ€™t compensate for it.

What you really want is to use some form of reduction, but the dependencies make it non-trivial, but Iโ€™ll have a think and get back to you. How large do you plan on making n? I ask as going parallel will likely only be faster for much larger n.

1 Like

Ah, makes sense - thank you.

I initially tried using FLoops.@reduce, although it wasnโ€™t working for individual array elements (or I just couldnโ€™t figure it out, I guess). n will be typically be around 10,000 - 25,000. The parts of my actual application that I have successfully parallelised give about 20x faster results for these values.

1 Like

Iโ€™ve been able to run the code now, and it seems like the results for the do_parallel is mostly correct, up to machine epsilon (~2e-16 for Float64). You can check the comparison with:

isapprox.(du_serial, du_parallel)

And you can see this is mostly true, but when n gets larger it sometimes stops being true. Since the floating point operations are not guaranteed to happen in the same order as in serial, they may return different results as floating point math is (EDIT after @Oscar_Smith kind correction) non-associative (i.e. a+b+c != c+a+b etc usually). Also the race condition too. I havenโ€™t been able to get the parallel code working well, but what I do have is a more optimised serial one:

function do_serial(du::AbstractVector{T}, u, elements) where {T}
    fill!(du, zero(T))
    for k โˆˆ elements
        v = @SVector [k, (k + 1) % length(u) + 1, (k + 7) % length(u) + 1]
        q = @SVector [u[v[1]], u[v[3]]]
        mag_q = q' * q
        mag_sin_q = sin.(q)' * sin.(q)
        d_mag = mag_q - mag_sin_q
        # Unrolled the j=1:3 loop
        du[v[1]] += d_mag
        du[v[2]] += d_mag
        du[v[3]] += d_mag
    end
    nothing
end

Which is just under 3x faster for n=20000:

julia> @benchmark do_serial(du_serial, u, k)
BenchmarkTools.Trial: 9494 samples with 1 evaluation.
 Range (min โ€ฆ max):  519.300 ฮผs โ€ฆ 775.700 ฮผs  โ”Š GC (min โ€ฆ max): 0.00% โ€ฆ 0.00%
 Time  (median):     523.600 ฮผs               โ”Š GC (median):    0.00% Time  (mean ยฑ ฯƒ):   524.676 ฮผs ยฑ  11.364 ฮผs  โ”Š GC (mean ยฑ ฯƒ):  0.00% ยฑ 0.00%

   โ–ˆโ–‚ โ–„โ–ˆโ–
  โ–„โ–ˆโ–ˆโ–‡โ–ˆโ–ˆโ–ˆโ–†โ–„โ–ƒโ–‚โ–‚โ–‚โ–‚โ–‚โ–‚โ–‚โ–‚โ–‚โ–‚โ–‚โ–‚โ–‚โ–‚โ–‚โ–‚โ–‚โ–โ–‚โ–‚โ–‚โ–‚โ–โ–โ–‚โ–‚โ–โ–‚โ–โ–โ–โ–‚โ–‚โ–‚โ–‚โ–‚โ–‚โ–‚โ–‚โ–‚โ–‚โ–‚โ–โ–‚โ–โ–โ–โ–โ–‚โ–‚โ–‚ โ–ƒ
  519 ฮผs           Histogram: frequency by time          574 ฮผs <

 Memory estimate: 0 bytes, allocs estimate: 0.

do_serial_old (generic function with 1 method)

julia> @benchmark do_serial_old(du_serial, u, k)
BenchmarkTools.Trial: 3533 samples with 1 evaluation.
 Range (min โ€ฆ max):  1.403 ms โ€ฆ  2.173 ms  โ”Š GC (min โ€ฆ max): 0.00% โ€ฆ 0.00%
 Time  (median):     1.407 ms              โ”Š GC (median):    0.00%
 Time  (mean ยฑ ฯƒ):   1.414 ms ยฑ 30.639 ฮผs  โ”Š GC (mean ยฑ ฯƒ):  0.00% ยฑ 0.00%

  โ–„โ–ˆโ–‡โ–†โ–‡โ–†โ–„โ–‚โ–                                โ–                 โ–
  โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‡โ–†โ–†โ–‡โ–…โ–‡โ–…โ–‡โ–†โ–†โ–†โ–‡โ–†โ–‡โ–ˆโ–†โ–†โ–…โ–‡โ–†โ–…โ–…โ–…โ–†โ–†โ–ˆโ–ˆโ–‡โ–ˆโ–ˆโ–ˆโ–‡โ–†โ–…โ–ƒโ–…โ–ƒโ–ƒโ–†โ–ˆโ–‡โ–ˆโ–ˆโ–‡โ–‡โ–† โ–ˆ
  1.4 ms       Histogram: log(frequency) by time     1.47 ms <

 Memory estimate: 0 bytes, allocs estimate: 0.
1 Like

Thanks for that. Unfortunately for my application these differences between du_serial and du_parallel seem to be a lot more severe in my actual application, with norm(du_serial - du_parallel) being around 150 rather than machine epsilon typically. That serial code improvement is nice though, cheers. Hopefully the solution can just be found somewhere in FLoops, Iโ€™ll keep looking.

1 Like

Just a minor note. Floating point math is commutative, but not associative.

1 Like

Still havenโ€™t been able to get anywhere. I think the only way must be to somehow define reductions for individual vector entries, although I canโ€™t find any examples of that as of yet.