I have the following code which loops over some indices and adds values to two separate parts of an array.
using FLoops, Random
function update_u!(du, u, k, q, j)
    v = [k, (k + 1) % length(u) + 1, (k + 7) % length(u) + 1]
    q .= [u[v[1]], u[v[3]]]
    du[v[j]] += q' * q
    du[v[(j%3)+1]] -= sin.(q)' * sin.(q)
    nothing
end
function do_serial(du::AbstractVector{T}, u, elements) where {T}
    q = zeros(2, length(elements), 3)
    fill!(du, zero(T))
    for k โ elements
        for j in 1:3
            @views update_u!(du, u, k, q[:, k, j], j)
        end
    end
    nothing
end
function do_parallel(du::AbstractVector{T}, u, elements) where {T}
    q = zeros(2, length(elements), 3)
    @floop for k โ elements
        for j in 1:3
            @views update_u!(du, u, k, q[:, k, j], j)
        end
    end
    nothing
end
n = 10
u = rand(n)
du_serial = zero(u)
du_parallel = zero(u)
k = randperm(n)
do_serial(du_serial, u, k)
do_parallel(du_parallel, u, k)
[du_serial du_parallel abs.(du_serial .- du_parallel)]
10ร3 Matrix{Float64}:
 0.316201   0.316201  0.0
 0.126995   0.126995  2.77556e-17
 0.472551   0.472551  0.0
 โฎ
 0.0436279  0.230509  0.186881
 0.574818   0.574818  0.0
 0.0795132  0.370962  0.291448
I should want these final results to be identical (i.e. have the same results when computing serially and with @floops. How can I modify the code so that this works with FLoops.jl? The code above does sometimes work, but for the larger examples that I do need (n \in (10000, 50000)), the errors accumulate too significantly to handle.
I canโt seem to identify where the issue is actually happening, since I already extended the array q to have a column for each k and for each j to avoid data races there. Is it possible to build reductions on individual vector elements?