# Incorrect output when using Base.Threads

I am trying to write a code which has a computationally expensive triply nested loop. I am trying to parallelize the top lop using the `@threads` macro from `Base.Threads`. The loops involve calling to two functions `GFC` and `GHT`, the functions essentially take many matrices as parameters and return complex numbers. Below is the main part of my code that deals with allocation of variables and the loop

``````function calc()
Δ = 2.02 # in eV
dd = 3                               # Change this to modify number of points
t =  range(-5*10^-12,stop=5*10^-12,length=dd)
t =  collect(t);
x =  zeros(ComplexF64,dd)
yr =  zeros(Float64,dd)
yi =  zeros(Float64,dd)
Temp=300
kT = Temp * 1.380649 * 10^-23  # Boltzmann constant times temperature in Joules
kT= kT * 2.29371227840*10^17         # in atomic units
b = 1/kT
Ωf = Matrix(Diagonal(vec(readdlm("wf.txt", '\t', Float64, '\n'))))*4.55633*10^-6   # Final state  frequencies in atomic units
Ωi = Matrix(Diagonal(vec(readdlm("wi.txt", '\t', Float64, '\n'))))*4.55633*10^-6  # Initial state  frequencies in atomic units
sysize = size(Ωi)[1]
P = zeros(Float64,sysize,sysize)
for i in range(1,sysize)
P[i,i] = (2*sinh(b*Ωi[i,i]/2))
end
T = readdlm("nac.txt", '\t', Float64, '\n'); # NAC matrix
T = T*T';
D = readdlm("d.txt", Float64); # Displacement Matrix
J = readdlm("j.txt", Float64); # Duchinsky Matrix
Sf = zeros(ComplexF64,sysize,sysize)
Si = zeros(ComplexF64,sysize,sysize)
Bf = zeros(ComplexF64,sysize,sysize)
Bi = zeros(ComplexF64,sysize,sysize)
X11 = zeros(ComplexF64,sysize,sysize)
X12 = zeros(ComplexF64,sysize,sysize)
X21 = zeros(ComplexF64,sysize,sysize)
X22 = zeros(ComplexF64,sysize,sysize)
Y1 = zeros(ComplexF64,sysize,1)
Y2 = zeros(ComplexF64,sysize,1)
t[Int64((dd+1)/2)] = 10e-25

################## Nested Loop Over Time ###################################
s=0
println("The value of s is \$s")
gfcpart=GFC(t[i],b,Δ,sysize,Ωf,Ωi,J,D,P,Si,Sf,Bi,Bf)[1]
for m in range(1,sysize)
for mp in range(1,sysize)
s=s+    (GHT(t[i],b,sysize,Ωf,Ωi,J,D,m,mp,Si,Sf,Bi,Bf,X11,X12,X21,X22,Y1,Y2)[1]*gfcpart*T[m,mp])
end
end
x[i]=s
end
yr[i] = real(x[i])
yi[i] = imag(x[i])
end
end
``````

I have used multiple `println()` statements to figure out where the problem lies. Whenever I run it with a single thread a get a single (correct) output. But whenever I use more than one thread, I get a different wrong output every time.

I know that the feature exhibited by the output represents a classic case of incorrect parallelization. I know that each iteration of `i` is independent of others so there should be no problem. Also I tried declared `x` `yr` `yi` all as `SharedArrays` , that also did not help, from the `println()` statements I concluded that the variable `gfcpart` is being calculated incorrectly when using multiple threads.

Any help towards what I have to do to correct this problem is greatly appreciated.
PS - I have also triend to parallelize it using `Distributed` , declaring the commonly accessed arrays as `SharedArrays` and distributing the upper loop using `@sync @distributed` macros. That reproduced the correct serially attained output everytime. I am trying to do the same with `Threads` because as per the Julia manual it has the least overhead, and I plan to run my code on a single machine.

Your code suggests that the `GFC` function uses its arguments Si, Sf, etc as scratch-space, i.e. mutates these arguments. That would be a race condition.

2 Likes

Picking up on what @foobar_lv2 said. If `GFC` and `GHT` modify some of their inputs, then different threads interfere with each other. In your example the loop you want to parallelize only runs 3 times but I assume in reality it will be much more.

The easiest thing to remedy the race condition is to split the workload into chunks and preallocate the things that will be mutated for each chunk. This can be done quite conveniently with `ChunkSplitters.jl`. Here is the outline:

``````using ChunkSplitters
function calc()
# define the variables, do precomputations, allocate output arrays

################## Nested Loop Over Time ###################################
# @sync tells Julia to wait for any tasks started inside it
@sync for (xrange, ichunk) in chunks(range(1,dd), nchunks)
# xrange are the indices this chunk contains, ichunk is just the index of the chunk, which we don't need in this scenario
Threads.@spawn begin # this line tells Julia to start a new thread with the contents of the begin block
# now preallocate the workspaces here
for i in xrange
# rest of code for the computation
# ...

end # loop
Also, consider that using `@threads` is no longer considered the best way to parallelize code.