I am trying to write a code which has a computationally expensive triply nested loop. I am trying to parallelize the top lop using the `@threads`

macro from `Base.Threads`

. The loops involve calling to two functions `GFC`

and `GHT`

, the functions essentially take many matrices as parameters and return complex numbers. Below is the main part of my code that deals with allocation of variables and the loop

```
function calc()
Δ = 2.02 # in eV
dd = 3 # Change this to modify number of points
t = range(-5*10^-12,stop=5*10^-12,length=dd)
t = collect(t);
x = zeros(ComplexF64,dd)
yr = zeros(Float64,dd)
yi = zeros(Float64,dd)
Temp=300
kT = Temp * 1.380649 * 10^-23 # Boltzmann constant times temperature in Joules
kT= kT * 2.29371227840*10^17 # in atomic units
b = 1/kT
Ωf = Matrix(Diagonal(vec(readdlm("wf.txt", '\t', Float64, '\n'))))*4.55633*10^-6 # Final state frequencies in atomic units
Ωi = Matrix(Diagonal(vec(readdlm("wi.txt", '\t', Float64, '\n'))))*4.55633*10^-6 # Initial state frequencies in atomic units
sysize = size(Ωi)[1]
P = zeros(Float64,sysize,sysize)
for i in range(1,sysize)
P[i,i] = (2*sinh(b*Ωi[i,i]/2))
end
T = readdlm("nac.txt", '\t', Float64, '\n'); # NAC matrix
T = T*T';
D = readdlm("d.txt", Float64); # Displacement Matrix
J = readdlm("j.txt", Float64); # Duchinsky Matrix
Sf = zeros(ComplexF64,sysize,sysize)
Si = zeros(ComplexF64,sysize,sysize)
Bf = zeros(ComplexF64,sysize,sysize)
Bi = zeros(ComplexF64,sysize,sysize)
X11 = zeros(ComplexF64,sysize,sysize)
X12 = zeros(ComplexF64,sysize,sysize)
X21 = zeros(ComplexF64,sysize,sysize)
X22 = zeros(ComplexF64,sysize,sysize)
Y1 = zeros(ComplexF64,sysize,1)
Y2 = zeros(ComplexF64,sysize,1)
t[Int64((dd+1)/2)] = 10e-25
################## Nested Loop Over Time ###################################
@threads for i in range(1,dd)
# println("ITERATION NUMBER = $i, started by thread $(Threads.threadid())")
s=0
println("The value of s is $s")
gfcpart=GFC(t[i],b,Δ,sysize,Ωf,Ωi,J,D,P,Si,Sf,Bi,Bf)[1]
println("THE VALUE OF GFC is $gfcpart calculated by thread $(Threads.threadid()), t = $(t[i])")
for m in range(1,sysize)
for mp in range(1,sysize)
s=s+ (GHT(t[i],b,sysize,Ωf,Ωi,J,D,m,mp,Si,Sf,Bi,Bf,X11,X12,X21,X22,Y1,Y2)[1]*gfcpart*T[m,mp])
end
end
x[i]=s
println("G(t) AT TIME t = $(t[i]) is calculated by thread $(Threads.threadid()) = $(x[i])")
# println("ITERATION NUMBER = $i, ended by thread $(Threads.threadid())")
# println("THE VALUE OF S is $s calculated by thread $(Threads.threadid())")
end
@threads for i in range(1,dd)
yr[i] = real(x[i])
yi[i] = imag(x[i])
end
end
```

I have used multiple `println()`

statements to figure out where the problem lies. Whenever I run it with a single thread a get a single (correct) output. But whenever I use more than one thread, I get a different wrong output every time.

**I know that the feature exhibited by the output represents a classic case of incorrect parallelization.** I know that each iteration of `i`

is independent of others so there should be no problem. Also I tried declared `x`

`yr`

`yi`

all as `SharedArrays`

, that also did not help, from the `println()`

statements I concluded that the variable `gfcpart`

is being calculated incorrectly when using multiple threads.

Any help towards what I have to do to correct this problem is greatly appreciated.

PS - I have also triend to parallelize it using `Distributed`

, declaring the commonly accessed arrays as `SharedArrays`

and distributing the upper loop using `@sync @distributed`

macros. That reproduced the correct serially attained output everytime. I am trying to do the same with `Threads`

because as per the Julia manual it has the least overhead, and I plan to run my code on a single machine.