I am trying to write a code which has a computationally expensive triply nested loop. I am trying to parallelize the top lop using the @threads
macro from Base.Threads
. The loops involve calling to two functions GFC
and GHT
, the functions essentially take many matrices as parameters and return complex numbers. Below is the main part of my code that deals with allocation of variables and the loop
function calc()
Δ = 2.02 # in eV
dd = 3 # Change this to modify number of points
t = range(-5*10^-12,stop=5*10^-12,length=dd)
t = collect(t);
x = zeros(ComplexF64,dd)
yr = zeros(Float64,dd)
yi = zeros(Float64,dd)
Temp=300
kT = Temp * 1.380649 * 10^-23 # Boltzmann constant times temperature in Joules
kT= kT * 2.29371227840*10^17 # in atomic units
b = 1/kT
Ωf = Matrix(Diagonal(vec(readdlm("wf.txt", '\t', Float64, '\n'))))*4.55633*10^-6 # Final state frequencies in atomic units
Ωi = Matrix(Diagonal(vec(readdlm("wi.txt", '\t', Float64, '\n'))))*4.55633*10^-6 # Initial state frequencies in atomic units
sysize = size(Ωi)[1]
P = zeros(Float64,sysize,sysize)
for i in range(1,sysize)
P[i,i] = (2*sinh(b*Ωi[i,i]/2))
end
T = readdlm("nac.txt", '\t', Float64, '\n'); # NAC matrix
T = T*T';
D = readdlm("d.txt", Float64); # Displacement Matrix
J = readdlm("j.txt", Float64); # Duchinsky Matrix
Sf = zeros(ComplexF64,sysize,sysize)
Si = zeros(ComplexF64,sysize,sysize)
Bf = zeros(ComplexF64,sysize,sysize)
Bi = zeros(ComplexF64,sysize,sysize)
X11 = zeros(ComplexF64,sysize,sysize)
X12 = zeros(ComplexF64,sysize,sysize)
X21 = zeros(ComplexF64,sysize,sysize)
X22 = zeros(ComplexF64,sysize,sysize)
Y1 = zeros(ComplexF64,sysize,1)
Y2 = zeros(ComplexF64,sysize,1)
t[Int64((dd+1)/2)] = 10e-25
################## Nested Loop Over Time ###################################
@threads for i in range(1,dd)
# println("ITERATION NUMBER = $i, started by thread $(Threads.threadid())")
s=0
println("The value of s is $s")
gfcpart=GFC(t[i],b,Δ,sysize,Ωf,Ωi,J,D,P,Si,Sf,Bi,Bf)[1]
println("THE VALUE OF GFC is $gfcpart calculated by thread $(Threads.threadid()), t = $(t[i])")
for m in range(1,sysize)
for mp in range(1,sysize)
s=s+ (GHT(t[i],b,sysize,Ωf,Ωi,J,D,m,mp,Si,Sf,Bi,Bf,X11,X12,X21,X22,Y1,Y2)[1]*gfcpart*T[m,mp])
end
end
x[i]=s
println("G(t) AT TIME t = $(t[i]) is calculated by thread $(Threads.threadid()) = $(x[i])")
# println("ITERATION NUMBER = $i, ended by thread $(Threads.threadid())")
# println("THE VALUE OF S is $s calculated by thread $(Threads.threadid())")
end
@threads for i in range(1,dd)
yr[i] = real(x[i])
yi[i] = imag(x[i])
end
end
I have used multiple println()
statements to figure out where the problem lies. Whenever I run it with a single thread a get a single (correct) output. But whenever I use more than one thread, I get a different wrong output every time.
I know that the feature exhibited by the output represents a classic case of incorrect parallelization. I know that each iteration of i
is independent of others so there should be no problem. Also I tried declared x
yr
yi
all as SharedArrays
, that also did not help, from the println()
statements I concluded that the variable gfcpart
is being calculated incorrectly when using multiple threads.
Any help towards what I have to do to correct this problem is greatly appreciated.
PS - I have also triend to parallelize it using Distributed
, declaring the commonly accessed arrays as SharedArrays
and distributing the upper loop using @sync @distributed
macros. That reproduced the correct serially attained output everytime. I am trying to do the same with Threads
because as per the Julia manual it has the least overhead, and I plan to run my code on a single machine.