Probable data race condition causing problems when trying to parallelize a loop used to populate an array

Uranium238 · August 1, 2024, 2:25pm

I am trying to parallelize a part of my code where different values of an array x are populated asynchronously. The serial function of my full code is as follows

using Distributed
using MKL
using DelimitedFiles
using LinearAlgebra
using LaTeXStrings
using TimerOutputs
using Plots
using Plots.PlotMeasures
using FFTW
using SharedArrays
BLAS.set_num_threads(1)
const to = TimerOutput()
println("Number of processes = 1")
open("ISC-SERIAL.TXT", "a") do f
    write(f, "\n\n\n====================================================\n\n\n")
end

function GDSO(
    t::Float64,
    b::Float64,
    Δ::Float64,
    sysize::Int64,
    Ωf::Diagonal{Float64},
    Ωi::Diagonal{Float64},
    J::Matrix{Float64},
    D::Matrix{Float64},
    ρ::Diagonal{Float64},
    Si::Diagonal{ComplexF64},
    Sf::Diagonal{ComplexF64},
    Bi::Diagonal{ComplexF64},
    Bf::Diagonal{ComplexF64}
)
    t = t * 4.1341373335e16                     # converting time to atomic units
    tp = -im * b - t
    for i = 1:sysize
        Sf[i, i] = Ωf[i, i] / sin(Ωf[i, i] * t)
        Si[i, i] = Ωi[i, i] / sin(Ωi[i, i] * tp)
        Bf[i, i] = Ωf[i, i] / tan(Ωf[i, i] * t)
        Bi[i, i] = Ωi[i, i] / tan(Ωi[i, i] * tp)
    end
    BB = (Bf + J' * Bi * J)
    AA = (Sf + J' * Si * J)

    W = [BB -AA
        -AA BB] #We wont calculate the 2Nx2N determinant directly.

    # Wapp = BB*(BB-  AA*(BB^-1)*AA  )
    Wapp = BB * (BB - AA * (BB \ AA))
    U = Bi - Si
    V = [J' * U * D
        J' * U * D]
    # F1 = sqrt(det(Si*Sf*ρ^2*Wapp^-1))
    # F1 = sqrt(det(Si*Sf*ρ^2)/det(Wapp))
    F1 = sqrt(det(Si * Sf * (Wapp \ ρ^2)))
    F2 = (-(im / 2) * (transpose(V) * (W \ V))) + ((im) * (transpose(D) * U * D))
    g = F1 * exp(F2)
    Δ = Δ * 0.0367493  # in atomic units
    g = g * exp(im * Δ * t)
    η = 2 * 4.55633e-6  # 10 cm^-1 to hatree energy
    #g = g * exp(-η * abs(t))     # Adding the damping factor Lorrentzian
    g = g * exp(-η * t^2)     # Adding the damping factor Gaussian
    # println("t = $t was evaluated by process $(myid())")
    return g[1]
end

@views function GSV(
    t::Float64,
    b::Float64,
    sysize::Int64,
    Ωf::Diagonal{Float64},
    Ωi::Diagonal{Float64},
    J::Matrix{Float64},
    D::Matrix{Float64},
    Si::Diagonal{ComplexF64},
    Sf::Diagonal{ComplexF64},
    Bi::Diagonal{ComplexF64},
    Bf::Diagonal{ComplexF64},
    X11::Matrix{ComplexF64},
    X12::Matrix{ComplexF64},
    X21::Matrix{ComplexF64},
    X22::Matrix{ComplexF64},
    Y1::Matrix{ComplexF64},
    Y2::Matrix{ComplexF64},
    gdsopart::ComplexF64,
    T::Matrix{Float64},       ## Product if NAC and SO 
    Hso::Float64
)

    ################################## Preliminary Part , Common for both Frank Condon and Herzberg Teller ##############################
    t = t * 4.1341373335e16                     # converting time to atomic units
    tp = -im * b - t
    for i = 1:sysize
        Sf[i, i] = Ωf[i, i] / sin(Ωf[i, i] * t)
        Si[i, i] = Ωi[i, i] / sin(Ωi[i, i] * tp)
        Bf[i, i] = Ωf[i, i] / tan(Ωf[i, i] * t)
        Bi[i, i] = Ωi[i, i] / tan(Ωi[i, i] * tp)
    end
    W = [(Bf+J'*Bi*J) -(Sf + J' * Si * J)
        -(Sf + J' * Si * J) (Bf+J'*Bi*J)]
    U = Bi - Si
    V = [J' * U * D
        J' * U * D]

    ########## Defining the X and Y Matrices ##################################


    ################# Final Calculation and returning value ##################
    s = zero(ComplexF64)
    MJSJ = J' * Si * J
    MJBJ = J' * Bi * J
    MJUD = J' * U * D
    Win = inv(W)
    WinV = Win * V
    X = zeros(ComplexF64, 2 * sysize, 2 * sysize)
    Y = zeros(ComplexF64, 2 * sysize, 1)
    XWin = similar(Win)
    for m in 1:sysize
        for mp in 1:sysize
            X11[m, :] .= (MJSJ)[mp, :] .* (-Bf[m, m])
            X12[m, :] .= (MJBJ)[mp, :] .* (Bf[m, m])
            X21[m, :] .= (MJSJ)[mp, :] .* (Sf[m, m])
            X22[m, :] .= (MJBJ)[mp, :] .* (-Sf[m, m])
            #            X = [X11 X12
            #                X21 X22]
            X[1:sysize, 1:sysize] .= X11
            X[1:sysize, sysize+1:2*sysize] .= X12
            X[sysize+1:2*sysize, 1:sysize] .= X21
            X[sysize+1:2*sysize, sysize+1:2*sysize] .= X22

            Y1[m, :] .= Bf[m, m] * (MJUD)[mp, :]
            Y2[m, :] .= -Sf[m, m] * (MJUD)[mp, :]
            Y[1:sysize, 1] .= Y1
            Y[sysize+1:2*sysize, 1] .= Y2
            g = im * (tr(mul!(XWin, X, Win))  + (transpose(WinV)*X*WinV)[1] - (transpose(Y)*WinV)[1])
            s = s + g * gdsopart * T[m, mp] 
        end
        X11[m, 1:sysize] .= 0
        X12[m, 1:sysize] .= 0
        X21[m, 1:sysize] .= 0
        X22[m, 1:sysize] .= 0
        Y1[m, 1] = 0
        Y2[m, 1] = 0
    end
    s = s + abs(Hso)^2 * gdsopart   #### LATEST MODIFICATION
    return s
end

const hso_s1t3 = 1.49 * 0.0000046  ## SOC between S1 and T3 = 0.48 cm-1, converting to atomic units
const Delta = 0.11 # in eV
const Delta_s1t3 = 0.091 * 0.037 # Difference between S1 and T3 = 0.091 eV in atomic units
const dd = 22                            # Change this to modify number of points
const Temp = 300
kT_temp = Temp * 1.380649e-23  # Boltzmann constant times temperature in Joules
const kT = kT_temp * 2.29371227840e17         # in atomic units
const b = 1 / kT
const omegaf = Diagonal(vec(readdlm("wf.txt", '\t', Float64, '\n'))) * 4.55633e-6   # Final state  frequencies in atomic units
const omegai = Diagonal(vec(readdlm("wi.txt", '\t', Float64, '\n'))) * 4.55633e-6  # Initial state  frequencies in atomic units
const sysize = size(omegai)[1]
const P = @. 2 * sinh(b * omegai / 2)
T_temp = readdlm("nac.txt", '\t', Float64, '\n') # NAC matrix
const T = (T_temp * T_temp')* (hso_s1t3^2/Delta_s1t3^2)
const D = readdlm("d.txt", Float64) # Displacement Matrix
const J = readdlm("j.txt", Float64) # Duchinsky Matrix
const t = collect(range(-5e-12, stop=5e-12, length=dd))
t[div(dd + 1, 2)] = 10e-25

function calc(t, b, Δ, sysize, Ωf, Ωi, J, D, P)
    # @sync tells Julia to wait for any tasks started inside it
    open("ISC-SERIAL.TXT", "a") do f
        write(f, "STARTING LOOP WITH NUMBER OF CORES = 1 \n====================================================\n")
    end
    x = zeros(ComplexF64, dd)
    Sf = Diagonal(zeros(ComplexF64, sysize))
    Si = Diagonal(zeros(ComplexF64, sysize))
    Bf = Diagonal(zeros(ComplexF64, sysize))
    Bi = Diagonal(zeros(ComplexF64, sysize))
    X11 = zeros(ComplexF64, sysize, sysize)
    X12 = zeros(ComplexF64, sysize, sysize)
    X21 = zeros(ComplexF64, sysize, sysize)
    X22 = zeros(ComplexF64, sysize, sysize)
    Y1 = zeros(ComplexF64, sysize, 1)
    Y2 = zeros(ComplexF64, sysize, 1)
    Hso = 0.48 * 0.0000046  ## SOC between S1 and T2 = 0.48 cm-1, converting to atomic units
    @timeit to "GT calculation" for i in 1:dd
        gdsopart = GDSO(t[i], b, Δ, sysize, Ωf, Ωi, J, D, P, Si, Sf, Bi, Bf)
        gtotal = GSV(t[i], b, sysize, Ωf, Ωi, J, D, Si, Sf, Bi, Bf, X11, X12, X21, X22, Y1, Y2, gdsopart, T,Hso)
        x[i] = gtotal
        open("ISC-SERIAL.TXT", "a") do f
            write(f, "ITERATION $i COMPLETED BY PROCESS 1 \n")
        end
    end # loop
    open("ISC-SERIAL.TXT", "a") do f
        write(f, "LOOP COMPLETE \n====================================================")
    end
    yr = real.(x)
    yi = imag.(x)
    p = plot(t, yr, linewidth=2, color="blue", title="Δ = $Δ", label="Real part", grid=false)
    p = plot!(t, yi, linewidth=2, color="red", label="Imaginary part", size=(1000, 800), xlims=(-30e-15, 30e-15))
    savefig(p, "gt.svg")
    begin # This part does FFTW and finds final rate constant

        fs = abs(t[1] - t[2])^-1
        signal = yr
        p = plan_fft(signal)
        F = fftshift(p * signal)
        freqs = fftshift(fftfreq(length(t), fs))  # X axis has frequencies in s^-1
        #Covert freqs to eV
        freqs = freqs * 4.13558e-15
        F = F * 6.57e15
        time_domain = plot(t, signal, title=L"G(t)", legend=false, color="orange", xlabel=L"\mathrm{Time}\ (s)", ylabel=L"\mathrm{A.u}")
        freq_domain = plot(freqs, abs.(F), title=L"\mathrm{Fourier\ transformed\ g(\bar{\nu}))}", legend=false, xlabel=L"\mathrm{Energy\ (eV)}", ylabel=L"\mathrm{k_{IC}\ (s^{-1})}", xlims=(0, 4))
        p = plot(time_domain, freq_domain, layout=(2, 1), size=(1000, 800), right_margin=10mm)
        savefig(p, "gt-fft.svg")
        ee = Δ# * 8065.73  #Converting eV to cm^-1
        index = findmin(abs.(freqs .- ee))[2]  # Find the index of freq closest to ee
        freqs[index]
        println("The rate constant is ", abs.(F)[index])
        rc = abs.(F)[index]
    end
#################################################This part writes the output to a file#######################################################
    begin 
        open("ISC-SERIAL.TXT", "a") do f
            write(f, "-------------------------\n")
            write(f, "NUMBER OF CORES = 1\n")
            write(f, "The value of Δ is $Δ\n")
            write(f, "Number of time points = $dd\n")
            write(f, "The rate constant calculated from Julia FFTW = $rc\n")
            write(f, "-------------------------\n")
            write(f, "The values of t,Re(g(t)),Im(g(t)) are \n\n")
            for i in 1:dd
                write(f, "$(t[i])  $(yr[i])  $(yi[i])\n")
            end
        end

        open("GT-VALUES-SERIAL.txt", "w") do f
            for i in 1:dd
                write(f, "$(t[i])  $(yr[i])  $(yi[i])\n")
            end
        end

    end

end


calc(t, b, Delta, sysize, omegaf, omegai, J, D, P)
display(to)

Which I have parralelized by declaring all the functions and common variables as @everywhere and adding @sync @distributed macro to the intended loop.

using Distributed
@everywhere begin
using MKL
using DelimitedFiles
using LinearAlgebra
using LaTeXStrings
using TimerOutputs
using Plots
using Plots.PlotMeasures
using FFTW
using SharedArrays
BLAS.set_num_threads(1)
end
const to = TimerOutput()
println("Number of processes = $(Distributed.nprocs())")
open("ISC-DISTRIBTUED.TXT", "a") do f
    write(f, "\n\n\n====================================================\n\n\n")
end

@everywhere function GDSO(
    t::Float64,
    b::Float64,
    Δ::Float64,
    sysize::Int64,
    Ωf::Diagonal{Float64},
    Ωi::Diagonal{Float64},
    J::Matrix{Float64},
    D::Matrix{Float64},
    ρ::Diagonal{Float64},
    Si::Diagonal{ComplexF64},
    Sf::Diagonal{ComplexF64},
    Bi::Diagonal{ComplexF64},
    Bf::Diagonal{ComplexF64}
)
    t = t * 4.1341373335e16                     # converting time to atomic units
    tp = -im * b - t
    for i = 1:sysize
        Sf[i, i] = Ωf[i, i] / sin(Ωf[i, i] * t)
        Si[i, i] = Ωi[i, i] / sin(Ωi[i, i] * tp)
        Bf[i, i] = Ωf[i, i] / tan(Ωf[i, i] * t)
        Bi[i, i] = Ωi[i, i] / tan(Ωi[i, i] * tp)
    end
    BB = (Bf + J' * Bi * J)
    AA = (Sf + J' * Si * J)

    W = [BB -AA
        -AA BB] #We wont calculate the 2Nx2N determinant directly.

    # Wapp = BB*(BB-  AA*(BB^-1)*AA  )
    Wapp = BB * (BB - AA * (BB \ AA))
    U = Bi - Si
    V = [J' * U * D
        J' * U * D]
    # F1 = sqrt(det(Si*Sf*ρ^2*Wapp^-1))
    # F1 = sqrt(det(Si*Sf*ρ^2)/det(Wapp))
    F1 = sqrt(det(Si * Sf * (Wapp \ ρ^2)))
    F2 = (-(im / 2) * (transpose(V) * (W \ V))) + ((im) * (transpose(D) * U * D))
    g = F1 * exp(F2)
    Δ = Δ * 0.0367493  # in atomic units
    g = g * exp(im * Δ * t)
    η = 2 * 4.55633e-6  # 10 cm^-1 to hatree energy
    #g = g * exp(-η * abs(t))     # Adding the damping factor Lorrentzian
    g = g * exp(-η * t^2)     # Adding the damping factor Gaussian
    # println("t = $t was evaluated by process $(myid())")
    return g[1]
end

@everywhere @views function GSV(
    t::Float64,
    b::Float64,
    sysize::Int64,
    Ωf::Diagonal{Float64},
    Ωi::Diagonal{Float64},
    J::Matrix{Float64},
    D::Matrix{Float64},
    Si::Diagonal{ComplexF64},
    Sf::Diagonal{ComplexF64},
    Bi::Diagonal{ComplexF64},
    Bf::Diagonal{ComplexF64},
    X11::Matrix{ComplexF64},
    X12::Matrix{ComplexF64},
    X21::Matrix{ComplexF64},
    X22::Matrix{ComplexF64},
    Y1::Matrix{ComplexF64},
    Y2::Matrix{ComplexF64},
    gdsopart::ComplexF64,
    T::Matrix{Float64},       ## Product if NAC and SO 
    Hso::Float64
)

    ################################## Preliminary Part , Common for both Frank Condon and Herzberg Teller ##############################
    t = t * 4.1341373335e16                     # converting time to atomic units
    tp = -im * b - t
    for i = 1:sysize
        Sf[i, i] = Ωf[i, i] / sin(Ωf[i, i] * t)
        Si[i, i] = Ωi[i, i] / sin(Ωi[i, i] * tp)
        Bf[i, i] = Ωf[i, i] / tan(Ωf[i, i] * t)
        Bi[i, i] = Ωi[i, i] / tan(Ωi[i, i] * tp)
    end
    W = [(Bf+J'*Bi*J) -(Sf + J' * Si * J)
        -(Sf + J' * Si * J) (Bf+J'*Bi*J)]
    U = Bi - Si
    V = [J' * U * D
        J' * U * D]

    ########## Defining the X and Y Matrices ##################################


    ################# Final Calculation and returning value ##################
    s = zero(ComplexF64)
    MJSJ = J' * Si * J
    MJBJ = J' * Bi * J
    MJUD = J' * U * D
    Win = inv(W)
    WinV = Win * V
    X = zeros(ComplexF64, 2 * sysize, 2 * sysize)
    Y = zeros(ComplexF64, 2 * sysize, 1)
    XWin = similar(Win)
    for m in 1:sysize
        for mp in 1:sysize
            X11[m, :] .= (MJSJ)[mp, :] .* (-Bf[m, m])
            X12[m, :] .= (MJBJ)[mp, :] .* (Bf[m, m])
            X21[m, :] .= (MJSJ)[mp, :] .* (Sf[m, m])
            X22[m, :] .= (MJBJ)[mp, :] .* (-Sf[m, m])
            #            X = [X11 X12
            #                X21 X22]
            X[1:sysize, 1:sysize] .= X11
            X[1:sysize, sysize+1:2*sysize] .= X12
            X[sysize+1:2*sysize, 1:sysize] .= X21
            X[sysize+1:2*sysize, sysize+1:2*sysize] .= X22

            Y1[m, :] .= Bf[m, m] * (MJUD)[mp, :]
            Y2[m, :] .= -Sf[m, m] * (MJUD)[mp, :]
            Y[1:sysize, 1] .= Y1
            Y[sysize+1:2*sysize, 1] .= Y2
            g = im * (tr(mul!(XWin, X, Win))  + (transpose(WinV)*X*WinV)[1] - (transpose(Y)*WinV)[1])
            s = s + g * gdsopart * T[m, mp] 
        end
        X11[m, 1:sysize] .= 0
        X12[m, 1:sysize] .= 0
        X21[m, 1:sysize] .= 0
        X22[m, 1:sysize] .= 0
        Y1[m, 1] = 0
        Y2[m, 1] = 0
    end
    s = s + abs(Hso)^2 * gdsopart   #### LATEST MODIFICATION
    return s
end
@everywhere begin
const hso_s1t3 = 1.49 * 0.0000046  ## SOC between S1 and T3 = 0.48 cm-1, converting to atomic units
const Delta = 0.11 # in eV
const Delta_s1t3 = 0.091 * 0.037 # Difference between S1 and T3 = 0.091 eV in atomic units
const dd = 22                              # Change this to modify number of points
const Temp = 300
kT_temp = Temp * 1.380649e-23  # Boltzmann constant times temperature in Joules
const kT = kT_temp * 2.29371227840e17         # in atomic units
const b = 1 / kT
const omegaf = Diagonal(vec(readdlm("wf.txt", '\t', Float64, '\n'))) * 4.55633e-6   # Final state  frequencies in atomic units
const omegai = Diagonal(vec(readdlm("wi.txt", '\t', Float64, '\n'))) * 4.55633e-6  # Initial state  frequencies in atomic units
const sysize = size(omegai)[1]
const P = @. 2 * sinh(b * omegai / 2)
T_temp = readdlm("nac.txt", '\t', Float64, '\n') # NAC matrix
const T = (T_temp * T_temp')* (hso_s1t3^2/Delta_s1t3^2)
const D = readdlm("d.txt", Float64) # Displacement Matrix
const J = readdlm("j.txt", Float64) # Duchinsky Matrix
const t = collect(range(-5e-12, stop=5e-12, length=dd))
t[div(dd + 1, 2)] = 10e-25
end

function calc(t, b, Δ, sysize, Ωf, Ωi, J, D, P)
    # @sync tells Julia to wait for any tasks started inside it
    open("ISC-DISTRIBTUED.TXT", "a") do f
        write(f, "STARTING LOOP WITH NUMBER OF CORES = $(Distributed.nprocs()) \n====================================================\n")
    end
    x = zeros(ComplexF64, dd)
    Sf = Diagonal(zeros(ComplexF64, sysize))
    Si = Diagonal(zeros(ComplexF64, sysize))
    Bf = Diagonal(zeros(ComplexF64, sysize))
    Bi = Diagonal(zeros(ComplexF64, sysize))
    X11 = zeros(ComplexF64, sysize, sysize)
    X12 = zeros(ComplexF64, sysize, sysize)
    X21 = zeros(ComplexF64, sysize, sysize)
    X22 = zeros(ComplexF64, sysize, sysize)
    Y1 = zeros(ComplexF64, sysize, 1)
    Y2 = zeros(ComplexF64, sysize, 1)
    Hso = 0.48 * 0.0000046  ## SOC between S1 and T2 = 0.48 cm-1, converting to atomic units
    @timeit to "GT calculation" @sync @distributed for i in 1:dd
        gdsopart = GDSO(t[i], b, Δ, sysize, Ωf, Ωi, J, D, P, Si, Sf, Bi, Bf)
        gtotal = GSV(t[i], b, sysize, Ωf, Ωi, J, D, Si, Sf, Bi, Bf, X11, X12, X21, X22, Y1, Y2, gdsopart, T,Hso)
        x[i] = gtotal
        open("ISC-DISTRIBTUED.TXT", "a") do f
            write(f, "ITERATION $i COMPLETED BY PROCESS $(Distributed.myid()) \n")
        end
    end # loop
    open("ISC-DISTRIBTUED.TXT", "a") do f
        write(f, "LOOP COMPLETE \n====================================================")
    end
    yr = real.(x)
    yi = imag.(x)
    p = plot(t, yr, linewidth=2, color="blue", title="Δ = $Δ", label="Real part", grid=false)
    p = plot!(t, yi, linewidth=2, color="red", label="Imaginary part", size=(1000, 800), xlims=(-30e-15, 30e-15))
    savefig(p, "gt.svg")
    begin # This part does FFTW and finds final rate constant

        fs = abs(t[1] - t[2])^-1
        signal = yr
        p = plan_fft(signal)
        F = fftshift(p * signal)
        freqs = fftshift(fftfreq(length(t), fs))  # X axis has frequencies in s^-1
        #Covert freqs to eV
        freqs = freqs * 4.13558e-15
        F = F * 6.57e15
        time_domain = plot(t, signal, title=L"G(t)", legend=false, color="orange", xlabel=L"\mathrm{Time}\ (s)", ylabel=L"\mathrm{A.u}")
        freq_domain = plot(freqs, abs.(F), title=L"\mathrm{Fourier\ transformed\ g(\bar{\nu}))}", legend=false, xlabel=L"\mathrm{Energy\ (eV)}", ylabel=L"\mathrm{k_{IC}\ (s^{-1})}", xlims=(0, 4))
        p = plot(time_domain, freq_domain, layout=(2, 1), size=(1000, 800), right_margin=10mm)
        savefig(p, "gt-fft.svg")
        ee = Δ# * 8065.73  #Converting eV to cm^-1
        index = findmin(abs.(freqs .- ee))[2]  # Find the index of freq closest to ee
        freqs[index]
        println("The rate constant is ", abs.(F)[index])
        rc = abs.(F)[index]
    end
#################################################This part writes the output to a file#######################################################
    begin 
        open("ISC-DISTRIBTUED.TXT", "a") do f
            write(f, "-------------------------\n")
            write(f, "NUMBER OF CORES = 1\n")
            write(f, "The value of Δ is $Δ\n")
            write(f, "Number of time points = $dd\n")
            write(f, "The rate constant calculated from Julia FFTW = $rc\n")
            write(f, "-------------------------\n")
            write(f, "The values of t,Re(g(t)),Im(g(t)) are \n\n")
            for i in 1:dd
                write(f, "$(t[i])  $(yr[i])  $(yi[i])\n")
            end
        end

        open("GT-VALUES-DISTRIBTUED.txt", "w") do f
            for i in 1:dd
                write(f, "$(t[i])  $(yr[i])  $(yi[i])\n")
            end
        end

    end

end


calc(t, b, Delta, sysize, omegaf, omegai, J, D, P)
display(to)

However, it seems the array x is not being populated at all as it’s values are still 0.0 after the parallel loop is ran as indicated by the output GT-VALUES which reads

-5.0e-12  0.0  0.0
-4.0e-12  0.0  0.0
-2.9999999999999997e-12  0.0  0.0
-2.0e-12  0.0  0.0
-1.0e-12  0.0  0.0
1.0e-24  6.394092795554566e-12  4.2656647577696985e-16
1.0e-12  0.0  0.0
2.0e-12  0.0  0.0
2.9999999999999997e-12  0.0  0.0
4.0e-12  0.0  0.0
5.0e-12  0.0  0.0

There is probably some data race condition going on that I am unable to detect which might be causing the problem. Please help me if possible in identifying and isolating the same.

abraemer · August 1, 2024, 3:27pm

Uranium238:

    X11 = zeros(ComplexF64, sysize, sysize)
    X12 = zeros(ComplexF64, sysize, sysize)
    X21 = zeros(ComplexF64, sysize, sysize)
    X22 = zeros(ComplexF64, sysize, sysize)
    Y1 = zeros(ComplexF64, sysize, 1)
    Y2 = zeros(ComplexF64, sysize, 1)
    Hso = 0.48 * 0.0000046  ## SOC between S1 and T2 = 0.48 cm-1, converting to atomic units
    @timeit to "GT calculation" @sync @distributed for i in 1:dd
        # (1) this line writes to all X matrices
        gdsopart = GDSO(t[i], b, Δ, sysize, Ωf, Ωi, J, D, P, Si, Sf, Bi, Bf)
        # (2) this line uses them
        gtotal = GSV(t[i], b, sysize, Ωf, Ωi, J, D, Si, Sf, Bi, Bf, X11, X12, X21, X22, Y1, Y2, gdsopart, T,Hso)
        # (3) this does not work at all!
        x[i] = gtotal
        #[..]
    end # loop

The issue is not a race condition. I didn’t see one during a quick read. You have misunderstood the fundamental difference between using processes (via Distributed.jl) and threads. You cannot just slap @distributed on the code and expect it to just work. You need to think about moving data between workers.

Let me try to explain:
Workers have totally different memory spaces on your machine. In fact for all practical purposes they could live on physically different machines. So when you preallocate the X123 matrices outside the loop this happens on the main process. When you access a variable inside @distributed (line (1) that is declared outside, Distributed.jl copies it to the process and writes into the local copy. Thus you have no data race. But the information is local to the process! But it also means that line (3) does not move data back to the main process. What happens is that the empty array x is copied to each process and each process writes something into its local version and then it get garbage collected because you never transport it back.

So what should you do:
1.) Partition your tasks (e.g. using ChunkSplitters.jl) into as many chunks as you have workers
2.) use @distributed to distribute the chunks
3.) Preallocate the necessary arrays X123 on each worker
4.) Perform the workload on each worker
5.) Write the results to a SharedArray from SharedArrays.jl
6.) If there are large constant arrays used as input for the calculations, then consider using a SharedArray for that as well to avoid copying it to every worker.

Side note:
I think that you could likely optimize your algebra code quite a bit and thought that you got some input on that in an earlier thread here. Did you perhaps base this on an older version of your code?

Uranium238:

    W = [(Bf+J'*Bi*J) -(Sf + J' * Si * J)
        -(Sf + J' * Si * J) (Bf+J'*Bi*J)]
    U = Bi - Si
    V = [J' * U * D
        J' * U * D]

    ########## Defining the X and Y Matrices ##################################


    ################# Final Calculation and returning value ##################
    s = zero(ComplexF64)
    MJSJ = J' * Si * J
    MJBJ = J' * Bi * J
    MJUD = J' * U * D
    Win = inv(W)
    WinV = Win * V

Uranium238 · August 1, 2024, 4:56pm

I understand where the problem was. Indeed x (the zero array) is being copied to each process each time, they are filling the local entries (which just goes to garbage). However, I do not understand of employing ChunkSplitters.jl here. For example (it was compulsory for the threaded version as in threads all the threads use the same memory space), a simple change

   x = SharedArray{ComplexF64}(dd)

seems to now produce the correct result (matching with the serial version) in the distributed version.

As for your side note, I did incorporate most of the suggestions from the old posts (which eventually led to the version that used threads). I do not understand what else there is to optimize at this point (without getting into completely changing how the original equations are expressed). For example, all the inverses were calculated outside the loop, @views was used wherever a splicing operation was about to take place etc.

abraemer · August 2, 2024, 8:01am

For almost the same reason you’d use it when employing threads. Right now you preallocate the X123 arrays on the main process where they won’t be used (I think) and then copy the empty data to all worker processes. This is of course much much slower, than if each worker simply allocated a new empty array to use.
It could be that the matrices are even copied multiple times. I don’t think this happens but not sure whether I fully understand the implications of how lowering and Distributed work jointly.

A small caveat: SharedArray works as long as all workers live on the same physical machine. If you want to have multi-node distributed computing you need to do something else.

Ah right I remember. I think this was the reason the code was not optimized more. I think it would be possible to totally avoid constructing these composite matrices (which are allocated anew on each call)

    W = [(Bf+J'*Bi*J) -(Sf + J' * Si * J)
        -(Sf + J' * Si * J) (Bf+J'*Bi*J)]
    U = Bi - Si
    V = [J' * U * D
        J' * U * D]

Here are some thoughts/observations:

There is also a bit of strange choice with regard to which things you preallocate outside of GSV and which you allocate inside, e.g. you pass in Si,Sf,Bi,Bf which are all Diagonal so amount to 4*sysize complex numbers which is a negligible amount of memory compared to the allocations done later. So I think, you could at least optimize the readability of the code a bit without losing noticeable performance.
I just noticed that my previous comment was not correct with respect to the X12 matrices (but the message doesn’t change). Why do you even have these X11,X12 matrices? You don’t really use them at all I think. In GSV you write to them and then copy them to a locally allocated larger X which you then use. But you could just write to that directly, no? The same is true for Y1 and Y2. That would simplify the code quite a bit and should also give some speed/reduce memory consumption. Instead preallocate X and Y on the outside to save allocations. Or alternatively rewrite your computations to use X11 and the like instead of using the larger X,Y and W
In GDSO and GSV you perform partially the same computations to setup some values. E.g. the W matrix and its inverse/factorization
If I see it right, in GDSO you form the larger objects W = [BB -AA; -AA BB] and V=[J'U*D;J'U*D] only to compute transpose(V)*(W\V). This you could be simplified (but modifies the structure I guess)
If I see it right, then passing in Hso and gdsopart to GSV is unnecessary. The result should be the same if you set Hso=0 and gdsopart=1 and then transform the result (GSV(...)+abs2(Hso))*gdsopart

nilshg · August 2, 2024, 8:46am

Even if you don’t want to change the whole structure there are still very easy wins like:

julia> using Chairmarks

julia> f(J, U, D) = [J' * U * D; J' * U * D]
f (generic function with 1 method)

julia> function g(J, U, D)
           x = J' * U * D
           [x; x]
       end
g (generic function with 1 method)

julia> j = rand(100, 200); u = rand(100, 100); d = rand(100, 100);

julia> @b f($j, $u, $d)
277.700 μs (15 allocs: 781.641 KiB)

julia> @b g($j, $u, $d)
142.000 μs (9 allocs: 547.109 KiB)

Julia doesn’t perform common subexpression elimination for you so you should avoid calculating the same thing multiple times.

Uranium238 · August 2, 2024, 9:12am

You are right. There is no sense in having four more matrices whose only purpose is to define X, so I modified the code in a way that does not need those four matrices anymore and still produces the correct result by redefining GSV as follows

@views function GSV(
    t::Float64,
    b::Float64,
    sysize::Int64,
    Ωf::Diagonal{Float64},
    Ωi::Diagonal{Float64},
    J::Matrix{Float64},
    D::Matrix{Float64},
    Si::Diagonal{ComplexF64},
    Sf::Diagonal{ComplexF64},
    Bi::Diagonal{ComplexF64},
    Bf::Diagonal{ComplexF64},
    gdsopart::ComplexF64,
    T::Matrix{Float64},       ## Product if NAC and SO 
    Hso::Float64
)

    ################################## Preliminary Part , Common for both Frank Condon and Herzberg Teller ##############################
    t = t * 4.1341373335e16                     # converting time to atomic units
    tp = -im * b - t
    for i = 1:sysize
        Sf[i, i] = Ωf[i, i] / sin(Ωf[i, i] * t)
        Si[i, i] = Ωi[i, i] / sin(Ωi[i, i] * tp)
        Bf[i, i] = Ωf[i, i] / tan(Ωf[i, i] * t)
        Bi[i, i] = Ωi[i, i] / tan(Ωi[i, i] * tp)
    end
    W = [(Bf+J'*Bi*J) -(Sf + J' * Si * J)
        -(Sf + J' * Si * J) (Bf+J'*Bi*J)]
    U = Bi - Si
    V = [J' * U * D
        J' * U * D]

    ########## Defining the X and Y Matrices ##################################


    ################# Final Calculation and returning value ##################
    s = zero(ComplexF64)
    MJSJ = J' * Si * J
    MJBJ = J' * Bi * J
    MJUD = J' * U * D
    Win = inv(W)
    WinV = Win * V
    X = zeros(ComplexF64, 2 * sysize, 2 * sysize)
    Y = zeros(ComplexF64, 2 * sysize, 1)
    XWin = similar(Win)
    for m in 1:sysize
        for mp in 1:sysize

            X[m, 1:sysize] .= (MJSJ)[mp, :] .* (-Bf[m, m])
            X[m, sysize+1:2*sysize] .= (MJBJ)[mp, :] .* (Bf[m, m])
            X[sysize+m, 1:sysize] .= (MJSJ)[mp, :] .* (Sf[m, m])
            X[sysize+m, sysize+1:2*sysize] .= (MJBJ)[mp, :] .* (-Sf[m, m])

            Y[m, 1] = Bf[m, m] * (MJUD)[mp, 1]
            Y[m+sysize, 1] = -Sf[m, m] * (MJUD)[mp, 1]

            g = im * (tr(mul!(XWin, X, Win))  + (transpose(WinV)*X*WinV)[1] - (transpose(Y)*WinV)[1])
            s = s + g * gdsopart * T[m, mp]

        end
    end
    s = s + abs(Hso)^2 * gdsopart   #### LATEST MODIFICATION
    return s
end

However

The results are different now, even though you can see the way the matrix X is defined in both the cases seem to be equivalent. They differ in the value which is closest to zero. For the old code

1.0e-24  6.394092795554566e-12  4.2656647577696985e-16

For the new code

1.0e-24  6.4464092866808745e-12  4.3093426033023286e-16

The improvement in speed is also minimal (although I guess this was expected)
For new GSV

Number of processes = 1
GKS: Possible loss of precision in routine SET_WINDOW
The rate constant is 42352.90901349334
 ───────────────────────────────────────────────────────────────────────────
                                   Time                    Allocations      
                          ───────────────────────   ────────────────────────
     Tot / % measured:         71.6s /  85.2%           1.51GiB /  28.3%    

 Section          ncalls     time    %tot     avg     alloc    %tot      avg
 ───────────────────────────────────────────────────────────────────────────
 GT calculation        1    60.9s  100.0%   60.9s    437MiB  100.0%   437MiB

For old GSV

anik@anik-MSI:~/Desktop/Swapan Sir/generating-function-codes/julia-codes/isc$ julia serial.jl 
Number of processes = 1
GKS: Possible loss of precision in routine SET_WINDOW
The rate constant is 42009.18966679349
 ───────────────────────────────────────────────────────────────────────────
                                   Time                    Allocations      
                          ───────────────────────   ────────────────────────
     Tot / % measured:         74.6s /  85.1%           1.58GiB /  28.3%    

 Section          ncalls     time    %tot     avg     alloc    %tot      avg
 ───────────────────────────────────────────────────────────────────────────
 GT calculation        1    63.5s  100.0%   63.5s    456MiB  100.0%   456MiB
 ───────────────────────────────────────────────────────────────────────────

abraemer · August 2, 2024, 9:22am

This is just floating point imprecision. If these difference mean anything to you, then you need to rethink your whole approach and/or use more precise floats.

Also I think that these codes are subtly different because you don’t reset the X and Y matrices between iterations of m which you did previously. That might explain the differences in your values actually, since I think otherwise the code should really be identically down to the very floating point operations (all we removed was an superfluous copy).

Uranium238 · August 2, 2024, 9:53am

You were right, the matrices need to be reset after the mp loop, not after the entire m,mp loop. That seemed to be causing the difference. Making the following small change now produces identical results…

    for m in 1:sysize
        for mp in 1:sysize

            X[m, 1:sysize] .= (MJSJ)[mp, :] .* (-Bf[m, m])
            X[m, sysize+1:2*sysize] .= (MJBJ)[mp, :] .* (Bf[m, m])
            X[sysize+m, 1:sysize] .= (MJSJ)[mp, :] .* (Sf[m, m])
            X[sysize+m, sysize+1:2*sysize] .= (MJBJ)[mp, :] .* (-Sf[m, m])

            Y[m, 1] = Bf[m, m] * (MJUD)[mp, 1]
            Y[m+sysize, 1] = -Sf[m, m] * (MJUD)[mp, 1]

            g = im * (tr(mul!(XWin, X, Win))  + (transpose(WinV)*X*WinV)[1] - (transpose(Y)*WinV)[1])
            s = s + g * gdsopart * T[m, mp]

        end
        X[m, 1:sysize] .= 0
        X[m, sysize+1:2*sysize] .= 0
        X[sysize+m, 1:sysize] .= 0
        X[sysize+m, sysize+1:2*sysize] .= 0
        Y[m, 1] = 0
        Y[m+sysize, 1] = 0
    end

That is indeed the case. The aim of using distributed here was to attain multi node parallelization, (the threaded version works as long as all the job is running on the same CPU). Then what can be done in this case? Isn’t there a way to split the output storing array x itself into chunks and make each worker populate its corresponding part ?

If we ignore the multi node caveat for now, working on your advice :

I tried to use ChunkSplitters.jl as below, this does give the correct result. But is this the optimal thing to do in this situation ? Or are there some things that should be changed within this to make it better?

@everywhere function worker_function(xrange, t, b, Δ, sysize, Ωf, Ωi, J, D, P, T, Hso, x)
    Sf = Diagonal(zeros(ComplexF64, sysize))
    Si = Diagonal(zeros(ComplexF64, sysize))
    Bf = Diagonal(zeros(ComplexF64, sysize))
    Bi = Diagonal(zeros(ComplexF64, sysize))
    X = zeros(ComplexF64, 2 * sysize, 2 * sysize)
    Y = zeros(ComplexF64, 2 * sysize, 1)
    for i in xrange
        gdsopart = GDSO(t[i], b, Δ, sysize, Ωf, Ωi, J, D, P, Si, Sf, Bi, Bf)
        gtotal = GSV(t[i], b, sysize, Ωf, Ωi, J, D, Si, Sf, Bi, Bf, gdsopart, T, Hso, X, Y)
        x[i] = gtotal
        open("ISC-DISTRIBUTED.TXT", "a") do f
            write(f, "ITERATION $i COMPLETED BY PROCESS $(Distributed.myid()) \n")
        end
    end
end

function calc(t, b, Δ, sysize, Ωf, Ωi, J, D, P; nchunks=Distributed.nworkers())
    # @sync tells Julia to wait for any tasks started inside it
    open("ISC-DISTRIBUTED.TXT", "w") do f
        write(f, "STARTING LOOP WITH NUMBER OF CORES = $(Distributed.nworkers()) \n====================================================\n")
    end
    x = SharedArray{ComplexF64}(dd)
    Hso = 0.48 * 0.0000046  ## SOC between S1 and T2 = 0.48 cm-1, converting to atomic units
    @timeit to "GT Calculation" @sync for (xrange, ichunk) in chunks(range(1, dd), nchunks)
        @distributed for i in xrange
            worker_function([i], t, b, Δ, sysize, Ωf, Ωi, J, D, P, T, Hso, x)
        end
    end
    open("ISC-DISTRIBUTED.TXT", "a") do f
        write(f, "LOOP COMPLETE \n====================================================")
    end
    yr = real.(x)
    yi = imag.(x)
    p = plot(t, yr, linewidth=2, color="blue", title="Δ = $Δ", label="Real part", grid=false)
    p = plot!(t, yi, linewidth=2, color="red", label="Imaginary part", size=(1000, 800), xlims=(-30e-15, 30e-15))
    savefig(p, "gt.svg")
    begin # This part does FFTW and finds final rate constant

        fs = abs(t[1] - t[2])^-1
        signal = yr
        p = plan_fft(signal)
        F = fftshift(p * signal)
        freqs = fftshift(fftfreq(length(t), fs))  # X axis has frequencies in s^-1
        #Covert freqs to eV
        freqs = freqs * 4.13558e-15
        F = F * 6.57e15
        time_domain = plot(t, signal, title=L"G(t)", legend=false, color="orange", xlabel=L"\mathrm{Time}\ (s)", ylabel=L"\mathrm{A.u}")
        freq_domain = plot(freqs, abs.(F), title=L"\mathrm{Fourier\ transformed\ g(\bar{\nu}))}", legend=false, xlabel=L"\mathrm{Energy\ (eV)}", ylabel=L"\mathrm{k_{IC}\ (s^{-1})}", xlims=(0, 4))
        p = plot(time_domain, freq_domain, layout=(2, 1), size=(1000, 800), right_margin=10mm)
        savefig(p, "gt-fft.svg")
        ee = Δ# * 8065.73  #Converting eV to cm^-1
        index = findmin(abs.(freqs .- ee))[2]  # Find the index of freq closest to ee
        freqs[index]
        println("The rate constant is ", abs.(F)[index])
        rc = abs.(F)[index]
    end

abraemer · August 2, 2024, 12:45pm

Uranium238:

But is this the optimal thing to do in this situation ?

    @timeit to "GT Calculation" @sync for (xrange, ichunk) in chunks(range(1, dd), nchunks)
        @distributed for i in xrange
            worker_function([i], t, b, Δ, sysize, Ωf, Ωi, J, D, P, T, Hso, x)
        end
    end

Almost . You need to use @distributed on the chunks and then it would be perfect:

    @timeit to "GT Calculation" @sync @distributed for (xrange, ichunk) in chunks(range(1, dd), nchunks)
        worker_function(xrange, t, b, Δ, sysize, Ωf, Ωi, J, D, P, T, Hso, x)
    end

In this case you should rewrite your worker_function to also allocate the part of the resulting array x and simply return that chunk. Then you can use the @distributed macro with vcat as reducer to get the full array. Consider this simple example:

julia> using Distributed; addprocs(3);
julia> @everywhere function foo(range)
           return [r^myid() for r in range]
       end
julia> @distributed vcat for xrange in [1:4,5:7,8:10]
           foo(xrange)
       end
10-element Vector{Int64}:
...

Uranium238 · August 3, 2024, 7:16am

I am guessing this is what you were suggesting ? Instead of sending each successive value to a different worker now we seem to be sending chunks of the values to different workers directly, reducing the number of times Si etc matrices are allocated.

    xrange_chunknum = chunks(range(1, dd), nchunks)
    @timeit to "GT Calculation" @sync @distributed for ichunk in 1:nchunks
        xrange,chunknum = xrange_chunknum[ichunk]
        worker_function(xrange, t, b, Δ, sysize, Ωf, Ωi, J, D, P, T, Hso, x)
    end

Now moving on to making it compatible with multi-node parallel computing ,

Following your toy example, I have modified my worker function as below

@everywhere function worker_function(xrange, t, b, Δ, sysize, Ωf, Ωi, J, D, P, T, Hso)
    Sf = Diagonal(zeros(ComplexF64, sysize))
    Si = Diagonal(zeros(ComplexF64, sysize))
    Bf = Diagonal(zeros(ComplexF64, sysize))
    Bi = Diagonal(zeros(ComplexF64, sysize))
    X = zeros(ComplexF64, 2 * sysize, 2 * sysize)
    Y = zeros(ComplexF64, 2 * sysize, 1)
    x_worker = zeros(ComplexF64, length(xrange))
    for i in xrange
        gdsopart = GDSO(t[i], b, Δ, sysize, Ωf, Ωi, J, D, P, Si, Sf, Bi, Bf)
        gtotal = GSV(t[i], b, sysize, Ωf, Ωi, J, D, Si, Sf, Bi, Bf, gdsopart, T, Hso, X, Y)
        # x[i] = gtotal
        x_worker[i-xrange[1]+1] = gtotal
        open("ISC-DISTRIBUTED.TXT", "a") do f
            write(f, "ITERATION $i COMPLETED BY PROCESS $(Distributed.myid()) \n")
        end
    end
    return x_worker
end

Here x_worker is the local part of the array that each worker would be dealing with. Now a small issue is that : x_worker would have indices starting from 1 and would have length(xrange) number of elements, but xrange itself will not always start from 1 as it indicates the indices of the larger array that has been chunked. We have to somehow make the correspondence between the two. Right now I have accomplished it by doing x_worker[i-xrange[1]+1] = gtotal , however this statement makes the code quite difficult to understand as it’s purpose is not obvious. Are there any more elegant methods of achieving the same?

Besides, I also modified the distributing loop as follows

    x = zeros(ComplexF64, dd)
    Hso = 0.48 * 0.0000046  ## SOC between S1 and T2 = 0.48 cm-1, converting to atomic units
    xrange_chunknum = chunks(range(1, dd), nchunks)
    @timeit to "GT Calculation" x = @distributed vcat for ichunk in 1:nchunks
        xrange,chunknum = xrange_chunknum[ichunk]
        worker_function(xrange, t, b, Δ, sysize, Ωf, Ωi, J, D, P, T, Hso)
    end

x now no longer needs to be a SharedArray I am guessing. This produces the expected correct results. Please let me know if there is anything more that catches your eye that might be deteriorating the performance of the code.

abraemer · August 3, 2024, 7:54am

How about:

for (xind, i) in enumerate(xrange)
    # ...
    x_worker[xind] = gtotal
    #...
end

You don’t need to allocate x at all now. And yes with multi-node parallelism you can’t use SharedArrays.

The structure looks good to me. I still think you could optimize the numerics itself quite a bit but it likely would require some substantial changes to the calculation which you disliked. Should you wish to optimize it further, I suggest you open a new thread an post a self-contained MWE in the Performance category here

Uranium238 · August 3, 2024, 8:33am

Before getting into that, I did a quick comparison with the threaded version of the same code given below

Threaded version

using MKL
using DelimitedFiles
using LinearAlgebra
using LaTeXStrings
using TimerOutputs
using Plots
using Plots.PlotMeasures
using FFTW
using Base.Threads
using SharedArrays
using ChunkSplitters
BLAS.set_num_threads(1)
to = TimerOutput()
println("Number of threads = $(Threads.nthreads())")
open("ISC-THREADED.TXT", "a") do f
    write(f, "\n\n\n====================================================\n\n\n")
end

function GDSO(
    t::Float64,
    b::Float64,
    Δ::Float64,
    sysize::Int64,
    Ωf::Diagonal{Float64},
    Ωi::Diagonal{Float64},
    J::Matrix{Float64},
    D::Matrix{Float64},
    ρ::Diagonal{Float64},
    Si::Diagonal{ComplexF64},
    Sf::Diagonal{ComplexF64},
    Bi::Diagonal{ComplexF64},
    Bf::Diagonal{ComplexF64}
)
    t = t * 4.1341373335e16                     # converting time to atomic units
    tp = -im * b - t
    for i = 1:sysize
        Sf[i, i] = Ωf[i, i] / sin(Ωf[i, i] * t)
        Si[i, i] = Ωi[i, i] / sin(Ωi[i, i] * tp)
        Bf[i, i] = Ωf[i, i] / tan(Ωf[i, i] * t)
        Bi[i, i] = Ωi[i, i] / tan(Ωi[i, i] * tp)
    end
    BB = (Bf + J' * Bi * J)
    AA = (Sf + J' * Si * J)

    W = [BB -AA
        -AA BB] #We wont calculate the 2Nx2N determinant directly.

    # Wapp = BB*(BB-  AA*(BB^-1)*AA  )
    Wapp = BB * (BB - AA * (BB \ AA))
    U = Bi - Si
    V = [J' * U * D
        J' * U * D]
    # F1 = sqrt(det(Si*Sf*ρ^2*Wapp^-1))
    # F1 = sqrt(det(Si*Sf*ρ^2)/det(Wapp))
    F1 = sqrt(det(Si * Sf * (Wapp \ ρ^2)))
    F2 = (-(im / 2) * (transpose(V) * (W \ V))) + ((im) * (transpose(D) * U * D))
    g = F1 * exp(F2)
    Δ = Δ * 0.0367493  # in atomic units
    g = g * exp(im * Δ * t)
    η = 2 * 4.55633e-6  # 10 cm^-1 to hatree energy
    #g = g * exp(-η * abs(t))     # Adding the damping factor Lorrentzian
    g = g * exp(-η * t^2)     # Adding the damping factor Gaussian
    # println("t = $t was evaluated by process $(myid())")
    return g[1]
end

@views function GSV(
    t::Float64,
    b::Float64,
    sysize::Int64,
    Ωf::Diagonal{Float64},
    Ωi::Diagonal{Float64},
    J::Matrix{Float64},
    D::Matrix{Float64},
    Si::Diagonal{ComplexF64},
    Sf::Diagonal{ComplexF64},
    Bi::Diagonal{ComplexF64},
    Bf::Diagonal{ComplexF64},
    gdsopart::ComplexF64,
    T::Matrix{Float64},       ## Product if NAC and SO 
    Hso::Float64,
    X::Matrix{ComplexF64},
    Y::Matrix{ComplexF64}
)

    ################################## Preliminary Part , Common for both Frank Condon and Herzberg Teller ##############################
    t = t * 4.1341373335e16                     # converting time to atomic units
    tp = -im * b - t
    for i = 1:sysize
        Sf[i, i] = Ωf[i, i] / sin(Ωf[i, i] * t)
        Si[i, i] = Ωi[i, i] / sin(Ωi[i, i] * tp)
        Bf[i, i] = Ωf[i, i] / tan(Ωf[i, i] * t)
        Bi[i, i] = Ωi[i, i] / tan(Ωi[i, i] * tp)
    end
    W = [(Bf+J'*Bi*J) -(Sf + J' * Si * J)
        -(Sf + J' * Si * J) (Bf+J'*Bi*J)]
    # U = Bi - Si
    V = [J' * (Bi-Si) * D
        J' * (Bi-Si) * D]

    ########## Defining the X and Y Matrices ##################################


    ################# Final Calculation and returning value ##################
    s = zero(ComplexF64)
    MJSJ = J' * Si * J
    MJBJ = J' * Bi * J
    MJUD = J' * (Bi-Si) * D
    Win = inv(W)
    WinV = Win * V
    XWin = similar(Win)
    for m in 1:sysize
        for mp in 1:sysize

            X[m, 1:sysize] .= (MJSJ)[mp, :] .* (-Bf[m, m])
            X[m, sysize+1:2*sysize] .= (MJBJ)[mp, :] .* (Bf[m, m])
            X[sysize+m, 1:sysize] .= (MJSJ)[mp, :] .* (Sf[m, m])
            X[sysize+m, sysize+1:2*sysize] .= (MJBJ)[mp, :] .* (-Sf[m, m])

            Y[m, 1] = Bf[m, m] * (MJUD)[mp, 1]
            Y[m+sysize, 1] = -Sf[m, m] * (MJUD)[mp, 1]

            g = im * (tr(mul!(XWin, X, Win))  + (transpose(WinV)*X*WinV)[1] - (transpose(Y)*WinV)[1])
            s = s + g * gdsopart * T[m, mp]

        end
        X[m, 1:sysize] .= 0
        X[m, sysize+1:2*sysize] .= 0
        X[sysize+m, 1:sysize] .= 0
        X[sysize+m, sysize+1:2*sysize] .= 0
        Y[m, 1] = 0
        Y[m+sysize, 1] = 0
    end
    s = s + abs(Hso)^2 * gdsopart   #### LATEST MODIFICATION
    return s
end

const hso_s1t3 = 1.49 * 0.0000046  ## SOC between S1 and T3 = 0.48 cm-1, converting to atomic units
const Delta = 0.11 # in eV
const Delta_s1t3 = 0.091 * 0.037 # Difference between S1 and T3 = 0.091 eV in atomic units
const dd = 41                             # Change this to modify number of points
const Temp = 300
kT_temp = Temp * 1.380649e-23  # Boltzmann constant times temperature in Joules
const kT = kT_temp * 2.29371227840e17         # in atomic units
const b = 1 / kT
const omegaf = Diagonal(vec(readdlm("wf.txt", '\t', Float64, '\n'))) * 4.55633e-6   # Final state  frequencies in atomic units
const omegai = Diagonal(vec(readdlm("wi.txt", '\t', Float64, '\n'))) * 4.55633e-6  # Initial state  frequencies in atomic units
const sysize = size(omegai)[1]
const P = @. 2 * sinh(b * omegai / 2)
T_temp = readdlm("nac.txt", '\t', Float64, '\n') # NAC matrix
const T = (T_temp * T_temp')* (hso_s1t3^2/Delta_s1t3^2)
const D = readdlm("d.txt", Float64) # Displacement Matrix
const J = readdlm("j.txt", Float64) # Duchinsky Matrix
const t = collect(range(-5e-12, stop=5e-12, length=dd))
t[div(dd + 1, 2)] = 10e-25

function calc(t, b, Δ, sysize, Ωf, Ωi, J, D, P; nchunks=Threads.nthreads())
    # @sync tells Julia to wait for any tasks started inside it
    open("ISC-THREADED.TXT", "w") do f
        write(f, "STARTING LOOP WITH NUMBER OF THREADS = $(Threads.nthreads()) \n====================================================\n")
    end
    x = zeros(ComplexF64, dd)
    @timeit to "GT Calculation" @sync for (xrange, ichunk) in chunks(range(1, dd), nchunks)
        # xrange are the indices this chunk contains, ichunk is just the index of the chunk, which we don't need in this scenario
        Threads.@spawn begin # this line tells Julia to start a new thread with the contents of the begin block
            Sf = Diagonal(zeros(ComplexF64, sysize))
            Si = Diagonal(zeros(ComplexF64, sysize))
            Bf = Diagonal(zeros(ComplexF64, sysize))
            Bi = Diagonal(zeros(ComplexF64, sysize))
            X = zeros(ComplexF64, 2 * sysize, 2 * sysize)
            Y = zeros(ComplexF64, 2 * sysize, 1)
            Hso = 0.48 * 0.0000046  ## SOC between S1 and T2 = 0.48 cm-1, converting to atomic units
            for i in xrange
                gdsopart = GDSO(t[i], b, Δ, sysize, Ωf, Ωi, J, D, P, Si, Sf, Bi, Bf)
                gtotal = GSV(t[i], b, sysize, Ωf, Ωi, J, D, Si, Sf, Bi, Bf,gdsopart, T,Hso,X,Y)
                x[i] = gtotal
                open("ISC-THREADED.TXT", "a") do f
                    write(f, "ITERATION $i COMPLETED BY THREAD $(Threads.threadid()) \n")
                end
            end # loop
        end # thread work load
    end # loop over chunks, Julia will wait until all task are completed
    open("ISC-THREADED.TXT", "a") do f
        write(f, "LOOP COMPLETE \n====================================================")
    end
    yr = real.(x)
    yi = imag.(x)
    p = plot(t, yr, linewidth=2, color="blue", title="Δ = $Δ", label="Real part", grid=false)
    p = plot!(t, yi, linewidth=2, color="red", label="Imaginary part", size=(1000, 800), xlims=(-30e-15, 30e-15))
    savefig(p, "gt.svg")
    begin # This part does FFTW and finds final rate constant

        fs = abs(t[1] - t[2])^-1
        signal = yr
        p = plan_fft(signal)
        F = fftshift(p * signal)
        freqs = fftshift(fftfreq(length(t), fs))  # X axis has frequencies in s^-1
        #Covert freqs to eV
        freqs = freqs * 4.13558e-15
        F = F * 6.57e15
        time_domain = plot(t, signal, title=L"G(t)", legend=false, color="orange", xlabel=L"\mathrm{Time}\ (s)", ylabel=L"\mathrm{A.u}")
        freq_domain = plot(freqs, abs.(F), title=L"\mathrm{Fourier\ transformed\ g(\bar{\nu}))}", legend=false, xlabel=L"\mathrm{Energy\ (eV)}", ylabel=L"\mathrm{k_{IC}\ (s^{-1})}", xlims=(0, 4))
        p = plot(time_domain, freq_domain, layout=(2, 1), size=(1000, 800), right_margin=10mm)
        savefig(p, "gt-fft.svg")
        ee = Δ# * 8065.73  #Converting eV to cm^-1
        index = findmin(abs.(freqs .- ee))[2]  # Find the index of freq closest to ee
        freqs[index]
        println("The rate constant is ", abs.(F)[index])
        rc = abs.(F)[index]
    end
#################################################This part writes the output to a file#######################################################
    begin 
        open("ISC-THREADED.TXT", "a") do f
            write(f, "-------------------------\n")
            write(f, "NUMBER OF CORES = $(Threads.nthreads())\n")
            write(f, "The value of Δ is $Δ\n")
            write(f, "Number of time points = $dd\n")
            write(f, "The rate constant calculated from Julia FFTW = $rc\n")
            write(f, "-------------------------\n")
            write(f, "The values of t,Re(g(t)),Im(g(t)) are \n\n")
            for i in 1:dd
                write(f, "$(t[i])  $(yr[i])  $(yi[i])\n")
            end
        end

        open("GT-VALUES-THREADED.txt", "w") do f
            for i in 1:dd
                write(f, "$(t[i])  $(yr[i])  $(yi[i])\n")
            end
        end

    end

end


calc(t, b, Delta, sysize, omegaf, omegai, J, D, P; nchunks=Threads.nthreads())
display(to)

Setting the number of points dd = 41, this is how the Base.Threads and Distributed.jl versions compare :

Threads :

Number of threads = 11
GKS: Possible loss of precision in routine SET_WINDOW
The rate constant is 42009.18966679349
 ───────────────────────────────────────────────────────────────────────────
                                   Time                    Allocations      
                          ───────────────────────   ────────────────────────
     Tot / % measured:         39.0s /  89.0%           1.89GiB /  71.9%    

 Section          ncalls     time    %tot     avg     alloc    %tot      avg
 ───────────────────────────────────────────────────────────────────────────
 GT Calculation        1    34.7s  100.0%   34.7s   1.36GiB  100.0%  1.36GiB
 ───────────────────────────────────────────────────────────────────────────

And Distributed :

Number of processes = 11
GKS: Possible loss of precision in routine SET_WINDOW
The rate constant is 42009.18966679349
 ───────────────────────────────────────────────────────────────────────────
                                   Time                    Allocations      
                          ───────────────────────   ────────────────────────
     Tot / % measured:         51.3s /  85.5%            655MiB /   7.4%    

 Section          ncalls     time    %tot     avg     alloc    %tot      avg
 ───────────────────────────────────────────────────────────────────────────
 GT Calculation        1    43.8s  100.0%   43.8s   48.3MiB  100.0%  48.3MiB
 ───────────────────────────────────────────────────────────────────────────

That’s quite a large difference in time taken (about 20% !). I understand that threads is supposed to be slightly faster due to less overhead and using the same memory space. However given that most of the structure of the code is the same (including the usage of ChunkSplitters.jl), is this a normal amount of discrepancy in times? Or is there some underlying inefficiency in the parallelization implemented in the Distributed.jl version that still has to be tackled.

abraemer · August 3, 2024, 2:55pm

Another thing is compilation. Each Julia process needs to (pre)compile the methods anew. That could add a significant (but constant) overhead. Could you try again with an a bit larger workload?
Also note that the TimerOutputs don’t measure useful data besides the total time.

Uranium238 · August 4, 2024, 6:37am

Then what would you suggest is to be used ? @btime from BenchmarkTools.jl also gives a single time value as output. I will try a larger computation to check how the differences scale. However, I had put the @timeit macro only on top of the outermost loop involved with the expensive part of the code. Why would that include compile time ? Doesn’t compilation occur at the very beginning of the program ?

abraemer · August 4, 2024, 9:35am

Each worker is a separate Julia process and data movement is not automatic. So when you try to something, then the timing occurs on the main Julia thread, so you don’t see allocation happening on workers. Even if you copy the TimerOutputs object to the workers, then still you’d see nothing because the data is never transferred back to the main process if you don’t take care of this.
For the same reason there is additional compilation overhead: Each worker needs to to load all of the libraries and compile all of the code for themselves. So with 16 processes you perform the same compilations 16x. Now on physically different machines there is simply no way around this. But this is one of the reasons why one usually would recommend threads for parallelisation within a single machine. Especially if workloads are short (<1min).

Topic		Replies	Views
How to convert a thread-parallelized code into a core-parallelized code? Julia at Scale multithreading , linearalgebra , distributed , threads , matrix	3	326	May 19, 2024
How to improve the scaling of Julia code aimed at multi-node parallelization? Julia at Scale linearalgebra , distributed	38	742	August 14, 2024
Parallel is very slow General Usage parallel	16	4744	March 9, 2018
Strange comportment of @distributed Performance question , distributed , loops	4	692	March 25, 2020
Data structures for threaded computing Performance	23	2922	October 23, 2019

Probable data race condition causing problems when trying to parallelize a loop used to populate an array

Related topics