How to improve performance in a function that repeatedly defines and multiplies matrices

The discourse browser window does not support \xor-TAB to generate . Even if it did, I am not going to learn / look up the specific tab completions for the glyphs used in a random discourse post.

Copy-pasting every single occurence of e.g. Δ if I want to add a line using that variable is annoying. Hand leaves keyboard, touches mouse, typing speed cut by 10 - 100x, I won’t do it.

Some code editors support tab completions, some don’t.

Most people (80% ish?) have either higher tolerance for these annoyances, or better tooling (what tooling do you use?).

The only current realistic way for me to interact with non-ascii people is to refactor their code it to remove all the non-ascii cruft.

I would absolutely love if julia decided that e.g. \xor becomes an alias for , on the same parser level that normalizes unicode aliases (i.e. 3 \xor 4 would create the same parse tree as 3 ⊻ 4, same way as 'µ': Unicode U+00B5 and 'μ': Unicode U+03BC alias).

I.e. make it possible to normalize julia files to ascii without changing the parse tree. (of course the normalization needs to parse the file, same as current normalization – this only applies to Symbols, not to string or character literals! These would be normalized to a unicode escape sequence, like “\u00B5”)

1 Like

I can see how this is a problem when replying on a phone, but on a computer wouldn’t one use a code editor for writing code?

What sort of code editors don’t support completions?:face_with_raised_eyebrow:

1 Like

Personally I don’t mind copying Unicode symbols, even when I’m going to do a Web search to find the symbol before I copy it. But largely I don’t copy them one-by-one, rather I copy an entire multi-line code segment at once.

VS Code, with OpenVSX. I’d prefer something not controlled by Microsoft, but Code seems to be the best option for me right now. And sometimes, for quick edits, Vim, though I don’t remember ever trying to use Unicode character completion in Vim.

I don’t see that happening, but perhaps an external tool, using JuliaSyntax.jl, would be viable?

GitHub - JuliaEditorSupport/julia-vim: Vim support for Julia. provides unicode completion for vim.

1 Like

Here is the cleaned up version of the code

using Distributed
addprocs(11)
@everywhere begin
    using LoopVectorization
    using MKL
    using DelimitedFiles
    using LinearAlgebra
    using LaTeXStrings
    using TimerOutputs
    BLAS.set_num_threads(1)
    using Plots
    using Plots.PlotMeasures
    using FFTW
    using Base.Threads
    using SharedArrays
end
@everywhere function GFC(t::Float64,b::Float64,Δ::Float64,sysize::Int64,Ωf::Diagonal{Float64},Ωi::Diagonal{Float64},J::Matrix{Float64},D::Matrix{Float64},ρ::Diagonal{Float64},Si::Diagonal{ComplexF64},Sf::Diagonal{ComplexF64},Bi::Diagonal{ComplexF64},Bf::Diagonal{ComplexF64})
    t = t* 4.1341373335*10^16                     # converting time to atomic units
    tp = - im*b - t
    for i = 1:sysize
        Sf[i,i]=Ωf[i,i]/sin(Ωf[i,i]*t)
        Si[i,i]=Ωi[i,i]/sin(Ωi[i,i]*tp)
        Bf[i,i]=Ωf[i,i]/tan(Ωf[i,i]*t)
        Bi[i,i]=Ωi[i,i]/tan(Ωi[i,i]*tp)
    end
    BB = (Bf+J'*Bi*J)
    AA = (Sf+J'*Si*J)

    W = [BB  -AA
        -AA  BB] #We wont calculate the 2Nx2N determinant directly.
    
    # Wapp = BB*(BB-  AA*(BB^-1)*AA  )
    Wapp = BB*(BB-  AA*(BB\AA)  )
    U = Bi-Si
    V = [J'*U*D
        J'*U*D]
    # F1 = sqrt(det(Si*Sf*ρ^2*Wapp^-1))
    # F1 = sqrt(det(Si*Sf*ρ^2)/det(W))
    F1 = sqrt(det(Si*Sf*  (Wapp\ρ^2)))
    F2 = (-(im/2)*(transpose(V)* (W\V)))    +  ((im)*(transpose(D)*U*D))
    g = F1*exp(F2)
    Δ = Δ * 0.0367493  # in atomic units
    g = g * exp(im * Δ * t)
    η = 2 * 4.55633 * 10^-6  # 10 cm^-1 to hatree energy
    #g = g * exp(-η * abs(t))     # Adding the damping factor Lorrentzian
    g = g * exp(-η * t^2)     # Adding the damping factor Gaussian
    # println("t = $t was evaluated by process $(myid())")
    # println("GFC at t = $t   is     $g")
    # println("F1 at t = $t         = $F1")
    return g[1]
end
@everywhere function GHT(t::Float64,b::Float64,sysize::Int64,Ωf::Diagonal{Float64},Ωi::Diagonal{Float64},J::Matrix{Float64},D::Matrix{Float64},Si::Diagonal{ComplexF64},Sf::Diagonal{ComplexF64},Bi::Diagonal{ComplexF64},Bf::Diagonal{ComplexF64},X11::Matrix{ComplexF64},X12::Matrix{ComplexF64},X21::Matrix{ComplexF64},X22::Matrix{ComplexF64},Y1::Matrix{ComplexF64},Y2::Matrix{ComplexF64},gfcpart::ComplexF64,T::Matrix{Float64})




    ################################## Preliminary Part , Common for both Frank Condon and Herzberg Teller ##############################
    t = t* 4.1341373335*10^16                     # converting time to atomic units
    tp = - im*b - t
    for i = 1:sysize
        Sf[i,i]=Ωf[i,i]/sin(Ωf[i,i]*t)
        Si[i,i]=Ωi[i,i]/sin(Ωi[i,i]*tp)
        Bf[i,i]=Ωf[i,i]/tan(Ωf[i,i]*t)
        Bi[i,i]=Ωi[i,i]/tan(Ωi[i,i]*tp)
    end
    W = [(Bf+J'*Bi*J)  -(Sf+J'*Si*J)
        -(Sf+J'*Si*J)  (Bf+J'*Bi*J)]
    U = Bi-Si
    V = [J'*U*D
        J'*U*D]

    ########## Defining the X and Y Matrices ##################################


    ################# Final Calculation and returning value ##################
    s=0

    for m in range(1,sysize)
        for mp in range(1,sysize)
            @. X11 = 0.0
            @. X12 = 0.0
            @. X21 = 0.0
            @. X22 = 0.0
            X11[m,:]=(J'*Si*J)[mp,:]*(-Bf[m,m])
            X12[m,:]=(J'*Bi*J)[mp,:]*(Bf[m,m])
            X21[m,:]=(J'*Si*J)[mp,:]*(Sf[m,m])
            X22[m,:]=(J'*Bi*J)[mp,:]*(-Sf[m,m])
            X = [X11 X12
                X21 X22]
            @. Y1 = 0.0
            @. Y2 = 0.0
            Y1[m,:] = Bf[m,m]*(J'*U*D)[mp,:] 
            Y2[m,:] = -Sf[m,m]*(J'*U*D)[mp,:]
            Y = [Y1
                Y2]
            g = im*tr(W\X)+    (transpose(W\V)*X*(W\V))[1]      -    (transpose(Y)*(W\V))[1]
            s = s+g[1]*gfcpart*T[m,mp]
        end
    end
    return s[1]
end
Delta::Float64 = 2.02 # in eV
dd::Int64 = 11                               # Change this to modify number of points
Temp::Int64=300
kT = Temp * 1.380649 * 10^-23  # Boltzmann constant times temperature in Joules
kT= kT * 2.29371227840*10^17         # in atomic units
b ::Float64= 1/kT
omegaf ::Diagonal{Float64}= Diagonal(vec(readdlm("wf.txt", '\t', Float64, '\n')))*4.55633*10^-6   # Final state  frequencies in atomic units
omegai ::Diagonal{Float64}= Diagonal(vec(readdlm("wi.txt", '\t', Float64, '\n')))*4.55633*10^-6  # Initial state  frequencies in atomic units
sysize ::Int64 = size(omegai)[1]
P = @. 2*sinh(b*omegai/2)
T ::Matrix{Float64}= readdlm("nac.txt", '\t', Float64, '\n'); # NAC matrix
T = T*T';
D ::Matrix{Float64}= readdlm("d.txt", Float64); # Displacement Matrix
J ::Matrix{Float64}= readdlm("j.txt", Float64); # Duchinsky Matrix
t ::Vector{Float64}=  collect(range(-5*10^-12,stop=5*10^-12,length=dd))
t[div(dd+1,2)] = 10e-25

function calc(t,b,Δ,sysize,Ωf,Ωi,J,D,P)
    # allocate workspaces
    Sf ::Diagonal{ComplexF64}= Diagonal(zeros(ComplexF64,sysize))
    Si ::Diagonal{ComplexF64}= Diagonal(zeros(ComplexF64,sysize))
    Bf ::Diagonal{ComplexF64}= Diagonal(zeros(ComplexF64,sysize))
    Bi ::Diagonal{ComplexF64}= Diagonal(zeros(ComplexF64,sysize))
    X11 :: Matrix{ComplexF64}  = zeros(ComplexF64,sysize,sysize)
    X12 :: Matrix{ComplexF64}  = zeros(ComplexF64,sysize,sysize)
    X21 :: Matrix{ComplexF64}  = zeros(ComplexF64,sysize,sysize)
    X22 :: Matrix{ComplexF64}  = zeros(ComplexF64,sysize,sysize)
    Y1  :: Matrix{ComplexF64} =  zeros(ComplexF64,sysize,1)
    Y2  :: Matrix{ComplexF64} =  zeros(ComplexF64,sysize,1)
    x = zeros(ComplexF64,dd)
    open("CORE-PARALLEL-OUTPUT-RV6.TXT", "a") do f
        write(f, "STARTING LOOP\n====================================================\n")
    end
    @time for i in 1:dd
        gfcpart = GFC(t[i],b,Δ,sysize,Ωf,Ωi,J,D,P,Si,Sf,Bi,Bf)
        gtotal = GHT(t[i],b,sysize,Ωf,Ωi,J,D,Si,Sf,Bi,Bf,X11,X12,X21,X22,Y1,Y2,gfcpart,T)
        x[i] = gtotal
        open("CORE-PARALLEL-OUTPUT-RV6.TXT", "a") do f
            write(f, "ITERATION $i COMPLETED BY PROCESS $(myid()) \n")
        end
    end
    open("CORE-PARALLEL-OUTPUT-RV5.TXT", "a") do f
        write(f, "LOOP COMPLETE \n====================================================")
    end
    yr = real.(x)
    yi = imag.(x)
    p = plot(t,yr,linewidth = 2,color = "blue",title = "Δ = $Δ",label = "Real part",grid=false)
    p =plot!(t,yi,linewidth = 2,color = "red",label = "Imaginary part",size=(1000,800),xlims=(-30*10^-15,30*10^-15))
    savefig(p,"gt.svg")
    begin # This part does FFTW and finds final rate constant

        fs = abs(t[1]-t[2])^-1
        signal = yr
        p = plan_fft(signal)
        F = fftshift(p*signal)
        freqs = fftshift(fftfreq(length(t), fs))  # X axis has frequencies in s^-1
        #Covert freqs to eV
        freqs = freqs * 4.13558 * 10^-15
        F = F * 6.57 * 10^15
        time_domain = plot(t, signal, title = L"G(t)",legend=false,color="orange",xlabel=L"\mathrm{Time}\ (s)",ylabel=L"\mathrm{A.u}")
        freq_domain = plot(freqs, abs.(F), title = L"\mathrm{Fourier\ transformed\ g(\bar{\nu}))}",legend=false,xlabel=L"\mathrm{Energy\ (eV)}",ylabel=L"\mathrm{k_{IC}\ (s^{-1})}",xlims=(0,4))
        p = plot(time_domain, freq_domain, layout = (2,1), size = (1000, 800),right_margin = 10mm)
        savefig(p,"gt-fft.svg")
        ee = Δ# * 8065.73  #Converting eV to cm^-1
        index = findmin(abs.(freqs .- ee))[2]  # Find the index of freq closest to ee
        freqs[index]
        println("The rate constant is ",abs.(F)[index])
        rc = abs.(F)[index]
    end
    begin # This part writes the output to a file
        open("CORE-PARALLEL-OUTPUT-RV6.TXT", "a") do f
            write(f, "-------------------------\n")
            write(f,"NUMBER OF CORES = $(nworkers())\n")
            write(f,"The value of Δ is $Δ\n")
            write(f,"Number of time points = $dd\n")
            write(f, "The rate constant is $rc\n")
            write(f,"-------------------------\n")
            write(f,"The values of t,Re(g(t)),Im(g(t)) are \n\n")
            for i in range(1,dd)
                write(f,"$(t[i])  $(yr[i])  $(yi[i])\n")
            end
        end
    end

end
calc(t,b,Delta,sysize,omegaf,omegai,J,D,P)

I have tried to incorporate most of the suggestions posted on this thread. Please feel free to post any additional changes you might feel good!

EDIT 1 - I have attached the latest working version of the full code instead of just the computationally expensive part.

EDIT 2 - Files The data files are quite elaborate and I do not know how else to type them into the code. The output is very sensitive to the order of magnitude of each element. So the best way out here is to attach the data files themselves. The main runnable code is also there with the name dist-rv6.jl

Indexing with a range makes a copy. @views is your friend.

1:sysize instead

Slightly simpler is X11 .= 0.0, or clearer: fill!(X11, 0.0)

t *= 4.1341373335e16 more idiomatic

Superfluous type annotations.

function calc accesses (non-const) global variable dd, but in this case it’s not likely serious.

I wouldn’t pass sysize to the functions. If each function retrieves the size from one of the Matrix arguments, then there is one less chance for an inconsistency error to arise.

Can you please provide some dummy input data? We don’t have these files:

omegaf ::Diagonal{Float64}= Diagonal(vec(readdlm("wf.txt", '\t', Float64, '\n')))*4.55633*10^-6   # Final state  frequencies in atomic units
omegai ::Diagonal{Float64}= Diagonal(vec(readdlm("wi.txt", '\t', Float64, '\n')))*4.55633*10^-6  # Initial state  frequencies in atomic units
# [...]
T ::Matrix{Float64}= readdlm("nac.txt", '\t', Float64, '\n'); # NAC matrix
D ::Matrix{Float64}= readdlm("d.txt", Float64); # Displacement Matrix
J ::Matrix{Float64}= readdlm("j.txt", Float64); # Duchinsky Matrix

Also, I haven’t looked into distributed processing in years, and even then only briefly, but I don’t really see where you are paeallelizing this. Don’t you need to somehow do the function call on the various processes, not just define the functions @everywhere?

And if you’re not doing multi-threading, why do you need to set BLAS threads to 1?

In fact, I cannot find any function call at all? How do you call the main function, calc.

It’s really great if the code is set up so that everyone else can just copy-paste it directly into their REPL, and it just works.

1 Like

To nitpick, ρ^2 * Wapp^-1 is not the same as Wapp \ ρ^2, but should be translated to ρ^2 / Wapp. Inside the determinant, it doesn’t matter much, though it leads to some floating point differences.

For performance (in addition to my doubts about the parallelization), there are many things that can be sped up, the most glaring one is perhaps

where the products J' * Si * J etc. are performed over and over, instead of doing them once outside the loop. Also slicing along rows is slower than along columns, and the slices also allocate unnecessary temporary arrays, and the entire matrices X11, etc. are zeroed out for each iteration, even though only one row has been touched.

So there’s good news: at least GHT should be possible to speed up a lot :slight_smile:

2 Likes

The code I provided was not a working example. I have edited it to reflect a current working version of the code.

EDIT 1 - I have also attached the data files needed.

So here is what I have in mind :
Instead of using the X11[m,:] to slice the array, I can instead manually use a for-loop to allocate the elements of the mth row. And instead of zeroing out the entire array at the start of mp iteration, I zero out only the used row of the workspace arrays X and Y at the end of each mp iteration. Let me know if you think that it would be step in the right direction.

EDIT 1 - I tried computing the J'... matrix outside the loop and only clearing the used part of X and Y matrices. This gave a speed-up of about 20 seconds !!

My short-term goal - To parallelize my code such that it can be run on a supercomputer using 20-40 cores of a single CPU.

Long Term goal - To generalize the parallelization such that we can run the code using multiple CPUs. For example, 2 CPUs with 40 cores in each.

I tried to incorporate the advice you have on the other thread using a naive port of your model code into my program as follows

println("Number of threads = $(Threads.nthreads())")
begin
    using LoopVectorization
    using MKL
    using DelimitedFiles
    using LinearAlgebra
    using LaTeXStrings
    using TimerOutputs
    BLAS.set_num_threads(1)
    using Plots
    using Plots.PlotMeasures
    using FFTW
    using Base.Threads
    using SharedArrays
    using ChunkSplitters
end
function GFC(t::Float64,b::Float64,Δ::Float64,sysize::Int64,Ωf::Diagonal{Float64},Ωi::Diagonal{Float64},J::Matrix{Float64},D::Matrix{Float64},ρ::Diagonal{Float64},Si::Diagonal{ComplexF64},Sf::Diagonal{ComplexF64},Bi::Diagonal{ComplexF64},Bf::Diagonal{ComplexF64})
    t = t* 4.1341373335*10^16                     # converting time to atomic units
    tp = - im*b - t
    for i = 1:sysize
        Sf[i,i]=Ωf[i,i]/sin(Ωf[i,i]*t)
        Si[i,i]=Ωi[i,i]/sin(Ωi[i,i]*tp)
        Bf[i,i]=Ωf[i,i]/tan(Ωf[i,i]*t)
        Bi[i,i]=Ωi[i,i]/tan(Ωi[i,i]*tp)
    end
    BB = (Bf+J'*Bi*J)
    AA = (Sf+J'*Si*J)

    W = [BB  -AA
        -AA  BB] #We wont calculate the 2Nx2N determinant directly.
    
    # Wapp = BB*(BB-  AA*(BB^-1)*AA  )
    Wapp = BB*(BB-  AA*(BB\AA)  )
    U = Bi-Si
    V = [J'*U*D
        J'*U*D]
    # F1 = sqrt(det(Si*Sf*ρ^2*Wapp^-1))
    # F1 = sqrt(det(Si*Sf*ρ^2)/det(Wapp))
    F1 = sqrt(det(Si*Sf*  (Wapp\ρ^2)))
    F2 = (-(im/2)*(transpose(V)* (W\V)))    +  ((im)*(transpose(D)*U*D))
    g = F1*exp(F2)
    Δ = Δ * 0.0367493  # in atomic units
    g = g * exp(im * Δ * t)
    η = 2 * 4.55633 * 10^-6  # 10 cm^-1 to hatree energy
    #g = g * exp(-η * abs(t))     # Adding the damping factor Lorrentzian
    g = g * exp(-η * t^2)     # Adding the damping factor Gaussian
    # println("t = $t was evaluated by process $(myid())")
    # println("GFC at t = $t   is     $g")
    # println("F1 at t = $t         = $F1")
    return g[1]
end
function GHT(t::Float64,b::Float64,sysize::Int64,Ωf::Diagonal{Float64},Ωi::Diagonal{Float64},J::Matrix{Float64},D::Matrix{Float64},Si::Diagonal{ComplexF64},Sf::Diagonal{ComplexF64},Bi::Diagonal{ComplexF64},Bf::Diagonal{ComplexF64},X11::Matrix{ComplexF64},X12::Matrix{ComplexF64},X21::Matrix{ComplexF64},X22::Matrix{ComplexF64},Y1::Matrix{ComplexF64},Y2::Matrix{ComplexF64},gfcpart::ComplexF64,T::Matrix{Float64})




    ################################## Preliminary Part , Common for both Frank Condon and Herzberg Teller ##############################
    t = t* 4.1341373335*10^16                     # converting time to atomic units
    tp = - im*b - t
    for i = 1:sysize
        Sf[i,i]=Ωf[i,i]/sin(Ωf[i,i]*t)
        Si[i,i]=Ωi[i,i]/sin(Ωi[i,i]*tp)
        Bf[i,i]=Ωf[i,i]/tan(Ωf[i,i]*t)
        Bi[i,i]=Ωi[i,i]/tan(Ωi[i,i]*tp)
    end
    W = [(Bf+J'*Bi*J)  -(Sf+J'*Si*J)
        -(Sf+J'*Si*J)  (Bf+J'*Bi*J)]
    U = Bi-Si
    V = [J'*U*D
        J'*U*D]

    ########## Defining the X and Y Matrices ##################################


    ################# Final Calculation and returning value ##################
    s=0
    MJSJ = J'*Si*J
    MJBJ = J'*Bi*J
    MJUD = J'*U*D
    for m in range(1,sysize)
        for mp in range(1,sysize)
            # @. X11 = 0.0
            # @. X12 = 0.0
            # @. X21 = 0.0
            # @. X22 = 0.0
            X11[m,:]=(MJSJ)[mp,:]*(-Bf[m,m])
            X12[m,:]=(MJBJ)[mp,:]*(Bf[m,m])
            X21[m,:]=(MJSJ)[mp,:]*(Sf[m,m])
            X22[m,:]=(MJBJ)[mp,:]*(-Sf[m,m])
            X = [X11 X12
                X21 X22]
            # @. Y1 = 0.0
            # @. Y2 = 0.0
            Y1[m,:] = Bf[m,m]*(MJUD)[mp,:] 
            Y2[m,:] = -Sf[m,m]*(MJUD)[mp,:]
            Y = [Y1
                Y2]
            g = im*tr(W\X)+    (transpose(W\V)*X*(W\V))[1]      -    (transpose(Y)*(W\V))[1]
            s = s+g[1]*gfcpart*T[m,mp]
        end
        for i in 1:sysize
            X11[m,i]=0
            X12[m,i]=0
            X21[m,i]=0
            X22[m,i]=0
            Y1[m,1]=0
            Y2[m,1]=0
        end
    end
    return s[1]
end

begin ## Define all constant variables
    const Delta::Float64 = 2.02 # in eV
    const dd::Int64 = 11                               # Change this to modify number of points
    const Temp::Int64=300
    kT = Temp * 1.380649 * 10^-23  # Boltzmann constant times temperature in Joules
    kT= kT * 2.29371227840*10^17         # in atomic units
    const b ::Float64= 1/kT
    const omegaf ::Diagonal{Float64}= Diagonal(vec(readdlm("wf.txt", '\t', Float64, '\n')))*4.55633*10^-6   # Final state  frequencies in atomic units
    const omegai ::Diagonal{Float64}= Diagonal(vec(readdlm("wi.txt", '\t', Float64, '\n')))*4.55633*10^-6  # Initial state  frequencies in atomic units
    const sysize ::Int64 = size(omegai)[1]
    const P = @. 2*sinh(b*omegai/2)
    T ::Matrix{Float64}= readdlm("nac.txt", '\t', Float64, '\n'); # NAC matrix
    T = T*T';
    const D ::Matrix{Float64}= readdlm("d.txt", Float64); # Displacement Matrix
    const J ::Matrix{Float64}= readdlm("j.txt", Float64); # Duchinsky Matrix
    t ::Vector{Float64}=  collect(range(-5*10^-12,stop=5*10^-12,length=dd))
    t[div(dd+1,2)] = 10e-25
end

function calc(t,b,Δ,sysize,Ωf,Ωi,J,D,P)
    nchunks = Threads.nthreads() # define number of chunks e.g. equal to number of threads
    # @sync tells Julia to wait for any tasks started inside it
    open("CORE-PARALLEL-OUTPUT-RV6.TXT", "a") do f
        write(f, "STARTING LOOP WITH NUMBER OF THREADS = $(Threads.nthreads()) \n====================================================\n")
    end
    @sync for (xrange, ichunk) in chunks(range(1,dd), nchunks)
        # xrange are the indices this chunk contains, ichunk is just the index of the chunk, which we don't need in this scenario
        Threads.@spawn begin # this line tells Julia to start a new thread with the contents of the begin block
            Sf ::Diagonal{ComplexF64}= Diagonal(zeros(ComplexF64,sysize))
            Si ::Diagonal{ComplexF64}= Diagonal(zeros(ComplexF64,sysize))
            Bf ::Diagonal{ComplexF64}= Diagonal(zeros(ComplexF64,sysize))
            Bi ::Diagonal{ComplexF64}= Diagonal(zeros(ComplexF64,sysize))
            X11 :: Matrix{ComplexF64}  = zeros(ComplexF64,sysize,sysize)
            X12 :: Matrix{ComplexF64}  = zeros(ComplexF64,sysize,sysize)
            X21 :: Matrix{ComplexF64}  = zeros(ComplexF64,sysize,sysize)
            X22 :: Matrix{ComplexF64}  = zeros(ComplexF64,sysize,sysize)
            Y1  :: Matrix{ComplexF64} =  zeros(ComplexF64,sysize,1)
            Y2  :: Matrix{ComplexF64} =  zeros(ComplexF64,sysize,1)
            x = zeros(ComplexF64,dd)
            for i in xrange
                @time for i in 1:dd
                    gfcpart = GFC(t[i],b,Δ,sysize,Ωf,Ωi,J,D,P,Si,Sf,Bi,Bf)
                    gtotal = GHT(t[i],b,sysize,Ωf,Ωi,J,D,Si,Sf,Bi,Bf,X11,X12,X21,X22,Y1,Y2,gfcpart,T)
                    x[i] = gtotal
                    open("CORE-PARALLEL-OUTPUT-RV6.TXT", "a") do f
                        write(f, "ITERATION $i COMPLETED BY PROCESS $(Threads.threadid()) \n")
                    end
                end  
            end # loop
        end # thread work load
    end # loop over chunks, Julia will wait until all task are completed
    open("CORE-PARALLEL-OUTPUT-RV6.TXT", "a") do f
        write(f, "LOOP COMPLETE \n====================================================")
    end
    yr = real.(x)
    yi = imag.(x)
    p = plot(t,yr,linewidth = 2,color = "blue",title = "Δ = $Δ",label = "Real part",grid=false)
    p =plot!(t,yi,linewidth = 2,color = "red",label = "Imaginary part",size=(1000,800),xlims=(-30*10^-15,30*10^-15))
    savefig(p,"gt.svg")
    begin # This part does FFTW and finds final rate constant

        fs = abs(t[1]-t[2])^-1
        signal = yr
        p = plan_fft(signal)
        F = fftshift(p*signal)
        freqs = fftshift(fftfreq(length(t), fs))  # X axis has frequencies in s^-1
        #Covert freqs to eV
        freqs = freqs * 4.13558 * 10^-15
        F = F * 6.57 * 10^15
        time_domain = plot(t, signal, title = L"G(t)",legend=false,color="orange",xlabel=L"\mathrm{Time}\ (s)",ylabel=L"\mathrm{A.u}")
        freq_domain = plot(freqs, abs.(F), title = L"\mathrm{Fourier\ transformed\ g(\bar{\nu}))}",legend=false,xlabel=L"\mathrm{Energy\ (eV)}",ylabel=L"\mathrm{k_{IC}\ (s^{-1})}",xlims=(0,4))
        p = plot(time_domain, freq_domain, layout = (2,1), size = (1000, 800),right_margin = 10mm)
        savefig(p,"gt-fft.svg")
        ee = Δ# * 8065.73  #Converting eV to cm^-1
        index = findmin(abs.(freqs .- ee))[2]  # Find the index of freq closest to ee
        freqs[index]
        println("The rate constant is ",abs.(F)[index])
        rc = abs.(F)[index]
    end
    begin # This part writes the output to a file
        open("CORE-PARALLEL-OUTPUT-RV6.TXT", "a") do f
            write(f, "-------------------------\n")
            write(f,"NUMBER OF CORES = $(Threads.nthreads())\n")
            write(f,"The value of Δ is $Δ\n")
            write(f,"Number of time points = $dd\n")
            write(f, "The rate constant is $rc\n")
            write(f,"-------------------------\n")
            write(f,"The values of t,Re(g(t)),Im(g(t)) are \n\n")
            for i in range(1,dd)
                write(f,"$(t[i])  $(yr[i])  $(yi[i])\n")
            end
        end
    end

end


timeneed = @elapsed calc(t,b,Delta,sysize,omegaf,omegai,J,D,P)
open("CORE-PARALLEL-OUTPUT-RV6.TXT", "a") do f
    write(f, "\n\n\n===============================================================\nThe time elapsed for main loop [INCLUDING GC + C + RC]=  $timeneed seconds \n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n")
end

I tried running it with julia -t 1 but I noticed that it seems to be going on to an infinite loop, the output (directly generated from the above code is as follow)

STARTING LOOP WITH NUMBER OF THREADS = 1 
====================================================
ITERATION 1 COMPLETED BY PROCESS 1 
ITERATION 2 COMPLETED BY PROCESS 1 
ITERATION 3 COMPLETED BY PROCESS 1 
ITERATION 4 COMPLETED BY PROCESS 1 
ITERATION 5 COMPLETED BY PROCESS 1 
ITERATION 6 COMPLETED BY PROCESS 1 
ITERATION 7 COMPLETED BY PROCESS 1 
ITERATION 8 COMPLETED BY PROCESS 1 
ITERATION 9 COMPLETED BY PROCESS 1 
ITERATION 10 COMPLETED BY PROCESS 1 
ITERATION 11 COMPLETED BY PROCESS 1 
ITERATION 1 COMPLETED BY PROCESS 1 
ITERATION 2 COMPLETED BY PROCESS 1 
ITERATION 3 COMPLETED BY PROCESS 1 
ITERATION 4 COMPLETED BY PROCESS 1 
ITERATION 5 COMPLETED BY PROCESS 1

After completion of 11 iterations it again starts from the first when I set the number of threads = 1. I also do not understand the purpose (or lack theirof) of the variable ichunk

You made small mistake and copied on loop too much.

for i in xrange
        #### this loop is too much!
                # @time for i in 1:dd
                    gfcpart = GFC(t[i],b,Δ,sysize,Ωf,Ωi,J,D,P,Si,Sf,Bi,Bf)
                    gtotal = GHT(t[i],b,sysize,Ωf,Ωi,J,D,Si,Sf,Bi,Bf,X11,X12,X21,X22,Y1,Y2,gfcpart,T)
                    x[i] = gtotal
                    open("CORE-PARALLEL-OUTPUT-RV6.TXT", "a") do f
                        write(f, "ITERATION $i COMPLETED BY PROCESS $(Threads.threadid()) \n")
                    # end
                end  
            end # loop

I commented on the other thread that ichunk is the number of the chunk, but you don’t need it here. It useful if you want to preallocate output space for each chunk only.

I corrected the mistake and rewrote the code accordingly :

println("Number of threads = $(Threads.nthreads())")
open("THREAD-PARALLEL-OUTPUT-RV6.TXT", "a") do f
    write(f, "\n\n\n====================================================\n\n\n")
end
begin
    using LoopVectorization
    using MKL
    using DelimitedFiles
    using LinearAlgebra
    using LaTeXStrings
    using TimerOutputs
    BLAS.set_num_threads(1)
    using Plots
    using Plots.PlotMeasures
    using FFTW
    using Base.Threads
    using SharedArrays
    using ChunkSplitters
end
function GFC(t::Float64,b::Float64,Δ::Float64,sysize::Int64,Ωf::Diagonal{Float64},Ωi::Diagonal{Float64},J::Matrix{Float64},D::Matrix{Float64},ρ::Diagonal{Float64},Si::Diagonal{ComplexF64},Sf::Diagonal{ComplexF64},Bi::Diagonal{ComplexF64},Bf::Diagonal{ComplexF64})
    t = t* 4.1341373335*10^16                     # converting time to atomic units
    tp = - im*b - t
    for i = 1:sysize
        Sf[i,i]=Ωf[i,i]/sin(Ωf[i,i]*t)
        Si[i,i]=Ωi[i,i]/sin(Ωi[i,i]*tp)
        Bf[i,i]=Ωf[i,i]/tan(Ωf[i,i]*t)
        Bi[i,i]=Ωi[i,i]/tan(Ωi[i,i]*tp)
    end
    BB = (Bf+J'*Bi*J)
    AA = (Sf+J'*Si*J)

    W = [BB  -AA
        -AA  BB] #We wont calculate the 2Nx2N determinant directly.
    
    # Wapp = BB*(BB-  AA*(BB^-1)*AA  )
    Wapp = BB*(BB-  AA*(BB\AA)  )
    U = Bi-Si
    V = [J'*U*D
        J'*U*D]
    # F1 = sqrt(det(Si*Sf*ρ^2*Wapp^-1))
    # F1 = sqrt(det(Si*Sf*ρ^2)/det(Wapp))
    F1 = sqrt(det(Si*Sf*  (Wapp\ρ^2)))
    F2 = (-(im/2)*(transpose(V)* (W\V)))    +  ((im)*(transpose(D)*U*D))
    g = F1*exp(F2)
    Δ = Δ * 0.0367493  # in atomic units
    g = g * exp(im * Δ * t)
    η = 2 * 4.55633 * 10^-6  # 10 cm^-1 to hatree energy
    #g = g * exp(-η * abs(t))     # Adding the damping factor Lorrentzian
    g = g * exp(-η * t^2)     # Adding the damping factor Gaussian
    # println("t = $t was evaluated by process $(myid())")
    # println("GFC at t = $t   is     $g")
    # println("F1 at t = $t         = $F1")
    return g[1]
end
function GHT(t::Float64,b::Float64,sysize::Int64,Ωf::Diagonal{Float64},Ωi::Diagonal{Float64},J::Matrix{Float64},D::Matrix{Float64},Si::Diagonal{ComplexF64},Sf::Diagonal{ComplexF64},Bi::Diagonal{ComplexF64},Bf::Diagonal{ComplexF64},X11::Matrix{ComplexF64},X12::Matrix{ComplexF64},X21::Matrix{ComplexF64},X22::Matrix{ComplexF64},Y1::Matrix{ComplexF64},Y2::Matrix{ComplexF64},gfcpart::ComplexF64,T::Matrix{Float64})




    ################################## Preliminary Part , Common for both Frank Condon and Herzberg Teller ##############################
    t = t* 4.1341373335*10^16                     # converting time to atomic units
    tp = - im*b - t
    for i = 1:sysize
        Sf[i,i]=Ωf[i,i]/sin(Ωf[i,i]*t)
        Si[i,i]=Ωi[i,i]/sin(Ωi[i,i]*tp)
        Bf[i,i]=Ωf[i,i]/tan(Ωf[i,i]*t)
        Bi[i,i]=Ωi[i,i]/tan(Ωi[i,i]*tp)
    end
    W = [(Bf+J'*Bi*J)  -(Sf+J'*Si*J)
        -(Sf+J'*Si*J)  (Bf+J'*Bi*J)]
    U = Bi-Si
    V = [J'*U*D
        J'*U*D]

    ########## Defining the X and Y Matrices ##################################


    ################# Final Calculation and returning value ##################
    s=0
    MJSJ = J'*Si*J
    MJBJ = J'*Bi*J
    MJUD = J'*U*D
    for m in range(1,sysize)
        for mp in range(1,sysize)
            # @. X11 = 0.0
            # @. X12 = 0.0
            # @. X21 = 0.0
            # @. X22 = 0.0
            X11[m,:]=(MJSJ)[mp,:]*(-Bf[m,m])
            X12[m,:]=(MJBJ)[mp,:]*(Bf[m,m])
            X21[m,:]=(MJSJ)[mp,:]*(Sf[m,m])
            X22[m,:]=(MJBJ)[mp,:]*(-Sf[m,m])
            X = [X11 X12
                X21 X22]
            # @. Y1 = 0.0
            # @. Y2 = 0.0
            Y1[m,:] = Bf[m,m]*(MJUD)[mp,:] 
            Y2[m,:] = -Sf[m,m]*(MJUD)[mp,:]
            Y = [Y1
                Y2]
            g = im*tr(W\X)+    (transpose(W\V)*X*(W\V))[1]      -    (transpose(Y)*(W\V))[1]
            s = s+g[1]*gfcpart*T[m,mp]
        end
        for i in 1:sysize
            X11[m,i]=0
            X12[m,i]=0
            X21[m,i]=0
            X22[m,i]=0
            Y1[m,1]=0
            Y2[m,1]=0
        end
    end
    return s[1]
end

begin ## Define all constant variables
    const Delta::Float64 = 2.02 # in eV
    const dd ::Int64 = 16                              # Change this to modify number of points
    const Temp::Int64=300
    kT = Temp * 1.380649 * 10^-23  # Boltzmann constant times temperature in Joules
    kT= kT * 2.29371227840*10^17         # in atomic units
    const b ::Float64= 1/kT
    const omegaf ::Diagonal{Float64}= Diagonal(vec(readdlm("wf.txt", '\t', Float64, '\n')))*4.55633*10^-6   # Final state  frequencies in atomic units
    const omegai ::Diagonal{Float64}= Diagonal(vec(readdlm("wi.txt", '\t', Float64, '\n')))*4.55633*10^-6  # Initial state  frequencies in atomic units
    const sysize ::Int64 = size(omegai)[1]
    const P = @. 2*sinh(b*omegai/2)
    T ::Matrix{Float64}= readdlm("nac.txt", '\t', Float64, '\n'); # NAC matrix
    T = T*T';
    const D ::Matrix{Float64}= readdlm("d.txt", Float64); # Displacement Matrix
    const J ::Matrix{Float64}= readdlm("j.txt", Float64); # Duchinsky Matrix
    t ::Vector{Float64}=  collect(range(-5*10^-12,stop=5*10^-12,length=dd))
    t[div(dd+1,2)] = 10e-25
end

function calc(t,b,Δ,sysize,Ωf,Ωi,J,D,P)
    nchunks = Threads.nthreads() # define number of chunks e.g. equal to number of threads
    # @sync tells Julia to wait for any tasks started inside it
    open("THREAD-PARALLEL-OUTPUT-RV6.TXT", "a") do f
        write(f, "STARTING LOOP WITH NUMBER OF THREADS = $(Threads.nthreads()) \n====================================================\n")
    end
    x = zeros(ComplexF64,dd)
    @time @sync for (xrange, ichunk) in chunks(range(1,dd), nchunks)
        # xrange are the indices this chunk contains, ichunk is just the index of the chunk, which we don't need in this scenario
        Threads.@spawn begin # this line tells Julia to start a new thread with the contents of the begin block
            Sf ::Diagonal{ComplexF64}= Diagonal(zeros(ComplexF64,sysize))
            Si ::Diagonal{ComplexF64}= Diagonal(zeros(ComplexF64,sysize))
            Bf ::Diagonal{ComplexF64}= Diagonal(zeros(ComplexF64,sysize))
            Bi ::Diagonal{ComplexF64}= Diagonal(zeros(ComplexF64,sysize))
            X11 :: Matrix{ComplexF64}  = zeros(ComplexF64,sysize,sysize)
            X12 :: Matrix{ComplexF64}  = zeros(ComplexF64,sysize,sysize)
            X21 :: Matrix{ComplexF64}  = zeros(ComplexF64,sysize,sysize)
            X22 :: Matrix{ComplexF64}  = zeros(ComplexF64,sysize,sysize)
            Y1  :: Matrix{ComplexF64} =  zeros(ComplexF64,sysize,1)
            Y2  :: Matrix{ComplexF64} =  zeros(ComplexF64,sysize,1)
            for i in xrange
                gfcpart = GFC(t[i],b,Δ,sysize,Ωf,Ωi,J,D,P,Si,Sf,Bi,Bf)
                gtotal = GHT(t[i],b,sysize,Ωf,Ωi,J,D,Si,Sf,Bi,Bf,X11,X12,X21,X22,Y1,Y2,gfcpart,T)
                x[i] = gtotal
                open("THREAD-PARALLEL-OUTPUT-RV6.TXT", "a") do f
                    write(f, "ITERATION $i COMPLETED BY THREAD $(Threads.threadid()) \n")
                end
            end # loop
        end # thread work load
    end # loop over chunks, Julia will wait until all task are completed
    open("THREAD-PARALLEL-OUTPUT-RV6.TXT", "a") do f
        write(f, "LOOP COMPLETE \n====================================================")
    end
    yr = real.(x)
    yi = imag.(x)
    p = plot(t,yr,linewidth = 2,color = "blue",title = "Δ = $Δ",label = "Real part",grid=false)
    p =plot!(t,yi,linewidth = 2,color = "red",label = "Imaginary part",size=(1000,800),xlims=(-30*10^-15,30*10^-15))
    savefig(p,"gt.svg")
    begin # This part does FFTW and finds final rate constant

        fs = abs(t[1]-t[2])^-1
        signal = yr
        p = plan_fft(signal)
        F = fftshift(p*signal)
        freqs = fftshift(fftfreq(length(t), fs))  # X axis has frequencies in s^-1
        #Covert freqs to eV
        freqs = freqs * 4.13558 * 10^-15
        F = F * 6.57 * 10^15
        time_domain = plot(t, signal, title = L"G(t)",legend=false,color="orange",xlabel=L"\mathrm{Time}\ (s)",ylabel=L"\mathrm{A.u}")
        freq_domain = plot(freqs, abs.(F), title = L"\mathrm{Fourier\ transformed\ g(\bar{\nu}))}",legend=false,xlabel=L"\mathrm{Energy\ (eV)}",ylabel=L"\mathrm{k_{IC}\ (s^{-1})}",xlims=(0,4))
        p = plot(time_domain, freq_domain, layout = (2,1), size = (1000, 800),right_margin = 10mm)
        savefig(p,"gt-fft.svg")
        ee = Δ# * 8065.73  #Converting eV to cm^-1
        index = findmin(abs.(freqs .- ee))[2]  # Find the index of freq closest to ee
        freqs[index]
        println("The rate constant is ",abs.(F)[index])
        rc = abs.(F)[index]
    end
    begin # This part writes the output to a file
        open("THREAD-PARALLEL-OUTPUT-RV6.TXT", "a") do f
            write(f, "-------------------------\n")
            write(f,"NUMBER OF CORES = $(Threads.nthreads())\n")
            write(f,"The value of Δ is $Δ\n")
            write(f,"Number of time points = $dd\n")
            write(f, "The rate constant is $rc\n")
            write(f,"-------------------------\n")
            write(f,"The values of t,Re(g(t)),Im(g(t)) are \n\n")
            for i in range(1,dd)
                write(f,"$(t[i])  $(yr[i])  $(yi[i])\n")
            end
        end
    end

end


timeneed = @elapsed calc(t,b,Delta,sysize,omegaf,omegai,J,D,P)
open("THREAD-PARALLEL-OUTPUT-RV6.TXT", "a") do f
    write(f, "\n\n\n===============================================================\nThe time elapsed for main loop [INCLUDING GC + C + RC]=  $timeneed seconds \n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n")
end

While running with 1 thread, it is taking 174s with dd=16. When I bump up the number of threads to 16 the total time taken for the expensive loop reduces to 42 seconds !

I know that we can never achieve perfect scaling, i.e when we just get a n times speed boost. This scaling although twice as better than using Distributed.jl , it is still far off from the ideal. I am getting a ~4.17 times speed boost on using 16 times the threads. Is this kind of scaling expected from Base.Threads ? Or is there something still to be done to improve the scaling ?

I would like to also mention another peculiar observation : When I ran the serial code on my laptop, it took about 170 seconds, before the optimizations I posted about today. When I ran the same code on the supercomputer that uses Intel Xeon, it took 303 seconds. Now I know that the single core performance on supercomputers are not always their selling point. But when I ran our Fortran code on the same supercomputer using the same number of point it took 59 seconds, whereas on my pc the fortran code took 148 seconds. In conclusion running the code on the supercomputer increased the speed of Fortran code and decreased the speed of my Julia code on the same machine. This seems very counter intuitive.

You can improve the scaling by improving the GFC and GHT functions. Right now they still allocate quite a bit. If you reduce that you will get better scaling with threads.

Another complementary approach on Julia 1.10 could be to play with the number of gcthreads. I have not experimented with this myself so I can’t give good tips nor know what can be expected. Maybe someone else can comment on that.

Anyhow, I think that we are at the end of easy programmatic improvements and now it is time to but in some work and improve the performance of the computation.

I would advise you to stop thinking about parallelization, either via multithreading or multi-core processing, at this point. Your code is still quite large and messy so I might have misread something but while you still have stuff like:

W = [(Bf+J'*Bi*J)  -(Sf+J'*Si*J)
        -(Sf+J'*Si*J)  (Bf+J'*Bi*J)]

or

g = im*tr(W\X) + (transpose(W\V)*X*(W\V))[1] - (transpose(Y)*(W\V))[1]
s = s + g[1]*gfcpart*T[m,mp]

where you repeatedly calculate things (and in the second example also seemingly throw away most of what you compute) I would expect the gain from actually optimizing the bottlenecks in your code for simple single-threaded execution to be much larger than the gain from fiddling with the number of gc threads.

I would take GFC and GHT in turn, clear them up as much as possible (i.e. remove unnecessary comments etc.), provide some dummy data for each of them (i.e. generate random inputs of roughly the size you are working with, say Ωf = Diagonal(rand(100))) and then post all of that here so that people can check for themselves how fast the current implementation is and improve it.

7 Likes

These lines create unnecessary temporary arrays, do

X11[m,:] .= (MJSJ)[mp,:] .* (-Bf[m,m])

etc. instead. Note the dots. Also, spacing around operators/commas/etc makes the code easier to read.

This does repeated work, like W \ V twice, and actually three solves involving W, even though W is the same for all iterations of the loop. I would pre-calculate either inv(W) or some appropriate factorization outside the loop and reuse that.

Emphatically agreed.

@Uranium238 Optimize single-threaded performance, with a particular view to reducing the number of allocations, first. Then start thinking about multi-threading.

Not only are single-threaded optimizations more likely to be significant, but parallelization will scale better as a result. Julia parallelization is not great for GC-heavy code.

4 Likes

The outcome of the line g = im*tr(W\X) + (transpose(W\V)*X*(W\V))[1] - (transpose(Y)*(W\V))[1] is a 1x1 matrix. So I assumed writing g[1] would not be a problem. As it seemed like I am essentially converting a Matrix of dimensions (1,1) to a number.

1 Like

Thank you for your excellent advice. I did it and it sped up my code by a huge margin! Now taking only 41 seconds for dd = 10 instead of >100. I am yet to compare it with the Fortran code on a common machine but it seems like it has surpassed the speed of the Fortran Code !