How to improve the scaling of Julia code aimed at multi-node parallelization?

It seems the β€œMost Expensive Loop” is the part with majority of the allocations. Which is very strange because I took your suggestion so that allocations are not made within the doubly nested loop.

    @timeit to "Most Expensive Loop" begin
    for m in 1:sysize
        for mp in 1:sysize
            ########## Defining the X and Y Matrices ##################################
            X[m, 1:sysize] .= (MJSJ)[mp, :] .* (-Bf[m, m])
            X[m, sysize+1:2*sysize] .= (MJBJ)[mp, :] .* (Bf[m, m])
            X[sysize+m, 1:sysize] .= (MJSJ)[mp, :] .* (Sf[m, m])
            X[sysize+m, sysize+1:2*sysize] .= (MJBJ)[mp, :] .* (-Sf[m, m])

            Y[m, 1] = Bf[m, m] * (MJUD)[mp, 1]
            Y[m+sysize, 1] = -Sf[m, m] * (MJUD)[mp, 1]

            g = im * (tr(mul!(XWin, X, Win))  + (transpose(WinV)*X*WinV)[1] - (transpose(Y)*WinV)[1])
            s = s + g * gdsopart * T[m, mp]

        end
        X[m, 1:sysize] .= 0
        X[m, sysize+1:2*sysize] .= 0
        X[sysize+m, 1:sysize] .= 0
        X[sysize+m, sysize+1:2*sysize] .= 0
        Y[m, 1] = 0
        Y[m+sysize, 1] = 0
    end
end

As you can see from the results

anik@anik-MSI:~/Desktop/Swapan Sir/generating-function-codes/julia-codes/isc$ julia 2serial.jl 
Number of processes = 1
GKS: Possible loss of precision in routine SET_WINDOW
The rate constant is 42009.188313308856
 ────────────────────────────────────────────────────────────────────────────────────
                                            Time                    Allocations      
                                   ───────────────────────   ────────────────────────
         Tot / % measured:              65.6s /  88.9%           1.26GiB /  28.0%    

 Section                   ncalls     time    %tot     avg     alloc    %tot      avg
 ────────────────────────────────────────────────────────────────────────────────────
 GT calculation                 1    58.3s  100.0%   58.3s    361MiB  100.0%   361MiB
   GSV                         22    58.3s  100.0%   2.65s    338MiB   93.6%  15.3MiB
     Most Expensive Loop       22    58.2s   99.9%   2.65s    312MiB   86.4%  14.2MiB
   GDSO                        22   23.6ms    0.0%  1.07ms   22.9MiB    6.4%  1.04MiB
 ────────────────────────────────────────────────────────────────────────────────────

And as you could see from my earlier code snippets. I have applied @views macro to the entire GSV function. So splicing operations should not create any new allocations!

Personally I recommend MPI.jl we had great success with it over the years, but it would be interesting to understand if you encountered an β€œimplementation”/algorithmic scaling bottleneck or a β€œinfrastructure” scaling bottleneck.

The thread-safety issues that Dagger stem from Distributed.jl itself not being fully thread-safe.

As for profiling you can use VTunes (with the Julia environment variable ENABLE_JITPROFILING=1 set to understand where time is being spent in a multi-process setting. IntelITT.jl allows you to annotate code regions.

To be honest, one of the things I am trying to achieve is to get a Julia code that is faster than the Fortran equivalent with as much code readability as possible. I am not very familiar with the usage of MPI.jl but it seems that it would take major refactoring of the code to get it to work with MPI.jl and even then the readability might suffer greatly.

These lines allocate:

This allocates:

  • transpose(WinV)*X*WinV
  • transpose(Y)*WinV

I think WinV and Y are supposed to be vectors right?
If you give them the correct shape (i.e. make them actual Vectors), then you can simply do:

  • dot(WinV, X, WinV)
  • dot(Y, WinV)
    Both of which does not allocate.

I don’t see any other allocations.

Yeah that’s why I am asking the question if it is algorithmic scalability, i.e. did you implement the same parallelization strategy as Fortran. The answer as much as we might not like it could be that the way you implemented your algorithm works well with shared memory parallelization, but not with multi-node. So while I am quick to blame Distributed.jl

Well written MPI can be very readable, and I have come to appreciate the explicitness that it promotes.

Doing these changes finally seemed to get rid of most allocations, with the result now stating

anik@anik-MSI:~/Desktop/Swapan Sir/generating-function-codes/julia-codes/isc$ julia 2serial.jl 
Number of processes = 1
GKS: Possible loss of precision in routine SET_WINDOW
The rate constant is 42009.19302545984
 ────────────────────────────────────────────────────────────────────────────────────────
                                                Time                    Allocations      
                                       ───────────────────────   ────────────────────────
           Tot / % measured:                65.9s /  89.2%           0.95GiB /   4.3%    

 Section                       ncalls     time    %tot     avg     alloc    %tot      avg
 ────────────────────────────────────────────────────────────────────────────────────────
 GT calculation                     1    58.8s  100.0%   58.8s   41.4MiB  100.0%  41.4MiB
   GSV                             22    58.8s   99.9%   2.67s   18.5MiB   44.6%   859KiB
     Most Expensive Loop           22    58.8s   99.9%   2.67s   1.47KiB    0.0%    68.4B
       Final Multiplications     124k    58.6s   99.6%   474ΞΌs     0.00B    0.0%    0.00B
       X Y Matrices Writing      124k    113ms    0.2%   909ns     0.00B    0.0%    0.00B
     Pre Loop Allocations          22   21.8ms    0.0%   989ΞΌs   18.5MiB   44.6%   859KiB
   GDSO                            22   35.9ms    0.1%  1.63ms   22.9MiB   55.3%  1.04MiB
 ────────────────────────────────────────────────────────────────────────────────────────

However, at the serial level, we have not seen any performance increase (if anything, the performance has become a bit worse). However, it might not be wise to conclude this from such a small benchmark. I shall try to repeat the dd = 20001 case on the cluster and report the differences.

As far as I understand the Fortran code does parallelize a bit different. @Uranium238 has the parallelization one layer higher essentially for the loop over different times, while the FORTRAN code only parallelizes the outer loop in GSV

NP1=MYID+1                    !MYID = PROCESSOR ID, NP1=TEMPORARY PROCESSOR ID
    DO K=NP1,N,NP
    DO M=1,N                  ! N should be our sysize
    ! rest of the GSV computation, i.e. setting up X,Y 

Is that supposed to cause this big of a difference? Rather shouldn’t that involve more data copying ? Also, given the current structure of my code, is there any way to achieve the equivalent type of parallelization without a complete overhaul of the code?

I agree that your way of parallelization seems better to me - at the very least I would not expect a huge difference but perhaps @vchuravy has a more substantiate opinion on this.
There could be a bit of a difference since the MPI approach would calculate/construct the matrices from scratch on each process whereas Distributed.jl transfers transfers them to the other processes. However since each process works on different values, I think overall your approach needs much less redundant work so should be faster.

1 Like

Compute is free, memory bandwidth is not or so the saying goes :slight_smile: It is a common-ish tradeoff to redo computation to avoid memory copies.

One thing to remember is that Distributed.jl is a primary-worker model vs the SPMD MPI model.

In a primary-worker the possible speedup Amdahl’s law is quite important. What is the time spent on the serial computation vs parallel computation. E.g. you can’t get infinite speedup by just focusing on the parallel section and instead need to figure out if you can parallelize the serial part. The serial part includes the movement of data from one process to the other.

In MPI we instead look at the overhead cost of data exchange and if we can hide the data movement while the code does useful other things.

I glanced at the code, and as far as I can tell you essentially have a parallel for-loop? What’s the ratio of time that for-loop takes vs the rest of it? And how much time is the data movement without any work?

1 Like

I have set up a computation which takes dd=20001 , obviously it will take a long time to complete. But as of now I can see (using the intermediate files) that it has completed 1600 iterations in approximately 148 minutes. Thus we can project that the full code will take about 30 hours to run. I think we can be sure that allocations were not the culprit that caused the bad scaling as we have eliminated almost all unnecessary allocations.

What ratio are you referring to here? My apologies I am very unfamiliar with any of the other things that you had stated in your replies.

Let’s say we have a program that consists out of two parts A and B.
We speed up B by a factor of 2. How much faster will the resulting program be?
This is referred to as β€œAmdahl’s Law” https://web.engr.oregonstate.edu/~mjb/cs575/Handouts/speedups.and.amdahls.law.1pp.pdf

Understanding where your program spends time will tell you where you should focus your attention.

MPI is by default parallel and you introduce communication. Distributed.jl is almost by default serial and then you try to parallelize the program.

So in MPI duplicated computation is kinda free and we often don’t worry about it, but we check more what the cost of communication is.

So in order to understand your program better you need to establish where the actual costs are. I would recommend running the program through at least the Julia profiler.

3 Likes

Although I have not used the specialized packages you mentioned for profiling. I had run TimerOutputs.jl in the serial version of the code to get a very good idea of what exactly is the most expensive part of the code in terms of time.

Serial Code
#
using MKL
using DelimitedFiles
using LinearAlgebra
using LaTeXStrings
using TimerOutputs
using Plots
using Plots.PlotMeasures
using FFTW
using SharedArrays
using TensorOperations
using TimerOutputs
BLAS.set_num_threads(1)
const to = TimerOutput()
println("Number of processes = 1")
open("ISC-SERIAL.TXT", "a") do f
    write(f, "\n\n\n====================================================\n\n\n")
end

@views function GDSO(
    t::Float64,
    b::Float64,
    Ξ”::Float64,
    sysize::Int64,
    Ξ©f::Diagonal{Float64},
    Ξ©i::Diagonal{Float64},
    J::Matrix{Float64},
    D::Vector{Float64},
    ρ::Diagonal{Float64},
    Si::Diagonal{ComplexF64},
    Sf::Diagonal{ComplexF64},
    Bi::Diagonal{ComplexF64},
    Bf::Diagonal{ComplexF64},
    JpBi::Matrix{ComplexF64},
    JpSi::Matrix{ComplexF64},
    BB::Matrix{ComplexF64},
    AA::Matrix{ComplexF64},
    W::Matrix{ComplexF64},
    Wapp::Matrix{ComplexF64},
    Wapp_i1::Matrix{ComplexF64},
    V::Vector{ComplexF64},
    V_int1::Vector{ComplexF64},
    V_int2::Matrix{ComplexF64},
    BimSi::Diagonal{ComplexF64}
    
)
    t = t * 4.1341373335e16                     # converting time to atomic units
    tp = -im * b - t
    for i = 1:sysize
        Sf[i, i] = Ξ©f[i, i] / sin(Ξ©f[i, i] * t)
        Si[i, i] = Ξ©i[i, i] / sin(Ξ©i[i, i] * tp)
        Bf[i, i] = Ξ©f[i, i] / tan(Ξ©f[i, i] * t)
        Bi[i, i] = Ξ©i[i, i] / tan(Ξ©i[i, i] * tp)
    end
    # BB = (Bf + J' * Bi * J)
    # AA = (Sf + J' * Si * J)
    mul!(JpBi,J',Bi)
    mul!(BB,JpBi,J)
    BB .+= Bf
    mul!(JpSi,J',Si)
    mul!(AA,JpSi,J)
    AA .+= Sf

    # W = [BB -AA
    #     -AA BB] 
    W[1:sysize,1:sysize] .= BB
    W[1:sysize,sysize+1:2*sysize] .= -AA
    W[sysize+1:2*sysize,1:sysize] .= -AA
    W[sysize+1:2*sysize,sysize+1:2*sysize] .= BB

    #We wont calculate the 2Nx2N determinant directly.
    # Wapp = BB * (BB - AA * (BB \ AA))
    BBinvAA = BB \ AA
    mul!(Wapp_i1,AA,BBinvAA,-1,0)
    Wapp_i1 .+= BB
    mul!(Wapp,BB,Wapp_i1)

    # V = [J' * (Bi-Si) * D
    #      J' * (Bi-Si) * D]
    BimSi .= Bi - Si
    mul!(V_int1,BimSi,D)
    mul!(view(V, 1:sysize), J', V_int1)
    V[sysize+1:end] .=  V[1:sysize]
    # mul!(V_int2,J',V_int1)
    # V[1:sysize,1] .= V_int2
    # V[sysize+1:2*sysize,1] .= V_int2

    F1 = sqrt(det(Si * Sf * (Wapp \ ρ^2)))
    F2 = (-(im / 2) * (transpose(V) * (W \ V))) + ((im) * (transpose(D) * (Bi-Si) * D))
    g = F1 * exp(F2)
    Ξ” = Ξ” * 0.0367493  # in atomic units
    g = g * exp(im * Ξ” * t)
    Ξ· = 2 * 4.55633e-6  # 10 cm^-1 to hatree energy
    g = g * exp(-Ξ· * abs(t))
    return g[1]
end

@views function GSV(
    t::Float64,
    b::Float64,
    sysize::Int64,
    Ξ©f::Diagonal{Float64},
    Ξ©i::Diagonal{Float64},
    J::Matrix{Float64},
    D::Vector{Float64},
    Si::Diagonal{ComplexF64},
    Sf::Diagonal{ComplexF64},
    Bi::Diagonal{ComplexF64},
    Bf::Diagonal{ComplexF64},
    gdsopart::ComplexF64,
    T::Matrix{Float64},       ## Product if NAC and SO 
    Hso::Float64,
    X::Matrix{ComplexF64},
    Y::Vector{ComplexF64},
    W::Matrix{ComplexF64},
    V::Vector{ComplexF64},
    WinV::Vector{ComplexF64},
    XWin::Matrix{ComplexF64}
)

    ################################## Preliminary Part , Common for both Frank Condon and Herzberg Teller ##############################
    t = t * 4.1341373335e16                     # converting time to atomic units
    # tp = -im * b - t
    # for i = 1:sysize
    #     Sf[i, i] = Ξ©f[i, i] / sin(Ξ©f[i, i] * t)
    #     Si[i, i] = Ξ©i[i, i] / sin(Ξ©i[i, i] * tp)
    #     Bf[i, i] = Ξ©f[i, i] / tan(Ξ©f[i, i] * t)
    #     Bi[i, i] = Ξ©i[i, i] / tan(Ξ©i[i, i] * tp)
    # end
    # W = [(Bf+J'*Bi*J) -(Sf + J' * Si * J)
    #     -(Sf + J' * Si * J) (Bf+J'*Bi*J)]
    # # U = Bi - Si
    # V = [J' * (Bi-Si) * D
    #     J' * (Bi-Si) * D]




    ################# Final Calculation and returning value ##################
    @timeit to "Pre Loop Allocations" begin
    s = zero(ComplexF64)
    MJSJ = J' * Si * J
    MJBJ = J' * Bi * J
    MJUD = J' * (Bi-Si) * D
    Win = inv(W)
    mul!(WinV,Win,V)
    end
    @timeit to "Most Expensive Loop" begin
    for m in 1:sysize
        for mp in 1:sysize
            ########## Defining the X and Y Matrices ##################################
            @timeit to "X Y Matrices Writing" begin
            X[m, 1:sysize] .= (MJSJ)[mp, :] .* (-Bf[m, m])
            X[m, sysize+1:2*sysize] .= (MJBJ)[mp, :] .* (Bf[m, m])
            X[sysize+m, 1:sysize] .= (MJSJ)[mp, :] .* (Sf[m, m])
            X[sysize+m, sysize+1:2*sysize] .= (MJBJ)[mp, :] .* (-Sf[m, m])

            Y[m, 1] = Bf[m, m] * (MJUD)[mp, 1]
            Y[m+sysize, 1] = -Sf[m, m] * (MJUD)[mp, 1]
            end
            @timeit to "Final Multiplications" begin

            # g = im * (tr(mul!(XWin, X, Win))  + (transpose(WinV)*X*WinV)[1] - (transpose(Y)*WinV)[1])
            g = im * (tr(mul!(XWin, X, Win))  + dot(WinV, X, WinV) - dot(Y, WinV))
            


            s = s + g * gdsopart * T[m, mp]
            end

        end
        X[m, 1:sysize] .= 0
        X[m, sysize+1:2*sysize] .= 0
        X[sysize+m, 1:sysize] .= 0
        X[sysize+m, sysize+1:2*sysize] .= 0
        Y[m, 1] = 0
        Y[m+sysize, 1] = 0
    end
end
    s = s + abs(Hso)^2 * gdsopart   #### LATEST MODIFICATION
    return s
end

const hso_s1t3 = 1.49 * 0.0000046  ## SOC between S1 and T3 = 0.48 cm-1, converting to atomic units
const Delta = 0.11 # in eV
const Delta_s1t3 = 0.091 * 0.037 # Difference between S1 and T3 = 0.091 eV in atomic units
const dd = 22                            # Change this to modify number of points
const Temp = 300
kT_temp = Temp * 1.380649e-23  # Boltzmann constant times temperature in Joules
const kT = kT_temp * 2.29371227840e17         # in atomic units
const b = 1 / kT
const omegaf = Diagonal(vec(readdlm("wf.txt", '\t', Float64, '\n'))) * 4.55633e-6   # Final state  frequencies in atomic units
const omegai = Diagonal(vec(readdlm("wi.txt", '\t', Float64, '\n'))) * 4.55633e-6  # Initial state  frequencies in atomic units
const sysize = size(omegai)[1]
const P = @. 2 * sinh(b * omegai / 2)
T_temp = readdlm("nac.txt", '\t', Float64, '\n') # NAC matrix
const T = (T_temp * T_temp')* (hso_s1t3^2/Delta_s1t3^2)
const D = readdlm("d.txt", Float64)[:,1] # Displacement Matrix
const J = readdlm("j.txt", Float64) # Duchinsky Matrix
const t = collect(range(-5e-12, stop=5e-12, length=dd))
t[div(dd + 1, 2)] = 10e-25

function calc(t, b, Ξ”, sysize, Ξ©f, Ξ©i, J, D, P)
    # @sync tells Julia to wait for any tasks started inside it
    open("ISC-SERIAL.TXT", "w") do f
        write(f, "STARTING LOOP WITH NUMBER OF CORES = 1 \n====================================================\n")
    end
    x = zeros(ComplexF64, dd)
    Sf = Diagonal(zeros(ComplexF64, sysize))
    Si = Diagonal(zeros(ComplexF64, sysize))
    Bf = Diagonal(zeros(ComplexF64, sysize))
    Bi = Diagonal(zeros(ComplexF64, sysize))
    X = zeros(ComplexF64, 2 * sysize, 2 * sysize)
    Y = zeros(ComplexF64, 2 * sysize)
    BB = zeros(ComplexF64, sysize, sysize)
    AA = zeros(ComplexF64, sysize, sysize)
    JpBi = zeros(ComplexF64, sysize, sysize)
    JpSi = zeros(ComplexF64, sysize, sysize)
    W = zeros(ComplexF64, 2 * sysize, 2 * sysize)
    Wapp = zeros(ComplexF64, sysize, sysize)
    Wapp_i1 = zeros(ComplexF64, sysize, sysize)
    V =  zeros(ComplexF64, 2 * sysize)
    V_int1 = zeros(ComplexF64, sysize)
    V_int2 = zeros(ComplexF64, sysize,1)
    BimSi = Diagonal(zeros(ComplexF64, sysize, sysize))
    WinV = zeros(ComplexF64, 2 * sysize)
    WinV_transpose = zeros(ComplexF64, 1, 2 * sysize)
    XWin = zeros(ComplexF64, 2 * sysize, 2 * sysize)
    Hso = 0.48 * 0.0000046  ## SOC between S1 and T2 = 0.48 cm-1, converting to atomic units
    @timeit to "GT calculation" for i in 1:dd
        @timeit to "GDSO" gdsopart = GDSO(t[i], b, Ξ”, sysize, Ξ©f, Ξ©i, J, D, P, Si, Sf, Bi, Bf,JpBi,JpSi,BB,AA,W,Wapp,Wapp_i1,V,V_int1,V_int2,BimSi)
        @timeit to "GSV" gtotal = GSV(t[i], b, sysize, Ξ©f, Ξ©i, J, D, Si, Sf, Bi, Bf,gdsopart, T,Hso,X,Y,W,V,WinV,XWin)
        x[i] = gtotal
        open("ISC-SERIAL.TXT", "a") do f
            write(f, "ITERATION $i COMPLETED BY PROCESS 1 \n")
        end
    end # loop
    open("ISC-SERIAL.TXT", "a") do f
        write(f, "LOOP COMPLETE \n====================================================")
    end
    yr = real.(x)
    yi = imag.(x)
    p = plot(t, yr, linewidth=2, color="blue", title="Ξ” = $Ξ”", label="Real part", grid=false)
    p = plot!(t, yi, linewidth=2, color="red", label="Imaginary part", size=(1000, 800), xlims=(-30e-15, 30e-15))
    savefig(p, "gt.svg")
    begin # This part does FFTW and finds final rate constant

        fs = abs(t[1] - t[2])^-1
        signal = yr
        p = plan_fft(signal)
        F = fftshift(p * signal)
        freqs = fftshift(fftfreq(length(t), fs))  # X axis has frequencies in s^-1
        #Covert freqs to eV
        freqs = freqs * 4.13558e-15
        F = F * 6.57e15
        time_domain = plot(t, signal, title=L"G(t)", legend=false, color="orange", xlabel=L"\mathrm{Time}\ (s)", ylabel=L"\mathrm{A.u}")
        freq_domain = plot(freqs, abs.(F), title=L"\mathrm{Fourier\ transformed\ g(\bar{\nu}))}", legend=false, xlabel=L"\mathrm{Energy\ (eV)}", ylabel=L"\mathrm{k_{IC}\ (s^{-1})}", xlims=(0, 4))
        p = plot(time_domain, freq_domain, layout=(2, 1), size=(1000, 800), right_margin=10mm)
        savefig(p, "gt-fft.svg")
        ee = Ξ”# * 8065.73  #Converting eV to cm^-1
        index = findmin(abs.(freqs .- ee))[2]  # Find the index of freq closest to ee
        freqs[index]
        println("The rate constant is ", abs.(F)[index])
        rc = abs.(F)[index]
    end
#################################################This part writes the output to a file#######################################################
    begin 
        open("ISC-SERIAL.TXT", "a") do f
            write(f, "-------------------------\n")
            write(f, "NUMBER OF CORES = 1\n")
            write(f, "The value of Ξ” is $Ξ”\n")
            write(f, "Number of time points = $dd\n")
            write(f, "The rate constant calculated from Julia FFTW = $rc\n")
            write(f, "-------------------------\n")
            write(f, "The values of t,Re(g(t)),Im(g(t)) are \n\n")
            for i in 1:dd
                write(f, "$(t[i])  $(yr[i])  $(yi[i])\n")
            end
        end

        open("GT-VALUES-SERIAL.txt", "w") do f
            for i in 1:dd
                write(f, "$(t[i])  $(yr[i])  $(yi[i])\n")
            end
        end

    end

end
reset_timer!(to)


calc(t, b, Delta, sysize, omegaf, omegai, J, D, P)
display(to)

Here are the results

anik@anik-MSI:~/Desktop/Swapan Sir/generating-function-codes/julia-codes/isc/version august 2024$ julia 2serial.jl 
Number of processes = 1
The rate constant is 42004.491770445566
 ────────────────────────────────────────────────────────────────────────────────────────
                                                Time                    Allocations      
                                       ───────────────────────   ────────────────────────
           Tot / % measured:                67.1s /  89.2%           0.95GiB /   4.3%    

 Section                       ncalls     time    %tot     avg     alloc    %tot      avg
 ────────────────────────────────────────────────────────────────────────────────────────
 GT calculation                     1    59.9s  100.0%   59.9s   41.4MiB  100.0%  41.4MiB
   GSV                             22    59.9s   99.9%   2.72s   18.5MiB   44.6%   859KiB
     Most Expensive Loop           22    59.9s   99.9%   2.72s   1.47KiB    0.0%    68.4B
       Final Multiplications     124k    59.7s   99.6%   482ΞΌs     0.00B    0.0%    0.00B
       X Y Matrices Writing      124k    111ms    0.2%   897ns     0.00B    0.0%    0.00B
     Pre Loop Allocations          22   21.9ms    0.0%   994ΞΌs   18.5MiB   44.6%   859KiB
   GDSO                            22   37.4ms    0.1%  1.70ms   22.9MiB   55.3%  1.04MiB
 ────────────────────────────────────────────────────────────────────────────────────────

It seems that without a doubt we can be sure that the final multiplications under GSV is where the code spends most of the time.

This tells you where your code is spending its time, but not what it is spending its time on. So the question how much time is being spent on sending data to each of the processes is hard to answer from that

Would not the @profile macro from the Profile module provide the same ?

Would it be possible for you to please provide a toy example of how the code that uses MPI.jl to parallelize the same loop would look like ?

With @profile by default you are profiling the primary process, so you would need to trigger the profiler on all processes and then lock at them together.

I have in the meantime gotten an implementation that purely uses MPI.jl ready and it seems to produce the correct output.

Using MPI.jl
#



using MPI
MPI.Init()
using MKL
using DelimitedFiles
using LinearAlgebra
using LaTeXStrings
using TimerOutputs
using Plots
using Plots.PlotMeasures
using FFTW
using SharedArrays
using TensorOperations
using TimerOutputs
using ChunkSplitters

BLAS.set_num_threads(1)
const to = TimerOutput()


@views function GDSO(
    t::Float64,
    b::Float64,
    Ξ”::Float64,
    sysize::Int64,
    Ξ©f::Diagonal{Float64},
    Ξ©i::Diagonal{Float64},
    J::Matrix{Float64},
    D::Vector{Float64},
    ρ::Diagonal{Float64},
    Si::Diagonal{ComplexF64},
    Sf::Diagonal{ComplexF64},
    Bi::Diagonal{ComplexF64},
    Bf::Diagonal{ComplexF64},
    JpBi::Matrix{ComplexF64},
    JpSi::Matrix{ComplexF64},
    BB::Matrix{ComplexF64},
    AA::Matrix{ComplexF64},
    W::Matrix{ComplexF64},
    Wapp::Matrix{ComplexF64},
    Wapp_i1::Matrix{ComplexF64},
    V::Vector{ComplexF64},
    V_int1::Vector{ComplexF64},
    V_int2::Matrix{ComplexF64},
    BimSi::Diagonal{ComplexF64}
    
)
    t = t * 4.1341373335e16                     # converting time to atomic units
    tp = -im * b - t
    for i = 1:sysize
        Sf[i, i] = Ξ©f[i, i] / sin(Ξ©f[i, i] * t)
        Si[i, i] = Ξ©i[i, i] / sin(Ξ©i[i, i] * tp)
        Bf[i, i] = Ξ©f[i, i] / tan(Ξ©f[i, i] * t)
        Bi[i, i] = Ξ©i[i, i] / tan(Ξ©i[i, i] * tp)
    end
    # BB = (Bf + J' * Bi * J)
    # AA = (Sf + J' * Si * J)
    mul!(JpBi,J',Bi)
    mul!(BB,JpBi,J)
    BB .+= Bf
    mul!(JpSi,J',Si)
    mul!(AA,JpSi,J)
    AA .+= Sf

    # W = [BB -AA
    #     -AA BB] 
    W[1:sysize,1:sysize] .= BB
    W[1:sysize,sysize+1:2*sysize] .= -AA
    W[sysize+1:2*sysize,1:sysize] .= -AA
    W[sysize+1:2*sysize,sysize+1:2*sysize] .= BB

    #We wont calculate the 2Nx2N determinant directly.
    # Wapp = BB * (BB - AA * (BB \ AA))
    BBinvAA = BB \ AA
    mul!(Wapp_i1,AA,BBinvAA,-1,0)
    Wapp_i1 .+= BB
    mul!(Wapp,BB,Wapp_i1)

    # V = [J' * (Bi-Si) * D
    #      J' * (Bi-Si) * D]
    BimSi .= Bi - Si
    mul!(V_int1,BimSi,D)
    mul!(view(V, 1:sysize), J', V_int1)
    V[sysize+1:end] .=  V[1:sysize]
    # mul!(V_int2,J',V_int1)
    # V[1:sysize,1] .= V_int2
    # V[sysize+1:2*sysize,1] .= V_int2

    F1 = sqrt(det(Si * Sf * (Wapp \ ρ^2)))
    F2 = (-(im / 2) * (transpose(V) * (W \ V))) + ((im) * (transpose(D) * (Bi-Si) * D))
    g = F1 * exp(F2)
    Ξ” = Ξ” * 0.0367493  # in atomic units
    g = g * exp(im * Ξ” * t)
    Ξ· = 2 * 4.55633e-6  # 10 cm^-1 to hatree energy
    g = g * exp(-Ξ· * abs(t))
    return g[1]
end

@views function GSV(
    t::Float64,
    b::Float64,
    sysize::Int64,
    Ξ©f::Diagonal{Float64},
    Ξ©i::Diagonal{Float64},
    J::Matrix{Float64},
    D::Vector{Float64},
    Si::Diagonal{ComplexF64},
    Sf::Diagonal{ComplexF64},
    Bi::Diagonal{ComplexF64},
    Bf::Diagonal{ComplexF64},
    gdsopart::ComplexF64,
    T::Matrix{Float64},       ## Product if NAC and SO 
    Hso::Float64,
    X::Matrix{ComplexF64},
    Y::Vector{ComplexF64},
    W::Matrix{ComplexF64},
    V::Vector{ComplexF64},
    WinV::Vector{ComplexF64},
    XWin::Matrix{ComplexF64}
)

    ################################## Preliminary Part , Common for both Frank Condon and Herzberg Teller ##############################
    t = t * 4.1341373335e16                     # converting time to atomic units
    # tp = -im * b - t
    # for i = 1:sysize
    #     Sf[i, i] = Ξ©f[i, i] / sin(Ξ©f[i, i] * t)
    #     Si[i, i] = Ξ©i[i, i] / sin(Ξ©i[i, i] * tp)
    #     Bf[i, i] = Ξ©f[i, i] / tan(Ξ©f[i, i] * t)
    #     Bi[i, i] = Ξ©i[i, i] / tan(Ξ©i[i, i] * tp)
    # end
    # W = [(Bf+J'*Bi*J) -(Sf + J' * Si * J)
    #     -(Sf + J' * Si * J) (Bf+J'*Bi*J)]
    # # U = Bi - Si
    # V = [J' * (Bi-Si) * D
    #     J' * (Bi-Si) * D]




    ################# Final Calculation and returning value ##################
    @timeit to "Pre Loop Allocations" begin
    s = zero(ComplexF64)
    MJSJ = J' * Si * J
    MJBJ = J' * Bi * J
    MJUD = J' * (Bi-Si) * D
    Win = inv(W)
    mul!(WinV,Win,V)
    end
    @timeit to "Most Expensive Loop" begin
    for m in 1:sysize
        for mp in 1:sysize
            ########## Defining the X and Y Matrices ##################################
            @timeit to "X Y Matrices Writing" begin
            X[m, 1:sysize] .= (MJSJ)[mp, :] .* (-Bf[m, m])
            X[m, sysize+1:2*sysize] .= (MJBJ)[mp, :] .* (Bf[m, m])
            X[sysize+m, 1:sysize] .= (MJSJ)[mp, :] .* (Sf[m, m])
            X[sysize+m, sysize+1:2*sysize] .= (MJBJ)[mp, :] .* (-Sf[m, m])

            Y[m, 1] = Bf[m, m] * (MJUD)[mp, 1]
            Y[m+sysize, 1] = -Sf[m, m] * (MJUD)[mp, 1]
            end
            @timeit to "Final Multiplications" begin

            # g = im * (tr(mul!(XWin, X, Win))  + (transpose(WinV)*X*WinV)[1] - (transpose(Y)*WinV)[1])
            g = im * (tr(mul!(XWin, X, Win))  + dot(WinV, X, WinV) - dot(Y, WinV))
            


            s = s + g * gdsopart * T[m, mp]
            end

        end
        X[m, 1:sysize] .= 0
        X[m, sysize+1:2*sysize] .= 0
        X[sysize+m, 1:sysize] .= 0
        X[sysize+m, sysize+1:2*sysize] .= 0
        Y[m, 1] = 0
        Y[m+sysize, 1] = 0
    end
end
    s = s + abs(Hso)^2 * gdsopart   #### LATEST MODIFICATION
    return s
end

const hso_s1t3 = 1.49 * 0.0000046  ## SOC between S1 and T3 = 0.48 cm-1, converting to atomic units
const Delta = 0.11 # in eV
const Delta_s1t3 = 0.091 * 0.037 # Difference between S1 and T3 = 0.091 eV in atomic units
const dd = 22                            # Change this to modify number of points
const Temp = 300
kT_temp = Temp * 1.380649e-23  # Boltzmann constant times temperature in Joules
const kT = kT_temp * 2.29371227840e17         # in atomic units
const b = 1 / kT
const omegaf = Diagonal(vec(readdlm("wf.txt", '\t', Float64, '\n'))) * 4.55633e-6   # Final state  frequencies in atomic units
const omegai = Diagonal(vec(readdlm("wi.txt", '\t', Float64, '\n'))) * 4.55633e-6  # Initial state  frequencies in atomic units
const sysize = size(omegai)[1]
const P = @. 2 * sinh(b * omegai / 2)
T_temp = readdlm("nac.txt", '\t', Float64, '\n') # NAC matrix
const T = (T_temp * T_temp')* (hso_s1t3^2/Delta_s1t3^2)
const D = readdlm("d.txt", Float64)[:,1] # Displacement Matrix
const J = readdlm("j.txt", Float64) # Duchinsky Matrix
const t = collect(range(-5e-12, stop=5e-12, length=dd))
t[div(dd + 1, 2)] = 10e-25


function worker_function(xrange, t, b, Ξ”, sysize, Ξ©f, Ξ©i, J, D, P, T, Hso,mpi_rank,mpi_size)
    Sf = Diagonal(zeros(ComplexF64, sysize))
    Si = Diagonal(zeros(ComplexF64, sysize))
    Bf = Diagonal(zeros(ComplexF64, sysize))
    Bi = Diagonal(zeros(ComplexF64, sysize))
    X = zeros(ComplexF64, 2 * sysize, 2 * sysize)
    Y = zeros(ComplexF64, 2 * sysize)
    BB = zeros(ComplexF64, sysize, sysize)
    AA = zeros(ComplexF64, sysize, sysize)
    JpBi = zeros(ComplexF64, sysize, sysize)
    JpSi = zeros(ComplexF64, sysize, sysize)
    W = zeros(ComplexF64, 2 * sysize, 2 * sysize)
    Wapp = zeros(ComplexF64, sysize, sysize)
    Wapp_i1 = zeros(ComplexF64, sysize, sysize)
    V =  zeros(ComplexF64, 2 * sysize)
    V_int1 = zeros(ComplexF64, sysize)
    V_int2 = zeros(ComplexF64, sysize,1)
    BimSi = Diagonal(zeros(ComplexF64, sysize, sysize))
    WinV = zeros(ComplexF64, 2 * sysize)
    WinV_transpose = zeros(ComplexF64, 1, 2 * sysize)
    XWin = zeros(ComplexF64, 2 * sysize, 2 * sysize)
    Hso = 0.48 * 0.0000046  ## SOC between S1 and T2 = 0.48 cm-1, converting to atomic units
    x_worker = zeros(ComplexF64, length(xrange))
    for (local_xind,main_xind) in enumerate(xrange)
        gdsopart = GDSO(t[main_xind], b, Ξ”, sysize, Ξ©f, Ξ©i, J, D, P, Si, Sf, Bi, Bf, JpBi, JpSi, BB, AA, W, Wapp, Wapp_i1, V, V_int1, V_int2, BimSi)
        gtotal = GSV(t[main_xind], b, sysize, Ξ©f, Ξ©i, J, D, Si, Sf, Bi, Bf, gdsopart, T, Hso, X, Y, W, V, WinV, XWin)
        x_worker[local_xind] = gtotal
        open("ISC-MPI.TXT", "a") do f
            write(f, "ITERATION $main_xind COMPLETED BY PROCESS $(mpi_rank) \n")
        end
    end
    return x_worker
end




function calc(t, b, Ξ”, sysize, Ξ©f, Ξ©i, J, D, P)
    comm = MPI.COMM_WORLD
    mpi_rank = MPI.Comm_rank(comm)
    mpi_size = MPI.Comm_size(comm)
    open("ISC-MPI.TXT", "w") do f
        write(f, "STARTING LOOP WITH NUMBER OF CORES = $(mpi_size) \n====================================================\n")
    end

    Hso = 0.48 * 0.0000046  ## SOC between S1 and T2 = 0.48 cm-1, converting to atomic units
    xrange_chunknum = chunks(range(1, dd), mpi_size)
    @timeit to "GT Calculation" begin  
        # for ichunk in 1:nchunks
        #     xrange,chunknum = xrange_chunknum[ichunk]
        #     inter_result = worker_function(xrange, t, b, Ξ”, sysize, Ξ©f, Ξ©i, J, D, P, T, Hso,mpi_rank,mpi_size)
        # end
        xrange,chunknum = xrange_chunknum[mpi_rank+1]
        inter_result = worker_function(xrange, t, b, Ξ”, sysize, Ξ©f, Ξ©i, J, D, P, T, Hso,mpi_rank,mpi_size)
        MPI.Barrier(comm)
        x = MPI.gather(inter_result,comm,root=0)

    end


    if mpi_rank == 0
        x = vcat(x...)
        println("LOOP COMPLETE")
        open("ISC-MPI.TXT", "a") do f
            write(f, "LOOP COMPLETE \n====================================================")
        end
        yr = real.(x)
        yi = imag.(x)
    



    begin 
        open("ISC-MPI.TXT", "a") do f
            write(f, "-------------------------\n")
            write(f, "NUMBER OF CORES = $(1)\n")
            write(f, "The value of Ξ” is $Ξ”\n")
            write(f, "Number of time points = $dd\n")
            write(f, "Central Value = $(yr[div(dd + 1, 2)])\n")
            write(f, "-------------------------\n")
            write(f, "The values of t,Re(g(t)),Im(g(t)) are \n\n")
            for i in 1:dd
                write(f, "$(t[i])  $(yr[i])  $(yi[i])\n")
            end
        end

        open("GT-MPI.TXT", "w") do f
            for i in 1:dd
                write(f, "$(t[i])  $(yr[i])  $(yi[i])\n")
            end
        end

    end
    display(to)
    end
end


calc(t, b, Delta, sysize, omegaf, omegai, J, D, P)
MPI.Finalize()


However, the initial benchmarks indicate that it is rather quite a bit slower than the one using Distributed.jl. The first one is for the time taken by MPI.jl implementation and the later one is for Distributed.jl.

anik@anik-MSI:~/Desktop/Swapan Sir/generating-function-codes/julia-codes/isc/version august 2024$ mpiexec --use-hwthread-cpus -n 11 julia
 mpi_parallel_timeloop.jl 
LOOP COMPLETE
 ────────────────────────────────────────────────────────────────────────────────
                                        Time                    Allocations      
                               ───────────────────────   ────────────────────────
       Tot / % measured:             173s /  90.1%            903MiB /   4.6%    

 Section               ncalls     time    %tot     avg     alloc    %tot      avg
 ────────────────────────────────────────────────────────────────────────────────
 GT Calculation             1     156s  100.0%    156s   41.1MiB  100.0%  41.1MiB
   Most Expensive Loop     19     114s   73.1%   6.01s   1.47KiB    0.0%    79.2B
     Final Multipli...   107k     114s   72.8%  1.06ms     0.00B    0.0%    0.00B
     X Y Matrices W...   107k    336ms    0.2%  3.14ΞΌs     0.00B    0.0%    0.00B
   Pre Loop Allocat...     19   43.1ms    0.0%  2.27ms   15.9MiB   38.8%   859KiB
 ────────────────────────────────────────────────────────────────────────────────
anik@anik-MSI:~/Desktop/Swapan Sir/generating-function-codes/julia-codes/isc/version august 2024$ julia -p 11 2distribute.jl 
Number of processes = 11
The rate constant is 41993.588419884196
 ───────────────────────────────────────────────────────────────────────────
                                   Time                    Allocations      
                          ───────────────────────   ────────────────────────
     Tot / % measured:          149s /  95.2%            656MiB /   7.4%    

 Section          ncalls     time    %tot     avg     alloc    %tot      avg
 ───────────────────────────────────────────────────────────────────────────
 GT Calculation        1     142s  100.0%    142s   48.3MiB  100.0%  48.3MiB
 ───────────────────────────────────────────────────────────────────────────

I will try in the meantime to get a full dd=20001 benchmark running on my cluster to compare the results, as I have seen that the order of speeds can vary severely between my personal computer and the cluster.

As for this:

How exactly can I trigger the profiler on all processes ?

EDIT 1 - There seems to be a problem with getting MPI.jl to play nicely with my cluster, I have opened another thread for the error I am facing.

Not sure. Possibly this post on slack
Slack