Julia code becomes slower on running on supercomputers and does not scale well when parallelizing with Base.Threads

Could this be a related issue: [OpenBLAS_jll] Use new build compiled with GCC 11 by giordano · Pull Request #52928 · JuliaLang/julia · GitHub ?

Could this be a related issue: [OpenBLAS_jll] Use new build compiled with GCC 11 by giordano · Pull Request #52928 · JuliaLang/julia · GitHub ?

Thank you for your reply. I am sorry but I am extremely new to such developments. Could you please paraphrase what you are referring to ?

New in the master branch of Julia is support for the sapphire rapids CPU family from Intel, which covers the latest XEON CPUs that where released in 2023. You still did not report which exact CPU your super computer is using, but if it should be one of the latest XEON CPUs trying out Julia from master might be worth a try…

I’m sorry for that, I did not realize the the importance of quoting the exact CPU. As per the website here we are using Intel Xeon SKL G-6148.

I’m unsure if this is related, but there are known issues with garbage collection when running multithreaded code:

The massive slowdown on using Xeon CPUs occurs regardless of whether I am running it on single or multi-threaded.

Isn’t that the Fortran code is faster?

Personally I would focus in improving the performance of the Julia code. Do you have profiled it to see where is it taking most of the time? If it is is in the LA calls, then one can focus in improving that.

You have some slices and non-constant global variables there, and they might be a problem. But, without a profile, these are just guesses.

This is not the topic of this thread. The topic of this thread is why the same code is faster than Fortran on a laptop CPU and much slower than Fortran on a super computer CPU.

But that can be only about the Fortran code and it’s compilers. There is no clear evidence that that has anything to do with Julia.

Also the thread is about why the Julia code does not scale well, and that is something do investigate optimizing the Julia code.

2 Likes

@lmiq As you rightly pointed out in your first reply, the primary focus should be to optimize the Julia code as much as possible. We have tried to do that as described in this thread here. We have significantly improved the performance of the serial Julia code. On observing that the Julia code is running faster than the serial Fortran code on i7 x86 architechture, we concluded that we have reached the end of “general programming optimization” , that we might have done without rewriting the matrix formalism that we used.

Would it be possible to provide this evidence?

The CPUs you are using are Skylake
Intel Xeon Gold 6148 Processor 27.5M Cache 2.40 GHz Product Specifications

You still refer to massive slowdowns. Have you tried running htop during a job run as I propose?
Honestly, some simple diagnostics may show things you don’t expect.

Also you refer to MPI and threaded code. Are you trying to compare these?

My goal is to compare the parallel performance of the Fortran code with the Julia one. For Julia I am using Base.Threads to gain a thread level parallelization as it is the one with the least amount of overhead (as told in the Julia manual).

For Fortran we are using MPI to parallelize the code since as far as I know it does not have any native parallelization packages like Julia does. The comparison might not be “fair” in the truest sense but we are more concerned about getting the Julia code to run faster than the Fortran code in both serial and parallel (which we achieved successfully on i7 x86 but is failing with the Xeon chip).

Also your cluster has two sockets of 20 physical cores. The documentation says there are 80 cores per server, so hyperthreading is enabled.
I would advise the following:
You will not be able to have hyperthreading switched off in the BIOS (your sys admins will probably refuse to do this)
You can disable all the even numbered cores at run time - again sys admins will probably refuse this

Could you try running on only odd numbered cores - I think hyperthreading may be having an influence here
GitHub - carstenbauer/ThreadPinning.jl: Readily pin Julia threads to CPU processors

I shall try my best to do this and provide the result (as stated above, I do not have direct access to the cluster in question). However, the slowdown occurs even when I am comparing the serial codes. Besides, when running for 20000 points, the Julia code comes of as about 5 minutes slower than the Fortran code (the Fortran code isn’t scaling perfectly either), so I suspect that this is the result of the accumulated serial code slowdown over multiple iterations.

Please look at OpenMP
Using OpenMP with Fortran — Research Computing University of Colorado Boulder documentation (curc.readthedocs.io)

OpenMP | LLNL HPC Tutorials

You are comparing threaded code with MPI code.
Fortran can used MPI or can use OpenMP For threaded code.
Julia can use MPI or can use threads.
I would say you chould compare MPI to MPI or threaded to threaded.

I would really advise you to look at thread pinning here
ThreadPinning · ThreadPinning.jl (carstenbauer.github.io)

Autochecking BLAS Thread Settings · ThreadPinning.jl (carstenbauer.github.io)

Also you say you do not have direct access to the cluster. Try the following:

ssh login to a login node
Submit your job
use qstat to see which compute node is in use
ssh to that compute node
Often HPC clusters stop you logging in to compute nodes - but you CAN do this when you are running a job on a node

Also look at running an interactive job - which batch queue system are you using?

After profiling the code (with @profview in VSCode), I can see that vcat operations were taking a lot of time, in these lines:

#            X = [X11 X12
#                X21 X22]
and
#            Y = [Y1
#                Y2]

By allocating these intermediate X and Y outside the loop and updating their values instead of recreating the arrays, the number of allocations is cut in half:
Before:

leandro@m3g:~/Downloads/uranium% julia -t 1 --project pal-rv6.jl 
Number of threads = 1
 31.559063 seconds (9.76 M allocations: 42.744 GiB, 1.54% gc time, 14.28% compilation time)
GKS: Possible loss of precision in routine SET_WINDOW
The rate constant is 2.1751344328561788e9

After:

leandro@m3g:~/Downloads/uranium% julia -t 1 --project pal-rv6_2.jl 
Number of threads = 1
 29.952532 seconds (9.86 M allocations: 21.572 GiB, 1.35% gc time, 15.68% compilation time)
GKS: Possible loss of precision in routine SET_WINDOW
The rate constant is 2.1751344328561788e9

Next I note that a line inside GHT which performs matrix multiplications is taking most of the time, and calling similar internally. So I preallocate the result of the matrix multiplication to use mul!(WinX, Win, X) instead, and get:

leandro@m3g:~/Downloads/uranium% julia -t 1 --project pal-rv6_2.jl 
Number of threads = 1
 30.032092 seconds (9.74 M allocations: 847.634 MiB, 0.86% gc time, 16.09% compilation time)
GKS: Possible loss of precision in routine SET_WINDOW
The rate constant is 2.1751344328561788e9

note that allocations are now in 874Mib. And mostly because of compilation. In a second run I get:

julia> @time calc(t, b, Delta, sysize, omegaf, omegai, J, D, P; nchunks=1)
 25.172865 seconds (311.39 k allocations: 228.236 MiB, 0.05% gc time)
The rate constant is 2.1751344328561788e9

and with 8 threads:

@time calc(t, b, Delta, sysize, omegaf, omegai, J, D, P; nchunks=8)
  4.791953 seconds (311.52 k allocations: 230.700 MiB)
The rate constant is 2.1751344328561788e9

I think this a better starting point to check the scalability in the cluster. Excessive allocations can really be a problem for that.

Here is the modified code I’m running. It has some other small differences which didn´t make a difference.

code
using MKL
using DelimitedFiles
using LinearAlgebra
using LaTeXStrings
using TimerOutputs
using Plots
using Plots.PlotMeasures
using FFTW
using Base.Threads
using SharedArrays
using ChunkSplitters
BLAS.set_num_threads(1)

println("Number of threads = $(Threads.nthreads())")
open("THREAD-PARALLEL-OUTPUT-RV6.TXT", "a") do f
    write(f, "\n\n\n====================================================\n\n\n")
end

function GFC(
    t::Float64,
    b::Float64,
    Δ::Float64,
    sysize::Int64,
    Ωf::Diagonal{Float64},
    Ωi::Diagonal{Float64},
    J::Matrix{Float64},
    D::Matrix{Float64},
    ρ::Diagonal{Float64},
    Si::Diagonal{ComplexF64},
    Sf::Diagonal{ComplexF64},
    Bi::Diagonal{ComplexF64},
    Bf::Diagonal{ComplexF64}
)
    t = t * 4.1341373335e16                     # converting time to atomic units
    tp = -im * b - t
    for i = 1:sysize
        Sf[i, i] = Ωf[i, i] / sin(Ωf[i, i] * t)
        Si[i, i] = Ωi[i, i] / sin(Ωi[i, i] * tp)
        Bf[i, i] = Ωf[i, i] / tan(Ωf[i, i] * t)
        Bi[i, i] = Ωi[i, i] / tan(Ωi[i, i] * tp)
    end
    BB = (Bf + J' * Bi * J)
    AA = (Sf + J' * Si * J)

    W = [BB -AA
        -AA BB] #We wont calculate the 2Nx2N determinant directly.

    # Wapp = BB*(BB-  AA*(BB^-1)*AA  )
    Wapp = BB * (BB - AA * (BB \ AA))
    U = Bi - Si
    V = [J' * U * D
        J' * U * D]
    # F1 = sqrt(det(Si*Sf*ρ^2*Wapp^-1))
    # F1 = sqrt(det(Si*Sf*ρ^2)/det(Wapp))
    F1 = sqrt(det(Si * Sf * (Wapp \ ρ^2)))
    F2 = (-(im / 2) * (transpose(V) * (W \ V))) + ((im) * (transpose(D) * U * D))
    g = F1 * exp(F2)
    Δ = Δ * 0.0367493  # in atomic units
    g = g * exp(im * Δ * t)
    η = 2 * 4.55633e-6  # 10 cm^-1 to hatree energy
    #g = g * exp(-η * abs(t))     # Adding the damping factor Lorrentzian
    g = g * exp(-η * t^2)     # Adding the damping factor Gaussian
    # println("t = $t was evaluated by process $(myid())")
    # println("GFC at t = $t   is     $g")
    # println("F1 at t = $t         = $F1")
    return g[1]
end

@views function GHT(
    t::Float64,
    b::Float64,
    sysize::Int64,
    Ωf::Diagonal{Float64},
    Ωi::Diagonal{Float64},
    J::Matrix{Float64},
    D::Matrix{Float64},
    Si::Diagonal{ComplexF64},
    Sf::Diagonal{ComplexF64},
    Bi::Diagonal{ComplexF64},
    Bf::Diagonal{ComplexF64},
    X11::Matrix{ComplexF64},
    X12::Matrix{ComplexF64},
    X21::Matrix{ComplexF64},
    X22::Matrix{ComplexF64},
    Y1::Matrix{ComplexF64},
    Y2::Matrix{ComplexF64},
    gfcpart::ComplexF64,
    T::Matrix{Float64}
)

    ################################## Preliminary Part , Common for both Frank Condon and Herzberg Teller ##############################
    t = t * 4.1341373335e16                     # converting time to atomic units
    tp = -im * b - t
    for i = 1:sysize
        Sf[i, i] = Ωf[i, i] / sin(Ωf[i, i] * t)
        Si[i, i] = Ωi[i, i] / sin(Ωi[i, i] * tp)
        Bf[i, i] = Ωf[i, i] / tan(Ωf[i, i] * t)
        Bi[i, i] = Ωi[i, i] / tan(Ωi[i, i] * tp)
    end
    W = [(Bf+J'*Bi*J) -(Sf + J' * Si * J)
        -(Sf + J' * Si * J) (Bf+J'*Bi*J)]
    U = Bi - Si
    V = [J' * U * D
        J' * U * D]

    ########## Defining the X and Y Matrices ##################################


    ################# Final Calculation and returning value ##################
    s = zero(ComplexF64)
    MJSJ = J' * Si * J
    MJBJ = J' * Bi * J
    MJUD = J' * U * D
    Win = inv(W)
    WinV = Win * V
    X = zeros(ComplexF64, 2 * sysize, 2 * sysize)
    Y = zeros(ComplexF64, 2 * sysize, 1)
    WinX = similar(Win)
    for m in 1:sysize
        for mp in 1:sysize
            X11[m, :] .= (MJSJ)[mp, :] .* (-Bf[m, m])
            X12[m, :] .= (MJBJ)[mp, :] .* (Bf[m, m])
            X21[m, :] .= (MJSJ)[mp, :] .* (Sf[m, m])
            X22[m, :] .= (MJBJ)[mp, :] .* (-Sf[m, m])
            #            X = [X11 X12
            #                X21 X22]
            X[1:sysize, 1:sysize] .= X11
            X[1:sysize, sysize+1:2*sysize] .= X12
            X[sysize+1:2*sysize, 1:sysize] .= X21
            X[sysize+1:2*sysize, sysize+1:2*sysize] .= X22

            Y1[m, :] .= Bf[m, m] * (MJUD)[mp, :]
            Y2[m, :] .= -Sf[m, m] * (MJUD)[mp, :]
            Y[1:sysize, 1] .= Y1
            Y[sysize+1:2*sysize, 1] .= Y2
            #            Y = [Y1
            #                Y2]
            # g = im*tr(W\X)+    (transpose(W\V)*X*(W\V))[1]      -    (transpose(Y)*(W\V))[1]
            # g = im*tr(Win*X)+    (transpose(Win*V)*X*(Win*V))[1]      -    (transpose(Y)*(Win*V))[1]
            g = im * tr(mul!(WinX, Win, X)) + (transpose(WinV)*X*(WinV))[1] - (transpose(Y)*(WinV))[1]
            s = s + g * gfcpart * T[m, mp]
        end
        X11[m, 1:sysize] .= 0
        X12[m, 1:sysize] .= 0
        X21[m, 1:sysize] .= 0
        X22[m, 1:sysize] .= 0
        Y1[m, 1] = 0
        Y2[m, 1] = 0
    end
    return s
end

const Delta = 2.02 # in eV
const dd = 11                              # Change this to modify number of points
const Temp = 300
kT_temp = Temp * 1.380649e-23  # Boltzmann constant times temperature in Joules
const kT = kT_temp * 2.29371227840e17         # in atomic units
const b = 1 / kT
const omegaf = Diagonal(vec(readdlm("wf.txt", '\t', Float64, '\n'))) * 4.55633e-6   # Final state  frequencies in atomic units
const omegai = Diagonal(vec(readdlm("wi.txt", '\t', Float64, '\n'))) * 4.55633e-6  # Initial state  frequencies in atomic units
const sysize = size(omegai)[1]
const P = @. 2 * sinh(b * omegai / 2)
T_temp = readdlm("nac.txt", '\t', Float64, '\n') # NAC matrix
const T = T_temp * T_temp'
const D = readdlm("d.txt", Float64) # Displacement Matrix
const J = readdlm("j.txt", Float64) # Duchinsky Matrix
const t = collect(range(-5e-12, stop=5e-12, length=dd))
t[div(dd + 1, 2)] = 10e-25

function calc(t, b, Δ, sysize, Ωf, Ωi, J, D, P; nchunks=Threads.nthreads())
    # @sync tells Julia to wait for any tasks started inside it
    open("THREAD-PARALLEL-OUTPUT-RV6.TXT", "a") do f
        write(f, "STARTING LOOP WITH NUMBER OF THREADS = $(Threads.nthreads()) \n====================================================\n")
    end
    x = zeros(ComplexF64, dd)
    @time @sync for (xrange, ichunk) in chunks(range(1, dd), nchunks)
        # xrange are the indices this chunk contains, ichunk is just the index of the chunk, which we don't need in this scenario
        Threads.@spawn begin # this line tells Julia to start a new thread with the contents of the begin block
            Sf = Diagonal(zeros(ComplexF64, sysize))
            Si = Diagonal(zeros(ComplexF64, sysize))
            Bf = Diagonal(zeros(ComplexF64, sysize))
            Bi = Diagonal(zeros(ComplexF64, sysize))
            X11 = zeros(ComplexF64, sysize, sysize)
            X12 = zeros(ComplexF64, sysize, sysize)
            X21 = zeros(ComplexF64, sysize, sysize)
            X22 = zeros(ComplexF64, sysize, sysize)
            Y1 = zeros(ComplexF64, sysize, 1)
            Y2 = zeros(ComplexF64, sysize, 1)
            for i in xrange
                gfcpart = GFC(t[i], b, Δ, sysize, Ωf, Ωi, J, D, P, Si, Sf, Bi, Bf)
                gtotal = GHT(t[i], b, sysize, Ωf, Ωi, J, D, Si, Sf, Bi, Bf, X11, X12, X21, X22, Y1, Y2, gfcpart, T)
                x[i] = gtotal
                open("THREAD-PARALLEL-OUTPUT-RV6.TXT", "a") do f
                    write(f, "ITERATION $i COMPLETED BY THREAD $(Threads.threadid()) \n")
                end
            end # loop
        end # thread work load
    end # loop over chunks, Julia will wait until all task are completed
    open("THREAD-PARALLEL-OUTPUT-RV6.TXT", "a") do f
        write(f, "LOOP COMPLETE \n====================================================")
    end
    yr = real.(x)
    yi = imag.(x)
    p = plot(t, yr, linewidth=2, color="blue", title="Δ = $Δ", label="Real part", grid=false)
    p = plot!(t, yi, linewidth=2, color="red", label="Imaginary part", size=(1000, 800), xlims=(-30e-15, 30e-15))
    savefig(p, "gt.svg")
    begin # This part does FFTW and finds final rate constant

        fs = abs(t[1] - t[2])^-1
        signal = yr
        p = plan_fft(signal)
        F = fftshift(p * signal)
        freqs = fftshift(fftfreq(length(t), fs))  # X axis has frequencies in s^-1
        #Covert freqs to eV
        freqs = freqs * 4.13558e-15
        F = F * 6.57e15
        time_domain = plot(t, signal, title=L"G(t)", legend=false, color="orange", xlabel=L"\mathrm{Time}\ (s)", ylabel=L"\mathrm{A.u}")
        freq_domain = plot(freqs, abs.(F), title=L"\mathrm{Fourier\ transformed\ g(\bar{\nu}))}", legend=false, xlabel=L"\mathrm{Energy\ (eV)}", ylabel=L"\mathrm{k_{IC}\ (s^{-1})}", xlims=(0, 4))
        p = plot(time_domain, freq_domain, layout=(2, 1), size=(1000, 800), right_margin=10mm)
        savefig(p, "gt-fft.svg")
        ee = Δ# * 8065.73  #Converting eV to cm^-1
        index = findmin(abs.(freqs .- ee))[2]  # Find the index of freq closest to ee
        freqs[index]
        println("The rate constant is ", abs.(F)[index])
        rc = abs.(F)[index]
    end
    begin # This part writes the output to a file
        open("THREAD-PARALLEL-OUTPUT-RV6.TXT", "a") do f
            write(f, "-------------------------\n")
            write(f, "NUMBER OF CORES = $(Threads.nthreads())\n")
            write(f, "The value of Δ is $Δ\n")
            write(f, "Number of time points = $dd\n")
            write(f, "The rate constant is $rc\n")
            write(f, "-------------------------\n")
            write(f, "The values of t,Re(g(t)),Im(g(t)) are \n\n")
            for i in 1:dd
                write(f, "$(t[i])  $(yr[i])  $(yi[i])\n")
            end
        end
    end

end


#timeneed = @elapsed calc(t, b, Delta, sysize, omegaf, omegai, J, D, P; nchunks=Threads.nthreads())
#open("THREAD-PARALLEL-OUTPUT-RV6.TXT", "a") do f
#    write(f, "\n\n\n===============================================================\nThe time elapsed for main loop [INCLUDING GC + C + RC]=  $timeneed seconds \n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n")
#end

The performance of the final code improved a little, but the matrix multiplications is still the bottleneck, and I don’t think there’s nothing that you can improve much further there in the serial version. But these better allocation profiles can improve the parallel version, effectively.

A different question is why the Fortran code in your machine is so slow. I don’t have an answer to that, but from your original post it seemed that the Fortran code compiled in the cluster had a roughly similar performance than the Julia one (serial version), which is consistent with the matrix multiplication being the bottleneck.

edit: If you go to larger and larger problems, of course those repetitive allocations can become a problem, and may explain why your Julia code is getting slower when you increase the size.

3 Likes