Julia code becomes slower on running on supercomputers and does not scale well when parallelizing with Base.Threads

This question is a continuation / generalization of my previous question. My end goal is to write a parallelized code where the main time taking part is Matrix multiplication and inversion. As per the advice of @ufechner7 @DNF @nilshg @abraemer and many other helpful community members, I have been able to optimize my code to the extent that when it is being run on a single thread, it outperforms the Fortran equivalent of the code by quite a significant margin ! [The Fortran code took about 149 seconds whereas my code took 47 seconds on using dd = 11, which is essentially the number of points on the top-level loop]. The latest version of my code is given below :

println("Number of threads = $(Threads.nthreads())")
    write(f, "\n\n\n====================================================\n\n\n")
    using MKL
    using DelimitedFiles
    using LinearAlgebra
    using LaTeXStrings
    using TimerOutputs
    using Plots
    using Plots.PlotMeasures
    using FFTW
    using Base.Threads
    using SharedArrays
    using ChunkSplitters
function GFC(t::Float64,b::Float64,Δ::Float64,sysize::Int64,Ωf::Diagonal{Float64},Ωi::Diagonal{Float64},J::Matrix{Float64},D::Matrix{Float64},ρ::Diagonal{Float64},Si::Diagonal{ComplexF64},Sf::Diagonal{ComplexF64},Bi::Diagonal{ComplexF64},Bf::Diagonal{ComplexF64})
    t = t* 4.1341373335*10^16                     # converting time to atomic units
    tp = - im*b - t
    for i = 1:sysize
    BB = (Bf+J'*Bi*J)
    AA = (Sf+J'*Si*J)

    W = [BB  -AA
        -AA  BB] #We wont calculate the 2Nx2N determinant directly.
    # Wapp = BB*(BB-  AA*(BB^-1)*AA  )
    Wapp = BB*(BB-  AA*(BB\AA)  )
    U = Bi-Si
    V = [J'*U*D
    # F1 = sqrt(det(Si*Sf*ρ^2*Wapp^-1))
    # F1 = sqrt(det(Si*Sf*ρ^2)/det(Wapp))
    F1 = sqrt(det(Si*Sf*  (Wapp\ρ^2)))
    F2 = (-(im/2)*(transpose(V)* (W\V)))    +  ((im)*(transpose(D)*U*D))
    g = F1*exp(F2)
    Δ = Δ * 0.0367493  # in atomic units
    g = g * exp(im * Δ * t)
    η = 2 * 4.55633 * 10^-6  # 10 cm^-1 to hatree energy
    #g = g * exp(-η * abs(t))     # Adding the damping factor Lorrentzian
    g = g * exp(-η * t^2)     # Adding the damping factor Gaussian
    # println("t = $t was evaluated by process $(myid())")
    # println("GFC at t = $t   is     $g")
    # println("F1 at t = $t         = $F1")
    return g[1]
function GHT(t::Float64,b::Float64,sysize::Int64,Ωf::Diagonal{Float64},Ωi::Diagonal{Float64},J::Matrix{Float64},D::Matrix{Float64},Si::Diagonal{ComplexF64},Sf::Diagonal{ComplexF64},Bi::Diagonal{ComplexF64},Bf::Diagonal{ComplexF64},X11::Matrix{ComplexF64},X12::Matrix{ComplexF64},X21::Matrix{ComplexF64},X22::Matrix{ComplexF64},Y1::Matrix{ComplexF64},Y2::Matrix{ComplexF64},gfcpart::ComplexF64,T::Matrix{Float64})

    ################################## Preliminary Part , Common for both Frank Condon and Herzberg Teller ##############################
    t = t* 4.1341373335*10^16                     # converting time to atomic units
    tp = - im*b - t
    for i = 1:sysize
    W = [(Bf+J'*Bi*J)  -(Sf+J'*Si*J)
        -(Sf+J'*Si*J)  (Bf+J'*Bi*J)]
    U = Bi-Si
    V = [J'*U*D

    ########## Defining the X and Y Matrices ##################################

    ################# Final Calculation and returning value ##################
    MJSJ = J'*Si*J
    MJBJ = J'*Bi*J
    MJUD = J'*U*D
    Win = inv(W)
    WinV = Win*V
    for m in range(1,sysize)
        for mp in range(1,sysize)
            X11[m,:] .= (MJSJ)[mp,:] .* (-Bf[m,m])
            X12[m,:] .= (MJBJ)[mp,:] .* (Bf[m,m])
            X21[m,:] .= (MJSJ)[mp,:] .* (Sf[m,m])
            X22[m,:] .= (MJBJ)[mp,:] .* (-Sf[m,m])
            X = [X11 X12
                X21 X22]
            Y1[m,:] = Bf[m,m]*(MJUD)[mp,:] 
            Y2[m,:] = -Sf[m,m]*(MJUD)[mp,:]
            Y = [Y1
            # g = im*tr(W\X)+    (transpose(W\V)*X*(W\V))[1]      -    (transpose(Y)*(W\V))[1]
            # g = im*tr(Win*X)+    (transpose(Win*V)*X*(Win*V))[1]      -    (transpose(Y)*(Win*V))[1]
            g = im*tr(Win*X)+    (transpose(WinV)*X*(WinV))[1]      -    (transpose(Y)*(WinV))[1]
            s = s+g[1]*gfcpart*T[m,mp]
        for i in 1:sysize
    return s[1]

begin ## Define all constant variables
    const Delta::Float64 = 2.02 # in eV
    const dd ::Int64 = 11                              # Change this to modify number of points
    const Temp::Int64=300
    kT = Temp * 1.380649 * 10^-23  # Boltzmann constant times temperature in Joules
    kT= kT * 2.29371227840*10^17         # in atomic units
    const b ::Float64= 1/kT
    const omegaf ::Diagonal{Float64}= Diagonal(vec(readdlm("wf.txt", '\t', Float64, '\n')))*4.55633*10^-6   # Final state  frequencies in atomic units
    const omegai ::Diagonal{Float64}= Diagonal(vec(readdlm("wi.txt", '\t', Float64, '\n')))*4.55633*10^-6  # Initial state  frequencies in atomic units
    const sysize ::Int64 = size(omegai)[1]
    const P = @. 2*sinh(b*omegai/2)
    T ::Matrix{Float64}= readdlm("nac.txt", '\t', Float64, '\n'); # NAC matrix
    T = T*T';
    const D ::Matrix{Float64}= readdlm("d.txt", Float64); # Displacement Matrix
    const J ::Matrix{Float64}= readdlm("j.txt", Float64); # Duchinsky Matrix
    t ::Vector{Float64}=  collect(range(-5*10^-12,stop=5*10^-12,length=dd))
    t[div(dd+1,2)] = 10e-25

function calc(t,b,Δ,sysize,Ωf,Ωi,J,D,P)
    nchunks = Threads.nthreads() # define number of chunks e.g. equal to number of threads
    # @sync tells Julia to wait for any tasks started inside it
    open("THREAD-PARALLEL-OUTPUT-RV6.TXT", "a") do f
        write(f, "STARTING LOOP WITH NUMBER OF THREADS = $(Threads.nthreads()) \n====================================================\n")
    x = zeros(ComplexF64,dd)
    @time @sync for (xrange, ichunk) in chunks(range(1,dd), nchunks)
        # xrange are the indices this chunk contains, ichunk is just the index of the chunk, which we don't need in this scenario
        Threads.@spawn begin # this line tells Julia to start a new thread with the contents of the begin block
            Sf ::Diagonal{ComplexF64}= Diagonal(zeros(ComplexF64,sysize))
            Si ::Diagonal{ComplexF64}= Diagonal(zeros(ComplexF64,sysize))
            Bf ::Diagonal{ComplexF64}= Diagonal(zeros(ComplexF64,sysize))
            Bi ::Diagonal{ComplexF64}= Diagonal(zeros(ComplexF64,sysize))
            X11 :: Matrix{ComplexF64}  = zeros(ComplexF64,sysize,sysize)
            X12 :: Matrix{ComplexF64}  = zeros(ComplexF64,sysize,sysize)
            X21 :: Matrix{ComplexF64}  = zeros(ComplexF64,sysize,sysize)
            X22 :: Matrix{ComplexF64}  = zeros(ComplexF64,sysize,sysize)
            Y1  :: Matrix{ComplexF64} =  zeros(ComplexF64,sysize,1)
            Y2  :: Matrix{ComplexF64} =  zeros(ComplexF64,sysize,1)
            for i in xrange
                gfcpart = GFC(t[i],b,Δ,sysize,Ωf,Ωi,J,D,P,Si,Sf,Bi,Bf)
                gtotal = GHT(t[i],b,sysize,Ωf,Ωi,J,D,Si,Sf,Bi,Bf,X11,X12,X21,X22,Y1,Y2,gfcpart,T)
                x[i] = gtotal
                open("THREAD-PARALLEL-OUTPUT-RV6.TXT", "a") do f
                    write(f, "ITERATION $i COMPLETED BY THREAD $(Threads.threadid()) \n")
            end # loop
        end # thread work load
    end # loop over chunks, Julia will wait until all task are completed
    open("THREAD-PARALLEL-OUTPUT-RV6.TXT", "a") do f
        write(f, "LOOP COMPLETE \n====================================================")
    yr = real.(x)
    yi = imag.(x)
    p = plot(t,yr,linewidth = 2,color = "blue",title = "Δ = $Δ",label = "Real part",grid=false)
    p =plot!(t,yi,linewidth = 2,color = "red",label = "Imaginary part",size=(1000,800),xlims=(-30*10^-15,30*10^-15))
    begin # This part does FFTW and finds final rate constant

        fs = abs(t[1]-t[2])^-1
        signal = yr
        p = plan_fft(signal)
        F = fftshift(p*signal)
        freqs = fftshift(fftfreq(length(t), fs))  # X axis has frequencies in s^-1
        #Covert freqs to eV
        freqs = freqs * 4.13558 * 10^-15
        F = F * 6.57 * 10^15
        time_domain = plot(t, signal, title = L"G(t)",legend=false,color="orange",xlabel=L"\mathrm{Time}\ (s)",ylabel=L"\mathrm{A.u}")
        freq_domain = plot(freqs, abs.(F), title = L"\mathrm{Fourier\ transformed\ g(\bar{\nu}))}",legend=false,xlabel=L"\mathrm{Energy\ (eV)}",ylabel=L"\mathrm{k_{IC}\ (s^{-1})}",xlims=(0,4))
        p = plot(time_domain, freq_domain, layout = (2,1), size = (1000, 800),right_margin = 10mm)
        ee = Δ# * 8065.73  #Converting eV to cm^-1
        index = findmin(abs.(freqs .- ee))[2]  # Find the index of freq closest to ee
        println("The rate constant is ",abs.(F)[index])
        rc = abs.(F)[index]
    begin # This part writes the output to a file
        open("THREAD-PARALLEL-OUTPUT-RV6.TXT", "a") do f
            write(f, "-------------------------\n")
            write(f,"NUMBER OF CORES = $(Threads.nthreads())\n")
            write(f,"The value of Δ is $Δ\n")
            write(f,"Number of time points = $dd\n")
            write(f, "The rate constant is $rc\n")
            write(f,"The values of t,Re(g(t)),Im(g(t)) are \n\n")
            for i in range(1,dd)
                write(f,"$(t[i])  $(yr[i])  $(yi[i])\n")


timeneed = @elapsed calc(t,b,Delta,sysize,omegaf,omegai,J,D,P)
    write(f, "\n\n\n===============================================================\nThe time elapsed for main loop [INCLUDING GC + C + RC]=  $timeneed seconds \n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n")

In case anybody wants to run the code on their own computers, the data files are given here.

Current Problems

  • As stated above, my code runs faster than the fortran code on my computer ( Intel i5 , 8th Generation Processor). But when I submit the same code for running on our supercomputer (Intel Xeon) serially, the Fortran code becomes faster, with it’s time taken becoming 49 seconds from 149 seconds. Whereas my code becomes slower ! taking 52 seconds instead of 47 seconds. As my end goal is to run the code on such supercomputers in a parallelized fashion, I think it is of paramount importance that we sort out this discrepency in the serial code performance.

  • With increasing the number of processors, my code does not speed up as much as the Fortran code. The Fortran code uses MPI parallelization, whereas my Julia code relies on the features of Base.Threads . When I say serial performance, I mean just running this code with julia -t 1 xxx.jl command such that it uses only one thread. To give an example of the current scaling scenario , when on my personal PC I use 10 threads and ~1000 points I end up with a time of 690 seconds instead of an ideally projected 381 seconds. I know ideal scaling is never possible but is the scaling I am getting expected ? I also realize that the comparison I am making is not completely fair since I am not using MPI.jl that I should , as the Fortran code uses it.

Any help / suggestions / pointers that could help me solve these problems, i.e prevent my code from getting slowed down on the supercomputer and making it scale better with increasing the number of threads would be greatly appreciated.


Do you use the same Fortran compiler on your PC and the Supercomputer ? Often on the supercomputer you have access to better compilers than on your PC, which could explain why migrating the Fortran code onto the supercomputer gets faster.

Also, similarly, is the same Julia version used on both machines ?

1 Like

Can you share the output of

cat /proc/cpuinfo

from your target machine?

1 Like

I can assure the Julia version and all the package versions are same as I installed them myself. I can’t say the same about Fortran compiler as it along with the required libraries were already present.

But can you share the output of

gfortran --version

or the equivalent command for the Fortran compilers used?

It is quite large and I am very new to this, so I might have copy-pasted more than neccessary. But here is the output. This is not the exact supercomputer where I reported the above statistics, but the same trend holds here also. [Note - there is a character limit, so I attached the lowermost part of the output]

processor	: 39
vendor_id	: GenuineIntel
cpu family	: 6
model		: 79
model name	: Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz
stepping	: 1
microcode	: 0xb00003e
cpu MHz		: 1197.795
cache size	: 25600 KB
physical id	: 1
siblings	: 20
core id		: 12
cpu cores	: 10
apicid		: 57
initial apicid	: 57
fpu		: yes
fpu_exception	: yes
cpuid level	: 20
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cdp_l3 invpcid_single pti intel_ppin ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm rdt_a rdseed adx smap intel_pt xsaveopt cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts md_clear flush_l1d
bugs		: cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds swapgs taa itlb_multihit
bogomips	: 4391.93
clflush size	: 64
cache_alignment	: 64
address sizes	: 46 bits physical, 48 bits virtual
power management:
gfortran --version
GNU Fortran (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
Copyright (C) 2017 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO

Here is the output

and you compiled yourself ? With gfortran on both machines or maybe with ifort on the supercomputer ?

Yes I compiled it myself using gfortran on both machines (Xeon and i7x86). Apologies for the late reply, I was travelling.

What is the reported GC time?
Your code makes lots of allocations, especially those in tight loops this can make the efficiency drop, especially so in multithreaded code (GC only uses up to 4 cores as of 1.10)
I recommend @views to try and get rid of the allocations in GHT and such.
Of course there can be a problem outside of this but this is the first I would try

1 Like

I don’t think this is true. You can set the number of GC threads like this:

julia --gcthreads=8,1

If you want to have 8+1 for example…


It is about 2.84%.
As for using @views , I cannot find a way to make it work because if I use @views to define another matrix, it shows some incompatibility with the native matrix multiplication that I have employed.

I have tried to reuse the same memory as much as possible as you can see from my code. I use X11 etc. as workspaces and just clean out the entries used after each iteration. I pass them as parameters instead of defining a new blank workspace in each iteration of the m and mp nested loop.

Well, if that is correct garbage collection would not be the limiting factor.


Perhaps you can create a very small toy example to compare the performance of Blas/ MKL on both machines? Perhaps one of these libraries is not recognizing the Xeon CPU correctly?

Will do that ! However I am gonna need some pieces of advice before I run the same on the supercomputer because being an undergraduate student I only have limited access to it.

  1. Would a single iteration of the same program do the job ? Or maybe should I try something as simple as a matrix inversion maybe ?

  2. Should I run it on single or multi threaded mode ?

I just tried:

# using MKL
using BenchmarkTools

@btime inv($m)

I run it with:

julia --project bench.jl

With MKL I get 9.1s execution time, without 7.9 s while running on battery.

In both cases 8 threads (the number of cores) are used.

julia> versioninfo()
Julia Version 1.10.0
Commit 3120989f39b (2023-12-25 18:01 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 16 × AMD Ryzen 7 7840U w/ Radeon  780M Graphics
  LIBM: libopenlibm
  LLVM: libLLVM-15.0.7 (ORCJIT, znver3)
  Threads: 1 on 16 virtual cores

The current paradigm for ‘supercomputers’ is just normal servers linked with a low latency network. (Hint - I build these things)
Usually the servers have two CPU sockets and multicore CPUs.
It is also quite usual these days to have GPUs.
The low latency interconnect can be Infiniband, Omnipath or ROCE Ethernet (OK NVlink also)

Lets start slowly. Submit an interactive job on one server on the supercomputer.
While your code is running run ‘top’ or better still ‘htop’


About problem 1:

Serial Julia getting a bit slower on the cluster is expected as cluster CPUs usually are a bit slower per core but make up for it by having a lot of cores.
I suspect that the Fortran code did some things in parallel here perhaps some linear algebra routines. Maybe you need to set some environment variables to constrain the underlying routines to be serial as well.
Generally the Fortran code might use a different parallelization strategy than your Julia code.This could also be the reason for the second problem you mentioned:

Some further comments:

I actually think this is fine given the inefficiencies in your current code.

If anything, using threads has less performance overhead than interprocess communication stuff. So using threads over processes is actually a plus for Julia.

All in all I would just focus on optimizing your Julia code and less on the raw time comparison with Fortran. However, it could be useful to analyze the Fortran code and see what strategies were employed over there, e.g. how did they break up the workload into chunks for parallel computation, how did they structure the linear algebra stuff and so on.

1 Like

Sure ! , the Fortran code is publicly available here .

In the past I have seen discussions here where performance is ultimately been determined by the BLAS library used.
Maybe you could try to find out the libraries being used.

Also on a multicore machien you should look at process pinning. That was what I was hinting at by running htop - are processes hopping between CPUs?