How to improve performance in a function that repeatedly defines and multiplies matrices

Uranium238 · December 29, 2023, 9:47am

I am trying to write a code that involves running 3 nested loops, and calling a computationally expensive function within the loop. The expense mainly comes from defining , multiplying and inverting large [~100x100] matrices. I am defining the variables that will be repeatedly needed in the global scope as follows

    Temp=300
    kT = Temp * 1.380649 * 10^-23  # Boltzmann constant times temperature in Joules
    kT= kT * 2.29371227840*10^17         # in atomic units
    b = 1/kT
    Ωf = readdlm("wf.txt", '\t', Float64, '\n')  # Final state  frequencies
    Ωi = readdlm("wi.txt", '\t', Float64, '\n') # Initial state  frequencies
    J = readdlm("j.txt", Float64);              # J matrix
    D = readdlm("d.txt", Float64);              # Duchinsky matrix
    Ωf = Matrix(Diagonal(vec(Ωf)))
    Ωi = Matrix(Diagonal(vec(Ωi)))
    sysize = size(Ωi)[1]
    Ωf = Ωf.*4.55633*10^-6     #Converting to cm^-1 atomic energy units, i.e hatree
    Ωi = Ωi.*4.55633*10^-6 ;   #Same as above
    ρ = zeros(Float64,sysize,sysize)
    for i in range(1,sysize)
        ρ[i,i] = (2*sinh(b*Ωi[i,i]/2))
    end
    T = readdlm("nac.txt", '\t', Float64, '\n'); # NAC matrix
    T = T*T';
    Sf = zeros(ComplexF64,sysize,sysize)
    Si = zeros(ComplexF64,sysize,sysize)
    Bf = zeros(ComplexF64,sysize,sysize)
    Bi = zeros(ComplexF64,sysize,sysize)

after that I define two of my functions

function GFC(t::Float64,T::Float64,Δ::Float64,sysize::Int64,Ωf::Matrix{Float64},Ωi::Matrix{Float64},J::Matrix{Float64},D::Matrix{Float64})
        t = t* 4.1341373335*10^16                     # converting time to atomic units
        tp = - im*b - t
        for i = 1:sysize
            Sf[i,i]=Ωf[i,i]/sin(Ωf[i,i]*t)
            Si[i,i]=Ωi[i,i]/sin(Ωi[i,i]*tp)
            Bf[i,i]=Ωf[i,i]/tan(Ωf[i,i]*t)
            Bi[i,i]=Ωi[i,i]/tan(Ωi[i,i]*tp)
        end
        BB = (Bf+J'*Bi*J)
        AA = (Sf+J'*Si*J)

        W = [BB  -AA
            -AA  BB] #We wont calculate the 2Nx2N determinant directly.
        
        Wapp = BB*(BB-  AA*(BB^-1)*AA  )
        U = Bi-Si
        V = [J'*U*D
            J'*U*D]
        F1 = sqrt(det(Si*Sf*ρ^2*Wapp^-1))
        F2 = (-(im/2)*(V'*W^-1*V))    +  ((im)*(D'*U*D))
        g = F1*exp(F2)
        return g

and another function GHT that has similar computational complexity. I am calling them inside a loop as follows:

     function calc()
        @sync @distributed for i in range(1,dd)
            s=0
            if t[i]==0
                t[i]=10^-35
            end
            gfcpart=GFC(t[i],300.0,Δ,sysize,Ωf,Ωi,J,D)[1]
            for m in range(1,sysize)
                for mp in range(1,sysize)
                    s=s+    (GHT(t[i],300.0,Δ,sysize,Ωf,Ωi,J,D,m,mp)[1]*gfcpart*T[m,mp])
                end
                x[i]=s
            end
            # x[i] = (GHT(t[i],300.0,Δ,sysize,Ωf,Ωi,J,D,1,30))[1]*GFC(t[i],300.0,Δ,sysize,Ωf,Ωi,J,D)[1]*T[1,1]*det(ρ)
        end
        @sync @distributed for i in range(1,dd)
            yr[i] = real(x[i])
            yi[i] = imag(x[i])
        end
        p = plot(t,yr,linewidth = 2,color = "blue",title = "Δ = $Δ",label = "Real part",grid=false)
        p =plot!(t,yi,linewidth = 2,color = "red",label = "Imaginary part",size=(1000,800),xlims=(-30*10^-15,30*10^-15))
        savefig(p,"gt.svg")
    end

Where I have tried to distribute the first loop among different processors in hope of gaining some performance. I have SharedArray for variables like x,yr,yi that might be used by all the different processors. I have checked that the distributed code is giving concurrent results as the serial code, so I know there isn’t any logical fallacy within the parallelization I have implemented.

The problem The code is scaling horribly as the matrices get larger. And the parallelization is not giving as much of a performance boost as expected. For example, when my the dimension of matrices was 30 the code took 778 seconds for the calc() to complete using dd=1000 points in the upper loop when running serially. When using 10 processors, it took 178 seconds instead. Is there any way to attain better scaling by fixing any inefficiencies in the code that might affect both the serial and parallel computations ?

nsajko · December 29, 2023, 11:09am

See the Performance Tips in the Manual: Performance Tips · The Julia Language

Don’t use global variables. As a quick fix you could just add const to improve performance, but the best practice (not just in Julia, in programming in general) would be to pass those as parameters.

Uranium238 · December 29, 2023, 12:42pm

Thank you for your reply ! Suppose I want to pass them as parameters, I would still have to define them somewhere right ? How do you suggest I go about it ? Should I define a function f() whose entire purpose is to define these variables and call the other functions ?

bertschi · December 29, 2023, 1:23pm

You can still define them globally. Yet, when passing them as parameters the compiler will know the types and specialize the method accordingly (it cannot do that for globally accessed variables as their type might change – unless you declare them const or type them).
With so many parameters, I would try to structure them a bit, e.g., by putting into suitable structs:

struct PhysicalConstants
    kT::Float64
    b::Float64
    ...
end

const physical_constants = PhysicalConstants(kT, 1/kT, ...)

struct ProblemState
    Ωf::Matrix{Float64}
    Ωi::Matrix{Float64}
    J::Matrix{Float64}
    ...
end

const myproblem = ProblemState(readdlm("wf.txt", '\t', Float64, '\n'), ...)

That should have no overhead while making the code more structured. Further, in a second step, you can make the code more generic by dropping all explicit types and parameterizing your structures, e.g.,

struct PhysicalConstants{U,V}
    kT::U
    b::U   # assuming it always must have the same type as kT
    Temp::V
    ...
end

Uranium238 · December 29, 2023, 3:31pm

I tried to declare the constant variables as const as suggested by you in the way described below

    Wf = readdlm("wf.txt", '\t', Float64, '\n')  # Final state  frequencies
    Wi = readdlm("wi.txt", '\t', Float64, '\n') # Initial state  frequencies
    J = readdlm("j.txt", Float64);              # J matrix
    D = readdlm("d.txt", Float64);              # Duchinsky matrix
    Wf = Matrix(Diagonal(vec(Wf)))
    Wi = Matrix(Diagonal(vec(Wi)))
    const sysize = size(Wi)[1]
    Wf = Wf.*4.55633*10^-6     #Converting to cm^-1 atomic energy units, i.e hatree
    Wi = Wi.*4.55633*10^-6 ;   #Same as above
    P = zeros(Float64,sysize,sysize)
    for i in range(1,sysize)
        P[i,i] = (2*sinh(b*Wi[i,i]/2))
    end
    const Ωf = Wf
    const Ωi = Wi
    const ρ = P
    NAC = readdlm("nac.txt", '\t', Float64, '\n'); # NAC matrix
    NAC = NAC*NAC';
    const T = NAC;

However, the time taken for the execution now increased instead of decreasing. Was I supposed to use const keyword in some other way ? I understand that I have declared more variables now that might take up more time, but then how else am I supposed to go about it ?

DNF · December 29, 2023, 7:05pm

It’s probably the stuff inside the loop that mostly benefits from const. But, in general, put your code in functions, especially loops. Loops should not be in global scope.

Can’t all of Wf, Wi and P be Diagonal? Is it necessary to convert to dense Matrix?

Uranium238 · December 30, 2023, 9:15am

Thank you for your suggestion. You are right ! Wf Wi or P do not need to be Matrix. However I have tried using them as Vector{Float64} , I did not observe any change in the speed of the code.

ufechner7 · December 30, 2023, 11:37am

First, put all your code into functions.

Then, did you try MKL.jl ?

ufechner7 · December 30, 2023, 11:38am

Is it really necessary to invert them? Most of the time it is not.

DNF · December 30, 2023, 1:20pm

If the data are small enough, the Matrix is all in cache, so the iteration is anyway efficient. But you should anyway use Diagonal, since it will scale better with data size.

Uranium238 · December 30, 2023, 3:21pm

I am afraid it is indeed absolutely necessary to do so.

DNF · December 31, 2023, 4:45pm

Could you show an example of how you use the inverted matrices? From the code you posted I do not see why you need the inversion. For example, here

you should not invert BB.

Uranium238 · January 1, 2024, 5:25am

I would love to know how I can get through the code-block you quoted without inverting.

Now to answer your question, consider the line:
F1 = sqrt(det(Si*Sf*ρ^2*Wapp^-1)) , we know that det(A*B^-1) is the same as det(A)/det(B) , however because of the nature of the problem, the later is numerically unstable , leaving me no choice but to use inversion instead of division.

DNF · January 1, 2024, 12:20pm

One way is to instead do

Wapp = BB * (BB - AA * (BB \ AA))

Solving linear systems is often better than doing direct inversion, both for performance and numerical stability. I can, however, not guarantee that this will help in your particular case, since you are doing matrix-matrix products. For the vector-matrix case, you should definitely do A \ x instead of inv(A) * x.

As far as I can tell, though, your code has a lot of mathematical structure, that you can probably carry a bit further on paper, like the quoted expression, and

where Bf/Sf etc. are actually diagonal matrices? Here you can surely exploit the structure of W to better invert or factorize it.

I am also a bit skeptical whether you are choosing the correct parallelization strategy. You turn off BLAS multi-threading, but have you also done the same in Matlab? Matlab does a lot of implicit multi-threading that is not visible to the user, do you have control over that? Perhaps you can turn off multi-threading for both Matlab and Julia, and compare single threaded performance, also making sure that BLAS-threading is comparable.

Your code is a bit complex to analyze at a glance, could you identify the bottlenecks precisely to help the other posters? And also provide dummy data so others can run your code? Due to xmas/newyears I personally am a bit lazy right now, and respond slowly.

Also, one more thing, your problem is possibly smack in the middle of the Matlab sweet spot, so large speed-ups in other languages may not be realistic, at least not without some extra effort on exploiting matrix structure, in-place operations, etc.

Uranium238 · January 3, 2024, 1:21pm

I did try using the \ operator instead of calling explicit inv() . It worked pretty well for the vector-matrix , matrix-matrix products. This gave me some boost in performance. For the code that took 393 seconds when using the inv() call, now it takes about 230 seconds !

However the speed is nowhere near as the Fortran equivalent of the same program. To answer your second point, yes, maybe we can use some algebraic properties of the BB and AA matrices to find out the determinants of W more efficiently, but we have not done that in our Fortran code. I am trying to compare the speed of the two languages while using the same code structure to as much of an extent as possible.

ufechner7 · January 3, 2024, 1:29pm

Well, in Julia you have methods to optimize your code that you do not have in Fortan, e.g. LoopVectorization.jl …

Most of the time you would not use the same code structure for Julia and Fortan…

And did you try MKL.jl ? It is very easy to use, just put using MKL at the beginning of your program. And how much of your execution time is spent in the garbage collector? Manually specifying the GC threads when starting Julia might be worth it (only Julia 1.10).

Uranium238 · January 3, 2024, 1:37pm

I did try putting using MKL at the begining of my program (keeping everything else the same). It ran successfully and gave the same exact performance as it did without usning MKL. I have not tried LoopVectorization.jl yet, will try to implement that and keep the thread updated !

Also, how exactly can I see how much time is spent in the Garbage collector ? Eitherway, I am mainly using Julia 1.9.3 for my code; but I will switch to Julia 1.10 if it gives a huge increase in performance.

JM_Beckers · January 3, 2024, 1:47pm

Did you try to put everything into a function ?

function mywrap()
   # your code here
end
mywrap()

ufechner7 · January 3, 2024, 1:53pm

@time myfunction()

Tells you how much time is spent for compilation, execution and garbage collection.
If you call it a second time the compilation time should disappear.

For small functions use BenchmarkTools.jl.

DNF · January 3, 2024, 2:07pm

But do you think you can create a reproducible example, so we can copy-paste and run the code? Reading data from files is just not possible.

Improving code runtime doesn’t necessarily require mathematical improvements, but reusing memory and in-place operations are probably required.

Also, could you comment on my question regarding multi-threading?

Topic		Replies	Views
How to improve the scaling of Julia code aimed at multi-node parallelization? Julia at Scale linearalgebra , distributed	38	686	August 14, 2024
Julia code becomes slower on running on supercomputers and does not scale well when parallelizing with Base.Threads Julia at Scale fortran , parallel , linearalgebra , threads	73	2180	January 22, 2024
@inbounds: is the compiler now so smart that this is no longer necessary? Performance	33	2969	July 16, 2018
How to convert a thread-parallelized code into a core-parallelized code? Julia at Scale multithreading , linearalgebra , distributed , threads , matrix	3	322	May 19, 2024
Why is this Julia code considerably slower than Matlab New to Julia performance	64	8523	March 5, 2017

How to improve performance in a function that repeatedly defines and multiplies matrices

Related topics