How to improve performance in a function that repeatedly defines and multiplies matrices

I am trying to write a code that involves running 3 nested loops, and calling a computationally expensive function within the loop. The expense mainly comes from defining , multiplying and inverting large [~100x100] matrices. I am defining the variables that will be repeatedly needed in the global scope as follows

    Temp=300
    kT = Temp * 1.380649 * 10^-23  # Boltzmann constant times temperature in Joules
    kT= kT * 2.29371227840*10^17         # in atomic units
    b = 1/kT
    Ωf = readdlm("wf.txt", '\t', Float64, '\n')  # Final state  frequencies
    Ωi = readdlm("wi.txt", '\t', Float64, '\n') # Initial state  frequencies
    J = readdlm("j.txt", Float64);              # J matrix
    D = readdlm("d.txt", Float64);              # Duchinsky matrix
    Ωf = Matrix(Diagonal(vec(Ωf)))
    Ωi = Matrix(Diagonal(vec(Ωi)))
    sysize = size(Ωi)[1]
    Ωf = Ωf.*4.55633*10^-6     #Converting to cm^-1 atomic energy units, i.e hatree
    Ωi = Ωi.*4.55633*10^-6 ;   #Same as above
    ρ = zeros(Float64,sysize,sysize)
    for i in range(1,sysize)
        ρ[i,i] = (2*sinh(b*Ωi[i,i]/2))
    end
    T = readdlm("nac.txt", '\t', Float64, '\n'); # NAC matrix
    T = T*T';
    Sf = zeros(ComplexF64,sysize,sysize)
    Si = zeros(ComplexF64,sysize,sysize)
    Bf = zeros(ComplexF64,sysize,sysize)
    Bi = zeros(ComplexF64,sysize,sysize)

after that I define two of my functions

function GFC(t::Float64,T::Float64,Δ::Float64,sysize::Int64,Ωf::Matrix{Float64},Ωi::Matrix{Float64},J::Matrix{Float64},D::Matrix{Float64})
        t = t* 4.1341373335*10^16                     # converting time to atomic units
        tp = - im*b - t
        for i = 1:sysize
            Sf[i,i]=Ωf[i,i]/sin(Ωf[i,i]*t)
            Si[i,i]=Ωi[i,i]/sin(Ωi[i,i]*tp)
            Bf[i,i]=Ωf[i,i]/tan(Ωf[i,i]*t)
            Bi[i,i]=Ωi[i,i]/tan(Ωi[i,i]*tp)
        end
        BB = (Bf+J'*Bi*J)
        AA = (Sf+J'*Si*J)

        W = [BB  -AA
            -AA  BB] #We wont calculate the 2Nx2N determinant directly.
        
        Wapp = BB*(BB-  AA*(BB^-1)*AA  )
        U = Bi-Si
        V = [J'*U*D
            J'*U*D]
        F1 = sqrt(det(Si*Sf*ρ^2*Wapp^-1))
        F2 = (-(im/2)*(V'*W^-1*V))    +  ((im)*(D'*U*D))
        g = F1*exp(F2)
        return g

and another function GHT that has similar computational complexity. I am calling them inside a loop as follows:

     function calc()
        @sync @distributed for i in range(1,dd)
            s=0
            if t[i]==0
                t[i]=10^-35
            end
            gfcpart=GFC(t[i],300.0,Δ,sysize,Ωf,Ωi,J,D)[1]
            for m in range(1,sysize)
                for mp in range(1,sysize)
                    s=s+    (GHT(t[i],300.0,Δ,sysize,Ωf,Ωi,J,D,m,mp)[1]*gfcpart*T[m,mp])
                end
                x[i]=s
            end
            # x[i] = (GHT(t[i],300.0,Δ,sysize,Ωf,Ωi,J,D,1,30))[1]*GFC(t[i],300.0,Δ,sysize,Ωf,Ωi,J,D)[1]*T[1,1]*det(ρ)
        end
        @sync @distributed for i in range(1,dd)
            yr[i] = real(x[i])
            yi[i] = imag(x[i])
        end
        p = plot(t,yr,linewidth = 2,color = "blue",title = "Δ = $Δ",label = "Real part",grid=false)
        p =plot!(t,yi,linewidth = 2,color = "red",label = "Imaginary part",size=(1000,800),xlims=(-30*10^-15,30*10^-15))
        savefig(p,"gt.svg")
    end

Where I have tried to distribute the first loop among different processors in hope of gaining some performance. I have SharedArray for variables like x,yr,yi that might be used by all the different processors. I have checked that the distributed code is giving concurrent results as the serial code, so I know there isn’t any logical fallacy within the parallelization I have implemented.

The problem The code is scaling horribly as the matrices get larger. And the parallelization is not giving as much of a performance boost as expected. For example, when my the dimension of matrices was 30 the code took 778 seconds for the calc() to complete using dd=1000 points in the upper loop when running serially. When using 10 processors, it took 178 seconds instead. Is there any way to attain better scaling by fixing any inefficiencies in the code that might affect both the serial and parallel computations ?

See the Performance Tips in the Manual: Performance Tips · The Julia Language

Don’t use global variables. As a quick fix you could just add const to improve performance, but the best practice (not just in Julia, in programming in general) would be to pass those as parameters.

3 Likes

Thank you for your reply ! Suppose I want to pass them as parameters, I would still have to define them somewhere right ? How do you suggest I go about it ? Should I define a function f() whose entire purpose is to define these variables and call the other functions ?

You can still define them globally. Yet, when passing them as parameters the compiler will know the types and specialize the method accordingly (it cannot do that for globally accessed variables as their type might change – unless you declare them const or type them).
With so many parameters, I would try to structure them a bit, e.g., by putting into suitable structs:

struct PhysicalConstants
    kT::Float64
    b::Float64
    ...
end

const physical_constants = PhysicalConstants(kT, 1/kT, ...)

struct ProblemState
    Ωf::Matrix{Float64}
    Ωi::Matrix{Float64}
    J::Matrix{Float64}
    ...
end

const myproblem = ProblemState(readdlm("wf.txt", '\t', Float64, '\n'), ...)

That should have no overhead while making the code more structured. Further, in a second step, you can make the code more generic by dropping all explicit types and parameterizing your structures, e.g.,

struct PhysicalConstants{U,V}
    kT::U
    b::U   # assuming it always must have the same type as kT
    Temp::V
    ...
end
3 Likes

I tried to declare the constant variables as const as suggested by you in the way described below

    Wf = readdlm("wf.txt", '\t', Float64, '\n')  # Final state  frequencies
    Wi = readdlm("wi.txt", '\t', Float64, '\n') # Initial state  frequencies
    J = readdlm("j.txt", Float64);              # J matrix
    D = readdlm("d.txt", Float64);              # Duchinsky matrix
    Wf = Matrix(Diagonal(vec(Wf)))
    Wi = Matrix(Diagonal(vec(Wi)))
    const sysize = size(Wi)[1]
    Wf = Wf.*4.55633*10^-6     #Converting to cm^-1 atomic energy units, i.e hatree
    Wi = Wi.*4.55633*10^-6 ;   #Same as above
    P = zeros(Float64,sysize,sysize)
    for i in range(1,sysize)
        P[i,i] = (2*sinh(b*Wi[i,i]/2))
    end
    const Ωf = Wf
    const Ωi = Wi
    const ρ = P
    NAC = readdlm("nac.txt", '\t', Float64, '\n'); # NAC matrix
    NAC = NAC*NAC';
    const T = NAC;

However, the time taken for the execution now increased instead of decreasing. Was I supposed to use const keyword in some other way ? I understand that I have declared more variables now that might take up more time, but then how else am I supposed to go about it ?

It’s probably the stuff inside the loop that mostly benefits from const. But, in general, put your code in functions, especially loops. Loops should not be in global scope.

Can’t all of Wf, Wi and P be Diagonal? Is it necessary to convert to dense Matrix?

4 Likes

Thank you for your suggestion. You are right ! Wf Wi or P do not need to be Matrix. However I have tried using them as Vector{Float64} , I did not observe any change in the speed of the code.

First, put all your code into functions.

Then, did you try MKL.jl ?

1 Like

Is it really necessary to invert them? Most of the time it is not.

1 Like

If the data are small enough, the Matrix is all in cache, so the iteration is anyway efficient. But you should anyway use Diagonal, since it will scale better with data size.

I am afraid it is indeed absolutely necessary to do so.

Could you show an example of how you use the inverted matrices? From the code you posted I do not see why you need the inversion. For example, here

you should not invert BB.

I would love to know how I can get through the code-block you quoted without inverting.

Now to answer your question, consider the line:
F1 = sqrt(det(Si*Sf*ρ^2*Wapp^-1)) , we know that det(A*B^-1) is the same as det(A)/det(B) , however because of the nature of the problem, the later is numerically unstable , leaving me no choice but to use inversion instead of division.

One way is to instead do

Wapp = BB * (BB - AA * (BB \ AA))

Solving linear systems is often better than doing direct inversion, both for performance and numerical stability. I can, however, not guarantee that this will help in your particular case, since you are doing matrix-matrix products. For the vector-matrix case, you should definitely do A \ x instead of inv(A) * x.

As far as I can tell, though, your code has a lot of mathematical structure, that you can probably carry a bit further on paper, like the quoted expression, and

where Bf/Sf etc. are actually diagonal matrices? Here you can surely exploit the structure of W to better invert or factorize it.

I am also a bit skeptical whether you are choosing the correct parallelization strategy. You turn off BLAS multi-threading, but have you also done the same in Matlab? Matlab does a lot of implicit multi-threading that is not visible to the user, do you have control over that? Perhaps you can turn off multi-threading for both Matlab and Julia, and compare single threaded performance, also making sure that BLAS-threading is comparable.

Your code is a bit complex to analyze at a glance, could you identify the bottlenecks precisely to help the other posters? And also provide dummy data so others can run your code? Due to xmas/newyears I personally am a bit lazy right now, and respond slowly.

Also, one more thing, your problem is possibly smack in the middle of the Matlab sweet spot, so large speed-ups in other languages may not be realistic, at least not without some extra effort on exploiting matrix structure, in-place operations, etc.

4 Likes

I did try using the \ operator instead of calling explicit inv() . It worked pretty well for the vector-matrix , matrix-matrix products. This gave me some boost in performance. For the code that took 393 seconds when using the inv() call, now it takes about 230 seconds !

However the speed is nowhere near as the Fortran equivalent of the same program. To answer your second point, yes, maybe we can use some algebraic properties of the BB and AA matrices to find out the determinants of W more efficiently, but we have not done that in our Fortran code. I am trying to compare the speed of the two languages while using the same code structure to as much of an extent as possible.

Well, in Julia you have methods to optimize your code that you do not have in Fortan, e.g. LoopVectorization.jl

Most of the time you would not use the same code structure for Julia and Fortan…

And did you try MKL.jl ? It is very easy to use, just put using MKL at the beginning of your program. And how much of your execution time is spent in the garbage collector? Manually specifying the GC threads when starting Julia might be worth it (only Julia 1.10).

1 Like

I did try putting using MKL at the begining of my program (keeping everything else the same). It ran successfully and gave the same exact performance as it did without usning MKL. I have not tried LoopVectorization.jl yet, will try to implement that and keep the thread updated !

Also, how exactly can I see how much time is spent in the Garbage collector ? Eitherway, I am mainly using Julia 1.9.3 for my code; but I will switch to Julia 1.10 if it gives a huge increase in performance.

Did you try to put everything into a function ?

function mywrap()
   # your code here
end
mywrap()

@time myfunction()

Tells you how much time is spent for compilation, execution and garbage collection.
If you call it a second time the compilation time should disappear.

For small functions use BenchmarkTools.jl.

But do you think you can create a reproducible example, so we can copy-paste and run the code? Reading data from files is just not possible.

Improving code runtime doesn’t necessarily require mathematical improvements, but reusing memory and in-place operations are probably required.

Also, could you comment on my question regarding multi-threading?

3 Likes