# How to improve performance in a function that repeatedly defines and multiplies matrices

I am trying to write a code that involves running 3 nested loops, and calling a computationally expensive function within the loop. The expense mainly comes from defining , multiplying and inverting large [~100x100] matrices. I am defining the variables that will be repeatedly needed in the global scope as follows

``````    Temp=300
kT = Temp * 1.380649 * 10^-23  # Boltzmann constant times temperature in Joules
kT= kT * 2.29371227840*10^17         # in atomic units
b = 1/kT
Ωf = readdlm("wf.txt", '\t', Float64, '\n')  # Final state  frequencies
Ωi = readdlm("wi.txt", '\t', Float64, '\n') # Initial state  frequencies
J = readdlm("j.txt", Float64);              # J matrix
D = readdlm("d.txt", Float64);              # Duchinsky matrix
Ωf = Matrix(Diagonal(vec(Ωf)))
Ωi = Matrix(Diagonal(vec(Ωi)))
sysize = size(Ωi)[1]
Ωf = Ωf.*4.55633*10^-6     #Converting to cm^-1 atomic energy units, i.e hatree
Ωi = Ωi.*4.55633*10^-6 ;   #Same as above
ρ = zeros(Float64,sysize,sysize)
for i in range(1,sysize)
ρ[i,i] = (2*sinh(b*Ωi[i,i]/2))
end
T = readdlm("nac.txt", '\t', Float64, '\n'); # NAC matrix
T = T*T';
Sf = zeros(ComplexF64,sysize,sysize)
Si = zeros(ComplexF64,sysize,sysize)
Bf = zeros(ComplexF64,sysize,sysize)
Bi = zeros(ComplexF64,sysize,sysize)
``````

after that I define two of my functions

``````function GFC(t::Float64,T::Float64,Δ::Float64,sysize::Int64,Ωf::Matrix{Float64},Ωi::Matrix{Float64},J::Matrix{Float64},D::Matrix{Float64})
t = t* 4.1341373335*10^16                     # converting time to atomic units
tp = - im*b - t
for i = 1:sysize
Sf[i,i]=Ωf[i,i]/sin(Ωf[i,i]*t)
Si[i,i]=Ωi[i,i]/sin(Ωi[i,i]*tp)
Bf[i,i]=Ωf[i,i]/tan(Ωf[i,i]*t)
Bi[i,i]=Ωi[i,i]/tan(Ωi[i,i]*tp)
end
BB = (Bf+J'*Bi*J)
AA = (Sf+J'*Si*J)

W = [BB  -AA
-AA  BB] #We wont calculate the 2Nx2N determinant directly.

Wapp = BB*(BB-  AA*(BB^-1)*AA  )
U = Bi-Si
V = [J'*U*D
J'*U*D]
F1 = sqrt(det(Si*Sf*ρ^2*Wapp^-1))
F2 = (-(im/2)*(V'*W^-1*V))    +  ((im)*(D'*U*D))
g = F1*exp(F2)
return g
``````

and another function `GHT` that has similar computational complexity. I am calling them inside a loop as follows:

``````     function calc()
@sync @distributed for i in range(1,dd)
s=0
if t[i]==0
t[i]=10^-35
end
gfcpart=GFC(t[i],300.0,Δ,sysize,Ωf,Ωi,J,D)[1]
for m in range(1,sysize)
for mp in range(1,sysize)
s=s+    (GHT(t[i],300.0,Δ,sysize,Ωf,Ωi,J,D,m,mp)[1]*gfcpart*T[m,mp])
end
x[i]=s
end
# x[i] = (GHT(t[i],300.0,Δ,sysize,Ωf,Ωi,J,D,1,30))[1]*GFC(t[i],300.0,Δ,sysize,Ωf,Ωi,J,D)[1]*T[1,1]*det(ρ)
end
@sync @distributed for i in range(1,dd)
yr[i] = real(x[i])
yi[i] = imag(x[i])
end
p = plot(t,yr,linewidth = 2,color = "blue",title = "Δ = \$Δ",label = "Real part",grid=false)
p =plot!(t,yi,linewidth = 2,color = "red",label = "Imaginary part",size=(1000,800),xlims=(-30*10^-15,30*10^-15))
savefig(p,"gt.svg")
end
``````

Where I have tried to distribute the first loop among different processors in hope of gaining some performance. I have `SharedArray` for variables like x,yr,yi that might be used by all the different processors. I have checked that the distributed code is giving concurrent results as the serial code, so I know there isn’t any logical fallacy within the parallelization I have implemented.

The problem The code is scaling horribly as the matrices get larger. And the parallelization is not giving as much of a performance boost as expected. For example, when my the dimension of matrices was `30` the code took `778` seconds for the `calc()` to complete using `dd=1000` points in the upper loop when running serially. When using `10` processors, it took `178` seconds instead. Is there any way to attain better scaling by fixing any inefficiencies in the code that might affect both the serial and parallel computations ?

See the Performance Tips in the Manual: Performance Tips · The Julia Language

Don’t use global variables. As a quick fix you could just add `const` to improve performance, but the best practice (not just in Julia, in programming in general) would be to pass those as parameters.

3 Likes

Thank you for your reply ! Suppose I want to pass them as parameters, I would still have to define them somewhere right ? How do you suggest I go about it ? Should I define a function `f()` whose entire purpose is to define these variables and call the other functions ?

You can still define them globally. Yet, when passing them as parameters the compiler will know the types and specialize the method accordingly (it cannot do that for globally accessed variables as their type might change – unless you declare them `const` or type them).
With so many parameters, I would try to structure them a bit, e.g., by putting into suitable structs:

``````struct PhysicalConstants
kT::Float64
b::Float64
...
end

const physical_constants = PhysicalConstants(kT, 1/kT, ...)

struct ProblemState
Ωf::Matrix{Float64}
Ωi::Matrix{Float64}
J::Matrix{Float64}
...
end

const myproblem = ProblemState(readdlm("wf.txt", '\t', Float64, '\n'), ...)
``````

That should have no overhead while making the code more structured. Further, in a second step, you can make the code more generic by dropping all explicit types and parameterizing your structures, e.g.,

``````struct PhysicalConstants{U,V}
kT::U
b::U   # assuming it always must have the same type as kT
Temp::V
...
end
``````
3 Likes

I tried to declare the constant variables as `const` as suggested by you in the way described below

``````    Wf = readdlm("wf.txt", '\t', Float64, '\n')  # Final state  frequencies
Wi = readdlm("wi.txt", '\t', Float64, '\n') # Initial state  frequencies
J = readdlm("j.txt", Float64);              # J matrix
D = readdlm("d.txt", Float64);              # Duchinsky matrix
Wf = Matrix(Diagonal(vec(Wf)))
Wi = Matrix(Diagonal(vec(Wi)))
const sysize = size(Wi)[1]
Wf = Wf.*4.55633*10^-6     #Converting to cm^-1 atomic energy units, i.e hatree
Wi = Wi.*4.55633*10^-6 ;   #Same as above
P = zeros(Float64,sysize,sysize)
for i in range(1,sysize)
P[i,i] = (2*sinh(b*Wi[i,i]/2))
end
const Ωf = Wf
const Ωi = Wi
const ρ = P
NAC = readdlm("nac.txt", '\t', Float64, '\n'); # NAC matrix
NAC = NAC*NAC';
const T = NAC;
``````

However, the time taken for the execution now increased instead of decreasing. Was I supposed to use `const` keyword in some other way ? I understand that I have declared more variables now that might take up more time, but then how else am I supposed to go about it ?

It’s probably the stuff inside the loop that mostly benefits from `const`. But, in general, put your code in functions, especially loops. Loops should not be in global scope.

Can’t all of `Wf`, `Wi` and `P` be `Diagonal`? Is it necessary to convert to dense `Matrix`?

4 Likes

Thank you for your suggestion. You are right ! `Wf` `Wi` or `P` do not need to be `Matrix`. However I have tried using them as `Vector{Float64}` , I did not observe any change in the speed of the code.

First, put all your code into functions.

Then, did you try MKL.jl ?

1 Like

Is it really necessary to invert them? Most of the time it is not.

1 Like

If the data are small enough, the `Matrix` is all in cache, so the iteration is anyway efficient. But you should anyway use `Diagonal`, since it will scale better with data size.

I am afraid it is indeed absolutely necessary to do so.

Could you show an example of how you use the inverted matrices? From the code you posted I do not see why you need the inversion. For example, here

you should not invert `BB`.

I would love to know how I can get through the code-block you quoted without inverting.

`F1 = sqrt(det(Si*Sf*ρ^2*Wapp^-1))` , we know that `det(A*B^-1)` is the same as `det(A)/det(B)` , however because of the nature of the problem, the later is numerically unstable , leaving me no choice but to use inversion instead of division.

One way is to instead do

``````Wapp = BB * (BB - AA * (BB \ AA))
``````

Solving linear systems is often better than doing direct inversion, both for performance and numerical stability. I can, however, not guarantee that this will help in your particular case, since you are doing matrix-matrix products. For the vector-matrix case, you should definitely do `A \ x` instead of `inv(A) * x`.

As far as I can tell, though, your code has a lot of mathematical structure, that you can probably carry a bit further on paper, like the quoted expression, and

where `Bf`/`Sf` etc. are actually diagonal matrices? Here you can surely exploit the structure of `W` to better invert or factorize it.

I am also a bit skeptical whether you are choosing the correct parallelization strategy. You turn off BLAS multi-threading, but have you also done the same in Matlab? Matlab does a lot of implicit multi-threading that is not visible to the user, do you have control over that? Perhaps you can turn off multi-threading for both Matlab and Julia, and compare single threaded performance, also making sure that BLAS-threading is comparable.

Your code is a bit complex to analyze at a glance, could you identify the bottlenecks precisely to help the other posters? And also provide dummy data so others can run your code? Due to xmas/newyears I personally am a bit lazy right now, and respond slowly.

Also, one more thing, your problem is possibly smack in the middle of the Matlab sweet spot, so large speed-ups in other languages may not be realistic, at least not without some extra effort on exploiting matrix structure, in-place operations, etc.

4 Likes

I did try using the `\` operator instead of calling explicit `inv()` . It worked pretty well for the vector-matrix , matrix-matrix products. This gave me some boost in performance. For the code that took `393` seconds when using the `inv()` call, now it takes about `230` seconds !

However the speed is nowhere near as the Fortran equivalent of the same program. To answer your second point, yes, maybe we can use some algebraic properties of the `BB` and `AA` matrices to find out the determinants of `W` more efficiently, but we have not done that in our Fortran code. I am trying to compare the speed of the two languages while using the same code structure to as much of an extent as possible.

Well, in Julia you have methods to optimize your code that you do not have in Fortan, e.g. LoopVectorization.jl

Most of the time you would not use the same code structure for Julia and Fortan…

And did you try MKL.jl ? It is very easy to use, just put `using MKL` at the beginning of your program. And how much of your execution time is spent in the garbage collector? Manually specifying the GC threads when starting Julia might be worth it (only Julia 1.10).

1 Like

I did try putting `using MKL` at the begining of my program (keeping everything else the same). It ran successfully and gave the same exact performance as it did without usning `MKL`. I have not tried LoopVectorization.jl yet, will try to implement that and keep the thread updated !

Also, how exactly can I see how much time is spent in the Garbage collector ? Eitherway, I am mainly using Julia 1.9.3 for my code; but I will switch to Julia 1.10 if it gives a huge increase in performance.

Did you try to put everything into a function ?

``````function mywrap()
end
mywrap()

``````
``````@time myfunction()
``````

Tells you how much time is spent for compilation, execution and garbage collection.
If you call it a second time the compilation time should disappear.

For small functions use `BenchmarkTools.jl`.

But do you think you can create a reproducible example, so we can copy-paste and run the code? Reading data from files is just not possible.

Improving code runtime doesn’t necessarily require mathematical improvements, but reusing memory and in-place operations are probably required.

Also, could you comment on my question regarding multi-threading?

3 Likes