How to improve performance in a function that repeatedly defines and multiplies matrices

I did ! It improved the performance by about 5-10 seconds when the order of time was ~300 seconds.

I did as you told, 0.02% gc time is what I got. I would assume that it is negligible (?)

Well, in Julia you have methods to optimize your code that you do not have in Fortan, e.g. LoopVectorization.jl

I tried using LoopVectorization.jl , I tried applying the @turbo macro to the outermost loop. It showed an error

ERROR: LoadError: "Indexing into the following expression was not recognized: GFC(t[i], b, Δ, sysize, Ωf, Ωi, J, D, P, Si, Sf, Bi, Bf)"

A quick chat session with Github copilot indicated that we the type of function call I am trying to do is not supported under the @turbo macro.

It is possible that LoopVectorization.jl is not a good tool for your type of problem… Other people will know more about that… It was meant as an example…

1 Like

I just noticed a very interesting phenomenon … while playing around with my code I moved the for m in 1:sysize and for mp in 1:sysize loops inside the GHT function. That combined with using \ instead of inv() and using MKL instead of OpenBLAS for all LinearAlgebra operations, I arrived at a a gross time need of 141 seconds. As compared to ~150 seconds needed for the Fortran code.

Although i am very happy about this small success I have a few follow-up questions,

  1. If \ is better than inv() in terms of speed. Causes no numerical instability, and is natively supported in Julia. Then why would someone ever use the inv() call? Doesn’t that become redundant ? Are there any scenarios when inv() must be neccesssary ?

  2. Why would just relocating the two nested loops within the GHT function increase performance ?

  3. On the same line as the first point, if MKL is so much faster than the default OpenBLAS and is as simple as just putting a single line of code on the top-level scope. Why would anyone ever use the default library ?

  1. This is more of a math question than a Julia question. Sometimes, someone will need the inverse, so it’s included for completeness. But, perhaps more importantly, including the inv function with Julia creates an interface that can be extended and used from user packages. Suppose, for example, that someone creates a package with some type where it makes sense to define the inverse for a value of that type. The package creator may overload the inv function from Base, which is much more nice/useful for users than if the package creator just created a new function with the same functionality.

  2. Julia compiles code that’s within a function. See the Performance tips that I pointed you to in the first reply.

  3. Not sure, maybe for legal reasons? Also, I’m not sure how cross-platform MKL is, and I think Intel intentionally cripples MKL performance when running on AMD hardware?

1 Like

I’ll expand slightly on the second point of @nsajko .
All of Julia’s code is compiled but that does not mean it is always the same “speed”. If the compiler can for some teason not infer the types of variables, it cannot emit efficient code. This is called type-instability. So if there is some variable of unknown (better uninferrible) type, everytime you do something with it Julia checks at runtime what type it actually has and performs a dynamic dispatch.

The interesting thing to realize is: Type instability is not contagious! After performing the dynamic dispatch to another function, all types are now known inside that function! This is essentially what the performance tipps call a function barrier.

In your code, I think the type instability stems from loading the matrices from a file. So everytime you call GFC and GHT you will have a dynamic dispatch. If you move the loops inside of those functions you call them many times less, so you have less dynamic dispatch and runtime improves. Note, that this is essentially caused by having a rather unclean code: You load the matrices inside the function. It would be better design to load the input matrices at toplevel scope and then give to the function as arguments. That way no type-instability occurs.

Disclaimer: I only briefly looked over the code and might have missed other sources of type-instability. Usually I/O is type unstable but it is just a guess of mine here. You can check using the @code_warntype calc() macro but its output can be a bit overwhelming especially with such long functions. If you see red, that means the compiler cannot infer the type and a function barrier will help performance.

2 Likes

You can check using the @code_warntype calc() macro but its output can be a bit overwhelming especially with such long functions. If you see red, that means the compiler cannot infer the type and a function barrier will help performance.

I tried running that, as expected it gave an overwhelming amount of output. But I was able to decipher that I have not explicitly mentioned the type of most variables like Ωi etc. So I explicityly typed them as follows

function calc()
    @everywhere begin
        Δ::Float64 = 2.02 # in eV
        dd::Int64 = 11                               # Change this to modify number of points
        Temp::Int64=300
        kT = Temp * 1.380649 * 10^-23  # Boltzmann constant times temperature in Joules
        kT= kT * 2.29371227840*10^17         # in atomic units
        b ::Float64= 1/kT
        Ωf ::Matrix{Float64}= Matrix(Diagonal(vec(readdlm("wf.txt", '\t', Float64, '\n'))))*4.55633*10^-6   # Final state  frequencies in atomic units
        Ωi ::Matrix{Float64}= Matrix(Diagonal(vec(readdlm("wi.txt", '\t', Float64, '\n'))))*4.55633*10^-6  # Initial state  frequencies in atomic units
        sysize ::Int64 = size(Ωi)[1]
        P = zeros(Float64,sysize,sysize)
        for i in range(1,sysize)
            P[i,i] = (2*sinh(b*Ωi[i,i]/2))
        end
        T ::Matrix{Float64}= readdlm("nac.txt", '\t', Float64, '\n'); # NAC matrix
        T = T*T';
        D ::Matrix{Float64}= readdlm("d.txt", Float64); # Displacement Matrix
        J ::Matrix{Float64}= readdlm("j.txt", Float64); # Duchinsky Matrix
        Sf :: Matrix{ComplexF64}= zeros(ComplexF64,sysize,sysize)
        Si :: Matrix{ComplexF64}= zeros(ComplexF64,sysize,sysize)
        Bf :: Matrix{ComplexF64}= zeros(ComplexF64,sysize,sysize)
        Bi :: Matrix{ComplexF64}= zeros(ComplexF64,sysize,sysize)
        X11 ::Matrix{ComplexF64} = zeros(ComplexF64,sysize,sysize)
        X12 ::Matrix{ComplexF64} = zeros(ComplexF64,sysize,sysize)
        X21 ::Matrix{ComplexF64} = zeros(ComplexF64,sysize,sysize)
        X22 ::Matrix{ComplexF64} = zeros(ComplexF64,sysize,sysize)
        Y1 :: Matrix{ComplexF64}= zeros(ComplexF64,sysize,1)
        Y2 :: Matrix{ComplexF64}= zeros(ComplexF64,sysize,1)
    end
    t ::Vector{Float64}=  collect(range(-5*10^-12,stop=5*10^-12,length=dd))
    t[Int64((dd+1)/2)] = 10e-25
    x =  SharedArray(zeros(ComplexF64,dd))
    yr = SharedArray( zeros(Float64,dd))
    yi = SharedArray( zeros(Float64,dd))

    ################## Nested Loop Over Time ###################################
    @time @sync @distributed for i in range(1,dd)
        gfcpart=GFC(t[i],b,Δ,sysize,Ωf,Ωi,J,D,P,Si,Sf,Bi,Bf)[1]
        gtotal = GHT(t[i],b,sysize,Ωf,Ωi,J,D,Si,Sf,Bi,Bf,X11,X12,X21,X22,Y1,Y2,gfcpart,T)
        x[i]=gtotal
        println("ITERATION $i COMPLETED BY PROCESS $(myid())")
    end
    for i in range(1,dd)
        yr[i] = real(x[i])
        yi[i] = imag(x[i])
    end

As you can see, I also employed the advice of not using any global variables here by declaring all the variables within the calc() function and then passing them as parameters, I modified the definitions of GFC and GHT accordingly so that they reuse the memory rather than allocating memory on every iteration. To my surprise, the time taken increased from 39 seconds to 50 seconds ! (I excluded the compilation time by runnning the function twice).

Well, explicit types for variables in functions are usually not needed. Explicit types are mainly useful within struct definitions.

1 Like

Indeed, in most cases you don’t want to use inv, but there are cases where you do. For example, if you plan to re-use the same inv(A) over and over many times, then the one-time cost of inversion may be worth it (though, then you may still be better off using some matrix factorization). Or perhaps the inverse is of interest in its own right, not just as an intermediate step while solving equations.

It’s related to software licenses, MKL’s license is not fully compatible with the MIT license favored in the Julia community. You are free to use it, though.

1 Like

Well I am not sure what else you changed around. Your code is quite messy right now. Also I don’t think your attempt at using multiple processes to parallelize the computation works as using @everywhere inside a function does not make sense.

Let’s maybe remove the attempted parallelization for now and first focus on having a single-threaded code that works well, does not have too much type-instability, does not allocate too much, etc. If it is structured well, parallelization will be much easier.

Here I tried to disentangle your code a bit, please check whether it still works correctly:

Δ = 2.02 # in eV
dd = 11                               # Change this to modify number of points
Temp=300
kT = Temp * 1.380649 * 10^-23  # Boltzmann constant times temperature in Joules
kT = kT * 2.29371227840*10^17         # in atomic units
b = 1/kT
# let's keep these as Diagonal
Ωf = Diagonal(vec(readdlm("wf.txt", '\t', Float64, '\n')))*4.55633*10^-6   # Final state  frequencies in atomic units
Ωi = Diagonal(vec(readdlm("wi.txt", '\t', Float64, '\n')))*4.55633*10^-6  # Initial state  frequencies in atomic units
sysize = size(Ωi)[1]
# P = zeros(Float64,sysize,sysize)
# for i in range(1,sysize)
#     P[i,i] = (2*sinh(b*Ωi[i,i]/2))
# end
P = @. 2*sinh(b*Ωi/2) # use broadcasting, this is likely a Diagonal now as well
T = readdlm("nac.txt", '\t', Float64, '\n'); # NAC matrix
T = T*T';
D = readdlm("d.txt", Float64); # Displacement Matrix
J = readdlm("j.txt", Float64); # Duchinsky Matrix
t = collect(range(-5*10^-12,stop=5*10^-12,length=dd))
t[div(dd+1,2)] = 10e-25 # div is integer division

function calc(t,b,Δ,sysize,Ωf,Ωi,J,D,P,T)
    # allocate workspaces
    Sf = zeros(ComplexF64,sysize,sysize)
    Si = zeros(ComplexF64,sysize,sysize)
    Bf = zeros(ComplexF64,sysize,sysize)
    Bi = zeros(ComplexF64,sysize,sysize)
    X11 = zeros(ComplexF64,sysize,sysize)
    X12 = zeros(ComplexF64,sysize,sysize)
    X21 = zeros(ComplexF64,sysize,sysize)
    X22 = zeros(ComplexF64,sysize,sysize)
    Y1 = zeros(ComplexF64,sysize,1)
    Y2 = zeros(ComplexF64,sysize,1)
    
    # Result array
    x =  zeros(ComplexF64,dd)

    ################## Nested Loop Over Time ###################################
    for (i, tpoint) in enumerate(t)
        gfcpart = GFC(tpoint,b,Δ,sysize,Ωf,Ωi,J,D,P,Si,Sf,Bi,Bf)[1]
        gtotal = GHT(tpoint,b,sysize,Ωf,Ωi,J,D,Si,Sf,Bi,Bf,X11,X12,X21,X22,Y1,Y2,gfcpart,T)
        x[i] = gtotal
    end
    yr = real.(x)
    yi = imag.(x)
    return yi, yr # could also just return x
end

@time calc(t,b,Δ,sysize,Ωf,Ωi,J,D,P,T)
@time calc(t,b,Δ,sysize,Ωf,Ωi,J,D,P,T)

Notes:

  1. Annotating variables is most of the times unnecessary. Usually structuring the code properly does more for performance.
  2. I load all the stuff in global scope and that’s fine. Global scope is not evil, it just means code will likely not be as fast as it could but that does not matter for a bit of setup. See whether I correctly guessed which matrices are “parameters” and which are used as workspaces.
  3. I made some small changes to the code to make it a bit more idiomatic. Note that usually there is no need to convert a Diagonal to a dense Matrix and doing so means losing information that could lead to more efficient routines. See if your code still works otherwise keep converting them to dense.
  4. calc should now also be type-stable.
  5. Why did you want to use Distributed.jl to use processes to parallelize? Usually threads are just fine and much easier to handle with lower overhead. You really only should use Distributed.jl if you plan on using multiple physical compute nodes and even then it oftentimes easier to just split the parameters and run different parts of the computation in parallel like that.
2 Likes

I load all the stuff in global scope and that’s fine. Global scope is not evil, it just means code will likely not be as fast as it could but that does not matter for a bit of setup. See whether I correctly guessed which matrices are “parameters” and which are used as workspaces.

In the latest version of my code that I shall attach shortly, I have defined the matrices which are just parameters like J D etc, within the calc() , and I as you correctly guessed, X11 etc are all work-spaces. Where my code diverges from your suggestion is when you did not send the workspace as a function parameter. My function definitinon looks like

function GHT(t::Float64,b::Float64,sysize::Int64,Ωf::Matrix{Float64},Ωi::Matrix{Float64},J::Matrix{Float64},D::Matrix{Float64},Si::Matrix{ComplexF64},Sf::Matrix{ComplexF64},Bi::Matrix{ComplexF64},Bf::Matrix{ComplexF64},X11::Matrix{ComplexF64},X12::Matrix{ComplexF64},X21::Matrix{ComplexF64},X22::Matrix{ComplexF64},Y1::Matrix{ComplexF64},Y2::Matrix{ComplexF64},gfcpart::ComplexF64,T::Matrix{Float64})

I did this because as many people suggested ,

Don’t use global variables. As a quick fix you could just add const to improve performance, but the best practice (not just in Julia, in programming in general) would be to pass those as parameters.

so I passed them as parameters.

Why did you want to use Distributed.jl to use processes to parallelize? Usually threads are just fine and much easier to handle with lower overhead. You really only should use Distributed.jl if you plan on using multiple physical compute nodes and even then it oftentimes easier to just split the parameters and run different parts of the computation in parallel like that.

This is an excellent point ! I did not want to. But when I used Base.Threads , I lose the ability to define the workspace variable for each worker as I can do in Distributed , this led to a data race condition when using the @threads macro on the top for-loop as every thread wanted to access the variables X11 X12 etc.

Also to add, before it came to this, in a very primitive version of the code I did the following

  • Define all variables within the GFC and GHT functions including the workspaces.

  • Check both @threads versus @sync @distributed performance, while declaring yr and yi as SharedArrays in the later one.

The @threads seemed to provide inferior performance than the later. That is why I was discouraged from using that.

Globals are only problematic if you access them repeatedly. Note in my code, that I only ever access them once in the moment where I call calc. calc itself does not access globals and GHT and GFC also shouldn’t. That is the very important difference!

@threads is not the only way to get thread-based parallelism and is really only a convenience tool. In your other thread I explained how to use Threads.@spawn together with ChunkSplitters.jl to split the workload into chunks and preallocate workspaces for each chunk.

Addendum: Also if you use things from LinearAlgebra.jl that use BLAS, then this opens another can of worms regarding threads…

I am using BLAS operations. I have explicitly set BLAS.set_num_threads(1) to prevent it from oversubscribing the threads. I have not tried out the solution you proposed using ChunkSplitters.jl yet, I shall post on this thread after I am able to do so.

Also,

Globals are only problematic if you access them repeatedly. Note in my code, that I only ever access them once in the moment where I call calc. calc itself does not access globals and GHT and GFC also shouldn’t. That is the very important difference!

Every time when GHT and GFC are called, aren’t the globals like J D etc accessed by them ? Or is it different because they act as parameters for calc now. Also, would adding const before the globals be a good idea in this regard as per previous advice ?

Your code has not yet reached sufficient quality to make such measurements meaningful. You first need to get the core loops right.

For better interactions on this forum, I recommend cutting out all non-ascii glyphs. You may be happy to type them, but a significant minority is not. Is your triangle instead of delta really worth losing ~ 20% of potentially helpful input here? (this doesn’t apply to “private” code that only you will ever see or run)

I recommend putting a struct around the workspace stuff, for readability.

Since your time is spent on the GFC / GHT functions, maybe post these functions again, in a cleaned up fashion?

Like

function GFC(t::Float64,
T::Float64,
delta::Float64, 
sysize::Int64, # roughly 100
omega_f::Matrix{Float64}, #diagonal, sysize x sysize
omega_i::Matrix{Float64}, #diagonal, sysize x sysize
J::Matrix{Float64},
D::Matrix{Float64},
rho #oops, forgot that in the initial posting. What is this? diagonal Float64, sysize x sysize ?
)

Then we can look at the code and fix bad decisions like gratuitous use of inv or e.g.

 F1 = sqrt(det(Si*Sf*ρ^2*Wapp^-1))

to

F1 = sqrt(det_Si * det_Sf * det_rho^2 /  det(Wapp))

where det_Si / det_Sf / det_rho are computed by multiplying the diagonal entries.

But generally, this looks like you took a formula / representation intended for mathematical proofs, and you used it instead for computations? This is a recipe for very bad performance.

Mathematically, a diagonal matrix is a matrix. Practically, the number of elements scales like N for the diagonal, and scales like N^2 if you insist on viewing it as a matrix. Of course your scaling will suck!

2 Likes

Yes exactly. I pass e.g. J as global into the function calc but then the J inside calc is a local variable. The local J and the global J just happen to share the same name. One would say “the global J is shadowed by the local J”. Another way to see this: you can rename J inside calc to anything you want and everything still works the same. The Jinside calc is fully independent from the J outside.

As the globals are not accessed often this should not make a noteworthy difference.

I agree with @foobar_lv2 . Please cleanup the code a bit and post the whole thing again, so we can help you improve it.

1 Like

I agree that I should have considered the fact that Si is a diagonal matrix, so I should not have declared it as a dense matrix. My cleaned up code will reflect this change. However, the F1 expression as you have written is numerrically unstable. It is essential that we use the properties of det to bring Wapp within the det() and not keep it as a division. I can still declared Si as a diagonal and not a dense matrix, but will it give any benefit in form of speed given the fact that I must multiply it with a dense matrix before taking the det() ?

As noted by others, there might be applications in which you are interested in the inverse and its elements itself. I think applications like that are rare. Having inverses available can be helpful for teaching or learning linear algebra. However if you want the inverse to multiply by a vector or matrix as you did in your code, you are solving a system, or multiple systems, of equations. For that use, working with a factorization is better in every way, both speed and stability. Julia provides helpful structs for all the common matrix factorizations (a great decision, I think), which even eliminates any advantage the inverse might have as something you can save to speed up system solving at various places in your code; a factorization works just as well for that. Avoiding inverses is a basic principle of numerical linear algebra. If you do this sort of thing a lot, you might want to pick up a book on the topic. There are a number of things covered in non-numerical linear algebra courses that don’t work well in computations.

By way of illustrating a couple of points:

using LinearAlgebra
n = 10
A = [1/(i+j-1) for j in 1:n, i in 1:n]
x = ones(Float64, n)
b = A*x

F = lu(A)
xf = F\b
@show norm(A*xf - b)

xi = inv(A)*b
@show norm(A*xi - b)

gives output

norm(A * xf - b) = 5.438959822042073e-16
norm(xf - x) = 0.0005243960103392659
norm(A * xi - b) = 0.00020718582151808406
norm(xi - x) = 0.0010553752154937755

I have set up an ill-conditioned system with known solution (all 1s). The solutions using a factorization and inversion achieve comparable forward errors xf-x and xi-x, although the factorization is slightly better, possibly because inversion involves more computation and more rounding. Both of these achieve accuracy that is about as good as could be expected given the ill conditioning of A. However, xf achieves a much smaller residual. They are not all that different in terms of accuracy of the solution, but the errors in xf are mostly in directions that are damped out by multiplication by A. So if your goal is to have a solution that when you form A*x you get something close to b, factorization is preferable. If you just want to look at x as your final result, the inverse might be OK, since the errors are comparable. But if you do further computations with x (or BB\AA in your case), large residuals can sometimes create stability problems with the later computations. Given that factorization is faster and similarly reusable, there’s rarely any reason to use numerical inverses unless computing the inverse itself is your final goal.

4 Likes

Could you expand on this? Is it an issue with displaying the glyphs, or with input? If the latter, why not just copy the code, instead of typing it out?

…what?

I.e. why is it more stable for your specific matrices to use det(A*B^-1) instead of det(A) / det(B)?

If you have a very specific setting where det(A*C) is good, and det(A)*det(C) is bad, e.g. because A and C are individually very bad matrices that are almost inverses of each other, of because the multiply mathematically kills some singularities, then you should:

  1. document that, and the physical reasons that this is the case!
  2. think very hard about whether floating-point multiplied A*C is still good (e.g. the multiplication might mathematically kill off some singularities – that doesn’t guarantee that floating-point matmul will kill off these singularities as well)
  3. think very hard whether these physical reasons suggest a different algorithm that is altogether better and that avoids ever writing down A or C

I understand that it is hard work to turn a mathematical description into an algorithm. The first naive attempts will suck. And if you want the results to be good, you will need to look at this thing holistically, and use all your domain knowledge on whatever you’re trying to compute. Ideally there would be an existing paper where the original matrix formulation is almost completely unrecognizable. But I am assuming that you are trying to write that paper?

I think your problem is not so much performance, it is that you need to mentally cut away all abstractions until you actually understand what you’re trying to compute, and then start writing down the actual computations.

A much harder example in your code is the matrix exponential. Matrix exponential can mean solving a linear ODE. Maybe an ODE solver is better for the job? Maybe you only need the leading eigenvalues? Maybe you don’t need them at full precision?

I can’t say without seeing the GHT function where you use the matrix exponential. But ultimately, this is a job for your domain knowledge.

2 Likes