I did ! It improved the performance by about 5-10
seconds when the order of time was ~300
seconds.
I did as you told, 0.02% gc time
is what I got. I would assume that it is negligible (?)
Well, in Julia you have methods to optimize your code that you do not have in Fortan, e.g. LoopVectorization.jl âŠ
I tried using LoopVectorization.jl , I tried applying the @turbo
macro to the outermost loop. It showed an error
ERROR: LoadError: "Indexing into the following expression was not recognized: GFC(t[i], b, Î, sysize, Ωf, Ωi, J, D, P, Si, Sf, Bi, Bf)"
A quick chat session with Github copilot indicated that we the type of function call I am trying to do is not supported under the @turbo
macro.
It is possible that LoopVectorization.jl is not a good tool for your type of problem⊠Other people will know more about that⊠It was meant as an exampleâŠ
I just noticed a very interesting phenomenon ⊠while playing around with my code I moved the for m in 1:sysize
and for mp in 1:sysize
loops inside the GHT
function. That combined with using \
instead of inv()
and using MKL
instead of OpenBLAS for all LinearAlgebra
operations, I arrived at a a gross time need of 141
seconds. As compared to ~150
seconds needed for the Fortran
code.
Although i am very happy about this small success I have a few follow-up questions,
-
If
\
is better thaninv()
in terms of speed. Causes no numerical instability, and is natively supported in Julia. Then why would someone ever use theinv()
call? Doesnât that become redundant ? Are there any scenarios wheninv()
must be neccesssary ? -
Why would just relocating the two nested loops within the
GHT
function increase performance ? -
On the same line as the first point, if
MKL
is so much faster than the default OpenBLAS and is as simple as just putting a single line of code on the top-level scope. Why would anyone ever use the default library ?
-
This is more of a math question than a Julia question. Sometimes, someone will need the inverse, so itâs included for completeness. But, perhaps more importantly, including the
inv
function with Julia creates an interface that can be extended and used from user packages. Suppose, for example, that someone creates a package with some type where it makes sense to define the inverse for a value of that type. The package creator may overload theinv
function fromBase
, which is much more nice/useful for users than if the package creator just created a new function with the same functionality. -
Julia compiles code thatâs within a function. See the Performance tips that I pointed you to in the first reply.
-
Not sure, maybe for legal reasons? Also, Iâm not sure how cross-platform MKL is, and I think Intel intentionally cripples MKL performance when running on AMD hardware?
Iâll expand slightly on the second point of @nsajko .
All of Juliaâs code is compiled but that does not mean it is always the same âspeedâ. If the compiler can for some teason not infer the types of variables, it cannot emit efficient code. This is called type-instability. So if there is some variable of unknown (better uninferrible) type, everytime you do something with it Julia checks at runtime what type it actually has and performs a dynamic dispatch.
The interesting thing to realize is: Type instability is not contagious! After performing the dynamic dispatch to another function, all types are now known inside that function! This is essentially what the performance tipps call a function barrier.
In your code, I think the type instability stems from loading the matrices from a file. So everytime you call GFC
and GHT
you will have a dynamic dispatch. If you move the loops inside of those functions you call them many times less, so you have less dynamic dispatch and runtime improves. Note, that this is essentially caused by having a rather unclean code: You load the matrices inside the function. It would be better design to load the input matrices at toplevel scope and then give to the function as arguments. That way no type-instability occurs.
Disclaimer: I only briefly looked over the code and might have missed other sources of type-instability. Usually I/O is type unstable but it is just a guess of mine here. You can check using the @code_warntype calc()
macro but its output can be a bit overwhelming especially with such long functions. If you see red, that means the compiler cannot infer the type and a function barrier will help performance.
You can check using the
@code_warntype calc()
macro but its output can be a bit overwhelming especially with such long functions. If you see red, that means the compiler cannot infer the type and a function barrier will help performance.
I tried running that, as expected it gave an overwhelming amount of output. But I was able to decipher that I have not explicitly mentioned the type of most variables like Ωi
etc. So I explicityly typed them as follows
function calc()
@everywhere begin
Î::Float64 = 2.02 # in eV
dd::Int64 = 11 # Change this to modify number of points
Temp::Int64=300
kT = Temp * 1.380649 * 10^-23 # Boltzmann constant times temperature in Joules
kT= kT * 2.29371227840*10^17 # in atomic units
b ::Float64= 1/kT
Ωf ::Matrix{Float64}= Matrix(Diagonal(vec(readdlm("wf.txt", '\t', Float64, '\n'))))*4.55633*10^-6 # Final state frequencies in atomic units
Ωi ::Matrix{Float64}= Matrix(Diagonal(vec(readdlm("wi.txt", '\t', Float64, '\n'))))*4.55633*10^-6 # Initial state frequencies in atomic units
sysize ::Int64 = size(Ωi)[1]
P = zeros(Float64,sysize,sysize)
for i in range(1,sysize)
P[i,i] = (2*sinh(b*Ωi[i,i]/2))
end
T ::Matrix{Float64}= readdlm("nac.txt", '\t', Float64, '\n'); # NAC matrix
T = T*T';
D ::Matrix{Float64}= readdlm("d.txt", Float64); # Displacement Matrix
J ::Matrix{Float64}= readdlm("j.txt", Float64); # Duchinsky Matrix
Sf :: Matrix{ComplexF64}= zeros(ComplexF64,sysize,sysize)
Si :: Matrix{ComplexF64}= zeros(ComplexF64,sysize,sysize)
Bf :: Matrix{ComplexF64}= zeros(ComplexF64,sysize,sysize)
Bi :: Matrix{ComplexF64}= zeros(ComplexF64,sysize,sysize)
X11 ::Matrix{ComplexF64} = zeros(ComplexF64,sysize,sysize)
X12 ::Matrix{ComplexF64} = zeros(ComplexF64,sysize,sysize)
X21 ::Matrix{ComplexF64} = zeros(ComplexF64,sysize,sysize)
X22 ::Matrix{ComplexF64} = zeros(ComplexF64,sysize,sysize)
Y1 :: Matrix{ComplexF64}= zeros(ComplexF64,sysize,1)
Y2 :: Matrix{ComplexF64}= zeros(ComplexF64,sysize,1)
end
t ::Vector{Float64}= collect(range(-5*10^-12,stop=5*10^-12,length=dd))
t[Int64((dd+1)/2)] = 10e-25
x = SharedArray(zeros(ComplexF64,dd))
yr = SharedArray( zeros(Float64,dd))
yi = SharedArray( zeros(Float64,dd))
################## Nested Loop Over Time ###################################
@time @sync @distributed for i in range(1,dd)
gfcpart=GFC(t[i],b,Î,sysize,Ωf,Ωi,J,D,P,Si,Sf,Bi,Bf)[1]
gtotal = GHT(t[i],b,sysize,Ωf,Ωi,J,D,Si,Sf,Bi,Bf,X11,X12,X21,X22,Y1,Y2,gfcpart,T)
x[i]=gtotal
println("ITERATION $i COMPLETED BY PROCESS $(myid())")
end
for i in range(1,dd)
yr[i] = real(x[i])
yi[i] = imag(x[i])
end
As you can see, I also employed the advice of not using any global variables here by declaring all the variables within the calc()
function and then passing them as parameters, I modified the definitions of GFC
and GHT
accordingly so that they reuse the memory rather than allocating memory on every iteration. To my surprise, the time taken increased from 39
seconds to 50
seconds ! (I excluded the compilation time by runnning the function twice).
Well, explicit types for variables in functions are usually not needed. Explicit types are mainly useful within struct definitions.
Indeed, in most cases you donât want to use inv
, but there are cases where you do. For example, if you plan to re-use the same inv(A)
over and over many times, then the one-time cost of inversion may be worth it (though, then you may still be better off using some matrix factorization). Or perhaps the inverse is of interest in its own right, not just as an intermediate step while solving equations.
Itâs related to software licenses, MKLâs license is not fully compatible with the MIT license favored in the Julia community. You are free to use it, though.
Well I am not sure what else you changed around. Your code is quite messy right now. Also I donât think your attempt at using multiple processes to parallelize the computation works as using @everywhere
inside a function does not make sense.
Letâs maybe remove the attempted parallelization for now and first focus on having a single-threaded code that works well, does not have too much type-instability, does not allocate too much, etc. If it is structured well, parallelization will be much easier.
Here I tried to disentangle your code a bit, please check whether it still works correctly:
Î = 2.02 # in eV
dd = 11 # Change this to modify number of points
Temp=300
kT = Temp * 1.380649 * 10^-23 # Boltzmann constant times temperature in Joules
kT = kT * 2.29371227840*10^17 # in atomic units
b = 1/kT
# let's keep these as Diagonal
Ωf = Diagonal(vec(readdlm("wf.txt", '\t', Float64, '\n')))*4.55633*10^-6 # Final state frequencies in atomic units
Ωi = Diagonal(vec(readdlm("wi.txt", '\t', Float64, '\n')))*4.55633*10^-6 # Initial state frequencies in atomic units
sysize = size(Ωi)[1]
# P = zeros(Float64,sysize,sysize)
# for i in range(1,sysize)
# P[i,i] = (2*sinh(b*Ωi[i,i]/2))
# end
P = @. 2*sinh(b*Ωi/2) # use broadcasting, this is likely a Diagonal now as well
T = readdlm("nac.txt", '\t', Float64, '\n'); # NAC matrix
T = T*T';
D = readdlm("d.txt", Float64); # Displacement Matrix
J = readdlm("j.txt", Float64); # Duchinsky Matrix
t = collect(range(-5*10^-12,stop=5*10^-12,length=dd))
t[div(dd+1,2)] = 10e-25 # div is integer division
function calc(t,b,Î,sysize,Ωf,Ωi,J,D,P,T)
# allocate workspaces
Sf = zeros(ComplexF64,sysize,sysize)
Si = zeros(ComplexF64,sysize,sysize)
Bf = zeros(ComplexF64,sysize,sysize)
Bi = zeros(ComplexF64,sysize,sysize)
X11 = zeros(ComplexF64,sysize,sysize)
X12 = zeros(ComplexF64,sysize,sysize)
X21 = zeros(ComplexF64,sysize,sysize)
X22 = zeros(ComplexF64,sysize,sysize)
Y1 = zeros(ComplexF64,sysize,1)
Y2 = zeros(ComplexF64,sysize,1)
# Result array
x = zeros(ComplexF64,dd)
################## Nested Loop Over Time ###################################
for (i, tpoint) in enumerate(t)
gfcpart = GFC(tpoint,b,Î,sysize,Ωf,Ωi,J,D,P,Si,Sf,Bi,Bf)[1]
gtotal = GHT(tpoint,b,sysize,Ωf,Ωi,J,D,Si,Sf,Bi,Bf,X11,X12,X21,X22,Y1,Y2,gfcpart,T)
x[i] = gtotal
end
yr = real.(x)
yi = imag.(x)
return yi, yr # could also just return x
end
@time calc(t,b,Î,sysize,Ωf,Ωi,J,D,P,T)
@time calc(t,b,Î,sysize,Ωf,Ωi,J,D,P,T)
Notes:
- Annotating variables is most of the times unnecessary. Usually structuring the code properly does more for performance.
- I load all the stuff in global scope and thatâs fine. Global scope is not evil, it just means code will likely not be as fast as it could but that does not matter for a bit of setup. See whether I correctly guessed which matrices are âparametersâ and which are used as workspaces.
- I made some small changes to the code to make it a bit more idiomatic. Note that usually there is no need to convert a
Diagonal
to a denseMatrix
and doing so means losing information that could lead to more efficient routines. See if your code still works otherwise keep converting them to dense. calc
should now also be type-stable.- Why did you want to use
Distributed.jl
to use processes to parallelize? Usually threads are just fine and much easier to handle with lower overhead. You really only should useDistributed.jl
if you plan on using multiple physical compute nodes and even then it oftentimes easier to just split the parameters and run different parts of the computation in parallel like that.
I load all the stuff in global scope and thatâs fine. Global scope is not evil, it just means code will likely not be as fast as it could but that does not matter for a bit of setup. See whether I correctly guessed which matrices are âparametersâ and which are used as workspaces.
In the latest version of my code that I shall attach shortly, I have defined the matrices which are just parameters like J
D
etc, within the calc()
, and I as you correctly guessed, X11
etc are all work-spaces. Where my code diverges from your suggestion is when you did not send the workspace as a function parameter. My function definitinon looks like
function GHT(t::Float64,b::Float64,sysize::Int64,Ωf::Matrix{Float64},Ωi::Matrix{Float64},J::Matrix{Float64},D::Matrix{Float64},Si::Matrix{ComplexF64},Sf::Matrix{ComplexF64},Bi::Matrix{ComplexF64},Bf::Matrix{ComplexF64},X11::Matrix{ComplexF64},X12::Matrix{ComplexF64},X21::Matrix{ComplexF64},X22::Matrix{ComplexF64},Y1::Matrix{ComplexF64},Y2::Matrix{ComplexF64},gfcpart::ComplexF64,T::Matrix{Float64})
I did this because as many people suggested ,
Donât use global variables. As a quick fix you could just add
const
to improve performance, but the best practice (not just in Julia, in programming in general) would be to pass those as parameters.
so I passed them as parameters.
Why did you want to use
Distributed.jl
to use processes to parallelize? Usually threads are just fine and much easier to handle with lower overhead. You really only should useDistributed.jl
if you plan on using multiple physical compute nodes and even then it oftentimes easier to just split the parameters and run different parts of the computation in parallel like that.
This is an excellent point ! I did not want to. But when I used Base.Threads
, I lose the ability to define the workspace variable for each worker as I can do in Distributed
, this led to a data race condition when using the @threads
macro on the top for-loop
as every thread wanted to access the variables X11
X12
etc.
Also to add, before it came to this, in a very primitive version of the code I did the following
-
Define all variables within the
GFC
andGHT
functions including the workspaces. -
Check both
@threads
versus@sync @distributed
performance, while declaringyr
andyi
asSharedArrays
in the later one.
The @threads
seemed to provide inferior performance than the later. That is why I was discouraged from using that.
Globals are only problematic if you access them repeatedly. Note in my code, that I only ever access them once in the moment where I call calc
. calc
itself does not access globals and GHT
and GFC
also shouldnât. That is the very important difference!
@threads
is not the only way to get thread-based parallelism and is really only a convenience tool. In your other thread I explained how to use Threads.@spawn
together with ChunkSplitters.jl
to split the workload into chunks and preallocate workspaces for each chunk.
Addendum: Also if you use things from LinearAlgebra.jl
that use BLAS, then this opens another can of worms regarding threadsâŠ
I am using BLAS operations. I have explicitly set BLAS.set_num_threads(1)
to prevent it from oversubscribing the threads. I have not tried out the solution you proposed using ChunkSplitters.jl yet, I shall post on this thread after I am able to do so.
Also,
Globals are only problematic if you access them repeatedly. Note in my code, that I only ever access them once in the moment where I call
calc
.calc
itself does not access globals andGHT
andGFC
also shouldnât. That is the very important difference!
Every time when GHT
and GFC
are called, arenât the globals like J
D
etc accessed by them ? Or is it different because they act as parameters for calc
now. Also, would adding const
before the globals be a good idea in this regard as per previous advice ?
Your code has not yet reached sufficient quality to make such measurements meaningful. You first need to get the core loops right.
For better interactions on this forum, I recommend cutting out all non-ascii glyphs. You may be happy to type them, but a significant minority is not. Is your triangle instead of delta
really worth losing ~ 20% of potentially helpful input here? (this doesnât apply to âprivateâ code that only you will ever see or run)
I recommend putting a struct
around the workspace stuff, for readability.
Since your time is spent on the GFC / GHT functions, maybe post these functions again, in a cleaned up fashion?
Like
function GFC(t::Float64,
T::Float64,
delta::Float64,
sysize::Int64, # roughly 100
omega_f::Matrix{Float64}, #diagonal, sysize x sysize
omega_i::Matrix{Float64}, #diagonal, sysize x sysize
J::Matrix{Float64},
D::Matrix{Float64},
rho #oops, forgot that in the initial posting. What is this? diagonal Float64, sysize x sysize ?
)
Then we can look at the code and fix bad decisions like gratuitous use of inv
or e.g.
F1 = sqrt(det(Si*Sf*Ï^2*Wapp^-1))
to
F1 = sqrt(det_Si * det_Sf * det_rho^2 / det(Wapp))
where det_Si / det_Sf / det_rho are computed by multiplying the diagonal entries.
But generally, this looks like you took a formula / representation intended for mathematical proofs, and you used it instead for computations? This is a recipe for very bad performance.
Mathematically, a diagonal matrix is a matrix. Practically, the number of elements scales like N for the diagonal, and scales like N^2 if you insist on viewing it as a matrix. Of course your scaling will suck!
Yes exactly. I pass e.g. J
as global into the function calc
but then the J
inside calc
is a local variable. The local J
and the global J
just happen to share the same name. One would say âthe global J
is shadowed by the local J
â. Another way to see this: you can rename J
inside calc
to anything you want and everything still works the same. The J
inside calc
is fully independent from the J
outside.
As the globals are not accessed often this should not make a noteworthy difference.
I agree with @foobar_lv2 . Please cleanup the code a bit and post the whole thing again, so we can help you improve it.
I agree that I should have considered the fact that Si
is a diagonal matrix, so I should not have declared it as a dense matrix. My cleaned up code will reflect this change. However, the F1
expression as you have written is numerrically unstable. It is essential that we use the properties of det
to bring Wapp
within the det()
and not keep it as a division. I can still declared Si
as a diagonal and not a dense matrix, but will it give any benefit in form of speed given the fact that I must multiply it with a dense matrix before taking the det()
?
As noted by others, there might be applications in which you are interested in the inverse and its elements itself. I think applications like that are rare. Having inverses available can be helpful for teaching or learning linear algebra. However if you want the inverse to multiply by a vector or matrix as you did in your code, you are solving a system, or multiple systems, of equations. For that use, working with a factorization is better in every way, both speed and stability. Julia provides helpful structs for all the common matrix factorizations (a great decision, I think), which even eliminates any advantage the inverse might have as something you can save to speed up system solving at various places in your code; a factorization works just as well for that. Avoiding inverses is a basic principle of numerical linear algebra. If you do this sort of thing a lot, you might want to pick up a book on the topic. There are a number of things covered in non-numerical linear algebra courses that donât work well in computations.
By way of illustrating a couple of points:
using LinearAlgebra
n = 10
A = [1/(i+j-1) for j in 1:n, i in 1:n]
x = ones(Float64, n)
b = A*x
F = lu(A)
xf = F\b
@show norm(A*xf - b)
xi = inv(A)*b
@show norm(A*xi - b)
gives output
norm(A * xf - b) = 5.438959822042073e-16
norm(xf - x) = 0.0005243960103392659
norm(A * xi - b) = 0.00020718582151808406
norm(xi - x) = 0.0010553752154937755
I have set up an ill-conditioned system with known solution (all 1s). The solutions using a factorization and inversion achieve comparable forward errors xf-x
and xi-x
, although the factorization is slightly better, possibly because inversion involves more computation and more rounding. Both of these achieve accuracy that is about as good as could be expected given the ill conditioning of A
. However, xf
achieves a much smaller residual. They are not all that different in terms of accuracy of the solution, but the errors in xf
are mostly in directions that are damped out by multiplication by A
. So if your goal is to have a solution that when you form A*x
you get something close to b
, factorization is preferable. If you just want to look at x
as your final result, the inverse might be OK, since the errors are comparable. But if you do further computations with x
(or BB\AA
in your case), large residuals can sometimes create stability problems with the later computations. Given that factorization is faster and similarly reusable, thereâs rarely any reason to use numerical inverses unless computing the inverse itself is your final goal.
Could you expand on this? Is it an issue with displaying the glyphs, or with input? If the latter, why not just copy the code, instead of typing it out?
âŠwhat?
I.e. why is it more stable for your specific matrices to use det(A*B^-1)
instead of det(A) / det(B)
?
If you have a very specific setting where det(A*C)
is good, and det(A)*det(C)
is bad, e.g. because A and C are individually very bad matrices that are almost inverses of each other, of because the multiply mathematically kills some singularities, then you should:
- document that, and the physical reasons that this is the case!
- think very hard about whether floating-point multiplied
A*C
is still good (e.g. the multiplication might mathematically kill off some singularities â that doesnât guarantee that floating-point matmul will kill off these singularities as well) - think very hard whether these physical reasons suggest a different algorithm that is altogether better and that avoids ever writing down
A
orC
I understand that it is hard work to turn a mathematical description into an algorithm. The first naive attempts will suck. And if you want the results to be good, you will need to look at this thing holistically, and use all your domain knowledge on whatever youâre trying to compute. Ideally there would be an existing paper where the original matrix formulation is almost completely unrecognizable. But I am assuming that you are trying to write that paper?
I think your problem is not so much performance, it is that you need to mentally cut away all abstractions until you actually understand what youâre trying to compute, and then start writing down the actual computations.
A much harder example in your code is the matrix exponential. Matrix exponential can mean solving a linear ODE. Maybe an ODE solver is better for the job? Maybe you only need the leading eigenvalues? Maybe you donât need them at full precision?
I canât say without seeing the GHT function where you use the matrix exponential. But ultimately, this is a job for your domain knowledge.