Try changing the number of BLAS threads using LinearAlgebra.BLAS.set_num_threads
for Hermitian
matrices, you can make it faster by calling A = Hermitian(A + A');
before the call to eigen
Thanks for the replies!
@baggepinnen What number do you recommend? With OpenBLAS, using one thread reduces the time modestly compared to the default number, as can be seen in the linked thread above. If I set the number of BLAS threads to one while using the MKL package, the computation time appears to increase.
@Oscar_Smith Using Hermitian
doesn’t improve the computation time for me.
If you use only one Julia thread, you should probably set the number of mkl threads to the number of physical or virtual cores.
For anyone about to look into this, yes MATLAB also uses double precision by default.
I guess that with one or no output argument Matlab’s eig just computes the eigenvalues and does not particularly compute/store the eigenvectors. In contrast, in Julia the eigen computes full eigendecomposition, that is, both eigenvalues and eigenvectors. If only eigenvalues are required, use eigvals instead. The performance of Matlab and Julia are then pretty much identical on my old laptop (running Linux). In fact, Julia even a bit faster.
In Matlab 2022a (and yes, I did run the code a few times):
>> A = rand(1000, 1000) + 1i * rand(1000, 1000); A = A + A';
>> tic, eig(A); toc
Elapsed time is 0.401051 seconds.
>> maxNumCompThreads
ans =
2
and in Julia 1.7.2
julia> using LinearAlgebra
julia> using MKL
julia> using BenchmarkTools
julia> A = rand(1000, 1000) + im * rand(1000, 1000);
julia> A = A + A';
julia> A = Hermitian(A);
julia> LinearAlgebra.BLAS.set_num_threads(2)
julia> @btime eigen(A);
628.196 ms (16 allocations: 46.57 MiB)
julia> @btime eigvals(A);
297.065 ms (13 allocations: 16.05 MiB)
Indeed, I am afraid that even the title of the post mislead us a bit because in your Matlab code you are really just computing the eigenvalues, which are not quite sufficient to perform diagonalization of a matrix.
So MATLAB is able to change the computation in eig() based on the number of outputs it expects. Julia has no such feature for its eigensolvers. Instead, to compare diagonalization in Julia vs. MATLAB (i.e. both eigenvectors and eigenvalues are computed and stored), we should compare performance of the following lines in Julia
julia> using LinearAlgebra
using MKL
using BenchmarkTools
A = rand(1000, 1000) + im * rand(1000, 1000);
A = A + A';
@btime((T, V) = eigen(A));
469.777 ms (18 allocations: 46.57 MiB)
and in MATLAB (this is a representative example of many trials)
>> A=rand(1000,1000)+1i*rand(1000,1000);
A=A+A';
tic
[T,V]=eig(A);
toc
Elapsed time is 0.543862 seconds.
Indeed, the performance on my machine is about the same, with Julia edging out a win, and therefore my choice of example is a misleading illustration. Unfortunately, however, my issue is not resolved. My program which involves diagonalizing Hermitian matrices in a for loop (with two outputs from eig() in the MATLAB version) still takes more than twice the time in Julia compared to MATLAB. I still think the eigensolver is the issue since timing each evaluation of them (by surrounding the line with tic toc
in MATLAB and using @elapsed
in Julia) gives the following times in Julia (in this case the size of the matrices is 288 x 288)
julia> MyProgram()
0.0362041
0.0379768
0.0364687
0.0407733
0.0351825
0.0347786
0.0336019
...
and MATLAB
>> MyProgram
Elapsed time is 0.018229 seconds.
Elapsed time is 0.035258 seconds.
Elapsed time is 0.020117 seconds.
Elapsed time is 0.016069 seconds.
Elapsed time is 0.007733 seconds.
Elapsed time is 0.015377 seconds.
Elapsed time is 0.011883 seconds.
Elapsed time is 0.011649 seconds.
Elapsed time is 0.011468 seconds.
...
I guessed that the Julia eigen() performance suffers when it is put into a for
loop. So I tried to construct an example. In Julia, I defined a function to compute the average time for eigen()
function myfunc()
@eval using LinearAlgebra
@eval using MKL
@eval using BenchmarkTools
@eval using Statistics
M = 288
time = Array{Float64}(undef, 0)
for i in 1:100
A = rand(M, M) + im * rand(M, M)
A = A + A'
time = push!(time, @elapsed((T, V) = eigen(A)))
end
println(mean(time))
end
julia> myfunc()
0.020339547999999995
And similarly in MATLAB
function [time] = myfunc()
M = 288;
time = 0;
for i = 1:100
A = rand(M ,M) + 1i * rand(M ,M);
A = A + A';
tic
[T, V] = eig(A);
elapsed = toc;
time = time + elapsed/100;
end
end
>> myfunc()
ans =
0.0144
Others can verify whether this is a fair comparison. The MATLAB times are consistent across examples, but the Julia times are slower than MATLAB and slowest when running my program. Any idea what might be happening here? Again, happy to provide more details!
If you are putting it in a loop like your MWE, it is probably most efficient to create a temporary scratch array before the loop and then call eigen!
First, I think it is not common in Julia to import (using the using
) a package within a function definition. But I confess I am still not quite fluent in this. Anyway, I would move it out of the function definition
using LinearAlgebra
using MKL
using BenchmarkTools
using Statistics
function myfunc()
M = 288
time = Array{Float64}(undef, 0)
for i in 1:100
A = rand(M, M) + im * rand(M, M)
A = A + A'
time = push!(time, @elapsed((T, V) = eigen(A)))
end
println(mean(time))
end
Second, when running the two codes, I get essentially identical timing
In Matlab:
>> myfunc
ans =
0.0215
In Julia:
julia> myfunc()
0.020213900150000002
I tried it on my friend’s nice Lenovo laptop running Windows (my computer also runs Windows), and here are the representative results
julia> myfunc()
0.018389479
>> myfunc
ans =
0.0129
But again, the more worrying thing is that when I time eigen
in my program, the performance suffers quite a bit.
There can be a large number of things that may be non-intuitive to a new user, but without knowing the code we can only help with what we’re given - in this case the eigen
problem itself (which I think has been solved, judging from the time difference there?).
So if you need help with a larger performance problem, feel free to share more details/code!
Here is, I think, the problematic portion of my code. Please see comments therein.
for iky in 1:length(KY), ikx in 1:length(KX) #KX and KY are ranges defined earlier
HBdG = [Hp[ikx, iky] Delta; Delta' -conj(Hm[ikx, iky])] #= This builds the matrix to be
diagonalized using Hp and Hm, which are arrays of 144 x 144 matrices and Delta, which is a
144 x 144 matrix. All of these matrices are complex-valued. =#
time = @elapsed((Ep, Up) = eigen(HBdG)) # I time the diagonalization.
println(time) # The results of the timing are printed. For results, see post 8 in this thread.
#=The rest is just number-cruching the eigenvalues and eigenvectors. I assume it's
not important for this discussion, but I include it anyway. =#
Up11 = @view Up[1:4*M*N, 1:4*M*N] #M and N are Int64's previously defined.
Up21 = @view Up[4*M*N+1:8*M*N, 1:4*M*N]
Up12 = @view Up[1:4*M*N, 4*M*N+1:8*M*N]
Up22 = @view Up[4*M*N+1:8*M*N, 4*M*N+1:8*M*N]
Um = conj([Up22 Up21; Up12 Up22])
Etop = @view Ep[4*M*N+1:8*M*N]
Ebott = @view Ep[1:4*M*N]
Em = -[Etop; Ebott]
Gp = transpose(Up * (diagm(Ep) .< 0) * Up')
Gm = transpose(Um * (diagm(Em) .< 0) * Um')
deltanew += U / L^2 * (diag(Gp[5*M*N+1:6*M*N, 1:M*N]) + diag(Gm[5*M*N+1:6*M*N, 1:M*N])) # U is a Float64 and L is an Int64, previously defined.
end
The printed elapsed times for the diagonalization average around .33 seconds – longer than what I get if I diagonalize a single matrix alone. Based on this slowdown, my guess is that this is the bottleneck in my code. Any comments welcome!
I tried your benchmark and am getting almost identical Matlab and Julia results, about 0.012 sec on my laptop.
Nevertheless, here’s some speculation about why Matlab could be faster. In the last few years, they’ve made enormous gains in their JIT compiler (sometimes 10x anecdotally). It is possible that their compiler can figure out [T, V] = eig(A)
repeatedly over-writes T
and V
, and so they don’t need to re-allocate them. If I time allocations in a loop, e.g. V = zeros(M, M)
100 times, the first ten iterations are slow (the ~5th one is really slow), and the last 90 are at a much faster steady state. This leads me to suspect the JIT.
Julia does have an “in-place” eigen!(A)
but that only avoids allocating a copy of A
, and still allocates and returns T
and V
. It should probably save some time, but perhaps not the difference you see.
There is a ton of scientific computing where the same operations are repeated many times using the same fixed-size arrays. A lot of performance is lost from re-allocating things that could just be over-written. It would be great if there were fully non-allocating versions of key functions like eigen!(T, V, A)
, or maybe a smarter compiler, e.g.
A = rand(M, M)
T = zeros(M)
V = zeros(M,M)
@reuse T, V = eigen!(A) # over-write T, V, and A with zero allocations
Thanks a lot for your thoughts! It sounds like, if you are right, there’s nothing I can do about it at the moment and MATLAB is faster for my application. I have to say that’s disappointing. I was excited about migrating this and other programs to Julia.
But given that my MWE may not be a correct representation since MATLAB and Julia have about the same performance for folks here (but Julia weirdly seems to perform slower on my computer and my friend’s computer), I was wondering how Julia would do with this code, which is just the for
loop in my program with all the variables pre-defined. It should be runnable.
using LinearAlgebra
using MKL
using BenchmarkTools
using Statistics
M=6
N = 6
L = 10
U = 1
KX = range(0, π / N, ceil(Int, L / 2))
KY = range((-π + 2π / L) / M, π / M, L)
Hp = Matrix{Matrix{ComplexF64}}(undef, length(KX), length(KY))
Hm = Matrix{Matrix{ComplexF64}}(undef, length(KX), length(KY))
for iky in 1:length(KY), ikx in 1:length(KX)
Hp[ikx,iky] = Hermitian(rand(ComplexF64,4M*N,4M*N));
Hm[ikx,iky] = Hermitian(rand(ComplexF64,4M*N,4M*N));
end
Delta = rand(144,144);
deltanew = zeros(M*N);
for iky in 1:length(KY), ikx in 1:length(KX) #KX and KY are ranges defined earlier
HBdG = [Hp[ikx, iky] Delta; Delta' -conj(Hm[ikx, iky])]
#= This builds the matrix to be
diagonalized using Hp and Hm, which are arrays of 144 x 144 matrices and Delta, which is a 144 x 144 matrix. All of these matrices are complex-valued. =#
time = @elapsed((Ep, Up) = eigen(HBdG)) # I time the diagonalization.
println(time) # The results of the timing are printed. For results, see post 8 in this thread.
#=The rest is just number-cruching the eigenvalues and eigenvectors. I assume it's
not important for this discussion, but I include it anyway. =#
Up11 = @view Up[1:4*M*N, 1:4*M*N] #M and N are Int64's previously defined.
Up21 = @view Up[4*M*N+1:8*M*N, 1:4*M*N]
Up12 = @view Up[1:4*M*N, 4*M*N+1:8*M*N]
Up22 = @view Up[4*M*N+1:8*M*N, 4*M*N+1:8*M*N]
Um = conj([Up22 Up21; Up12 Up22])
Etop = @view Ep[4*M*N+1:8*M*N]
Ebott = @view Ep[1:4*M*N]
Em = -[Etop; Ebott]
Gp = transpose(Up * (diagm(Ep) .< 0) * Up')
Gm = transpose(Um * (diagm(Em) .< 0) * Um')
global deltanew += U / L^2 * (diag(Gp[5*M*N+1:6*M*N, 1:M*N]) .+ diag(Gm[5*M*N+1:6*M*N, 1:M*N])) # U is a Float64 and L is an Int64, previously defined.
end
The output for the @elapsed eigen
are the following
1.4806952
0.0229872
0.020562
0.0225279
0.0247384
0.0206821
0.0251862
0.0277046
...
I expect that if Julia were running at “full speed”, we would see times closer to .012 seconds or so, as others have found. I wonder if this is reproducible and if others have thoughts or tips on this. Thanks!!
Here is, I think, the analogous script for MATLAB
M = 6;
N = 6;
L = 10;
U = 1;
KX = 1:5;
KY = 1:10;
Hp=cell(size(KX));
Hm=cell(size(KX));
for iky = 1:length(KY)
for ikx = 1:length(KX)
A1 = rand(4*M*N,4*M*N) + 1i*rand(4*M*N,4*M*N);
A2 = rand(4*M*N,4*M*N) + 1i*rand(4*M*N,4*M*N);
Hp{ikx,iky} = (A1 + A1')/2;
Hm{ikx,iky} = (A2 + A2')/2;
end
end
Delta = rand(144,144);
deltanew = zeros(M*N);
for iky = 1:length(KY)
for ikx = 1:length(KX) %KX and KY are ranges defined earlier
HBdG = [Hp{ikx, iky}, Delta; Delta', -conj(Hm{ikx, iky})];
%= This builds the matrix to be
% diagonalized using Hp and Hm, which are arrays of 144 x 144 matrices and Delta, which is a 144 x 144 matrix. All of these matrices are complex-valued.
tic
[Up,Ep] = eig(HBdG); % I time the diagonalization.
toc
%The rest is just number-cruching the eigenvalues and eigenvectors. I assume it's
% not important for this discussion, but I include it anyway.
Um=conj([Up(4*M*N+1:8*M*N,4*M*N+1:8*M*N),Up(4*M*N+1:8*M*N,1:4*M*N);Up(1:4*M*N,4*M*N+1:8*M*N),Up(1:4*M*N,1:4*M*N)]); % DEFINE U(-k) using PH symmetry
Em=diag(Ep);
Em=-diag([Em(4*M*N+1:8*M*N);Em(1:4*M*N)]);
Gp=transpose(Up*double(Ep<0)*Up');
Gm=transpose(Um*double(Em<0)*Um');
deltanew=deltanew+U/L^2*(diag(Gp(5*M*N+1:6*M*N,1:M*N))+diag(Gm(5*M*N+1:6*M*N,1:M*N)));% Do a RUNNING
end
end
Elapsed time is 0.034494 seconds.
Elapsed time is 0.017346 seconds.
Elapsed time is 0.013591 seconds.
Elapsed time is 0.016380 seconds.
Elapsed time is 0.013675 seconds.
Elapsed time is 0.013416 seconds.
Elapsed time is 0.013837 seconds.
Elapsed time is 0.015104 seconds.
...
So I guess Julia takes ~ 50% longer. Slower, but not THAT much slower. Still, I took the time to learn Julia hoping that Julia would be faster than, or at least as fast as, MATLAB…
But maybe there is still a way to speed up my Julia code to match MATLAB?
Did you try with the whole code inside a function? You should not be running code in the global scope. Put everything inside a main
and just call it at the end of the script. You should not be using global variables if you are worried about the performance down to the deciseconds.
If your main work is calling into BLAS and/or LAPACK then there isn’t much speed to be gained aside from algorithmic improvements like using special matrix types. Julia does excel at that and there are packages with many structured matrix types that are much faster if they apply to your actual problem (banded matrices, block-banded matrices, etc.). The other place where Julia will be much faster is if there’s substantial work that isn’t just calling BLAS or LAPACK, which is typically the case—very few real-world workflows are just diagonalizaing a completely unstructured matrix repeatedly. Of course you may have one of those problems, in which case this is a good news/bad news situation: the good news is that you’re already getting optimal performance; the bad news is that you can’t make it any faster. Otherwise it seems like you may have pared down a real problem where Julia can help your performance (possibly structured matrices, work that isn’t just diagonalization) to the part that can’t be sped up. So the advice I’d give is to zoom out and ask for help optimizing a bigger and more realistic problem. There are some people on here with incredible insight into both mathematics and code optimization who are very willing to help!
Also, as Henrique said, don’t do everything in global scope. Try that first and see how that fares.
In my program all those variables are passed into a function which contains the for loop above. This is what you mean, right?
The results for that test are shown in post 8 above.
What code are you actually running, and how are you running it? What you’ve shown are two versions, either everything global scope (which can easily kill performance on its own), or everything in a function - including using
modules via @eval
, again making it hard for the compiler to do much of anything because that loads the required routines AFTER compilation, which thus may include more dynamic dispatch, fewer optimizations etc…
What we’ve tried to tell you was that you should put your actual code into a function, with the using
statements outside of it at the top level.
It’s unclear to me what you’re running, how you’re running the code, how you’re benchmarking, how you’re profiling (how did you identify eigen
as the expensive part in the first place? Where there other clues hinting at why that may be, instead of just “eigen
is slow by intuition/knowledge about matrix operations”?)
I think these in part are what make it hard to give specific suggestions.