Julia is slower than MATLAB at diagonalizing matrices

I’ve been learning Julia in the past week and in the process, I have re-written one of my programs, which I previously implemented in MATLAB. I noticed the program in Julia was quite noticeably slower than the MATLAB version and, using the profiler, I discovered that the eigensolver “eigen()” takes about 2x longer than the “eig()” solver in MATLAB. My program involves diagonalizing many MxM (Hermitian) matrices, where M ~ 10^2-10^3.
One can compare times for the two eigensolvers to diagonalize random Hermitian matrices in
Julia:

julia> using LinearAlgebra
       using MKL
       using BenchmarkTools
       A = rand(1000, 1000) + im * rand(1000, 1000);
       A = A + A';
       @btime eigen(A);
  487.983 ms (16 allocations: 46.57 MiB)

or .487983 seconds. And in MATLAB:

>> A = rand(1000, 1000) + 1i * rand(1000, 1000);
A = A + A';
tic
eig(A);
toc
Elapsed time is 0.199620 seconds.

Although both times are longer, the discrepancy remains for totally random (i.e. not necessarily Hermitian) matrices

Julia:

julia> A = rand(1000, 1000) + im * rand(1000, 1000);
       @btime eigen(A);
  2.722 s (24 allocations: 36.46 MiB)

MATLAB:

>> A = rand(1000, 1000) + 1i * rand(1000, 1000);
tic
eig(A);
toc
Elapsed time is 1.660253 seconds.

Note that I’ve used the MKL.jl package, which I learned to do from this thread. This gives ~2x speed up for the eigen() computation in Julia. Is there a relatively easy way to speed up the computation in Julia? Could this issue be particular to my machine? I’m happy to provide any more needed details.

2 Likes

Try changing the number of BLAS threads using LinearAlgebra.BLAS.set_num_threads

1 Like

for Hermitian matrices, you can make it faster by calling A = Hermitian(A + A'); before the call to eigen

2 Likes

Thanks for the replies!

@baggepinnen What number do you recommend? With OpenBLAS, using one thread reduces the time modestly compared to the default number, as can be seen in the linked thread above. If I set the number of BLAS threads to one while using the MKL package, the computation time appears to increase.

@Oscar_Smith Using Hermitian doesn’t improve the computation time for me.

If you use only one Julia thread, you should probably set the number of mkl threads to the number of physical or virtual cores.

1 Like

For anyone about to look into this, yes MATLAB also uses double precision by default.

3 Likes

I guess that with one or no output argument Matlab’s eig just computes the eigenvalues and does not particularly compute/store the eigenvectors. In contrast, in Julia the eigen computes full eigendecomposition, that is, both eigenvalues and eigenvectors. If only eigenvalues are required, use eigvals instead. The performance of Matlab and Julia are then pretty much identical on my old laptop (running Linux). In fact, Julia even a bit faster.

In Matlab 2022a (and yes, I did run the code a few times):

>> A = rand(1000, 1000) + 1i * rand(1000, 1000); A = A + A';
>> tic, eig(A); toc
Elapsed time is 0.401051 seconds.
>> maxNumCompThreads

ans =

     2

and in Julia 1.7.2

julia> using LinearAlgebra

julia> using MKL

julia> using BenchmarkTools

julia> A = rand(1000, 1000) + im * rand(1000, 1000);

julia> A = A + A';

julia> A = Hermitian(A);

julia> LinearAlgebra.BLAS.set_num_threads(2)

julia> @btime eigen(A);
  628.196 ms (16 allocations: 46.57 MiB)

julia> @btime eigvals(A);
  297.065 ms (13 allocations: 16.05 MiB)

Indeed, I am afraid that even the title of the post mislead us a bit because in your Matlab code you are really just computing the eigenvalues, which are not quite sufficient to perform diagonalization of a matrix.

32 Likes

So MATLAB is able to change the computation in eig() based on the number of outputs it expects. Julia has no such feature for its eigensolvers. Instead, to compare diagonalization in Julia vs. MATLAB (i.e. both eigenvectors and eigenvalues are computed and stored), we should compare performance of the following lines in Julia

julia> using LinearAlgebra
       using MKL
       using BenchmarkTools
       A = rand(1000, 1000) + im * rand(1000, 1000);
       A = A + A';
       @btime((T, V) = eigen(A));
  469.777 ms (18 allocations: 46.57 MiB)

and in MATLAB (this is a representative example of many trials)

>> A=rand(1000,1000)+1i*rand(1000,1000);
A=A+A';
tic
[T,V]=eig(A);
toc
Elapsed time is 0.543862 seconds.

Indeed, the performance on my machine is about the same, with Julia edging out a win, and therefore my choice of example is a misleading illustration. Unfortunately, however, my issue is not resolved. My program which involves diagonalizing Hermitian matrices in a for loop (with two outputs from eig() in the MATLAB version) still takes more than twice the time in Julia compared to MATLAB. I still think the eigensolver is the issue since timing each evaluation of them (by surrounding the line with tic toc in MATLAB and using @elapsed in Julia) gives the following times in Julia (in this case the size of the matrices is 288 x 288)

julia> MyProgram()
0.0362041
0.0379768
0.0364687
0.0407733
0.0351825
0.0347786
0.0336019
...

and MATLAB

>> MyProgram
Elapsed time is 0.018229 seconds.
Elapsed time is 0.035258 seconds.
Elapsed time is 0.020117 seconds.
Elapsed time is 0.016069 seconds.
Elapsed time is 0.007733 seconds.
Elapsed time is 0.015377 seconds.
Elapsed time is 0.011883 seconds.
Elapsed time is 0.011649 seconds.
Elapsed time is 0.011468 seconds.
...

I guessed that the Julia eigen() performance suffers when it is put into a for loop. So I tried to construct an example. In Julia, I defined a function to compute the average time for eigen()

function myfunc()
    @eval using LinearAlgebra
    @eval using MKL
    @eval using BenchmarkTools
    @eval using Statistics
    M = 288
    time = Array{Float64}(undef, 0)
    for i in 1:100
        A = rand(M, M) + im * rand(M, M)
        A = A + A'
        time = push!(time, @elapsed((T, V) = eigen(A)))
    end
    println(mean(time))
end
julia> myfunc()
0.020339547999999995

And similarly in MATLAB

function [time] = myfunc()
M = 288;
time = 0;
for i = 1:100
A = rand(M ,M) + 1i * rand(M ,M);
A = A + A';
tic
[T, V] = eig(A);
elapsed = toc;
time = time + elapsed/100;
end
end
>> myfunc()

ans =

    0.0144

Others can verify whether this is a fair comparison. The MATLAB times are consistent across examples, but the Julia times are slower than MATLAB and slowest when running my program. Any idea what might be happening here? Again, happy to provide more details!

If you are putting it in a loop like your MWE, it is probably most efficient to create a temporary scratch array before the loop and then call eigen!

5 Likes

First, I think it is not common in Julia to import (using the using) a package within a function definition. But I confess I am still not quite fluent in this. Anyway, I would move it out of the function definition

using LinearAlgebra
using MKL
using BenchmarkTools
using Statistics

function myfunc()
    M = 288
    time = Array{Float64}(undef, 0)
    for i in 1:100
        A = rand(M, M) + im * rand(M, M)
        A = A + A'
        time = push!(time, @elapsed((T, V) = eigen(A)))
    end
    println(mean(time))
end

Second, when running the two codes, I get essentially identical timing

In Matlab:

>> myfunc

ans =

    0.0215

In Julia:

julia> myfunc()
0.020213900150000002
4 Likes

I tried it on my friend’s nice Lenovo laptop running Windows (my computer also runs Windows), and here are the representative results

julia> myfunc()
0.018389479
>> myfunc

ans =

    0.0129

But again, the more worrying thing is that when I time eigen in my program, the performance suffers quite a bit.

There can be a large number of things that may be non-intuitive to a new user, but without knowing the code we can only help with what we’re given - in this case the eigen problem itself (which I think has been solved, judging from the time difference there?).

So if you need help with a larger performance problem, feel free to share more details/code!

3 Likes

Here is, I think, the problematic portion of my code. Please see comments therein.

for iky in 1:length(KY), ikx in 1:length(KX) #KX and KY are ranges defined earlier
            HBdG = [Hp[ikx, iky] Delta; Delta' -conj(Hm[ikx, iky])] #= This builds the matrix to be 
              diagonalized using Hp and Hm, which are arrays of 144 x 144 matrices and Delta, which is a 
              144 x 144 matrix. All of these matrices are complex-valued. =#

            time = @elapsed((Ep, Up) = eigen(HBdG)) # I time the diagonalization.
            println(time) # The results of the timing are printed. For results, see post 8 in this thread.
            
            #=The rest is just number-cruching the eigenvalues and eigenvectors. I assume it's
              not important for this discussion, but I include it anyway. =#
            Up11 = @view Up[1:4*M*N, 1:4*M*N] #M and N are Int64's previously defined.
            Up21 = @view Up[4*M*N+1:8*M*N, 1:4*M*N]
            Up12 = @view Up[1:4*M*N, 4*M*N+1:8*M*N]
            Up22 = @view Up[4*M*N+1:8*M*N, 4*M*N+1:8*M*N]
            Um  = conj([Up22 Up21; Up12 Up22])
            Etop = @view Ep[4*M*N+1:8*M*N]
            Ebott = @view Ep[1:4*M*N]
            Em = -[Etop; Ebott]

            Gp = transpose(Up * (diagm(Ep) .< 0) * Up')
            Gm = transpose(Um * (diagm(Em) .< 0) * Um')

            deltanew += U / L^2 * (diag(Gp[5*M*N+1:6*M*N, 1:M*N]) + diag(Gm[5*M*N+1:6*M*N, 1:M*N])) # U is a Float64 and L is an Int64, previously defined.
end 

The printed elapsed times for the diagonalization average around .33 seconds – longer than what I get if I diagonalize a single matrix alone. Based on this slowdown, my guess is that this is the bottleneck in my code. Any comments welcome!

1 Like

I tried your benchmark and am getting almost identical Matlab and Julia results, about 0.012 sec on my laptop.

Nevertheless, here’s some speculation about why Matlab could be faster. In the last few years, they’ve made enormous gains in their JIT compiler (sometimes 10x anecdotally). It is possible that their compiler can figure out [T, V] = eig(A) repeatedly over-writes T and V, and so they don’t need to re-allocate them. If I time allocations in a loop, e.g. V = zeros(M, M) 100 times, the first ten iterations are slow (the ~5th one is really slow), and the last 90 are at a much faster steady state. This leads me to suspect the JIT.

Julia does have an “in-place” eigen!(A) but that only avoids allocating a copy of A, and still allocates and returns T and V. It should probably save some time, but perhaps not the difference you see.

There is a ton of scientific computing where the same operations are repeated many times using the same fixed-size arrays. A lot of performance is lost from re-allocating things that could just be over-written. It would be great if there were fully non-allocating versions of key functions like eigen!(T, V, A), or maybe a smarter compiler, e.g.

A = rand(M, M)
T = zeros(M)
V = zeros(M,M)
@reuse T, V = eigen!(A) # over-write T, V, and A with zero allocations
5 Likes

Thanks a lot for your thoughts! It sounds like, if you are right, there’s nothing I can do about it at the moment and MATLAB is faster for my application. I have to say that’s disappointing. I was excited about migrating this and other programs to Julia.

But given that my MWE may not be a correct representation since MATLAB and Julia have about the same performance for folks here (but Julia weirdly seems to perform slower on my computer and my friend’s computer), I was wondering how Julia would do with this code, which is just the for loop in my program with all the variables pre-defined. It should be runnable.

using LinearAlgebra
using MKL
using BenchmarkTools
using Statistics

M=6
N = 6
L = 10
U = 1
KX = range(0, π / N, ceil(Int, L / 2))
KY = range((-π + 2π / L) / M, π / M, L)
Hp = Matrix{Matrix{ComplexF64}}(undef, length(KX), length(KY))
Hm = Matrix{Matrix{ComplexF64}}(undef, length(KX), length(KY))
for iky in 1:length(KY), ikx in 1:length(KX)
    Hp[ikx,iky] = Hermitian(rand(ComplexF64,4M*N,4M*N));
    Hm[ikx,iky] = Hermitian(rand(ComplexF64,4M*N,4M*N));
end
Delta = rand(144,144);
deltanew = zeros(M*N);

for iky in 1:length(KY), ikx in 1:length(KX) #KX and KY are ranges defined earlier
    HBdG = [Hp[ikx, iky] Delta; Delta' -conj(Hm[ikx, iky])] 
    #= This builds the matrix to be 
      diagonalized using Hp and Hm, which are arrays of 144 x 144 matrices and Delta, which is a 144 x 144 matrix. All of these matrices are complex-valued. =#

    time = @elapsed((Ep, Up) = eigen(HBdG)) # I time the diagonalization.
    println(time) # The results of the timing are printed. For results, see post 8 in this thread.
    #=The rest is just number-cruching the eigenvalues and eigenvectors. I assume it's
      not important for this discussion, but I include it anyway. =#
    Up11 = @view Up[1:4*M*N, 1:4*M*N] #M and N are Int64's previously defined.
    Up21 = @view Up[4*M*N+1:8*M*N, 1:4*M*N]
    Up12 = @view Up[1:4*M*N, 4*M*N+1:8*M*N]
    Up22 = @view Up[4*M*N+1:8*M*N, 4*M*N+1:8*M*N]
    Um  = conj([Up22 Up21; Up12 Up22])
    Etop = @view Ep[4*M*N+1:8*M*N]
    Ebott = @view Ep[1:4*M*N]
    Em = -[Etop; Ebott]

    Gp = transpose(Up * (diagm(Ep) .< 0) * Up')
    Gm = transpose(Um * (diagm(Em) .< 0) * Um')

    global deltanew += U / L^2 * (diag(Gp[5*M*N+1:6*M*N, 1:M*N]) .+ diag(Gm[5*M*N+1:6*M*N, 1:M*N])) # U is a Float64 and L is an Int64, previously defined.
end

The output for the @elapsed eigen are the following

1.4806952
0.0229872
0.020562
0.0225279
0.0247384
0.0206821
0.0251862
0.0277046
...

I expect that if Julia were running at “full speed”, we would see times closer to .012 seconds or so, as others have found. I wonder if this is reproducible and if others have thoughts or tips on this. Thanks!!

1 Like

Here is, I think, the analogous script for MATLAB

M = 6;
N = 6;
L = 10;
U = 1;
KX = 1:5;
KY = 1:10;
Hp=cell(size(KX));
Hm=cell(size(KX));
for iky = 1:length(KY)
    for ikx = 1:length(KX)
        A1 = rand(4*M*N,4*M*N) + 1i*rand(4*M*N,4*M*N);
        A2 = rand(4*M*N,4*M*N) + 1i*rand(4*M*N,4*M*N);
        Hp{ikx,iky} = (A1 + A1')/2;
        Hm{ikx,iky} = (A2 + A2')/2;
    end
end
Delta = rand(144,144);
deltanew = zeros(M*N);

for iky = 1:length(KY)
    for ikx = 1:length(KX) %KX and KY are ranges defined earlier
        HBdG = [Hp{ikx, iky}, Delta; Delta', -conj(Hm{ikx, iky})];
        %= This builds the matrix to be
        %  diagonalized using Hp and Hm, which are arrays of 144 x 144 matrices and Delta, which is a 144 x 144 matrix. All of these matrices are complex-valued.
        tic
        [Up,Ep] = eig(HBdG); % I time the diagonalization.
        toc
        %The rest is just number-cruching the eigenvalues and eigenvectors. I assume it's
        %  not important for this discussion, but I include it anyway. 
        Um=conj([Up(4*M*N+1:8*M*N,4*M*N+1:8*M*N),Up(4*M*N+1:8*M*N,1:4*M*N);Up(1:4*M*N,4*M*N+1:8*M*N),Up(1:4*M*N,1:4*M*N)]); % DEFINE U(-k) using PH symmetry
        Em=diag(Ep);
        Em=-diag([Em(4*M*N+1:8*M*N);Em(1:4*M*N)]);
        Gp=transpose(Up*double(Ep<0)*Up'); 
        Gm=transpose(Um*double(Em<0)*Um');

        deltanew=deltanew+U/L^2*(diag(Gp(5*M*N+1:6*M*N,1:M*N))+diag(Gm(5*M*N+1:6*M*N,1:M*N)));% Do a RUNNING
    end
end
Elapsed time is 0.034494 seconds.
Elapsed time is 0.017346 seconds.
Elapsed time is 0.013591 seconds.
Elapsed time is 0.016380 seconds.
Elapsed time is 0.013675 seconds.
Elapsed time is 0.013416 seconds.
Elapsed time is 0.013837 seconds.
Elapsed time is 0.015104 seconds.
...

So I guess Julia takes ~ 50% longer. Slower, but not THAT much slower. Still, I took the time to learn Julia hoping that Julia would be faster than, or at least as fast as, MATLAB…
But maybe there is still a way to speed up my Julia code to match MATLAB?

Did you try with the whole code inside a function? You should not be running code in the global scope. Put everything inside a main and just call it at the end of the script. You should not be using global variables if you are worried about the performance down to the deciseconds.

6 Likes

If your main work is calling into BLAS and/or LAPACK then there isn’t much speed to be gained aside from algorithmic improvements like using special matrix types. Julia does excel at that and there are packages with many structured matrix types that are much faster if they apply to your actual problem (banded matrices, block-banded matrices, etc.). The other place where Julia will be much faster is if there’s substantial work that isn’t just calling BLAS or LAPACK, which is typically the case—very few real-world workflows are just diagonalizaing a completely unstructured matrix repeatedly. Of course you may have one of those problems, in which case this is a good news/bad news situation: the good news is that you’re already getting optimal performance; the bad news is that you can’t make it any faster. Otherwise it seems like you may have pared down a real problem where Julia can help your performance (possibly structured matrices, work that isn’t just diagonalization) to the part that can’t be sped up. So the advice I’d give is to zoom out and ask for help optimizing a bigger and more realistic problem. There are some people on here with incredible insight into both mathematics and code optimization who are very willing to help!

15 Likes

Also, as Henrique said, don’t do everything in global scope. Try that first and see how that fares.

5 Likes

In my program all those variables are passed into a function which contains the for loop above. This is what you mean, right?

The results for that test are shown in post 8 above.