Julia is slower than MATLAB at diagonalizing matrices

What code are you actually running, and how are you running it? What you’ve shown are two versions, either everything global scope (which can easily kill performance on its own), or everything in a function - including using modules via @eval, again making it hard for the compiler to do much of anything because that loads the required routines AFTER compilation, which thus may include more dynamic dispatch, fewer optimizations etc…

What we’ve tried to tell you was that you should put your actual code into a function, with the using statements outside of it at the top level.


It’s unclear to me what you’re running, how you’re running the code, how you’re benchmarking, how you’re profiling (how did you identify eigen as the expensive part in the first place? Where there other clues hinting at why that may be, instead of just “eigen is slow by intuition/knowledge about matrix operations”?)

I think these in part are what make it hard to give specific suggestions.

3 Likes

I checked a couple of variants I saw in this thread

using LinearAlgebra, MKL, BenchmarkTools, Random

function test1()
    M = 288
    local T, V
    for i in 1:100
        A = rand(ComplexF64, M, M)
        H = A + A'
        (T, V) = eigen(H)
    end
    return (T, V)
end

function test2()
    M = 288
    local T
    for i in 1:100
        A = rand(ComplexF64, M, M)
        H = A + A'
        T = eigvals(H)
    end
    return T
end

function test3()
    M = 288
    A = Matrix{ComplexF64}(undef, M, M)
    H = similar(A)
    local T, V
    for i in 1:100
        rand!(A)
        H .= A .+ A'
        (T, V) = eigen(H)
    end
    return (T, V)
end

function test4()
    M = 288
    A = Matrix{ComplexF64}(undef, M, M)
    H = similar(A)
    local T
    for i in 1:100
        rand!(A)
        H .= A .+ A'
        T = eigvals(H)
    end
    return T
end

function test5(uplo)
    M = 288
    A = Matrix{ComplexF64}(undef, M, M)
    H = similar(A)
    local T, V
    for i in 1:100
        rand!(A)
        H .= A .+ A'
        (T, V) = eigen(Hermitian{ComplexF64, typeof(H)}(H, uplo))
    end
    return (T, V)
end

function test6(uplo)
    M = 288
    A = Matrix{ComplexF64}(undef, M, M)
    H = similar(A)
    local T
    for i in 1:100
        rand!(A)
        H .= A .+ A'
        T = eigvals(Hermitian{ComplexF64, typeof(H)}(H, uplo))
    end
    return T
end

@btime test1() setup=(Random.seed!(42))
@btime test2() setup=(Random.seed!(42))
@btime test3() setup=(Random.seed!(42))
@btime test4() setup=(Random.seed!(42))
@btime test5('U') setup=(Random.seed!(42))
@btime test6('U') setup=(Random.seed!(42))
@btime test5('L') setup=(Random.seed!(42))
@btime test6('L') setup=(Random.seed!(42))
versioninfo()

yielding

  1.043 s (2001 allocations: 655.76 MiB) # eigen
  606.747 ms (1900 allocations: 529.21 MiB) # eigvals
  1.074 s (1605 allocations: 405.16 MiB) # eigen reducing allocs
  596.373 ms (1504 allocations: 278.60 MiB) # eigvals reducing allocs
  1.078 s (1504 allocations: 405.16 MiB) # eigen reducing allocs with Hermitian wrapper
  535.213 ms (1304 allocations: 152.04 MiB) # eigen reducing allocs with Hermitian wrapper
  1.141 s (1504 allocations: 405.16 MiB) # eigvals reducing allocs with Hermitian wrapper
  578.855 ms (1304 allocations: 152.04 MiB)  # eigvals reducing allocs with Hermitian wrapper
Julia Version 1.9.0-DEV.167
Commit f5d15571b3 (2022-03-11 17:10 UTC)
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: 12 × Intel(R) Core(TM) i7-10710U CPU @ 1.10GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-13.0.1 (ORCJIT, skylake)
  Threads: 5 on 12 virtual cores
Environment:
  JULIA_EDITOR = code
  JULIA_NUM_THREADS = 5

I hope this helps with the discussion.

3 Likes

Thanks for your replies!

At the risk of posting even more code, I defined the function with everything in it

function main()
M = 6
N = 6
L = 10
U = 1
KX = range(0, π / N, ceil(Int, L / 2))
KY = range((-π + 2π / L) / M, π / M, L)
Hp = Matrix{Matrix{ComplexF64}}(undef, length(KX), length(KY))
Hm = Matrix{Matrix{ComplexF64}}(undef, length(KX), length(KY))
for iky in 1:length(KY), ikx in 1:length(KX)
  Hp[ikx, iky] = Hermitian(rand(ComplexF64, 4M * N, 4M * N))
  Hm[ikx, iky] = Hermitian(rand(ComplexF64, 4M * N, 4M * N))
end
Delta = rand(144, 144);
deltanew = zeros(M * N);
  for iky in 1:length(KY), ikx in 1:length(KX) #KX and KY are ranges defined earlier
    HBdG = [Hp[ikx, iky] Delta; Delta' -conj(Hm[ikx, iky])]
    #= This builds the matrix to be 
      diagonalized using Hp and Hm, which are arrays of 144 x 144 matrices and Delta, which is a 144 x 144 matrix. All of these matrices are complex-valued. =#

    time = @elapsed((Ep, Up) = eigen(HBdG)) # I time the diagonalization.
    println(time) # The results of the timing are printed. For results, see post 8 in this thread.
    #=The rest is just number-cruching the eigenvalues and eigenvectors. I assume it's
      not important for this discussion, but I include it anyway. =#
    Up11 = @view Up[1:4*M*N, 1:4*M*N] #M and N are Int64's previously defined.
    Up21 = @view Up[4*M*N+1:8*M*N, 1:4*M*N]
    Up12 = @view Up[1:4*M*N, 4*M*N+1:8*M*N]
    Up22 = @view Up[4*M*N+1:8*M*N, 4*M*N+1:8*M*N]
    Um = conj([Up22 Up21; Up12 Up22])
    Etop = @view Ep[4*M*N+1:8*M*N]
    Ebott = @view Ep[1:4*M*N]
    Em = -[Etop; Ebott]

    Gp = transpose(Up * (diagm(Ep) .< 0) * Up')
    Gm = transpose(Um * (diagm(Em) .< 0) * Um')
     deltanew += U / L^2 * (diag(Gp[5*M*N+1:6*M*N, 1:M*N]) .+ diag(Gm[5*M*N+1:6*M*N, 1:M*N])) # U is a Float64 and L is an Int64, previously defined.
  end
end

and I ran

using LinearAlgebra
using MKL
using BenchmarkTools
using Statistics

main()

which gave eigen times basically identical to the above example, about .025 s or so. Have I made any mistakes?

If I may attempt a summary, it seems that for diagonalizing generic matrices repeatedly, which just involves calling LAPACK or BLAS, using MATLAB is faster. However for more “interesting” problems (but not necessarily more realistic problems! – this example was not contrived, indeed many physicists use a method like this to solve self-consistent mean-field theories), Julia has an edge since it can handle many operations faster. After playing around benchmarking things here and there (aside from the eigensolver) and comparing with MATLAB, that does seem to be the case. I will continue with Julia and see if I can use it to solve something more involved.

2 Likes

I think if your code just bald BLAS, your expectation would be that Julia has the same speed as Matlab (or indeed numpy and others), as long as the same BLAS implementation is used (here MKL). This also seems to be the case for most people in this thread who have tried running your examples, so it might be worth investigating further to figure out why you’re seeing a difference - unfortunately I don’t have Matlab license so can’t help with this.

5 Likes
I assume it's not important for this discussion.

The following

using LinearAlgebra
using MKL
using BenchmarkTools
using Statistics

function main1()
    M = 6
    N = 6
    L = 10
    U = 1
    KX = range(0, π / N, ceil(Int, L / 2))
    KY = range((-π + 2π / L) / M, π / M, L)
    Hp = Matrix{Matrix{ComplexF64}}(undef, length(KX), length(KY))
    Hm = Matrix{Matrix{ComplexF64}}(undef, length(KX), length(KY))
    for iky in 1:length(KY), ikx in 1:length(KX)
      Hp[ikx, iky] = Hermitian(rand(ComplexF64, 4M * N, 4M * N))
      Hm[ikx, iky] = Hermitian(rand(ComplexF64, 4M * N, 4M * N))
    end
    Delta = rand(144, 144);
    deltanew = zeros(M * N);
    for iky in 1:length(KY), ikx in 1:length(KX) #KX and KY are ranges defined earlier
        HBdG = [Hp[ikx, iky] Delta; Delta' -conj(Hm[ikx, iky])]
        #= This builds the matrix to be 
            diagonalized using Hp and Hm, which are arrays of 144 x 144 matrices and Delta, which is a 144 x 144 matrix. All of these matrices are complex-valued. =#

        (Ep, Up) = eigen(HBdG) # I time the diagonalization.
        #=The rest is just number-cruching the eigenvalues and eigenvectors. I assume it's
            not important for this discussion, but I include it anyway. =#
        Up11 = @view Up[1:4*M*N, 1:4*M*N] #M and N are Int64's previously defined.
        Up21 = @view Up[4*M*N+1:8*M*N, 1:4*M*N]
        Up12 = @view Up[1:4*M*N, 4*M*N+1:8*M*N]
        Up22 = @view Up[4*M*N+1:8*M*N, 4*M*N+1:8*M*N]
        Um = conj([Up22 Up21; Up12 Up22])
        Etop = @view Ep[4*M*N+1:8*M*N]
        Ebott = @view Ep[1:4*M*N]
        Em = -[Etop; Ebott]

        Gp = transpose(Up * (diagm(Ep) .< 0) * Up')
        Gm = transpose(Um * (diagm(Em) .< 0) * Um')
        deltanew += U / L^2 * (diag(Gp[5*M*N+1:6*M*N, 1:M*N]) .+ diag(Gm[5*M*N+1:6*M*N, 1:M*N])) # U is a Float64 and L is an Int64, previously defined.
    end
end

function main2()
    M = 6
    N = 6
    L = 10
    U = 1
    KX = range(0, π / N, ceil(Int, L / 2))
    KY = range((-π + 2π / L) / M, π / M, L)
    Hp = Matrix{Matrix{ComplexF64}}(undef, length(KX), length(KY))
    Hm = Matrix{Matrix{ComplexF64}}(undef, length(KX), length(KY))
    for iky in 1:length(KY), ikx in 1:length(KX)
      Hp[ikx, iky] = Hermitian(rand(ComplexF64, 4M * N, 4M * N))
      Hm[ikx, iky] = Hermitian(rand(ComplexF64, 4M * N, 4M * N))
    end
    Delta = rand(144, 144);
    deltanew = zeros(M * N);
    for iky in 1:length(KY), ikx in 1:length(KX) #KX and KY are ranges defined earlier
        HBdG = [Hp[ikx, iky] Delta; Delta' -conj(Hm[ikx, iky])]
        #= This builds the matrix to be 
            diagonalized using Hp and Hm, which are arrays of 144 x 144 matrices and Delta, which is a 144 x 144 matrix. All of these matrices are complex-valued. =#

        (Ep, Up) = eigen(HBdG) # I time the diagonalization.
        #=The rest is just number-cruching the eigenvalues and eigenvectors. I assume it's
            not important for this discussion, but I include it anyway. =#
        Up11 = @view Up[1:4*M*N, 1:4*M*N] #M and N are Int64's previously defined.
        Up21 = @view Up[4*M*N+1:8*M*N, 1:4*M*N]
        Up12 = @view Up[1:4*M*N, 4*M*N+1:8*M*N]
        Up22 = @view Up[4*M*N+1:8*M*N, 4*M*N+1:8*M*N]
        Um = conj([Up22 Up21; Up12 Up22])
        Etop = @view Ep[4*M*N+1:8*M*N]
        Ebott = @view Ep[1:4*M*N]
        Em = -[Etop; Ebott]

        Gp = transpose(Up * (Diagonal{eltype(Ep)}(Ep) .< 0) * Up')
        Gm = transpose(Um * (Diagonal{eltype(Em)}(Em) .< 0) * Um')
        deltanew += U / L^2 * (diag(Gp[5*M*N+1:6*M*N, 1:M*N]) .+ diag(Gm[5*M*N+1:6*M*N, 1:M*N])) # U is a Float64 and L is an Int64, previously defined. =#
    end
end

@btime main1()
@btime main2()

yields

  4.036 s (5155 allocations: 808.55 MiB)
  1.026 s (5955 allocations: 741.95 MiB)

for me.

1 Like

Nice catch! This change makes the computation run just as fast as MATLAB! Although the diagonalization is still slower according to my timing, Julia is able to make up for the difference later in the code (at least for the size considered here)! Very cool!

So I guess the issue with the original version was that I was multiplying complex arrays by a BitArray. Is that correct?

I believe diagm gives you a full matrix whereas diag gives you a proxy…

And that means that Diagonal doesn’t store a new matrix whereas diagm does, correct?

Yep, you don’t need gemm there.

[off topic]
Sorry to digress in this thread, but IMHO when there are a lot of responses and code posted, it seems much easier to scroll through and read the different comments if the code details are hidden (as shown here). Call it discourse hygiene.

6 Likes

I now believe that both Julia and Matlab can run the same BLAS and get different timings. To follow up my previous comment on JIT and allocations, I ran some more tests and found Matlab about 1 ms faster.
Julia (11.8 ms):

julia> using LinearAlgebra, MKL, BenchmarkTools
julia> M = 288;
julia> @btime eigen(H) setup=(A=rand(ComplexF64,M,M);H=A+A');
  11.775 ms (16 allocations: 4.03 MiB)

Matlab (10.5 ms):

M = 288;
timee = eigtimer(M,1000);
mean(timee(10:end)) % ans = 0.0105

function time = eigtimer(M,N)
time = zeros(N,1);
for i = 1:N
    A = rand(M, M) + 1i * rand(M, M);
    A = A + A'; 
    tic
    [T, V] = eig(A);
    time(i) = toc;
end
end

Assuming both are calling a similar MKL BLAS, I previously speculated that there could be a difference in allocations. Julia’s eigen! calls LAPACK.geevx! which includes some boilerplate that allocates seven (7) matrices similar to A (specifically they are wr, wi, VL, VR, scale, rconde, rcondv). eigen also allocates a copy of A. There is no mechanism to feed pre-allocated storage, so this is allocated anew each time you call it.

Perhaps Matlab’s JIT figures out that it can re-use existing space? Let’s try just allocating a 288x288 matrix.
Julia (45 µs):

julia> @btime V=zeros(288,288);
  45.002 μs (2 allocations: 648.05 KiB)

Matlab (26.3 µs, freshly restarted to flush JIT):

M = 288;
time = zeros(1000,1);
for i = 1:1000
    tic;
    V = zeros(M,M);
    time(i) = toc;
end
mean(time(10:end))

I have trouble believing that Matlab can allocate a matrix 42% faster than Julia, so I am guessing that it figures out not to do it. Now let’s apply that to eigenvalues. For Julia, 45 µs x 8 copies x 2 re-im = 0.72 ms. If Matlab avoids those allocations, that almost makes up for the 1 ms difference.

Here’s a plot from Matlab that shows the allocations getting faster within the first ten or so iterations, and the eigenvalues similarly adapting but then reaching the steady state of about 10.5 ms.
timingtst

Conclusion? I believe Julia leaves some performance on the table, because there’s no easy mechanism to re-use previous allocations with eigen and the like (there are 176 matches for similar in LAPACK.jl). It might help if there were a truly in-place design pattern LAPACK.geevx!!(A, ... bufferspace). Then geevx! could retain its original functionality by allocating and then calling geevx!!. And if you know what you’re doing, you can re-use your own space repeatedly.

Also as a side note, I wonder if it’s a good idea for geevx! to allocate seven copies with similar, or better to allocate a 7x array just once and use views?

6 Likes

Thanks for the deep-dive into MKL.jl, but on my machine Julia is consisently faster than MATLAB, and this applies for most MKL operations. May be it’s a hardware related thing? I tested on an old-ish Haswell i7-4790k processor.

Julia:
  8.279 ms (16 allocations: 4.03 MiB)

MATLAB R2022a:
ans = 0.0096104
ans = 0.009347
ans = 0.0093659

I don’t think the eigen differences are absolute, and Julia could be faster in other areas. I do believe the Julia re-allocations are unnecessary for this case and suck up some time. I am mostly speculating about Matlab, but the higher speed upon repeated evaluation or allocation points to JIT.

Just to get another data point, would you mind running my tests on your machine?

In my experience we can magnify the influence of allocations and GC if we increase the memory pressure. But in this case the sizes seem to be pretty small.

These are my results, Julia is faster again:


@btime V=zeros(288,288);
  # 20.300 μs (2 allocations: 648.05 KiB)
  # 18.800 μs (2 allocations: 648.05 KiB)
  # 20.100 μs (2 allocations: 648.05 KiB)

MATLAB R2022a:
ans = 2.4305e-05
ans = 2.3665e-05
ans = 2.3465e-05
ans = 2.3496e-05
1 Like

As far as I can tell, about 1/3 of the time here is used on eigen decomposition, while the rest is spent on shuffling around data and allocating an enormous amount of unnecessary matrices.

I’m pretty sure you can still pretty easily get a 2x speedup over main2, by streamlining the rest of the code.

BTW, is this correct?

Um = conj([Up22 Up21; Up12 Up22]) # should the last entry be Up11?

The whole function doesn’t return anything, so what is actually its objective? It also looks a bit ‘algorithmically inefficient’. There must be some algorithmic shortcuts here. Or is it all a toy example?

4 Likes

@DNF: excellent advice, as usual. I only tried to identify hot spots (this time even without a profile and without verifying results). If @poopsilon is interested we could certainly follow this up.

2 Likes

I think it’s impressive that you can determine that the matrices are not necessary without first knowing the purpose of the code. As I continue with Julia and programming, I hope to develop your keen eye. In the meantime, let me digest what I’ve learned in this thread and see if I can speed up the computation myself first. If I have any more questions, I’ll make a new thread, and I hope to see you there.

1 Like

To be fair, the entire computation is unnecessary matrices since it doesn’t return anything. I’ve optimized it maximally:

function main()
    # don't return anything or have any visible side effects
end

This version is very fast:

julia> @btime main()
  1.259 ns (0 allocations: 0 bytes)

Snarky? Maybe. But the point is this: without a more realistic example it’s impossible for anyone to actually determine what is or isn’t useless work. The Julia compiler is getting closer and closer to being able to actually determine that a function like this doesn’t return anything so it could in the future do what I just did here and delete your entire function body in a case like this. There are, however, a number of intermediate matrices allocated in your code that are definitely unnecessary, which may be what @DNF was referring to.

10 Likes

Agreed. I tried to test allocations and eigenvalues on matrices up to 2000x2000, and on my machine Matlab’s eigenvalues were consistently slightly faster, and Julia’s zeros were slightly faster (as @Seif_Shebl found). In any case, zeros is a poor test because we should test similar, which we cannot do in Matlab. So I can’t really verify whether my hypothesis is correct, and different machines seem to do slightly different things.

I guess eigen should be O(N^3), and should generally dominate any allocations. But the Julia code is pretty unimpeachable–it is doing very straightforward set-up, and I can’t imagine Matlab doing anything different except the allocations. In Julia’s case, eigen does 8 allocations (and eigen! 7) that are technically not necessary when repeatedly evaluating same-size matrices. (I didn’t check whether my Matlab MKL matches Julia’s, so that’s another source of uncertainty.) However, the point remains that there are functions like eigen! that sound like they should be in-place, but are only partially so.

For anyone interested, the code is here:


7 Likes