Some eigenpairs from a large, sparse, nonsymmetric matrix: Julia vs Matlab

In the course of working through a book on frequency domain finite difference solutions for electromagnetics, I have been converting the supplied Matlab programs to Julia. For the problem of determining the modes of an optical rib waveguide, it is necessary to find three eigenvalues nearest a given value (or equivalently with most negative real parts) and their corresponding eigenvectors for a sparse, nonsymmetric, Float64 matrix of order 65274 with 1103319 stored entries. This constitutes the major computing task for analyzing this geometry. Comparing several Julia packages to Matlab 2022a on the same machine gave the following timings:

Package Function Time (s)
Matlab eigs 1.02
KrylovKit eigsolve 7.61
ArnoldiMethod partialschur 4.72
Arpack eigs 3.03

These results are on an 8-core i7–9700 CPU @ 3.00GHz with 32 GBytes RAM. Julia 1.8.5 runs were preceded by using MKL, and Julia was started with the -t8 option to enable multithreading with all cores. Reported times are for the second run to eliminate compilation time. All methods produced accurate solutions.

The fastest Julia solution I found (Arpack) was about 3x slower than the Matlab eigs function. Is there anything I can do (or other packages to try) to achieve a faster solution?

Have you checked whether the evaluation x->A*x is threaded in matlab when A is sparse? In julia it is not unless you use a package

Are you doing shift/invert? If so, the time probably mainly depends on how you do the matrix factorization? (Finding the eigenvalues nearest a given value, i.e. in the interior of the spectrum, is quite different than finding the eigenvalues with the most negative real parts.)

(You can check this by printing an iteration count, or incrementing a counter during each call to your linear operator, to see if the numbers of matrix–vector multiplications are similar. Make sure you also set the convergence tolerances to be the same, since defaults may vary.)

As @rveltz commented, you should also check whether the difference is due to parallelization. The simplest way to remove this variable is to run all of your comparisons single-threaded (using the --use-single-comp-thread flag in Matlab).

It looks to me like both are multi-threading, with Matlab perhaps doing so more efficiently. I base this on looking at the Windows 11 Resource Monitor. Here is a snapshot while calling Matlab eigs in a loop 10 times:

and here is a snapshot while calling the Julia Arpack eigs function in a loop 4 times (to require approximately the same total time):

Julia is only using multithreading for dense operations like dot products, not for the sparse products/solves. You can turn this off with

julia> using LinearAlgebra

julia> BLAS.set_num_threads(1)

Again, when tracking this sort of thing down, it’s a good idea to disable multi-threading completely (in both Julia and Matlab). Once you understand the serial performance, then you can look into parallel performance.

Apparently, in Matlab one now uses maxNumCompThreads(1) at the prompt. After doing this (and verifying that it returns 1 when subsequently called), the timing for calling eigs with a tol value of 1e-14 was 1.01 seconds. For Julia -t1, and setting BLAS.set_num_threads(1) the time needed for the same tolerance setting was 2.76 seconds. So it doesn’t look like threading affects either function very much. Both calls used the sigma input variable to request eigenvalues close to a given value, so I assume shift/invert is being done. I haven’t implemented the instrumentation for counting iterations yet. May not get to that till later tonight or tomorrow.

On a side note, in KrylovKit, you have to perform the SI yourself (this is very easy). I guess eigsolve did not converge.

I reran KrylovKit a few more times and I can’t reproduce the slow time I initially reported. Maybe I included compilation time? Or perhaps the time strongly depends on the starting random vector? Anyway, what I now get using the :SR and EigSorter options is as follows:

julia> @time D2, Exyvecs, info = eigsolve(PQ, rand(eltype(PQ),size(PQ,2)), NMODES, EigSorter(x -> abs(x-ev)));
  4.482947 seconds (2.36 M allocations: 1.575 GiB, 9.18% gc time, 16.20% compilation time)

julia> info
ConvergenceInfo: no converged values after 100 iterations and 1218 applications of the linear map;
norms of residuals are given by (2.724575091679008e-8, 2.7459221420001996e-7, 2.669637953757969e-5).


julia> @time D2, Exyvecs, info = eigsolve(PQ, rand(eltype(PQ),size(PQ,2)), NMODES, :SR);
  3.172921 seconds (34.65 k allocations: 1.470 GiB, 2.48% gc time, 0.34% compilation time)

julia> info
ConvergenceInfo: no converged values after 100 iterations and 1218 applications of the linear map;
norms of residuals are given by (4.7814323846721965e-9, 5.155249529207612e-8, 5.3271094081317336e-6).

Even though eigsolve reports no convergence, the results are actually accurate enough for my use. And the same three eigensolutions are found using either option. So it’s speed is competitive with Arpack’s eigs, though it allocates a lot more than the 460 MB of eigs.

You can get multi-threaded sparse BLAS by installing and using the MKLSparse package, which should get you closer to the Matlab times. (It overrides the simple methods in SparseArrays, so no code changes are needed.)

Have you tried actually using shift-invert with eigsolve as suggested? Otherwise you probably aren’t using the same algorithm as you are in Matlab. e.g. pass x -> (PQ - ev*I) \ x as your linear operator. Better yet, precompute the factorization:

F = lu(PQ - ev*I) # sparse LU for shifted matrix
eigsolve(x -> F \ x, ...)

using the default which = :LM (least-magnitude) selector to find eigenvalues closest to ev of PQ. Then take the returned eigenvalues µ and convert them back into eigenvalues λ of your original matrix PQ with λ = inv.(μ) .+ ev.

1 Like

Here are the detailed results for the eigs function in Matlab and the eigs function from Arpack in Julia, both single threaded…

Matlab

Contents of rerun.m:

maxNumCompThreads(1);
PQ = P*Q;
tic; [Exy,D2] = eigs(PQ, NMODES, ev); toc
[Exy,D2] = eigs(PQ, NMODES, ev, Display=true);
format long
disp('D2 = ')
disp(diag(D2))

and the resulting output:

>> rerun
Elapsed time is 0.972256 seconds.

=== Simple eigenvalue problem A*x = lambda*x ===

The eigenvalue problem is real non-symmetric.

Computing 3 eigenvalues closest to -12.25.


Parameters passed to Krylov-Schur method:
  Maximum number of iterations: 300
  Tolerance: 1e-14
  Subspace Dimension: 20

Find eigenvalues of (A - sigma*I)\x = mu*x.
Compute decomposition of (A - sigma*I)...

--- Start of Krylov-Schur method ---
Iteration   1: 0 of 3 eigenvalues converged. Smallest non-converged residual 3.0e-09 (tolerance 1.0e-14).
Iteration   2: 2 of 3 eigenvalues converged. Smallest non-converged residual 1.5e-12 (tolerance 1.0e-14).
Iteration   3: 3 of 3 eigenvalues converged.
D2 = 
 -11.417575235082193
 -11.347643228598985
 -11.175322773263716

I also did a run where I passed in a function handle that performed (PQ - sigma*I)\x and counted the number of times it was called, which turned out to be 53.

Julia

Here is the contents of the file rerun.jl:

Threads.nthreads() == 1 || error("Restart Julia with -t1")
PQ = P*Q
using MKL
using LinearAlgebra
BLAS.set_num_threads(1)
using Arpack: eigs
@time D2, Exy, nconv, niter, nmult, resid = eigs(PQ, nev=NMODES, sigma=ev, tol=1e-14)
@show(D2, nconv, niter, nmult)

and the resulting output:

julia> include("rerun.jl")
  2.642512 seconds (67.75 k allocations: 467.161 MiB, 0.38% gc time, 0.63% compilation time)
D2 = ComplexF64[-11.41757523508308 + 0.0im, -11.347643228598551 + 0.0im, -11.175322773264579 + 0.0im]
nconv = 3
niter = 3
nmult = 52
52

So it looks like Matlab and Julia’s Arpack routine are using the same algorithm and perform the same number of iterations and matrix solves. Matlab’s implementation is several times faster, though.

@stevengj, @rveltz, and @Ralph_Smith: I’m going to try all your other suggestions as well. They are much appreciated!

To have better comparison, you should set the tolerances and compare the residuals. Id be surprised if you can beat Arpack with interpreted code.

I will set the tolerances on the Julia codes for my upcoming tests. Note that I did set the tolerance to 1-14 for Arpack.eigs, which is equal the default Matlab tolerance, as shown in the Matlab results.

Re the residuals: It doesn’t look like the residual vector is available as an output from Matlab. I did look at the residual vector from Arpack.eigs but it’s norm is 0.1394, so I don’t understand how to interpret that. I did try to compare the results from Matlab’s and Arpack’s eigs functions by examining the 2-norm of the difference between the eigenvalue 3-vectors they each returned. The norm of the difference was 1e-12, satisfyingly small.

I don’t understand this comment since Julia is compiled.

Thanks for the detailed explanation of how to implement the shift-invert operation (which I needed!). My code for implementing this is in the file rerun_KrylovKit show below:

using KrylovKit: eigsolve
using LinearAlgebra
PQ = P*Q

function rerun_KrylovKit(n=1, PQ=PQ)
    Threads.nthreads() == n || error("Restart Julia with -t$n")
    BLAS.set_num_threads(n)
    println("KrylovKit  $n Threads:")
    @time begin
        LUfactors = lu(PQ - ev*I)
        fun(x) = LUfactors \ x
        D2, Exyvecs, info = eigsolve(fun, rand(eltype(PQ), size(PQ,2)), NMODES, :LM; tol=1e-14)
    end
    info.converged ≥ NMODES || warning("$(info.converged) converged modes, but $(NMODES) requested")
    D2 = inv.(D2[1:NMODES]) .+ ev
    @show D2
    print(info)
end

Running this on single-threaded Julia produces the following output:

julia> rerun_KrylovKit(1)
KrylovKit  1 Threads:
  2.402129 seconds (92.02 k allocations: 423.060 MiB, 0.77% gc time, 3.03% compilation time)
D2 = ComplexF64[-11.41757523508308 - 0.0im, -11.34764322859855 - 0.0im, -11.175322773264579 - 0.0im]
ConvergenceInfo: 5 converged values after 2 iterations and 42 applications of the linear map;
norms of residuals are given by (4.651212075017882e-28, 1.1311489902029288e-25, 1.1303118924642023e-20, 6.12279637828794e-19, 3.090439836645266e-16).

The results are much improved over my initial try, with convergence achieved and execution time competitive with Arpack. I also tried running it with 8 threads and with MKLSparse:

using MKL, MKLSparse
julia> using MKL, MKLSparse

julia> rerun_KrylovKit(8)
KrylovKit  8 Threads:
  2.364362 seconds (1.49 k allocations: 418.421 MiB, 0.80% gc time)
D2 = ComplexF64[-11.41757523508308 - 0.0im, -11.34764322859856 - 0.0im, -11.175322773264574 - 0.0im]
ConvergenceInfo: 5 converged values after 2 iterations and 42 applications of the linear map;
norms of residuals are given by (4.367435621113895e-28, 1.0519282546009974e-25, 1.0801634314205764e-20, 5.526478130311956e-19, 2.822258996060205e-16).

Almost no difference in run time–the difference is within the variation seen for multiple runs.

I went through a similar exercise with ArnoldiMethod, using the LinearMap wrapper shown in the package’s documentation. The contents of rerun_ArnoldiMethod.jl are shown below:

using ArnoldiMethod, LinearAlgebra, LinearMaps

PQ = P*Q

# Factorizes A and builds a linear map that applies inv(A) to a vector.
function construct_linear_map(A)
    F = factorize(A - ev*I)
    LinearMap{eltype(A)}((y, x) -> ldiv!(y, F, x), size(A,1), ismutating=true)
end

function rerun_ArnoldiMethod(n=1, PQ=PQ)
    Threads.nthreads() == n || error("Restart Julia with -t$n")
    BLAS.set_num_threads(n)
    println("ArnoldiMethod $n Threads:")
    @time begin
        decomp, history = partialschur(construct_linear_map(PQ), nev=NMODES, tol=1e-14, restarts=300, which=LM())
        D2, Exy = partialeigen(decomp)
    end
    println(history)
    D2 = inv.(D2) .+ ev
    @show D2
end

Running this on a single thread produces the following output:

julia> include("rerun_ArnoldiMethod.jl")
rerun_ArnoldiMethod (generic function with 3 methods)

julia> rerun_ArnoldiMethod(1)
ArnoldiMethod 1 Threads:
  2.216459 seconds (1.69 M allocations: 447.162 MiB, 0.94% gc time, 10.12% compilation time)
Converged: 3 of 3 eigenvalues in 38 matrix-vector products
D2 = [-11.175322773264579, -11.347643228598555, -11.41757523508308]
3-element Vector{Float64}:
 -11.175322773264579
 -11.347643228598555
 -11.41757523508308

and with 8 threads (restart of Julia with 8 threads not shown):

julia> using MKL, MKLSparse

julia> rerun_ArnoldiMethod(8)
ArnoldiMethod 8 Threads:
  1.997144 seconds (240 allocations: 367.015 MiB, 0.76% gc time)
Converged: 3 of 3 eigenvalues in 38 matrix-vector products
D2 = [-11.17532277326458, -11.347643228598553, -11.41757523508308]
3-element Vector{Float64}:
 -11.17532277326458
 -11.347643228598553
 -11.41757523508308

so the 8-threaded run with MKLSparse is about 10% faster than the single-threaded run.

Here is an updated summary of the timings with and without multi-threading. “With” also includes use of MKL and MKLSparse for the Julia results.

Package Function 1 Thread Time (s) 8 Threads Time (s)
Matlab eigs 1.0 1.0
Arpack eigs 2.6 2.8
KrylovKit eigsolve 2.4 2.4
ArnoldiMethod partialschur, partialeigen 2.2 2.0

I only reported the times to the nearest tenth of a second because they vary by almost that much from run to run (due to random starting vectors?). Only the ArnoldiMethod routines show even a modest reduction in execution time with threading and use of MKLSparse. Arpack actually got slightly worse. So far, the shortest time I can get from Julia is using ArnoldiMethod which takes about twice as long as eigs in Matlab.

A couple of years ago (the last time I checked), Matlab implemented this entirely in Arpack.
So, this should really be a comparison of one and the same library doing the same thing in Matlab
and Julia. There may be scope for doing things differently before the computation is handed off to the library, though.

Is it possible then that the almost 3x speed difference is simply that the Mathworks uses a better compiler, say IFORT rather than GCC?

A reasonable suspicion. I don’t have access to any data though.

Actually, Matlab does not use Arpack anymore. Instead, they have implemented their own version of the Krylov-Schur method accessible from the Matlab prompt with edit eigs.m.

I am not sure whether the following tweak will resolve your issue, however, I remember from my own work on a modesolver that SuiteSparse conducts an iterative refinement of the solutions by default. Disabling this refinement using

SuiteSparse.UMFPACK.umf_ctrl[8] = 0

improved the performance of eigs significantly for my use case.

5 Likes

As Jonas indicates, in the shift-invert case Julia is using SuiteSparse factorizations. Those have their own solver methods, which do not benefit from MKLSparse.

Faster sparse BLAS should help when using direct Krylov schemes to get boundary eigenvalues; e.g. which=SR() in ArnoldiMethod. That approach usually takes significantly more operator applications than shift-inverse with a good shift, so it’s good if you have a huge matrix, a large number of cores, and/or no good idea for a shift,

Jonas’ solution looks best for your problem class.

1 Like

Using SuiteSparse.UMFPACK.umf_ctrl[8] = 0 results in

julia> rerun_Arpack(8)
Arpack  8 Threads:
  1.307017 seconds (915 allocations: 360.065 MiB, 9.91% gc time)
D2 = ComplexF64[-11.417575235083103 + 0.0im, -11.347643228598558 + 0.0im, -11.175322773264604 + 0.0im]
nconv = 3
niter = 3
nmult = 52

The returned eigenvalue vector differs from the slower default Arpack solution by only 1e-15 or so! The solution at last! Thanks to all who contributed!

Edit/Update: With this umf_ctrl setting, the time for rerun_Arpack(8) can be consistently reduced to 1.2 seconds or less by preceding the call with GC.gc() (garbage collection).

6 Likes