# Some eigenpairs from a large, sparse, nonsymmetric matrix: Julia vs Matlab

In the course of working through a book on frequency domain finite difference solutions for electromagnetics, I have been converting the supplied Matlab programs to Julia. For the problem of determining the modes of an optical rib waveguide, it is necessary to find three eigenvalues nearest a given value (or equivalently with most negative real parts) and their corresponding eigenvectors for a sparse, nonsymmetric, Float64 matrix of order 65274 with 1103319 stored entries. This constitutes the major computing task for analyzing this geometry. Comparing several Julia packages to Matlab 2022a on the same machine gave the following timings:

Package Function Time (s)
Matlab `eigs` 1.02
KrylovKit `eigsolve` 7.61
ArnoldiMethod `partialschur` 4.72
Arpack `eigs` 3.03

These results are on an 8-core i7â€“9700 CPU @ 3.00GHz with 32 GBytes RAM. Julia 1.8.5 runs were preceded by `using MKL`, and Julia was started with the -t8 option to enable multithreading with all cores. Reported times are for the second run to eliminate compilation time. All methods produced accurate solutions.

The fastest Julia solution I found (Arpack) was about 3x slower than the Matlab `eigs` function. Is there anything I can do (or other packages to try) to achieve a faster solution?

Have you checked whether the evaluation x->A*x is threaded in matlab when A is sparse? In julia it is not unless you use a package

Are you doing shift/invert? If so, the time probably mainly depends on how you do the matrix factorization? (Finding the eigenvalues nearest a given value, i.e. in the interior of the spectrum, is quite different than finding the eigenvalues with the most negative real parts.)

(You can check this by printing an iteration count, or incrementing a counter during each call to your linear operator, to see if the numbers of matrixâ€“vector multiplications are similar. Make sure you also set the convergence tolerances to be the same, since defaults may vary.)

As @rveltz commented, you should also check whether the difference is due to parallelization. The simplest way to remove this variable is to run all of your comparisons single-threaded (using the `--use-single-comp-thread` flag in Matlab).

It looks to me like both are multi-threading, with Matlab perhaps doing so more efficiently. I base this on looking at the Windows 11 Resource Monitor. Here is a snapshot while calling Matlab `eigs` in a loop 10 times:

and here is a snapshot while calling the Julia Arpack `eigs` function in a loop 4 times (to require approximately the same total time):

Julia is only using multithreading for dense operations like dot products, not for the sparse products/solves. You can turn this off with

``````julia> using LinearAlgebra

``````

Again, when tracking this sort of thing down, itâ€™s a good idea to disable multi-threading completely (in both Julia and Matlab). Once you understand the serial performance, then you can look into parallel performance.

Apparently, in Matlab one now uses `maxNumCompThreads(1)` at the prompt. After doing this (and verifying that it returns 1 when subsequently called), the timing for calling `eigs` with a `tol` value of 1e-14 was 1.01 seconds. For Julia -t1, and setting `BLAS.set_num_threads(1)` the time needed for the same tolerance setting was 2.76 seconds. So it doesnâ€™t look like threading affects either function very much. Both calls used the `sigma` input variable to request eigenvalues close to a given value, so I assume shift/invert is being done. I havenâ€™t implemented the instrumentation for counting iterations yet. May not get to that till later tonight or tomorrow.

On a side note, in KrylovKit, you have to perform the SI yourself (this is very easy). I guess `eigsolve` did not converge.

I reran KrylovKit a few more times and I canâ€™t reproduce the slow time I initially reported. Maybe I included compilation time? Or perhaps the time strongly depends on the starting random vector? Anyway, what I now get using the `:SR` and `EigSorter` options is as follows:

``````julia> @time D2, Exyvecs, info = eigsolve(PQ, rand(eltype(PQ),size(PQ,2)), NMODES, EigSorter(x -> abs(x-ev)));
4.482947 seconds (2.36 M allocations: 1.575 GiB, 9.18% gc time, 16.20% compilation time)

julia> info
ConvergenceInfo: no converged values after 100 iterations and 1218 applications of the linear map;
norms of residuals are given by (2.724575091679008e-8, 2.7459221420001996e-7, 2.669637953757969e-5).

julia> @time D2, Exyvecs, info = eigsolve(PQ, rand(eltype(PQ),size(PQ,2)), NMODES, :SR);
3.172921 seconds (34.65 k allocations: 1.470 GiB, 2.48% gc time, 0.34% compilation time)

julia> info
ConvergenceInfo: no converged values after 100 iterations and 1218 applications of the linear map;
norms of residuals are given by (4.7814323846721965e-9, 5.155249529207612e-8, 5.3271094081317336e-6).
``````

Even though `eigsolve` reports no convergence, the results are actually accurate enough for my use. And the same three eigensolutions are found using either option. So itâ€™s speed is competitive with Arpackâ€™s `eigs`, though it allocates a lot more than the 460 MB of `eigs`.

You can get multi-threaded sparse BLAS by installing and using the `MKLSparse` package, which should get you closer to the Matlab times. (It overrides the simple methods in `SparseArrays`, so no code changes are needed.)

Have you tried actually using shift-invert with `eigsolve` as suggested? Otherwise you probably arenâ€™t using the same algorithm as you are in Matlab. e.g. pass `x -> (PQ - ev*I) \ x` as your linear operator. Better yet, precompute the factorization:

``````F = lu(PQ - ev*I) # sparse LU for shifted matrix
eigsolve(x -> F \ x, ...)
``````

using the default `which = :LM` (least-magnitude) selector to find eigenvalues closest to `ev` of `PQ`. Then take the returned eigenvalues `Âµ` and convert them back into eigenvalues `Î»` of your original matrix `PQ` with `Î» = inv.(ÎĽ) .+ ev`.

1 Like

Here are the detailed results for the `eigs` function in Matlab and the `eigs` function from Arpack in Julia, both single threadedâ€¦

### Matlab

Contents of rerun.m:

``````maxNumCompThreads(1);
PQ = P*Q;
tic; [Exy,D2] = eigs(PQ, NMODES, ev); toc
[Exy,D2] = eigs(PQ, NMODES, ev, Display=true);
format long
disp('D2 = ')
disp(diag(D2))
``````

and the resulting output:

``````>> rerun
Elapsed time is 0.972256 seconds.

=== Simple eigenvalue problem A*x = lambda*x ===

The eigenvalue problem is real non-symmetric.

Computing 3 eigenvalues closest to -12.25.

Parameters passed to Krylov-Schur method:
Maximum number of iterations: 300
Tolerance: 1e-14
Subspace Dimension: 20

Find eigenvalues of (A - sigma*I)\x = mu*x.
Compute decomposition of (A - sigma*I)...

--- Start of Krylov-Schur method ---
Iteration   1: 0 of 3 eigenvalues converged. Smallest non-converged residual 3.0e-09 (tolerance 1.0e-14).
Iteration   2: 2 of 3 eigenvalues converged. Smallest non-converged residual 1.5e-12 (tolerance 1.0e-14).
Iteration   3: 3 of 3 eigenvalues converged.
D2 =
-11.417575235082193
-11.347643228598985
-11.175322773263716
``````

I also did a run where I passed in a function handle that performed `(PQ - sigma*I)\x` and counted the number of times it was called, which turned out to be 53.

### Julia

Here is the contents of the file rerun.jl:

``````Threads.nthreads() == 1 || error("Restart Julia with -t1")
PQ = P*Q
using MKL
using LinearAlgebra
using Arpack: eigs
@time D2, Exy, nconv, niter, nmult, resid = eigs(PQ, nev=NMODES, sigma=ev, tol=1e-14)
@show(D2, nconv, niter, nmult)
``````

and the resulting output:

``````julia> include("rerun.jl")
2.642512 seconds (67.75 k allocations: 467.161 MiB, 0.38% gc time, 0.63% compilation time)
D2 = ComplexF64[-11.41757523508308 + 0.0im, -11.347643228598551 + 0.0im, -11.175322773264579 + 0.0im]
nconv = 3
niter = 3
nmult = 52
52
``````

So it looks like Matlab and Juliaâ€™s Arpack routine are using the same algorithm and perform the same number of iterations and matrix solves. Matlabâ€™s implementation is several times faster, though.

@stevengj, @rveltz, and @Ralph_Smith: Iâ€™m going to try all your other suggestions as well. They are much appreciated!

To have better comparison, you should set the tolerances and compare the residuals. Id be surprised if you can beat Arpack with interpreted code.

I will set the tolerances on the Julia codes for my upcoming tests. Note that I did set the tolerance to `1-14` for `Arpack.eigs`, which is equal the default Matlab tolerance, as shown in the Matlab results.

Re the residuals: It doesnâ€™t look like the residual vector is available as an output from Matlab. I did look at the residual vector from `Arpack.eigs` but itâ€™s norm is 0.1394, so I donâ€™t understand how to interpret that. I did try to compare the results from Matlabâ€™s and Arpackâ€™s `eigs` functions by examining the 2-norm of the difference between the eigenvalue 3-vectors they each returned. The norm of the difference was 1e-12, satisfyingly small.

I donâ€™t understand this comment since Julia is compiled.

Thanks for the detailed explanation of how to implement the shift-invert operation (which I needed!). My code for implementing this is in the file `rerun_KrylovKit` show below:

``````using KrylovKit: eigsolve
using LinearAlgebra
PQ = P*Q

function rerun_KrylovKit(n=1, PQ=PQ)
Threads.nthreads() == n || error("Restart Julia with -t\$n")
@time begin
LUfactors = lu(PQ - ev*I)
fun(x) = LUfactors \ x
D2, Exyvecs, info = eigsolve(fun, rand(eltype(PQ), size(PQ,2)), NMODES, :LM; tol=1e-14)
end
info.converged â‰Ą NMODES || warning("\$(info.converged) converged modes, but \$(NMODES) requested")
D2 = inv.(D2[1:NMODES]) .+ ev
@show D2
print(info)
end
``````

Running this on single-threaded Julia produces the following output:

``````julia> rerun_KrylovKit(1)
2.402129 seconds (92.02 k allocations: 423.060 MiB, 0.77% gc time, 3.03% compilation time)
D2 = ComplexF64[-11.41757523508308 - 0.0im, -11.34764322859855 - 0.0im, -11.175322773264579 - 0.0im]
ConvergenceInfo: 5 converged values after 2 iterations and 42 applications of the linear map;
norms of residuals are given by (4.651212075017882e-28, 1.1311489902029288e-25, 1.1303118924642023e-20, 6.12279637828794e-19, 3.090439836645266e-16).
``````

The results are much improved over my initial try, with convergence achieved and execution time competitive with Arpack. I also tried running it with 8 threads and with MKLSparse:

``````using MKL, MKLSparse
julia> using MKL, MKLSparse

julia> rerun_KrylovKit(8)
2.364362 seconds (1.49 k allocations: 418.421 MiB, 0.80% gc time)
D2 = ComplexF64[-11.41757523508308 - 0.0im, -11.34764322859856 - 0.0im, -11.175322773264574 - 0.0im]
ConvergenceInfo: 5 converged values after 2 iterations and 42 applications of the linear map;
norms of residuals are given by (4.367435621113895e-28, 1.0519282546009974e-25, 1.0801634314205764e-20, 5.526478130311956e-19, 2.822258996060205e-16).
``````

Almost no difference in run timeâ€“the difference is within the variation seen for multiple runs.

I went through a similar exercise with `ArnoldiMethod`, using the `LinearMap` wrapper shown in the packageâ€™s documentation. The contents of `rerun_ArnoldiMethod.jl` are shown below:

``````using ArnoldiMethod, LinearAlgebra, LinearMaps

PQ = P*Q

# Factorizes A and builds a linear map that applies inv(A) to a vector.
function construct_linear_map(A)
F = factorize(A - ev*I)
LinearMap{eltype(A)}((y, x) -> ldiv!(y, F, x), size(A,1), ismutating=true)
end

function rerun_ArnoldiMethod(n=1, PQ=PQ)
Threads.nthreads() == n || error("Restart Julia with -t\$n")
@time begin
decomp, history = partialschur(construct_linear_map(PQ), nev=NMODES, tol=1e-14, restarts=300, which=LM())
D2, Exy = partialeigen(decomp)
end
println(history)
D2 = inv.(D2) .+ ev
@show D2
end
``````

Running this on a single thread produces the following output:

``````julia> include("rerun_ArnoldiMethod.jl")
rerun_ArnoldiMethod (generic function with 3 methods)

julia> rerun_ArnoldiMethod(1)
2.216459 seconds (1.69 M allocations: 447.162 MiB, 0.94% gc time, 10.12% compilation time)
Converged: 3 of 3 eigenvalues in 38 matrix-vector products
D2 = [-11.175322773264579, -11.347643228598555, -11.41757523508308]
3-element Vector{Float64}:
-11.175322773264579
-11.347643228598555
-11.41757523508308
``````

and with 8 threads (restart of Julia with 8 threads not shown):

``````julia> using MKL, MKLSparse

julia> rerun_ArnoldiMethod(8)
1.997144 seconds (240 allocations: 367.015 MiB, 0.76% gc time)
Converged: 3 of 3 eigenvalues in 38 matrix-vector products
D2 = [-11.17532277326458, -11.347643228598553, -11.41757523508308]
3-element Vector{Float64}:
-11.17532277326458
-11.347643228598553
-11.41757523508308
``````

so the 8-threaded run with `MKLSparse` is about 10% faster than the single-threaded run.

Here is an updated summary of the timings with and without multi-threading. â€śWithâ€ť also includes use of `MKL` and `MKLSparse` for the Julia results.

Package Function 1 Thread Time (s) 8 Threads Time (s)
Matlab eigs 1.0 1.0
Arpack eigs 2.6 2.8
KrylovKit eigsolve 2.4 2.4
ArnoldiMethod partialschur, partialeigen 2.2 2.0

I only reported the times to the nearest tenth of a second because they vary by almost that much from run to run (due to random starting vectors?). Only the `ArnoldiMethod` routines show even a modest reduction in execution time with threading and use of `MKLSparse`. `Arpack` actually got slightly worse. So far, the shortest time I can get from Julia is using `ArnoldiMethod` which takes about twice as long as `eigs` in Matlab.

A couple of years ago (the last time I checked), Matlab implemented this entirely in Arpack.
So, this should really be a comparison of one and the same library doing the same thing in Matlab
and Julia. There may be scope for doing things differently before the computation is handed off to the library, though.

Is it possible then that the almost 3x speed difference is simply that the Mathworks uses a better compiler, say IFORT rather than GCC?

A reasonable suspicion. I donâ€™t have access to any data though.

Actually, Matlab does not use Arpack anymore. Instead, they have implemented their own version of the Krylov-Schur method accessible from the Matlab prompt with `edit eigs.m`.

I am not sure whether the following tweak will resolve your issue, however, I remember from my own work on a modesolver that `SuiteSparse` conducts an iterative refinement of the solutions by default. Disabling this refinement using

``````SuiteSparse.UMFPACK.umf_ctrl[8] = 0
``````

improved the performance of `eigs` significantly for my use case.

4 Likes

As Jonas indicates, in the shift-invert case Julia is using SuiteSparse factorizations. Those have their own solver methods, which do not benefit from `MKLSparse`.

Faster sparse BLAS should help when using direct Krylov schemes to get boundary eigenvalues; e.g. `which=SR()` in `ArnoldiMethod`. That approach usually takes significantly more operator applications than shift-inverse with a good shift, so itâ€™s good if you have a huge matrix, a large number of cores, and/or no good idea for a shift,

Jonasâ€™ solution looks best for your problem class.

1 Like

Using `SuiteSparse.UMFPACK.umf_ctrl[8] = 0` results in

``````julia> rerun_Arpack(8)
The returned eigenvalue vector differs from the slower default `Arpack` solution by only 1e-15 or so! The solution at last! Thanks to all who contributed!
Edit/Update: With this `umf_ctrl` setting, the time for `rerun_Arpack(8)` can be consistently reduced to 1.2 seconds or less by preceding the call with `GC.gc()` (garbage collection).