Some eigenpairs from a large, sparse, nonsymmetric matrix: Julia vs Matlab

Thanks for the detailed explanation of how to implement the shift-invert operation (which I needed!). My code for implementing this is in the file rerun_KrylovKit show below:

using KrylovKit: eigsolve
using LinearAlgebra
PQ = P*Q

function rerun_KrylovKit(n=1, PQ=PQ)
    Threads.nthreads() == n || error("Restart Julia with -t$n")
    BLAS.set_num_threads(n)
    println("KrylovKit  $n Threads:")
    @time begin
        LUfactors = lu(PQ - ev*I)
        fun(x) = LUfactors \ x
        D2, Exyvecs, info = eigsolve(fun, rand(eltype(PQ), size(PQ,2)), NMODES, :LM; tol=1e-14)
    end
    info.converged ≥ NMODES || warning("$(info.converged) converged modes, but $(NMODES) requested")
    D2 = inv.(D2[1:NMODES]) .+ ev
    @show D2
    print(info)
end

Running this on single-threaded Julia produces the following output:

julia> rerun_KrylovKit(1)
KrylovKit  1 Threads:
  2.402129 seconds (92.02 k allocations: 423.060 MiB, 0.77% gc time, 3.03% compilation time)
D2 = ComplexF64[-11.41757523508308 - 0.0im, -11.34764322859855 - 0.0im, -11.175322773264579 - 0.0im]
ConvergenceInfo: 5 converged values after 2 iterations and 42 applications of the linear map;
norms of residuals are given by (4.651212075017882e-28, 1.1311489902029288e-25, 1.1303118924642023e-20, 6.12279637828794e-19, 3.090439836645266e-16).

The results are much improved over my initial try, with convergence achieved and execution time competitive with Arpack. I also tried running it with 8 threads and with MKLSparse:

using MKL, MKLSparse
julia> using MKL, MKLSparse

julia> rerun_KrylovKit(8)
KrylovKit  8 Threads:
  2.364362 seconds (1.49 k allocations: 418.421 MiB, 0.80% gc time)
D2 = ComplexF64[-11.41757523508308 - 0.0im, -11.34764322859856 - 0.0im, -11.175322773264574 - 0.0im]
ConvergenceInfo: 5 converged values after 2 iterations and 42 applications of the linear map;
norms of residuals are given by (4.367435621113895e-28, 1.0519282546009974e-25, 1.0801634314205764e-20, 5.526478130311956e-19, 2.822258996060205e-16).

Almost no difference in run time–the difference is within the variation seen for multiple runs.

I went through a similar exercise with ArnoldiMethod, using the LinearMap wrapper shown in the package’s documentation. The contents of rerun_ArnoldiMethod.jl are shown below:

using ArnoldiMethod, LinearAlgebra, LinearMaps

PQ = P*Q

# Factorizes A and builds a linear map that applies inv(A) to a vector.
function construct_linear_map(A)
    F = factorize(A - ev*I)
    LinearMap{eltype(A)}((y, x) -> ldiv!(y, F, x), size(A,1), ismutating=true)
end

function rerun_ArnoldiMethod(n=1, PQ=PQ)
    Threads.nthreads() == n || error("Restart Julia with -t$n")
    BLAS.set_num_threads(n)
    println("ArnoldiMethod $n Threads:")
    @time begin
        decomp, history = partialschur(construct_linear_map(PQ), nev=NMODES, tol=1e-14, restarts=300, which=LM())
        D2, Exy = partialeigen(decomp)
    end
    println(history)
    D2 = inv.(D2) .+ ev
    @show D2
end

Running this on a single thread produces the following output:

julia> include("rerun_ArnoldiMethod.jl")
rerun_ArnoldiMethod (generic function with 3 methods)

julia> rerun_ArnoldiMethod(1)
ArnoldiMethod 1 Threads:
  2.216459 seconds (1.69 M allocations: 447.162 MiB, 0.94% gc time, 10.12% compilation time)
Converged: 3 of 3 eigenvalues in 38 matrix-vector products
D2 = [-11.175322773264579, -11.347643228598555, -11.41757523508308]
3-element Vector{Float64}:
 -11.175322773264579
 -11.347643228598555
 -11.41757523508308

and with 8 threads (restart of Julia with 8 threads not shown):

julia> using MKL, MKLSparse

julia> rerun_ArnoldiMethod(8)
ArnoldiMethod 8 Threads:
  1.997144 seconds (240 allocations: 367.015 MiB, 0.76% gc time)
Converged: 3 of 3 eigenvalues in 38 matrix-vector products
D2 = [-11.17532277326458, -11.347643228598553, -11.41757523508308]
3-element Vector{Float64}:
 -11.17532277326458
 -11.347643228598553
 -11.41757523508308

so the 8-threaded run with MKLSparse is about 10% faster than the single-threaded run.

Here is an updated summary of the timings with and without multi-threading. “With” also includes use of MKL and MKLSparse for the Julia results.

Package Function 1 Thread Time (s) 8 Threads Time (s)
Matlab eigs 1.0 1.0
Arpack eigs 2.6 2.8
KrylovKit eigsolve 2.4 2.4
ArnoldiMethod partialschur, partialeigen 2.2 2.0

I only reported the times to the nearest tenth of a second because they vary by almost that much from run to run (due to random starting vectors?). Only the ArnoldiMethod routines show even a modest reduction in execution time with threading and use of MKLSparse. Arpack actually got slightly worse. So far, the shortest time I can get from Julia is using ArnoldiMethod which takes about twice as long as eigs in Matlab.