CUDA porting to shift-inverse eigenvalues

The Arpack eigenvalues solver is very fast. With a 100000x100000 sparse matrix, I can get a few eigenvalues with the shift-invert method in 20-30 seconds approximately!

Since I always want the best performances, I wondered if it was possible to calculate a limited number of eigenvalues through the CUDA.jl package.

As an exercise I speeded up the CPU algorithm, using the KrylovKit.jl package. Here the minimal code

using LinearAlgebra, KrylovKit, IncompleteLU, CUDA
CUDA.allowscalar(false)

# H is an Hermitian matrix. It is often positive-definite.

my_eltype = eltype(H)
n = size(H, 1)
sigma_shift = -2 # the shift for the shift-inverse method
posdef_shift = 5 # perhaps a meaningless parameter. 
                 # It should shift the total matrix to be sure it is positive definite

k = 8 # Number of eigenvalues to return
v0 = rand(my_eltype, n)
v0 /= norm(v0)

A_cpu = H
A_shifted_cpu = A_cpu + (posdef_shift - sigma_shift) * I

P_exact_cpu = factorize(A_shifted_cpu)

@time vals, vecs, info = eigsolve(x -> P_exact_cpu \ x, v0, k, ishermitian = true)
vals = (1 .+ (sigma_shift - posdef_shift) * vals) ./ vals
vals[1:k]

or using IterativeSolvers.jl

P_cpu = ilu(A_shifted_cpu, τ = 0.01)
@time vals, vecs, info = eigsolve(x -> cg(A_shifted_cpu, x; Pl = P_cpu, maxiter = 2000), v0, k, ishermitian = true)
vals = (1 .+ (sigma_shift - posdef_shift) * vals) ./ vals
vals[1:k]

Which are a little bit faster than the Arpack.jl method.

Taking advantage of the versatility of the CUDA.jl, IterativeSolvers.jl and KrylovKit.jl packages, I implemented this GPU version

function LinearAlgebra.ldiv!(y::CuArray, P::CUSPARSE.CuSparseMatrixCSC, x::CuArray)
    copyto!(y, x)                        # Variant for CuSparseMatrixCSR
    CUSPARSE.sv2!('T', 'U', 'N', 1.0, P, y, 'O')  # sv2!('N', 'L', 'N', 1.0, P, y, 'O')
    CUSPARSE.sv2!('N', 'U', 'N', 1.0, P, y, 'O')  # sv2!('T', 'L', 'N', 1.0, P, y, 'O')
    return y
end
function LinearAlgebra.ldiv!(P::CUSPARSE.CuSparseMatrixCSC, x::CuArray)
    CUSPARSE.sv2!('T', 'U', 'N', 1.0, P, x, 'O')  # sv2!('N', 'L', 'N', 1.0, P, y, 'O')
    CUSPARSE.sv2!('N', 'U', 'N', 1.0, P, x, 'O')  # sv2!('T', 'L', 'N', 1.0, P, y, 'O')
    return x
end

v0 = CUDA.rand(my_eltype, n)
v0 /= norm(v0)
A_shifted_gpu = CUSPARSE.CuSparseMatrixCSC(A_shifted_cpu)
P_gpu = CUSPARSE.ic02(A_shifted_gpu, 'O')

@time vals, vecs, info = eigsolve(x -> cg(A_shifted_gpu, x; Pl = P_gpu, maxiter = 2000), v0, k, ishermitian = true)
vals = (1 .+ (sigma - posdef_shift) * vals) ./ vals
vals[1:k]

However this method is slower compared to the GPU version, and is expensive in terms of GPU memory.
I think that the bottleneck is in the linear solve method, since

v0 = rand(my_eltype, n)
v0 /= norm(v0)
v0_gpu = CuArray(v0)

@time test1 = P_exact_cpu \ v0
println(test1[1])
println("")

@time test2 = cg(A_shifted_cpu, v0; Pl = P_cpu, maxiter = 2000)
println(test2[1])
println("")

@time test3 = cg(A_shifted_gpu, v0_gpu; Pl = P_gpu, maxiter = 2000)
println(Array(test3)[1])
println("")

returns

"""
0.000764 seconds (12 allocations: 127.094 KiB)
1.966588992514033e-6 + 1.98235638546855e-6im

  0.001486 seconds (18 allocations: 126.375 KiB)
1.9665889918161988e-6 + 1.982356384764342e-6im

  0.089783 seconds (5.69 k allocations: 244.609 KiB)
1.9665889861684783e-6 + 1.9823563706705657e-6im
"""

I think that solving eigenvalues problems for large matrices with GPUs is very useful, and I hope o succeed one day.