 # CUDA eigenvalues of a sparse matrix

I’m trying to optimize the eigenvalue (and eigenvector) calculation for a neutrino oscillation package which is currently under development (https://github.com/KM3NeT/Neurthino.jl).
The problem consists of a lot of small symmetric matrices where the eigenvalues should be determined and I thought maybe I can utilize the GPU for that problem. In order to pass this as a single problem to the GPU via a sparse matrix having all the small matrices on its trace.

For testing I created this matrix:

``````A_full = Symmetric(rand(3000, 3000))
``````

When doing it in the “classic” way I get:

``````@benchmark eigs(A_full)
BenchmarkTools.Trial:
memory estimate:  808.38 KiB
allocs estimate:  2571
--------------
minimum time:     519.974 ms (0.00% GC)
median time:      527.320 ms (0.00% GC)
mean time:        528.424 ms (0.00% GC)
maximum time:     546.705 ms (0.00% GC)
--------------
samples:          10
evals/sample:     1
``````

I also refined the example with respect to my initial problem description and created this example:

``````dim = 3
n = 1000
A = spzeros(dim*n, dim*n)
for i in 1:n
A[(i-1)*dim+1:i*dim, (i-1)*dim+1:i*dim] = Symmetric(rand(dim, dim))
end
``````

and when calculating the eigenvalues using `Arpack.jl` I get:

``````@benchmark eigs(A)
BenchmarkTools.Trial:
memory estimate:  810.03 KiB
allocs estimate:  1889
--------------
minimum time:     24.614 ms (0.00% GC)
median time:      27.885 ms (0.00% GC)
mean time:        28.766 ms (0.12% GC)
maximum time:     47.153 ms (0.00% GC)
--------------
samples:          174
evals/sample:     1
``````

Using the CUDA package I get:

``````A_cuda = CuArray{ComplexF64}(A)
@benchmark CUDA.@sync CUDA.CUSOLVER.heevd!('V','U', A_cuda)
BenchmarkTools.Trial:
memory estimate:  1.23 KiB
allocs estimate:  33
--------------
minimum time:     1.768 s (0.00% GC)
median time:      1.769 s (0.00% GC)
mean time:        1.770 s (0.00% GC)
maximum time:     1.773 s (0.00% GC)
--------------
samples:          3
evals/sample:     1
``````

As this is already slower than the first example it leads me to the question whether I have to change my approach?

If I stick to my approach I guess I have to change it to `CUDA.CUSPARSE.CuSparseMatrixCSR(A)` and use specific eigenvalue functionality for sparse arrays, because I have seen there is a function `CUDA.CUSOLVER.csreigs`, but I don’t really get how to use it (with respect to its arguments).

Btw. all the testing was done using a NVIDIA GTX 1080 Ti.

Can you help me out at that point?

Thanks.

2 Likes

I don’t know if that makes a difference but `A` is `Float64` while `A_cuda` is complex right?

Yes indeed it is complex, because `heevd` has the following signature:

``````help?> CUDA.CUSOLVER.heevd!
No documentation found.

CUDA.CUSOLVER.heevd! is a Function.

# 2 methods for generic function "heevd!":
 heevd!(jobz::Char, uplo::Char, A::CuArray{Complex{Float64},2}) in CUDA.CUSOLVER at /home/test/.julia/packages/CUDA/dZvbp/lib/cusolver/wrappers.jl:720
 heevd!(jobz::Char, uplo::Char, A::CuArray{Complex{Float32},2}) in CUDA.CUSOLVER at /home/test/.julia/packages/CUDA/dZvbp/lib/cusolver/wrappers.jl:720
``````

I chose the comparing examples as I did, because my problem has real numbers. Do you know how to do it in CUDA with real numbers, then I will update my numbers on the benchmark!?

I think the correct cuda function is `syevd`

How to call that from julia though I don’t know.

First of all thanks for the help and the hint with `syevd`.
Unfortunately my benchmark gets worse using it:

``````A = Symmetric(rand(3000,3000));
A_cuda = CuArray{Float64}(A);
@benchmark CUDA.@sync CUDA.CUSOLVER.syevd!('V','U', A_cuda)
BenchmarkTools.Trial:
memory estimate:  992 bytes
allocs estimate:  20
--------------
minimum time:     15.029 s (0.00% GC)
median time:      15.029 s (0.00% GC)
mean time:        15.029 s (0.00% GC)
maximum time:     15.029 s (0.00% GC)
--------------
samples:          1
evals/sample:     1
``````

The full thing takes 800ms here (also about 500ms using Arpack), so still slower but not that big of a different (while avoiding expensive memory copies).

For the sparse case, isn’t `csreigvsi` the API you need?

``````julia> @benchmark CUSOLVER.csreigvsi(dA, rand(T), CUDA.rand(T, 3000), 1e-6, Cint(1000), 'O')
BenchmarkTools.Trial:
memory estimate:  1.47 KiB
allocs estimate:  64
--------------
minimum time:     1.384 ms (0.00% GC)
median time:      2.278 ms (0.00% GC)
mean time:        4.155 ms (0.00% GC)
maximum time:     328.102 ms (0.00% GC)
--------------
samples:          1203
evals/sample:     1
``````

(Note that these wrappers are a little rough, and would benefit from a clean-up / higher-level functions.)

3 Likes

Hey @maleadt, I think that is what I’m looking for: Thank You!!!

1 Like

Is there a reason why this `CUSOLVER.csreigvsi` just returns one eigenvalue and its associated eigenvector, when I use it exactly your example, @maleadt . I guess there’s some parameter to adjust it? I guess so, because for just getting one eigenvalue there’s also the possibility to use `KrylovKit` and set `howmany` to `1`, which has the following benchmark without GPU usage:

``````julia> @benchmark eigsolve(A, 1)
BenchmarkTools.Trial:
memory estimate:  7.92 MiB
allocs estimate:  1432
--------------
minimum time:     15.516 ms (0.00% GC)
median time:      15.844 ms (0.00% GC)
mean time:        16.328 ms (1.43% GC)
maximum time:     21.957 ms (13.21% GC)
--------------
samples:          307
evals/sample:     1
``````