CUDA eigenvalues of a sparse matrix

I’m trying to optimize the eigenvalue (and eigenvector) calculation for a neutrino oscillation package which is currently under development (https://github.com/KM3NeT/Neurthino.jl).
The problem consists of a lot of small symmetric matrices where the eigenvalues should be determined and I thought maybe I can utilize the GPU for that problem. In order to pass this as a single problem to the GPU via a sparse matrix having all the small matrices on its trace.

For testing I created this matrix:

A_full = Symmetric(rand(3000, 3000))

When doing it in the “classic” way I get:

@benchmark eigs(A_full)
BenchmarkTools.Trial: 
  memory estimate:  808.38 KiB
  allocs estimate:  2571
  --------------
  minimum time:     519.974 ms (0.00% GC)
  median time:      527.320 ms (0.00% GC)
  mean time:        528.424 ms (0.00% GC)
  maximum time:     546.705 ms (0.00% GC)
  --------------
  samples:          10
  evals/sample:     1

I also refined the example with respect to my initial problem description and created this example:

dim = 3
n = 1000
A = spzeros(dim*n, dim*n)
for i in 1:n
     A[(i-1)*dim+1:i*dim, (i-1)*dim+1:i*dim] = Symmetric(rand(dim, dim))
end

and when calculating the eigenvalues using Arpack.jl I get:

@benchmark eigs(A)
BenchmarkTools.Trial: 
  memory estimate:  810.03 KiB
  allocs estimate:  1889
  --------------
  minimum time:     24.614 ms (0.00% GC)
  median time:      27.885 ms (0.00% GC)
  mean time:        28.766 ms (0.12% GC)
  maximum time:     47.153 ms (0.00% GC)
  --------------
  samples:          174
  evals/sample:     1

Using the CUDA package I get:

A_cuda = CuArray{ComplexF64}(A)
@benchmark CUDA.@sync CUDA.CUSOLVER.heevd!('V','U', A_cuda)
BenchmarkTools.Trial: 
  memory estimate:  1.23 KiB
  allocs estimate:  33
  --------------
  minimum time:     1.768 s (0.00% GC)
  median time:      1.769 s (0.00% GC)
  mean time:        1.770 s (0.00% GC)
  maximum time:     1.773 s (0.00% GC)
  --------------
  samples:          3
  evals/sample:     1

As this is already slower than the first example it leads me to the question whether I have to change my approach?

If I stick to my approach I guess I have to change it to CUDA.CUSPARSE.CuSparseMatrixCSR(A) and use specific eigenvalue functionality for sparse arrays, because I have seen there is a function CUDA.CUSOLVER.csreigs, but I don’t really get how to use it (with respect to its arguments).

Btw. all the testing was done using a NVIDIA GTX 1080 Ti.

Can you help me out at that point?

Thanks.

2 Likes

I don’t know if that makes a difference but A is Float64 while A_cuda is complex right?

Yes indeed it is complex, because heevd has the following signature:

help?> CUDA.CUSOLVER.heevd!
  No documentation found.

  CUDA.CUSOLVER.heevd! is a Function.

  # 2 methods for generic function "heevd!":
  [1] heevd!(jobz::Char, uplo::Char, A::CuArray{Complex{Float64},2}) in CUDA.CUSOLVER at /home/test/.julia/packages/CUDA/dZvbp/lib/cusolver/wrappers.jl:720
  [2] heevd!(jobz::Char, uplo::Char, A::CuArray{Complex{Float32},2}) in CUDA.CUSOLVER at /home/test/.julia/packages/CUDA/dZvbp/lib/cusolver/wrappers.jl:720

I chose the comparing examples as I did, because my problem has real numbers. Do you know how to do it in CUDA with real numbers, then I will update my numbers on the benchmark!?

I think the correct cuda function is syevd

How to call that from julia though I don’t know.

First of all thanks for the help and the hint with syevd.
Unfortunately my benchmark gets worse using it:

A = Symmetric(rand(3000,3000));
A_cuda = CuArray{Float64}(A);
@benchmark CUDA.@sync CUDA.CUSOLVER.syevd!('V','U', A_cuda)
BenchmarkTools.Trial: 
  memory estimate:  992 bytes
  allocs estimate:  20
  --------------
  minimum time:     15.029 s (0.00% GC)
  median time:      15.029 s (0.00% GC)
  mean time:        15.029 s (0.00% GC)
  maximum time:     15.029 s (0.00% GC)
  --------------
  samples:          1
  evals/sample:     1

The full thing takes 800ms here (also about 500ms using Arpack), so still slower but not that big of a different (while avoiding expensive memory copies).

For the sparse case, isn’t csreigvsi the API you need?

julia> @benchmark CUSOLVER.csreigvsi(dA, rand(T), CUDA.rand(T, 3000), 1e-6, Cint(1000), 'O')
BenchmarkTools.Trial: 
  memory estimate:  1.47 KiB
  allocs estimate:  64
  --------------
  minimum time:     1.384 ms (0.00% GC)
  median time:      2.278 ms (0.00% GC)
  mean time:        4.155 ms (0.00% GC)
  maximum time:     328.102 ms (0.00% GC)
  --------------
  samples:          1203
  evals/sample:     1

(Note that these wrappers are a little rough, and would benefit from a clean-up / higher-level functions.)

3 Likes

Hey @maleadt, I think that is what I’m looking for: Thank You!!!

1 Like

Is there a reason why this CUSOLVER.csreigvsi just returns one eigenvalue and its associated eigenvector, when I use it exactly your example, @maleadt . I guess there’s some parameter to adjust it? I guess so, because for just getting one eigenvalue there’s also the possibility to use KrylovKit and set howmany to 1, which has the following benchmark without GPU usage:

julia> @benchmark eigsolve(A, 1)
BenchmarkTools.Trial: 
  memory estimate:  7.92 MiB
  allocs estimate:  1432
  --------------
  minimum time:     15.516 ms (0.00% GC)
  median time:      15.844 ms (0.00% GC)
  mean time:        16.328 ms (1.43% GC)
  maximum time:     21.957 ms (13.21% GC)
  --------------
  samples:          307
  evals/sample:     1