Eigenvalues for lots of small matrices, GPU batched vs CPU eigen

arxiv · November 26, 2020, 2:48am

I need to diagonalize lots of small matrices of the same size, like 50000 matrices of size 50x50, and wonder if using GPU to diagonalize them in parallel will help.

So I set up a small test:

using LinearAlgebra
using CUDA

N = 50;
matNum = 1000;

matReLst = [ Symmetric( rand(N,N) ) for it = 1 : matNum ];
matReArr = zeros( N, N, matNum );
for n = 1 : matNum
	matReArr[:,:,n] = matReLst[n];
end
matReArrCu = cu(matReArr);

function testEigenRe()
	Threads.@threads for ii in eachindex(matReLst)
		eigen(matReLst[ii]);
	end
end

function testCuSolRe()
	sols = CUDA.CUSOLVER.syevjBatched!('V','U',matReArrCu);
end

Which creates 1000 symmetric matrices of size 50x50, and solve them either using eigen() on CPU, or by the batched solver syevjBatched! provided by CuSolver.

Then I run

@btime testEigenRe()

and

@btime testCuSolRe()

and I get about 100ms and 980ms (50x50 matrices), respectively, so GPU is a lot slower.

If it’s 10x10 matrices then the time is 5.6ms and 4.1ms respectively, so GPU is a bit faster

Is it to be expected? Which means that GPU is not suitable for this kind of problem?
I have seen somewhere on the web that to multithread multiple diagonalization, since each thread requires some extra memory, the number of threads one can run is limited by the memory of the GPU. So when the matrix size gets a little bigger, fewer threads can be run on GPU, which brings down performance. Is this a plausible argument?
The argument came from here:

I also wonder if these numbers, by themselves, are reasonable. Hopefully I didn’t do anything to inadvertently hurt performance.

My versioninfo():

Julia Version 1.5.2
Commit 539f3ce943 (2020-09-23 23:17 UTC)
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: Intel(R) Core(TM) i5-8250U CPU @ 1.60GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-9.0.1 (ORCJIT, skylake)

CUDA.versioninfo():

CUDA toolkit 11.1.1, artifact installation
CUDA driver 11.1.0

Libraries:
- CUBLAS: 11.3.0
- CURAND: 10.2.2
- CUFFT: 10.3.0
- CUSOLVER: 11.0.1
- CUSPARSE: 11.3.0
- CUPTI: 14.0.0
- NVML: missing
- CUDNN: 8.0.4 (for CUDA 11.1.0)
- CUTENSOR: 1.2.1 (for CUDA 11.1.0)

Toolchain:
- Julia: 1.5.2
- LLVM: 9.0.1
- PTX ISA support: 3.2, 4.0, 4.1, 4.2, 4.3, 5.0, 6.0, 6.1, 6.3, 6.4
- Device support: sm_35, sm_37, sm_50, sm_52, sm_53, sm_60, sm_61, sm_62, sm_70, sm_72, sm_75

1 device:
  0: GeForce MX150 (sm_61, 1.545 GiB / 2.000 GiB available)

The julia is started with julia -t 8 so 8 threads, and the BLAS library was replaced with intel MKL by the MKL.jl package.

Appreciate anyone’s response.

Topic		Replies	Views
SVD solve with CUSOLVER GPU first-steps	2	2694	June 11, 2019
Accelerate solving many matrix problems GPU cuda , linearalgebra , regression	8	2556	June 3, 2020
CUDA eigenvalues of a sparse matrix GPU question	8	4316	November 17, 2021
Slowdown when computing eigenvalues for list of matrices with pmap Performance parallel , eigenvalues	2	368	September 3, 2021
Need to find out all eigenvalues and vectors of large Hermitian matrices Numerics	1	298	October 17, 2023

Eigenvalues for lots of small matrices, GPU batched vs CPU eigen

Related topics