Eigenvalues for lots of small matrices, GPU batched vs CPU eigen

I need to diagonalize lots of small matrices of the same size, like 50000 matrices of size 50x50, and wonder if using GPU to diagonalize them in parallel will help.

So I set up a small test:

using LinearAlgebra
using CUDA

N = 50;
matNum = 1000;

matReLst = [ Symmetric( rand(N,N) ) for it = 1 : matNum ];
matReArr = zeros( N, N, matNum );
for n = 1 : matNum
	matReArr[:,:,n] = matReLst[n];
matReArrCu = cu(matReArr);

function testEigenRe()
	Threads.@threads for ii in eachindex(matReLst)

function testCuSolRe()
	sols = CUDA.CUSOLVER.syevjBatched!('V','U',matReArrCu);

Which creates 1000 symmetric matrices of size 50x50, and solve them either using eigen() on CPU, or by the batched solver syevjBatched! provided by CuSolver.

Then I run

@btime testEigenRe()


@btime testCuSolRe()

and I get about 100ms and 980ms (50x50 matrices), respectively, so GPU is a lot slower.

If it’s 10x10 matrices then the time is 5.6ms and 4.1ms respectively, so GPU is a bit faster

Is it to be expected? Which means that GPU is not suitable for this kind of problem?
I have seen somewhere on the web that to multithread multiple diagonalization, since each thread requires some extra memory, the number of threads one can run is limited by the memory of the GPU. So when the matrix size gets a little bigger, fewer threads can be run on GPU, which brings down performance. Is this a plausible argument?
The argument came from here:

I also wonder if these numbers, by themselves, are reasonable. Hopefully I didn’t do anything to inadvertently hurt performance.

My versioninfo():

Julia Version 1.5.2
Commit 539f3ce943 (2020-09-23 23:17 UTC)
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: Intel(R) Core(TM) i5-8250U CPU @ 1.60GHz
  LIBM: libopenlibm
  LLVM: libLLVM-9.0.1 (ORCJIT, skylake)


CUDA toolkit 11.1.1, artifact installation
CUDA driver 11.1.0

- CUBLAS: 11.3.0
- CURAND: 10.2.2
- CUFFT: 10.3.0
- CUSOLVER: 11.0.1
- CUSPARSE: 11.3.0
- CUPTI: 14.0.0
- NVML: missing
- CUDNN: 8.0.4 (for CUDA 11.1.0)
- CUTENSOR: 1.2.1 (for CUDA 11.1.0)

- Julia: 1.5.2
- LLVM: 9.0.1
- PTX ISA support: 3.2, 4.0, 4.1, 4.2, 4.3, 5.0, 6.0, 6.1, 6.3, 6.4
- Device support: sm_35, sm_37, sm_50, sm_52, sm_53, sm_60, sm_61, sm_62, sm_70, sm_72, sm_75

1 device:
  0: GeForce MX150 (sm_61, 1.545 GiB / 2.000 GiB available)

The julia is started with julia -t 8 so 8 threads, and the BLAS library was replaced with intel MKL by the MKL.jl package.

Appreciate anyone’s response.