The full thing takes 800ms here (also about 500ms using Arpack), so still slower but not that big of a different (while avoiding expensive memory copies).
For the sparse case, isn’t csreigvsi the API you need?
julia> @benchmark CUSOLVER.csreigvsi(dA, rand(T), CUDA.rand(T, 3000), 1e-6, Cint(1000), 'O')
BenchmarkTools.Trial:
memory estimate: 1.47 KiB
allocs estimate: 64
--------------
minimum time: 1.384 ms (0.00% GC)
median time: 2.278 ms (0.00% GC)
mean time: 4.155 ms (0.00% GC)
maximum time: 328.102 ms (0.00% GC)
--------------
samples: 1203
evals/sample: 1
(Note that these wrappers are a little rough, and would benefit from a clean-up / higher-level functions.)