Batched Matrix solve in CUDA.jl

Is there a way to perform batched matrix inversion/ldiv in CUDA.jl, ie:

Yi = Ri \ Xi

where Ri is NxN, and Yi and Xi are NxM, with N and M small (~12). I need to solve a large number of these systems, e.g. 1 < i < 50k. Obviously making multiple GPU calls in a loop will be highly inefficient for small N,M and CUDA provides several relevant batched methods, e.g. cusolverDnpotrsBatched() or cublasgetrsBatched().

I don’t believe the current CUDA.jl implementations of inv or ldiv support these batched calls. Is there a workaround or a roadmap for implementing this in CUDA.jl?

1 Like

Is https://github.com/JuliaGPU/CUDA.jl/blob/master/test/cublas.jl#L1646 what you are looking for?

Take a look at this discussion: Accelerate solving many matrix problems - #9 by clinton

Sorry to bring up old posts but I can across this when trying to solve a similar problem.

I want to make a quick update on the solution because the line:

 cuuplo = CUDA.CUBLAS.cublasfill('U')

in the hyperreg() gave a “not defined” error for me (UndefVarError). I replaced it with:

 cuuplo = CUDA.CUBLAS.CUBLAS_FILL_MODE_UPPER

to get it working again.

I also wanted to ask if this is still the go-to solution for batch Matrix solves on the GPU?