Is there a way to perform batched matrix inversion/ldiv in CUDA.jl, ie:
Yi = Ri \ Xi
where Ri is NxN, and Yi and Xi are NxM, with N and M small (~12). I need to solve a large number of these systems, e.g. 1 < i < 50k. Obviously making multiple GPU calls in a loop will be highly inefficient for small N,M and CUDA provides several relevant batched methods, e.g. cusolverDnpotrsBatched() or cublasgetrsBatched().
I don’t believe the current CUDA.jl implementations of inv or ldiv support these batched calls. Is there a workaround or a roadmap for implementing this in CUDA.jl?