I’m trying to perform various operations such as multiplication, inversion, solving, Cholesky decomposition in batches with CUDA.jl.
As far as I’m aware, there is no high-level API for doing this, as is the case with Tensorflow/PyTorch, e.g. tf.linalg.inv which natively supports batched operations.
Instead, I have been trying to use the low-level CUBLAS wrappers. I would have expected these to accept multi-dimensional arrays […, N, N] but instead most of them seem to accept vectors of CuArrays.
Here is an example using gemv_batched!.
I wanted to confirm two things:
- Is this the correct way to be performing batched operations with CUDA.jl?
- Are these operations actually parallelised over the batch or just within each single operation (it’s the vector of CuArrays that is making me doubt this)?
As a follow-up question, are there any plans to implement a Tensorflow/PyTorch style high-level interface for batch operations. If so, I would be happy to help where I can as my research would benefit greatly from such implementations.
1 Like
The strided_batched
methods are the ones accepting multi-dimensional arrays (as per NVIDIA naming), and are supposed to be faster than the ones using vectors of GPU arrays: https://developer.nvidia.com/blog/cublas-strided-batched-matrix-multiply/
Other than that, I’m not terribly familiar with the use or design of batched APIs, so help is always appreciated. There is some existing work though, like batched_mul!
in NNlib.jl, Batched.jl, BatchedBLAS.jl, etc.
2 Likes
We have a few low-level wrappers for batch operations in CUBLAS
and CUSOLVER
:
I confirm that these operations are parallelized over the batch.
I am reluctant to adding an high-level dispatch (mul!, cholesky, etc…) because we don’t have a “batch” version for CPU.
2 Likes
Thank you both for clarifying!
1 Like
This is a very fair point and I can see how this could lead to confusion. How would you feel about adding a high-level interface specifically for the non-strided batch case?
I believe all of these are already supported by Julia through automatic broadcasting. E.g.
As = [rand(2, 2) for i in 1:10]
Bs = [rand(2, 3) for i in 1:10]
As .* Bs
We could just intercept the broadcasting and replace it with a call to gemm_batched
.
It’s an excellent idea @TimHargreaves, if we can intercept the .*
, it makes sense to dispatch to gemm_batched
.
We just need to check if a recent CUDA release is installed (12.4) because the matrices in As
and Bs
needed to have the same shape before.
It also explains why we never think of that, the CUDA routine was too restrictive on As
and Bs
.