The strided_batched methods are the ones accepting multi-dimensional arrays (as per NVIDIA naming), and are supposed to be faster than the ones using vectors of GPU arrays: https://developer.nvidia.com/blog/cublas-strided-batched-matrix-multiply/
Other than that, I’m not terribly familiar with the use or design of batched APIs, so help is always appreciated. There is some existing work though, like batched_mul! in NNlib.jl, Batched.jl, BatchedBLAS.jl, etc.