How to broadcast or batch multiply a "batch" of matrices with another matrix on the GPU?

xiaodai · August 29, 2021, 2:54am

I have a 3 dimension tensor, where each “slice” of the tensor is a matrix. I want to multiple each slice by another matrix and store there result in a 3D tensor/array.

How do I do that in the most efficient way using the GPU?

E.g. I have

using CUDA
CUDA.allowscalar(false)

tensor = rand(4, 4, 1000) |> cu
matrix = rand(4,4) |> cu

result = mapslices( slice-> slice*matrix, tensor, dims=(1,2))

This fails due to scalar indexing not allowed.

Do I need to write a kernel myself?

mcabbott · August 29, 2021, 3:56am

No, you need ⊠:

julia> using NNlib, NNlibCUDA

julia> result ≈ batched_mul(tensor, matrix)
true

Note that if you wanted the other way around, then you could also do it by reshaping:

julia> using TensorCore

julia> @btime CUDA.@sync $matrix ⊡ $tensor;
  33.002 μs (34 allocations: 784 bytes)

julia> @btime CUDA.@sync $matrix ⊠ $tensor;
  27.840 μs (12 allocations: 368 bytes)

julia> matrix ⊠ tensor ≈ matrix ⊡ tensor ≈ mapslices(slice-> matrix * slice, tensor, dims=(1,2))
true  # on CPU

Topic		Replies	Views
Batched Matrix Multiply General Usage gpu , blas , linearalgebra , cuarrays	11	3710	January 31, 2025
Batched multiplication with CuSparseMatrix isn't working GPU	6	502	October 5, 2021
How to do mapslices() in parallel for 3D arrays GPU question	2	539	June 24, 2023
CUDA on slices General Usage cuda , broadcasting , cuarrays , tensors , mapslices	5	1365	September 6, 2022
Julia Cuda Matrix multiplication General Usage cudanative , cuda	3	4246	February 24, 2021

How to broadcast or batch multiply a "batch" of matrices with another matrix on the GPU?

Related topics