I have a 3 dimension tensor, where each “slice” of the tensor is a matrix.  I want to multiple each slice by another matrix and store there result in a 3D tensor/array.
How do I do that in the most efficient way using the GPU?
E.g. I have
using CUDA
CUDA.allowscalar(false)
tensor = rand(4, 4, 1000) |> cu
matrix = rand(4,4) |> cu
result = mapslices( slice-> slice*matrix, tensor, dims=(1,2))
This fails due to scalar indexing not allowed.
Do I need to write a kernel myself?
             
            
              
              
              
            
            
           
          
            
            
              No, you need ⊠:
julia> using NNlib, NNlibCUDA
julia> result ≈ batched_mul(tensor, matrix)
true
Note that if you wanted the other way around, then you could also do it by reshaping:
julia> using TensorCore
julia> @btime CUDA.@sync $matrix ⊡ $tensor;
  33.002 μs (34 allocations: 784 bytes)
julia> @btime CUDA.@sync $matrix ⊠ $tensor;
  27.840 μs (12 allocations: 368 bytes)
julia> matrix ⊠ tensor ≈ matrix ⊡ tensor ≈ mapslices(slice-> matrix * slice, tensor, dims=(1,2))
true  # on CPU
             
            
              
              
              3 Likes