No, you need ⊠
:
julia> using NNlib, NNlibCUDA
julia> result ≈ batched_mul(tensor, matrix)
true
Note that if you wanted the other way around, then you could also do it by reshaping:
julia> using TensorCore
julia> @btime CUDA.@sync $matrix ⊡ $tensor;
33.002 μs (34 allocations: 784 bytes)
julia> @btime CUDA.@sync $matrix ⊠ $tensor;
27.840 μs (12 allocations: 368 bytes)
julia> matrix ⊠ tensor ≈ matrix ⊡ tensor ≈ mapslices(slice-> matrix * slice, tensor, dims=(1,2))
true # on CPU