One reason that some combined operations are their own functions is that these can be more efficient than making slices. You can see this with your batched_mul
example, where the fused on saves allocations, and (IIRC) will also parallelise better on larger cases, and call special CUDA routines.
julia> using NNlib, BenchmarkTools
julia> C = @btime $A ⊠ $B; # special batched matrix multiplication
309.363 ns (1 allocation: 400 bytes) # fused, 1 allocation of final Array
julia> using JJ
julia> C ≈ @btime rank"2 * 2"($A, $B) # just rank the standard one!
803.831 ns (7 allocations: 768 bytes) # allocates slices, returns lazy JuliennedArrays.Align
true
Or a simpler example:
julia> M = transpose(rand(10^3, 10^3));
julia> V = @btime vec(sum($M, dims=1));
158.417 μs (3 allocations: 8.02 KiB)
julia> V ≈ @btime rank"sum 1"($M) # this is less cache-friendly than sum
908.500 μs (1 allocation: 7.94 KiB)
true
You might like this earlier vmap discussion about trying to make such transformations automatically.
Such concerns aside, some other ways to handle slices besides SplitApplyCombine & JuliennedArrays (mentioned above) include these. They are certainly more verbose than rank
:
julia> C ≈ @btime stack(*, eachslice($A, dims=3), eachslice($B, dims=3)) # should work with PR43334
872.121 ns (14 allocations: 1.27 KiB) # allocates slices, returns Array
true
julia> using TensorCast # (my package)
julia> C ≈ @cast C2[i,j,n] := (A[:,:,n] * B[:,:,n])[i,j]
true