ANN: JJ.jl -- library for J-like verb rank in Julia

One reason that some combined operations are their own functions is that these can be more efficient than making slices. You can see this with your batched_mul example, where the fused on saves allocations, and (IIRC) will also parallelise better on larger cases, and call special CUDA routines.

julia> using NNlib, BenchmarkTools

julia> C = @btime $A ⊠ $B;  # special batched matrix multiplication
  309.363 ns (1 allocation: 400 bytes)  # fused, 1 allocation of final Array

julia> using JJ

julia> C ≈ @btime rank"2 * 2"($A, $B)  # just rank the standard one!
  803.831 ns (7 allocations: 768 bytes)  # allocates slices, returns lazy JuliennedArrays.Align

Or a simpler example:

julia> M = transpose(rand(10^3, 10^3));

julia> V = @btime vec(sum($M, dims=1));
  158.417 μs (3 allocations: 8.02 KiB)

julia> V ≈ @btime rank"sum 1"($M)  # this is less cache-friendly than sum
  908.500 μs (1 allocation: 7.94 KiB)

You might like this earlier vmap discussion about trying to make such transformations automatically.

Such concerns aside, some other ways to handle slices besides SplitApplyCombine & JuliennedArrays (mentioned above) include these. They are certainly more verbose than rank:

julia> C ≈ @btime stack(*, eachslice($A, dims=3), eachslice($B, dims=3))  # should work with PR43334
  872.121 ns (14 allocations: 1.27 KiB)  # allocates slices, returns Array

julia> using TensorCast  # (my package)

julia> C ≈ @cast C2[i,j,n] := (A[:,:,n] * B[:,:,n])[i,j]
1 Like