ANN: JJ.jl -- library for J-like verb rank in Julia

Recently, I started to learn some APL and J. In particular, I really liked the idea of verb rank. This package JJ.jl attempts to enable something similar in Julia.

In a nutshell, verb rank is related to the split-apply-combine strategy and allows to apply a function (called verb in APL/J) to sub-arrays by specifying the desired number of dimensions the function should act on. As an example, consider a 2 3 4 sized array

arr = reshape(1:24, 2, 3, 4)

and a function providing some information about a value:

info(x) = "Some scalar $x"
info(x::AbstractArray) = "Array of shape $(size(x))"

Then, we can obtain information about each (scalar) element of an
array via broadcasting, i.e.,

julia> info.(arr)
2×3×4 Array{String, 3}:
[:, :, 1] =
 "Some scalar 1"  "Some scalar 3"  "Some scalar 5"
 "Some scalar 2"  "Some scalar 4"  "Some scalar 6"

[:, :, 2] =
 "Some scalar 7"  "Some scalar 9"   "Some scalar 11"
 "Some scalar 8"  "Some scalar 10"  "Some scalar 12"

[:, :, 3] =
 "Some scalar 13"  "Some scalar 15"  "Some scalar 17"
 "Some scalar 14"  "Some scalar 16"  "Some scalar 18"

[:, :, 4] =
 "Some scalar 19"  "Some scalar 21"  "Some scalar 23"
 "Some scalar 20"  "Some scalar 22"  "Some scalar 24"

Alternatively, the info function can be “ranked” and applied to sub-arrays in a systematic fashion:

  • rank"info 0"(arr) is the same as info.(arr) and applies the function to each scalar element of rank 0.

  • rank"info 2"(arr) now applies the function to all sub-arrays of rank 2, i.e., with ndims = 2 in Julia

julia> rank"info 2"(arr)
4-element Vector{String}:
 "Array of shape (2, 3)"
 "Array of shape (2, 3)"
 "Array of shape (2, 3)"
 "Array of shape (2, 3)"

Using this technique, many algorithms on arrays can be expressed quite easily. As a teaser, batched matrix multiplication is straight-forward with this approach

using Flux

A = randn(2, 3, 5)  # a batch of 2x3 matrices
B = randn(3, 4, 5)  # another batch

A ⊠ B  # special batched matrix multiplication

rank"2 * 2"(A, B)  # just rank the standard one!

and as slightly bigger examples, I have implemented K-Means clustering and a very basic Transformer layer.

The library tries to be fast (building on JuliennedArrays.jl), but has some overhead and also the syntax could still be improved. Hope, you still find it interesting or useful …

9 Likes

Cool idea! I’ve only dabbled in J, but used kdb+ for many years and often miss the ease with which one could express complex array operations.

This is an interesting idea. Is there a strong reason for this to be a macro? Why not spell it rank(*, 2, 2, A, B)? (Also not convinced by the name rank)

1 Like

No strong reason and there is indeed a function version, i.e., ranked(*, Val(2), Val(2))(A, B) would also work. Maybe a little less nice due to the Val types.
The reason that rank"2 * 2" constructs a callable struct is that the higher-order functions insert and table can dispatch on that to respect the rank of their argument as well:

A = reshape(1:6, 2, 3)
B = reshape(1:12, 3, 4)
dot(x, y) = sum(x .* y)
# Now, construct a table, i.e., outer product, taking the dot product between each row of A and column of B
table(rank"1 dot 1", A', B)  # here table needs to know the rank of its argument function

Currently, all names are based on the ones from J, i.e., rank, insert and table. Don’t have a strong preference for these either. How would you call rank?

The examples in the first post can easily be done with existing packages, and (imho) in a cleaner way than specialized macros.

julia> using SplitApplyCombine

julia> info.(splitdims(arr, 3))
4-element Vector{String}:
 "Array of shape (2, 3)"
 "Array of shape (2, 3)"
 "Array of shape (2, 3)"
 "Array of shape (2, 3)"

julia> splitdims(A, 3) .* splitdims(B, 3)
5-element Vector{Matrix{Float64}}:
...

I never used J or APL, but seems like Julia already has all necessary facilities to make such operations easy.
It also follows the principle of orthogonal design: splitting arrays by dimension(s) is separated from applying a function.

The SplitApplyCombine solution also makes it easy to apply a function by another dimension, or multiple dims:

julia> info.(splitdims(arr, 3))
4-element Vector{String}:
 "Array of shape (2, 3)"
 "Array of shape (2, 3)"
 "Array of shape (2, 3)"
 "Array of shape (2, 3)"

julia> info.(splitdims(arr, 2))
3-element Vector{String}:
 "Array of shape (2, 4)"
 "Array of shape (2, 4)"
 "Array of shape (2, 4)"

julia> info.(splitdims(arr, (1, 3)))
2×4 Matrix{String}:
 "Array of shape (3,)"  "Array of shape (3,)"  "Array of shape (3,)"  "Array of shape (3,)"
 "Array of shape (3,)"  "Array of shape (3,)"  "Array of shape (3,)"  "Array of shape (3,)"

Right, SplitApplyCombine as well as JuliennedArrays provide similar functionality and my implementation is actually pretty much as you suggest (using JuliennedArrays). The analog for the batched multiplication would be combinedims(splitdims(A, 3) .* splitdims(B, 3)) though.

Overall, I view these libraries as slightly lower level (and more general for that matter). The verb rank consistently always splits of the starting dimensions of an array. To me this consistency appears advantageous in larger examples and things get more interesting with functions of two arguments, i.e., using different left and right ranks, and higher-order functions respecting these ranks as well (currently only insert and table are implemented here). Also combining the results automatically again, makes successive operations on arrays more composable. But feel free to disagree, if you prefer a more explicit approach.

but there are things which are tricky in Julia, for example the treating “arrays of arrays as arrays” in broadcast isn’t straight forward and ranked verbs can exactly help with that.

Not sure, if/how it would help with nested arrays. Instead, it allows non-nested arrays to be treated as if they were nested on an as needed basis. Results are then recombined into a non-nested array, such that you never need to work with nested arrays explicitly.

One reason that some combined operations are their own functions is that these can be more efficient than making slices. You can see this with your batched_mul example, where the fused on saves allocations, and (IIRC) will also parallelise better on larger cases, and call special CUDA routines.

julia> using NNlib, BenchmarkTools

julia> C = @btime $A ⊠ $B;  # special batched matrix multiplication
  309.363 ns (1 allocation: 400 bytes)  # fused, 1 allocation of final Array

julia> using JJ

julia> C ≈ @btime rank"2 * 2"($A, $B)  # just rank the standard one!
  803.831 ns (7 allocations: 768 bytes)  # allocates slices, returns lazy JuliennedArrays.Align
true

Or a simpler example:

julia> M = transpose(rand(10^3, 10^3));

julia> V = @btime vec(sum($M, dims=1));
  158.417 μs (3 allocations: 8.02 KiB)

julia> V ≈ @btime rank"sum 1"($M)  # this is less cache-friendly than sum
  908.500 μs (1 allocation: 7.94 KiB)
true

You might like this earlier vmap discussion about trying to make such transformations automatically.

Such concerns aside, some other ways to handle slices besides SplitApplyCombine & JuliennedArrays (mentioned above) include these. They are certainly more verbose than rank:

julia> C ≈ @btime stack(*, eachslice($A, dims=3), eachslice($B, dims=3))  # should work with PR43334
  872.121 ns (14 allocations: 1.27 KiB)  # allocates slices, returns Array
true

julia> using TensorCast  # (my package)

julia> C ≈ @cast C2[i,j,n] := (A[:,:,n] * B[:,:,n])[i,j]
true
1 Like

Many thanks for this detailed analysis. I had seen the vmap discussion before and it appears that many optimizations require a more global analysis of the computations, i.e., when to push down loops into sub-functions and when to copy data in order to enable cache-friendly processing. Especially your second example is very interesting, as I see no difference when M is not transposed. Indeed, there is a special method on Base.mapreducedim! which seems to be hit be sum – as far as I can tell from reading the code (is there any tool that would show the actually executed methods?).
Overall, I guess rankis most useful if you have a natural and consistent ordering of your dimensions and build up functions following that order, e.g., as in a transformer with dimensions embedding x seq_position x head x batch and operations acting on all batches, within all heads or at each sequence position. When more elaborate reshapings are required other options are probably more useful. I will certainly have a look at TensorCast (had found Tullio, but not seen yours before).