NDReducibles.jl: multi-dimensional array computations without indexing (a proof-of-concept)

(This is kind of a follow-up discussion of my JuliaCon talk. So the context probably is not super clear if you haven’t watched it.)

Writing Transducers.jl (see also the latest release announcement) made me think that fold is a great approach for writing a large class of loops. So, it naturally made me want to try stretching the boundary of fold-based computation. That’s why I tried to tackle Julep: solving tricky iteration problems. You can read the Julep in GitHub (I also posted a different version of this post there) but a short summary is: if/how we can support multi-dimensional iteration like the one in matrix-matrix multiplication in the iterator API. As you have multiple arrays with different set of indices, this is a hairy problem to solve. You have to find an index that corresponds to the fastest axis for all arrays with that index. Once you find the best or good enough order of indices to iterate, another problem is how to execute such loops.

Actually, it’s even more challenging for fold because these kinds of problem often require mutating multi-dimensional arrays. Fold is really good at reduction, like summing things up. But how can we use it for mutation? It turned out there is a simple trick we can use. The idea is to create an array of references to the elements in the original array to be mutated.

julia> ys = zeros(3);

julia> ys_ref = [Ref(ys, i) for i in eachindex(ys)];

julia> y1 = ys_ref[1];
       y1[] = 123

julia> ys
3-element Array{Float64,1}:

julia> foreach(zip(ys_ref, ones(3))) do (y, x)
           y[] = x

Of course, above code allocates a new array ys_ref and it’s very inefficient to do so. But it can be fixed by a very simple lazy array type. So I created a package called Referenceables.jl to do it. (You can install it by pkg> add Referenceables.)

Using this trick, we “only” need a custom fold that runs over the indices. It is not so difficult to automatically schedule the loop in such a way that the inner-most loop uses the fastest axis of the all arrays involved in the loop (if such choice exists). Putting these things together, I came up with the following API and implemented in NDReducibles.jl:

using NDReducibles: ndreducible
using Referenceables: referenceable

        referenceable(C) => (:i, :j),  # mark that we want to mutate C
        A => (:i, :k),
        B => (:k, :j),
) do (c, a, b)
    c[] += a * b

foreach is defined in terms of foldl [1] so above code internally calls foldl for ndreducible. This foldl does the aforementioned loop scheduling.

I find this API very interesting (even though I wrote it). There is so little foreah users can do: specifying the indices and loop body. Everything else, including even indexing (getindex), is automatically handled (hence the title). It gives a huge opportunity for fold implementers. They can implement various optimizations. Re-ordering the loop for improving locality is a pretty simple one. But it is also in principle possible to do automatic parallelization, cache oblivious access pattern, etc.

NDReducibles.jl is not registered yet, as at this point I don’t think it has a big advantage over already established packages like TensorOperations.jl. To make it more practical, I think something like BroadcastStyle is required for flexibly negotiating indexing order when, e.g., some non-dense arrays are involved.

[1] foldl is formally defined as left-to-right application of a binary function so it is not very clean to use it for shuffling loop order. But since Transducers.jl’s machinery is build around foldl, it’s very handy to use it here. For more serious implementations, it is probably a better idea to have a separate “commutative fold” function.


What’s the advantages of using Referencables versus @view, for example?

referenceable(x) is not view(x, :, ..., :) because the element type of referenceable(x) is <: Ref{eltype(x)} while the element type of view(x, :, ..., :) is eltype(x). You could use a view instead of Ref but I want to make it as simple as possible and also keep inbounds information in the type parameter.

1 Like

Please elaborate, that sounds very interesting but I still don’t completely understand why it’s better for you.

referenceable is nice because I can combine it with, say, foreach and zip to implement copy!:

ys = zeros(3)
foreach(zip(referenceable(ys), ones(3))) do (y, x)
    y[] = x

I can’t use the same trick with view:

ys = zeros(3)
foreach(zip(view(ys), ones(3))) do (y, x)
    y :: Float64
    x :: Float64
    # There's nothing I can do here.

Note that zip and foreach do not have to know which array is mutable or not. They can just focus on the shape of the array and how to iterate over the items.