(This is kind of a follow-up discussion of my JuliaCon talk. So the context probably is not super clear if you haven’t watched it.)
Writing Transducers.jl (see also the latest release announcement) made me think that fold is a great approach for writing a large class of loops. So, it naturally made me want to try stretching the boundary of fold-based computation. That’s why I tried to tackle Julep: solving tricky iteration problems. You can read the Julep in GitHub (I also posted a different version of this post there) but a short summary is: if/how we can support multi-dimensional iteration like the one in matrix-matrix multiplication in the iterator API. As you have multiple arrays with different set of indices, this is a hairy problem to solve. You have to find an index that corresponds to the fastest axis for all arrays with that index. Once you find the best or good enough order of indices to iterate, another problem is how to execute such loops.
Actually, it’s even more challenging for fold because these kinds of problem often require mutating multi-dimensional arrays. Fold is really good at reduction, like summing things up. But how can we use it for mutation? It turned out there is a simple trick we can use. The idea is to create an array of references to the elements in the original array to be mutated.
julia> ys = zeros(3);
julia> ys_ref = [Ref(ys, i) for i in eachindex(ys)];
julia> y1 = ys_ref[1];
y1[] = 123
123
julia> ys
3-element Array{Float64,1}:
123.0
0.0
0.0
julia> foreach(zip(ys_ref, ones(3))) do (y, x)
y[] = x
end
Of course, above code allocates a new array ys_ref
and it’s very inefficient to do so. But it can be fixed by a very simple lazy array type. So I created a package called Referenceables.jl to do it. (You can install it by pkg> add Referenceables
.)
Using this trick, we “only” need a custom fold that runs over the indices. It is not so difficult to automatically schedule the loop in such a way that the inner-most loop uses the fastest axis of the all arrays involved in the loop (if such choice exists). Putting these things together, I came up with the following API and implemented in NDReducibles.jl:
using NDReducibles: ndreducible
using Referenceables: referenceable
foreach(
ndreducible(
referenceable(C) => (:i, :j), # mark that we want to mutate C
A => (:i, :k),
B => (:k, :j),
),
) do (c, a, b)
c[] += a * b
end
foreach
is defined in terms of foldl
[1] so above code internally calls foldl
for ndreducible
. This foldl
does the aforementioned loop scheduling.
I find this API very interesting (even though I wrote it). There is so little foreah
users can do: specifying the indices and loop body. Everything else, including even indexing (getindex
), is automatically handled (hence the title). It gives a huge opportunity for fold implementers. They can implement various optimizations. Re-ordering the loop for improving locality is a pretty simple one. But it is also in principle possible to do automatic parallelization, cache oblivious access pattern, etc.
NDReducibles.jl is not registered yet, as at this point I don’t think it has a big advantage over already established packages like TensorOperations.jl. To make it more practical, I think something like BroadcastStyle
is required for flexibly negotiating indexing order when, e.g., some non-dense arrays are involved.
[1] foldl
is formally defined as left-to-right application of a binary function so it is not very clean to use it for shuffling loop order. But since Transducers.jl’s machinery is build around foldl
, it’s very handy to use it here. For more serious implementations, it is probably a better idea to have a separate “commutative fold” function.