Handling functions that have "constant" sub-expressions when looping over arguments

Very commonly, I want to define some function mainfunc and then evaluate it many times in a loop where some of the arguments are constant during looping but others vary with each iteration.

Here is an example of what I mean.

using BenchmarkTools
using StaticArrays
using LinearAlgebra: (⋅)

mainfunc(r, a, b, c) =
    (r⋅a)*(r⋅b)*(r⋅c)/3 + (a⋅b)*(r⋅c) - (a⋅c)*(r⋅b) - (b⋅c)*(r⋅a)

fabricate_vectors(n) = [rand(SVector{3,Float64}) for _ ∈ 1:n]
const r = rand(SVector{3,Float64})
const as = fabricate_vectors(100)
const bs = fabricate_vectors(100)
const cs = fabricate_vectors(100)
const buffer = Array{Float64}(undef, (length(as), length(bs), length(cs)));

function loopfunc1!(out)
    for (k,c) ∈ enumerate(cs)
        for (j,b) ∈ enumerate(bs)
            for (i,a) ∈ enumerate(as)
                out[i,j,k] = mainfunc(r, a, b, c)
            end
        end
    end
    return out
end;

In general, all (sub)expressions within mainfunc need to be evaluated anew for every innermost loop iteration since the output of mainfunc might depend on and/or mutate global state whenever it is called. However, I am interested in the case where mainfunc is pure in that it does not do these things and its output depends only on its input values. I have tried to ensure this is the case for mainfunc, since I don’t expect “mathematical” functions such as +, *, or to have any side effects for the kinds of inputs (simple vectors of numbers) I am interested in.

Then, it seems that an efficient strategy might be as follows, at least in concept.

function loopfunc2!(out)
    for (k,c) ∈ enumerate(cs)
        f1(a,b) = mainfunc(r,a,b,c)
        for (j,b) ∈ enumerate(bs)
            f2(a) = f1(a,b)
            for (i,a) ∈ enumerate(as)
                out[i,j,k] = f2(a)
            end
        end
    end
    return out
end;

Here, I hoped that the compiler would recognize that in defining f1, sub-expressions within the wrapped mainfunc that contain only r and c become constant and don’t need to be evaluated again and again in the inner loops that follow. Thus, to save work, these sub-expressions could be evaluated just once per outermost loop right when f1 is defined. Similarly, in defining f2, only sub-expressions involving a of the (doubly) wrapped mainfunc are non-constant in the innermost loop.

Unfortunately, my attempt to get the compiler to infer what I want by using wrapper functions doesn’t make a difference.

julia> @btime loopfunc1!(buffer);
      8.531 ms (0 allocations: 0 bytes)

julia> @btime loopfunc2!(buffer);
      8.536 ms (0 allocations: 0 bytes)

Therefore, I am wondering if there is a way to get the Julia compiler to make such optimizations when they are possible.

A manual solution would be to have mainfunc take vectors of vectors as arguments, move the loops into mainfunc, and manually pre-compute the subexpressions at each loop level.

function mainfunc_explicit_loops!(out, r, as, bs, cs)
    # (r⋅a)*(r⋅b)*(r⋅c)/3 + (a⋅b)*(r⋅c) - (a⋅c)*(r⋅b) - (b⋅c)(r⋅a)
    for (k,c) ∈ enumerate(cs)
        rc = r⋅c
        rc_div_3 = rc/3
        for (j,b) ∈ enumerate(bs)
            rb = r⋅b
            bc = b⋅c
            for (i,a) ∈ enumerate(as)
                ra = r⋅a
                ab = a⋅b
                ac = a⋅c
                out[i,j,k] = ra*rb*rc_div_3 + ab*rc - ac*rb - bc*ra
            end
        end
    end
    return out
end
julia> @btime mainfunc_explicit_loops!(buffer, r, as, bs, cs);
  4.788 ms (0 allocations: 0 bytes)

We get a 2x speedup doing things by hand.

However, there are some disadvantages. For one, being able to define something like the non-explicit mainfunc is very convenient since the code looks exactly like the mathematical expression it represents, and looping patterns over combinations of input values, preallocation of result etc. are kept out of the definition of mainfunc itself, allowing greater orthogonalization of design.

I am also interested in solutions that don’t depend on the compiler but are maybe cleaner than my mainfunc_explicit_loops! solution. Thanks!

Maybe loopfunc2 combined with Memoize.jl helps. But probably no automatic alternative will beat the manual splitting. (Or, thinking a little bit more, maybe this won’t help at all).

On the other side your manual inner loop seems to be loop-vectorizable.

Here are some tweaks that get me roughly a factor of 2, for both the initial version and the hand-arranged one.

I agree there ought to be some automated way to do what you wrote, I tried using CommonSubexpressions.jl but I think it’s reluctant to move around anything involving a variable defined within what it sees. Although surely a similar macro could be written.

LoopVectorization does many smart things like this, but won’t currently handle arrays of SVector. It’s possible you could make it work on reinterpret(reshape, Float64, as) etc, but I haven’t tried.

julia> function loopfunc1i!(out)
           @inbounds for k ∈ eachindex(cs)
               c = cs[k]
               for j ∈ eachindex(bs)
                   b = bs[j]
                   for i ∈ eachindex(as)
                       a = as[i]
                       out[i,j,k] = mainfunc(r, a, b, c)
                   end
               end
           end
           return out
       end;

julia> @btime loopfunc1!(buffer);  # original
  3.040 ms (0 allocations: 0 bytes)

julia> @btime loopfunc1i!(buffer);  # with indexing + @inbounds
  2.566 ms (0 allocations: 0 bytes)

julia> @inline mainfunc(r, a, b, c) =
           (r⋅a)*(r⋅b)*(r⋅c)/3 + (a⋅b)*(r⋅c) - (a⋅c)*(r⋅b) - (b⋅c)*(r⋅a)
mainfunc (generic function with 1 method)

julia> @btime loopfunc1i!(buffer);  # with indexing + @inbounds + @inline
  1.230 ms (0 allocations: 0 bytes)

julia> @btime mainfunc_explicit_loops!(buffer, r, as, bs, cs);  # as above
  1.355 ms (0 allocations: 0 bytes)

julia> @btime mainfunc_explicit_loops_i!(buffer, r, as, bs, cs);  # with indexing + @inbounds
  810.584 μs (0 allocations: 0 bytes)

julia> using CommonSubexpressions

julia> @macroexpand @cse for k ∈ eachindex(cs)
                      c = cs[k]
                      for j ∈ eachindex(bs)
                          b = bs[j]
                          for i ∈ eachindex(as)
                              a = as[i]
                              out[i,j,k] =
           (r⋅a)*(r⋅b)*(r⋅c)/3 + (a⋅b)*(r⋅c) - (a⋅c)*(r⋅b) - (b⋅c)*(r⋅a)
           end
                      end
                  end
quote
    var"##2132" = eachindex(cs)
    var"##2133" = eachindex(bs)
    var"##2134" = eachindex(as)
    for k = var"##2132"
        #= REPL[362]:2 =#
        c = cs[k]
        #= REPL[362]:3 =#
        for j = var"##2133"
            #= REPL[362]:4 =#
            b = bs[j]
            #= REPL[362]:5 =#
            for i = var"##2134"
                #= REPL[362]:6 =#
                a = as[i]
                #= REPL[362]:7 =#
                out[i, j, k] = ((((r ⋅ a) * (r ⋅ b) * (r ⋅ c)) / 3 + (a ⋅ b) * (r ⋅ c)) - (a ⋅ c) * (r ⋅ b)) - (b ⋅ c) * (r ⋅ a)
            end
        end
    end
end

Thanks for the tips. I didn’t know about LoopVectorization.jl so I’ll have to try out your suggestion when I get a minute to familiarize myself with it.

For now, it looks like my best bet is to is to accept that code maximizing performance usually does not usually make for pretty code. Looks like the cleanest thing to do is to have mainfunc_explicit_loops!, which knows how to efficiently handle combinations of arguments. Then maybe define mainfunc(r, a, b, c) = mainfunc_explicit_loops(r, (a,), (b,), (c,)) or something, and hope the compiler might know to eliminate the now redundant loops over 1-element Tuples. Either that or I am stuck writing two functions, or more likely simply accepting the inefficiency (a factor of two isn’t so bad at least in this case).

It’s tempting to write a macro that would take mainfunc and turn it into something like mainfunc_explicit_loops (at the macro users’ risk), though that might take me some time to figure out.