# Handling functions that have "constant" sub-expressions when looping over arguments

Very commonly, I want to define some function `mainfunc` and then evaluate it many times in a loop where some of the arguments are constant during looping but others vary with each iteration.

Here is an example of what I mean.

``````using BenchmarkTools
using StaticArrays
using LinearAlgebra: (⋅)

mainfunc(r, a, b, c) =
(r⋅a)*(r⋅b)*(r⋅c)/3 + (a⋅b)*(r⋅c) - (a⋅c)*(r⋅b) - (b⋅c)*(r⋅a)

fabricate_vectors(n) = [rand(SVector{3,Float64}) for _ ∈ 1:n]
const r = rand(SVector{3,Float64})
const as = fabricate_vectors(100)
const bs = fabricate_vectors(100)
const cs = fabricate_vectors(100)
const buffer = Array{Float64}(undef, (length(as), length(bs), length(cs)));

function loopfunc1!(out)
for (k,c) ∈ enumerate(cs)
for (j,b) ∈ enumerate(bs)
for (i,a) ∈ enumerate(as)
out[i,j,k] = mainfunc(r, a, b, c)
end
end
end
return out
end;
``````

In general, all (sub)expressions within `mainfunc` need to be evaluated anew for every innermost loop iteration since the output of `mainfunc` might depend on and/or mutate global state whenever it is called. However, I am interested in the case where `mainfunc` is pure in that it does not do these things and its output depends only on its input values. I have tried to ensure this is the case for `mainfunc`, since I don’t expect “mathematical” functions such as `+`, `*`, or `⋅` to have any side effects for the kinds of inputs (simple vectors of numbers) I am interested in.

Then, it seems that an efficient strategy might be as follows, at least in concept.

``````function loopfunc2!(out)
for (k,c) ∈ enumerate(cs)
f1(a,b) = mainfunc(r,a,b,c)
for (j,b) ∈ enumerate(bs)
f2(a) = f1(a,b)
for (i,a) ∈ enumerate(as)
out[i,j,k] = f2(a)
end
end
end
return out
end;
``````

Here, I hoped that the compiler would recognize that in defining `f1`, sub-expressions within the wrapped `mainfunc` that contain only `r` and `c` become constant and don’t need to be evaluated again and again in the inner loops that follow. Thus, to save work, these sub-expressions could be evaluated just once per outermost loop right when `f1` is defined. Similarly, in defining `f2`, only sub-expressions involving `a` of the (doubly) wrapped `mainfunc` are non-constant in the innermost loop.

Unfortunately, my attempt to get the compiler to infer what I want by using wrapper functions doesn’t make a difference.

``````julia> @btime loopfunc1!(buffer);
8.531 ms (0 allocations: 0 bytes)

julia> @btime loopfunc2!(buffer);
8.536 ms (0 allocations: 0 bytes)
``````

Therefore, I am wondering if there is a way to get the Julia compiler to make such optimizations when they are possible.

A manual solution would be to have `mainfunc` take vectors of vectors as arguments, move the loops into `mainfunc`, and manually pre-compute the subexpressions at each loop level.

``````function mainfunc_explicit_loops!(out, r, as, bs, cs)
# (r⋅a)*(r⋅b)*(r⋅c)/3 + (a⋅b)*(r⋅c) - (a⋅c)*(r⋅b) - (b⋅c)(r⋅a)
for (k,c) ∈ enumerate(cs)
rc = r⋅c
rc_div_3 = rc/3
for (j,b) ∈ enumerate(bs)
rb = r⋅b
bc = b⋅c
for (i,a) ∈ enumerate(as)
ra = r⋅a
ab = a⋅b
ac = a⋅c
out[i,j,k] = ra*rb*rc_div_3 + ab*rc - ac*rb - bc*ra
end
end
end
return out
end
``````
``````julia> @btime mainfunc_explicit_loops!(buffer, r, as, bs, cs);
``````
``````  4.788 ms (0 allocations: 0 bytes)
``````

We get a 2x speedup doing things by hand.

However, there are some disadvantages. For one, being able to define something like the non-explicit `mainfunc` is very convenient since the code looks exactly like the mathematical expression it represents, and looping patterns over combinations of input values, preallocation of `result` etc. are kept out of the definition of `mainfunc` itself, allowing greater orthogonalization of design.

I am also interested in solutions that don’t depend on the compiler but are maybe cleaner than my `mainfunc_explicit_loops!` solution. Thanks!

Maybe `loopfunc2` combined with Memoize.jl helps. But probably no automatic alternative will beat the manual splitting. (Or, thinking a little bit more, maybe this won’t help at all).

On the other side your manual inner loop seems to be loop-vectorizable.

Here are some tweaks that get me roughly a factor of 2, for both the initial version and the hand-arranged one.

I agree there ought to be some automated way to do what you wrote, I tried using CommonSubexpressions.jl but I think it’s reluctant to move around anything involving a variable defined within what it sees. Although surely a similar macro could be written.

LoopVectorization does many smart things like this, but won’t currently handle arrays of SVector. It’s possible you could make it work on `reinterpret(reshape, Float64, as)` etc, but I haven’t tried.

``````julia> function loopfunc1i!(out)
@inbounds for k ∈ eachindex(cs)
c = cs[k]
for j ∈ eachindex(bs)
b = bs[j]
for i ∈ eachindex(as)
a = as[i]
out[i,j,k] = mainfunc(r, a, b, c)
end
end
end
return out
end;

julia> @btime loopfunc1!(buffer);  # original
3.040 ms (0 allocations: 0 bytes)

julia> @btime loopfunc1i!(buffer);  # with indexing + @inbounds
2.566 ms (0 allocations: 0 bytes)

julia> @inline mainfunc(r, a, b, c) =
(r⋅a)*(r⋅b)*(r⋅c)/3 + (a⋅b)*(r⋅c) - (a⋅c)*(r⋅b) - (b⋅c)*(r⋅a)
mainfunc (generic function with 1 method)

julia> @btime loopfunc1i!(buffer);  # with indexing + @inbounds + @inline
1.230 ms (0 allocations: 0 bytes)

julia> @btime mainfunc_explicit_loops!(buffer, r, as, bs, cs);  # as above
1.355 ms (0 allocations: 0 bytes)

julia> @btime mainfunc_explicit_loops_i!(buffer, r, as, bs, cs);  # with indexing + @inbounds
810.584 μs (0 allocations: 0 bytes)

julia> using CommonSubexpressions

julia> @macroexpand @cse for k ∈ eachindex(cs)
c = cs[k]
for j ∈ eachindex(bs)
b = bs[j]
for i ∈ eachindex(as)
a = as[i]
out[i,j,k] =
(r⋅a)*(r⋅b)*(r⋅c)/3 + (a⋅b)*(r⋅c) - (a⋅c)*(r⋅b) - (b⋅c)*(r⋅a)
end
end
end
quote
var"##2132" = eachindex(cs)
var"##2133" = eachindex(bs)
var"##2134" = eachindex(as)
for k = var"##2132"
#= REPL:2 =#
c = cs[k]
#= REPL:3 =#
for j = var"##2133"
#= REPL:4 =#
b = bs[j]
#= REPL:5 =#
for i = var"##2134"
#= REPL:6 =#
a = as[i]
#= REPL:7 =#
out[i, j, k] = ((((r ⋅ a) * (r ⋅ b) * (r ⋅ c)) / 3 + (a ⋅ b) * (r ⋅ c)) - (a ⋅ c) * (r ⋅ b)) - (b ⋅ c) * (r ⋅ a)
end
end
end
end
``````

Thanks for the tips. I didn’t know about `LoopVectorization.jl` so I’ll have to try out your suggestion when I get a minute to familiarize myself with it.

For now, it looks like my best bet is to is to accept that code maximizing performance usually does not usually make for pretty code. Looks like the cleanest thing to do is to have `mainfunc_explicit_loops!`, which knows how to efficiently handle combinations of arguments. Then maybe define `mainfunc(r, a, b, c) = mainfunc_explicit_loops(r, (a,), (b,), (c,))` or something, and hope the compiler might know to eliminate the now redundant loops over 1-element `Tuple`s. Either that or I am stuck writing two functions, or more likely simply accepting the inefficiency (a factor of two isn’t so bad at least in this case).

It’s tempting to write a macro that would take `mainfunc` and turn it into something like `mainfunc_explicit_loops` (at the macro users’ risk), though that might take me some time to figure out.