Very commonly, I want to define some function mainfunc and then evaluate it many times in a loop where some of the arguments are constant during looping but others vary with each iteration.
Here is an example of what I mean.
using BenchmarkTools
using StaticArrays
using LinearAlgebra: (⋅)
mainfunc(r, a, b, c) =
(r⋅a)*(r⋅b)*(r⋅c)/3 + (a⋅b)*(r⋅c) - (a⋅c)*(r⋅b) - (b⋅c)*(r⋅a)
fabricate_vectors(n) = [rand(SVector{3,Float64}) for _ ∈ 1:n]
const r = rand(SVector{3,Float64})
const as = fabricate_vectors(100)
const bs = fabricate_vectors(100)
const cs = fabricate_vectors(100)
const buffer = Array{Float64}(undef, (length(as), length(bs), length(cs)));
function loopfunc1!(out)
for (k,c) ∈ enumerate(cs)
for (j,b) ∈ enumerate(bs)
for (i,a) ∈ enumerate(as)
out[i,j,k] = mainfunc(r, a, b, c)
end
end
end
return out
end;
In general, all (sub)expressions within mainfunc need to be evaluated anew for every innermost loop iteration since the output of mainfunc might depend on and/or mutate global state whenever it is called. However, I am interested in the case where mainfunc is pure in that it does not do these things and its output depends only on its input values. I have tried to ensure this is the case for mainfunc, since I don’t expect “mathematical” functions such as +, *, or ⋅ to have any side effects for the kinds of inputs (simple vectors of numbers) I am interested in.
Then, it seems that an efficient strategy might be as follows, at least in concept.
function loopfunc2!(out)
for (k,c) ∈ enumerate(cs)
f1(a,b) = mainfunc(r,a,b,c)
for (j,b) ∈ enumerate(bs)
f2(a) = f1(a,b)
for (i,a) ∈ enumerate(as)
out[i,j,k] = f2(a)
end
end
end
return out
end;
Here, I hoped that the compiler would recognize that in defining f1, sub-expressions within the wrapped mainfunc that contain only r and c become constant and don’t need to be evaluated again and again in the inner loops that follow. Thus, to save work, these sub-expressions could be evaluated just once per outermost loop right when f1 is defined. Similarly, in defining f2, only sub-expressions involving a of the (doubly) wrapped mainfunc are non-constant in the innermost loop.
Unfortunately, my attempt to get the compiler to infer what I want by using wrapper functions doesn’t make a difference.
julia> @btime loopfunc1!(buffer);
8.531 ms (0 allocations: 0 bytes)
julia> @btime loopfunc2!(buffer);
8.536 ms (0 allocations: 0 bytes)
Therefore, I am wondering if there is a way to get the Julia compiler to make such optimizations when they are possible.
A manual solution would be to have mainfunc take vectors of vectors as arguments, move the loops into mainfunc, and manually pre-compute the subexpressions at each loop level.
function mainfunc_explicit_loops!(out, r, as, bs, cs)
# (r⋅a)*(r⋅b)*(r⋅c)/3 + (a⋅b)*(r⋅c) - (a⋅c)*(r⋅b) - (b⋅c)(r⋅a)
for (k,c) ∈ enumerate(cs)
rc = r⋅c
rc_div_3 = rc/3
for (j,b) ∈ enumerate(bs)
rb = r⋅b
bc = b⋅c
for (i,a) ∈ enumerate(as)
ra = r⋅a
ab = a⋅b
ac = a⋅c
out[i,j,k] = ra*rb*rc_div_3 + ab*rc - ac*rb - bc*ra
end
end
end
return out
end
julia> @btime mainfunc_explicit_loops!(buffer, r, as, bs, cs);
4.788 ms (0 allocations: 0 bytes)
We get a 2x speedup doing things by hand.
However, there are some disadvantages. For one, being able to define something like the non-explicit mainfunc is very convenient since the code looks exactly like the mathematical expression it represents, and looping patterns over combinations of input values, preallocation of result etc. are kept out of the definition of mainfunc itself, allowing greater orthogonalization of design.
I am also interested in solutions that don’t depend on the compiler but are maybe cleaner than my mainfunc_explicit_loops! solution. Thanks!