All excellent points and I’d probably do the same in production code.
The identity
trick is manual loop “unfusing” and forces the allocation of a temporary array, exactly like making the temporary explicitly on a separate line. Deciding on which parts of a broadcast would better be materialized early, stored and reused in a separate loop is an optimization which depends on pretty high level knowledge of the cost of memory allocation, memory bandwidth vs the cost of redoing some computation in the inner loop. And these things also completely depend on the size of the arrays involved. I don’t expect the julia compiler to do this any time soon and it also seems at odds with the simple definition of broadcasting which we have now as a fully fused operation.
If I understand LICM correctly it’s a much more local optimization which given a loop structure decides which parts may be hoisted out as loop invariants. It would definitely help here (if it’s not done already), as it allows the following transformation
for i=1:n
for j=1:n
for k=1:n
out[i,j,k] = exp(x[i]) * exp(y[j]) * exp(z[k])
end
end
end
to
for i=1:n
ex = exp(x[i])
for j=1:n
ey = exp(y[j])
for k=1:n
out[i,j,k] = ex * ey * exp(z[k])
end
end
end
But even so, you’ve still got O(n^3)
invocations of exp
, whereas allocating temporary storage and splitting the loops brings you down to O(n)
.