It is common to reduce (e.g. sum) the result of an array expression without needing the whole resulting array itself. In this case Julia will still create a temporary array to pass to the function, even if you managed to avoid temporary arrays before by taking advantage of loop fusion. Is there some way to skip that for reduction somehow? Perhaps some macro or specialized function?

If the reduced expression has only one array argument, one can use mapreduce(), but since it only accepts one iterator, for an expression with several arrays zip() is required which degrades the performance (it also looks somewhat ugly without tuple destructuring). As an example:

function test_vectorized(x, y)
sqrt(sum((x .- y) .^ 2))
end
function test_devectorized(x, y)
s = zero(eltype(x))
@inbounds @simd for i in 1:length(x)
s += (x[i] - y[i])^2
end
sqrt(s)
end
function test_mapreduce(x, y)
sqrt(mapreduce(p -> (p[1]-p[2])^2, +, zip(x, y)))
end
x = rand(100000000)
y = rand(100000000)
r1 = test_vectorized(x, y)
@time test_vectorized(x, y)
r2 = test_devectorized(x, y)
@time test_devectorized(x, y)
r3 = test_mapreduce(x, y)
@time test_mapreduce(x, y)
@assert isapprox(r1, r2)
@assert isapprox(r1, r3)

You can do sqrt(sum((a - b)^2 for (a, b) in zip(x, y))), but itâ€™s quite verbose and Iâ€™m not sure how it will perform. A reduction syntax which would automatically be combined with dot broadcasting operations has been discussed, but no solution exists yet. See for example this comment.

Another way to do this is: norm(x-y). It is the most character economical but allocates another array. Perhaps an iterator/stream/generator version of norm can be defined.

IMHO the mapreduce version is fast enough but the indexing is annoying aesthetically, so I defined:

It is actually as slow as test_vectorized(), because norm() is still a fusion boundary. Plus, the norm is just an example, I have various expressions I need to sum.

Yes, thatâ€™s what I ended up with as well. Still 50% slower than devectorized code, but at least not as slow as just sum().

Just stumbled on this old discussion - and I was wondering, since a lot has changed since then, esp. with broadcasting, is there work going on regarding elision of temporary arrays?

It just so â€¦ imperfect â€¦ that sum(log.(A)) allocates a temporary array (mapreduce(log, +, A) wonâ€™t, of course). Nowadays, we could use

sum(Base.broadcasted(log, A))

It works, but itâ€™s kinda verbose, and doesnâ€™t support dot-notation. If there was a way to say â€śdonâ€™t materialize my broadcast hereâ€ť â€¦ is there something going on in that direction? Iâ€™d be surprised if people havenâ€™t thought about it already (some probably in depth).

It might also be help if Base.broadcasted were a (read-only) AbstractArray, e.g. to allow multi-threaded operation. Or alternatively, if there was a standard option to materialize it as a some kind of mapped-array type. But maybe this has been considered too already (and possibly been rejected for good reasons)?

Thanks for the links, @tkf! I was pretty sure that there would be a discussion about this going on somewhere, just didnâ€™t manage to find it for some reason.