Avoid temporary array creation for the argument of `reduce`


It is common to reduce (e.g. sum) the result of an array expression without needing the whole resulting array itself. In this case Julia will still create a temporary array to pass to the function, even if you managed to avoid temporary arrays before by taking advantage of loop fusion. Is there some way to skip that for reduction somehow? Perhaps some macro or specialized function?

If the reduced expression has only one array argument, one can use mapreduce(), but since it only accepts one iterator, for an expression with several arrays zip() is required which degrades the performance (it also looks somewhat ugly without tuple destructuring). As an example:

function test_vectorized(x, y)
    sqrt(sum((x .- y) .^ 2))

function test_devectorized(x, y)
    s = zero(eltype(x))
    @inbounds @simd for i in 1:length(x)
        s += (x[i] - y[i])^2

function test_mapreduce(x, y)
    sqrt(mapreduce(p -> (p[1]-p[2])^2, +, zip(x, y)))

x = rand(100000000)
y = rand(100000000)

r1 = test_vectorized(x, y)
@time test_vectorized(x, y)

r2 = test_devectorized(x, y)
@time test_devectorized(x, y)

r3 = test_mapreduce(x, y)
@time test_mapreduce(x, y)

@assert isapprox(r1, r2)
@assert isapprox(r1, r3)

The result:

  0.575121 seconds (87 allocations: 762.946 MiB, 18.08% gc time)
  0.094775 seconds (5 allocations: 176 bytes)
  0.140150 seconds (9 allocations: 272 bytes)

Is there a better way to do this kind of calculation without resorting to devectorization?


You can do sqrt(sum((a - b)^2 for (a, b) in zip(x, y))), but it’s quite verbose and I’m not sure how it will perform. A reduction syntax which would automatically be combined with dot broadcasting operations has been discussed, but no solution exists yet. See for example this comment.


Another way to do this is: norm(x-y). It is the most character economical but allocates another array. Perhaps an iterator/stream/generator version of norm can be defined.

IMHO the mapreduce version is fast enough but the indexing is annoying aesthetically, so I defined:

import Base: mapreduce
mapreduce(f, op, v0, itr1, itr2) = mapreduce(t->f(t[1],t[2]),op,v0, zip(itr1,itr2))

and then the mapreduce flows better:


And runs quite decently fast.


Better than test_vectorized(), but worse than test_mapreduce(), unfortunately.

Thanks for the link, glad to see I’m not the only one who needs this functionality.


It is actually as slow as test_vectorized(), because norm() is still a fusion boundary. Plus, the norm is just an example, I have various expressions I need to sum.

Yes, that’s what I ended up with as well. Still 50% slower than devectorized code, but at least not as slow as just sum().