Is there something like broadcast_mapreduce?



Assume I have 2 arrays x and y that can be broadcast to a common size. Is there any way to get a (zipped) iterator over the elements of both arrays as they would result from a broadcast operation? For example:

x = rand(2,5,3)
y = rand(1,5,1)

Is there any way to obtain the sum without allocating the intermediate array by getting an iterator over the elements and calling mapreduce on it?


Not necessarily, the MWE runs fine since the length 1 axes are just repeated.


I should have made this clearer. As @Tamas_Papp mentioned, I want this to work on arrays of different sizes with singleton dimensions repeated, exactly the way broadcast is defined…



x = rand(2,5,3)
y = rand(1,5,1)

yourself. What @fabiangans needs is expansion of singleton dimensions, then broadcast proper, then the sum. I do not see a way to do it with the current setup, other than reusing the expansion framework some clever way with the generator approach you suggested above.


I think this is up to you, but would probably make sense. I had a look at the broadcast source code and if no one else steps up with a solution I will try to tap into that system and adjust it to my needs.

Thanks anyway,


Thanks for the hint, I tried to adapt the expansion framework. I tried two approaches:

  1. An iterator similar to zip, but based on the broadcast expansion to repeat singleton dimensions
  2. A broadcast_reduce function, similar to mapreduce

Here are the results of some simple benchmarks:

using BenchmarkTools

x1 = rand(200,5000,30)
x2 = rand(1,5000,1)
x3 = rand(1,1,1)

# The allocating version
@btime sum(x1.*x2.*x3)
# 60.590 ms (57 allocations: 228.88 MiB)

# Iterator-based mapreduce
it = BroadIter(x1,x2,x3)
@btime mapreduce(x->x[1]*x[2]*x[3],+,it)
# 117.096 ms (3 allocations: 64 bytes)

# Iterator-based loop
function sum_iter(x1,x2,x3)
    it = BroadIter(x1,x2,x3)
    s = 0.0
    for (a1,a2,a3) in it
@btime sum_iter(x1,x2,x3)
# 219.865 ms (4 allocations: 240 bytes)

# broadcast_reduce
@btime broadcast_reduce(*,+,0.0,x1,x2,x3)
# 134.958 ms (79 allocations: 1.92 KiB)

So the code is allocation-free and working correctly fro my use case, but still slower than the allocating version. I would guess that this is due to simd-optimizsations? I don’t think they are possible for the iterator approach. Any hints on how to improve one of the approaches would be welcome. The code that I adapted from broadcast.jl is here:


Just in case someone stumbles upon this thread and to answer my own question. Today I thought about the problem again and I found a nice and fast way to do this by using the mutating broadcast! on a custom AbstractArray which does the reduction upon calling setindex! Here is my final implementation of broadcast_reduce which has no allocations and is faster than the allocating version:

#Define abstract Array which consumes values on setindex!
type ArrayReducer{T,N,F} <: AbstractArray{T,N}
Base.setindex!(x::ArrayReducer,v,i...) = x.v=x.op(x.v,v)

#Define the function
function broadcast_reduce(f,op,v0,A,Bs...)
    shape = Base.Broadcast.broadcast_indices(A,Bs...)
    reducer = ArrayReducer(get_stop.(shape),v0,op)

# First check if both give the same result
x1 = rand(200,5000,30)
x2 = rand(1,5000,1)
x3 = rand(1,1,1)


#Then benchmark
using BenchmarkTools
@btime sum(x1.*x2.*x3)
#  109.582 ms (57 allocations: 228.88 MiB)

@btime broadcast_reduce(*,+,0.0,x1,x2,x3)
# 35.359 ms (46 allocations: 1.81 KiB)


That seems really useful, maybe it could be added to a package? (Or become part of a new package?)