Sum, mapreduce and broadcasted

I was trying to sum a broadcasted array without materialising it. Why is this slower?

using BenchmarkTools

v = rand(100);

@btime sum($v .* $v') ## 9.906 μs (2 allocations: 78.20 KiB)

@btime mapreduce(identity, +, Broadcast.broadcasted(*, $v, $v')) ## 21.593 μs (3 allocations: 64 bytes)

When I look at @code_warntype mapreduce(identity, +, Broadcast.broadcasted(*, v, v')) the output type is inferred, but inside it contains ::Union{Nothing, Tuple{Float64,Tuple{CartesianIndices{2,... which sounds like it’s allowing for empty sums.

But adding ; init=0.0) makes this much slower. And z=rand(0); mapreduce(identity, +, z) has no such problems, and is as fast as sum(z)

Adding materialize helps, but the benchmarks are extremely noisy.

julia> using BenchmarkTools

julia> v = rand(100);

julia> f(v) = sum(v .* v')
f (generic function with 1 method)

julia> g(v) = mapreduce(identity, +, Broadcast.broadcasted(*, v, v'))
g (generic function with 1 method)

julia> h(v) = mapreduce(identity, +, Base.Broadcast.materialize(Broadcast.broadcasted(*, v, v')))
h (generic function with 1 method)

julia> @benchmark f($v)
BenchmarkTools.Trial: 
  memory estimate:  78.20 KiB
  allocs estimate:  2
  --------------
  minimum time:     8.536 μs (0.00% GC)
  median time:      11.241 μs (0.00% GC)
  mean time:        19.604 μs (36.83% GC)
  maximum time:     43.407 ms (99.92% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> @benchmark g($v)
BenchmarkTools.Trial: 
  memory estimate:  64 bytes
  allocs estimate:  3
  --------------
  minimum time:     13.595 μs (0.00% GC)
  median time:      14.367 μs (0.00% GC)
  mean time:        14.242 μs (0.00% GC)
  maximum time:     73.228 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> @benchmark h($v)
BenchmarkTools.Trial: 
  memory estimate:  78.20 KiB
  allocs estimate:  2
  --------------
  minimum time:     8.106 μs (0.00% GC)
  median time:      11.221 μs (0.00% GC)
  mean time:        19.606 μs (37.20% GC)
  maximum time:     44.118 ms (99.93% GC)
  --------------
  samples:          10000
  evals/sample:     1

Note that you can copy and paste the above into the REPL. julia> and results are automatically ignored.

And how is that? Can you please elaborate on this? My REPL doesn’t support this.

julia> julia> using BenchmarkTools
ERROR: UndefVarError: julia not defined
Stacktrace:
 [1] top-level scope at none:0

julia> versioninfo()
Julia Version 1.0.0
Commit 5d4eaca0c9 (2018-08-08 20:58 UTC)
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: Intel(R) Core(TM) i7-4790K CPU @ 4.00GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-6.0.0 (ORCJIT, haswell)

Weird. I can copy and paste it into a REPL just fine:

julia> using BenchmarkTools

julia> v = rand(100);

julia> f(v) = sum(v .* v')
f (generic function with 1 method)

julia> g(v) = mapreduce(identity, +, Broadcast.broadcasted(*, v, v'))
g (generic function with 1 method)

julia> h(v) = mapreduce(identity, +, Base.Broadcast.materialize(Broadcast.broadcasted(*, v, v')))
h (generic function with 1 method)

julia> @benchmark f($v)
BenchmarkTools.Trial: 
  memory estimate:  78.20 KiB
  allocs estimate:  2
  --------------
  minimum time:     7.319 μs (0.00% GC)
  median time:      9.342 μs (0.00% GC)
  mean time:        12.233 μs (18.48% GC)
  maximum time:     8.865 ms (99.77% GC)
  --------------
  samples:          10000
  evals/sample:     4

julia> @benchmark g($v)
BenchmarkTools.Trial: 
  memory estimate:  64 bytes
  allocs estimate:  3
  --------------
  minimum time:     13.595 μs (0.00% GC)
  median time:      14.507 μs (0.00% GC)
  mean time:        15.796 μs (0.00% GC)
  maximum time:     47.980 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> @benchmark h($v)
BenchmarkTools.Trial: 
  memory estimate:  78.20 KiB
  allocs estimate:  2
  --------------
  minimum time:     6.870 μs (0.00% GC)
  median time:      9.014 μs (0.00% GC)
  mean time:        11.420 μs (19.74% GC)
  maximum time:     8.638 ms (99.73% GC)
  --------------
  samples:          10000
  evals/sample:     4

It removes the julia>s, as well as the output (replacing it with new output). I’d been taking advantage of that to make it easy to share results, and for others to still be able to reproduce it for a long time.
However, I wasn’t aware that many people were unfamiliar with that feature until this thread.

If it isn’t working for you, that sounds like a bug.

1 Like

You necessarily have to paste the code in order to work. The REPL is able to know if you are typing or pasting the text. It will work only on the latter case.

It doesn’t work in e.g default windows terminal.

REPL cleverness aside, yes I agree that materialize is quicker, but it allocates the entire array just like sum(), which is what I’m trying to avoid. It amounts to this, roughly:

@btime mapreduce(identity, +, copy(Broadcast.broadcasted(*, $v, $v'))) ## 9.745 μs (2 allocations: 78.20 KiB)

But what I’d like to understand is how this copy method is able to get all the elements much quicker than mapreduce (or various naive loops I tried).

BroadcastArray in LazyArrays.jl ( which is equivalent to Broadcasted but <: AbstractArray) gets rid of the type inference issue:

julia> g̃(v) = mapreduce(identity, +, BroadcastArray(*, v, v'))
g̃ (generic function with 1 method)

julia> @code_warntype g̃(v)
Body::Float64
1 1 ── %1  = %new(Adjoint{Float64,Array{Float64,1}}, v)::Adjoint{Float64,Array{Float64,1}}
  │    %2  = (LazyArrays.tuple)(v, %1)::Tuple{Array{Float64,1},Adjoint{Float64,Array{Float64,1}}}
  │    %3  = (Base.arraysize)(v, 1)::Int64  │││╻╷╷╷         instantiate
  │    %4  = (Base.slt_int)(%3, 0)::Bool    ││││╻╷╷╷         combine_axes
  │    %5  = (Base.ifelse)(%4, 0, %3)::Int64│││││┃│││││       broadcast_axes
  │    %6  = %new(Base.OneTo{Int64}, %5)::Base.OneTo{Int64}    axes
  │          (Base.ifelse)(false, 0, 1)     ││││││╻╷╷╷╷        broadcast_axes
  │    %8  = %new(Base.OneTo{Int64}, 1)::Base.OneTo{Int64}      axes
  │    %9  = (Base.arraysize)(v, 1)::Int64  ││││││││╻            axes
  │    %10 = (Base.slt_int)(%9, 0)::Bool    │││││││││╻╷╷╷         map
  │    %11 = (Base.ifelse)(%10, 0, %9)::Int64│││││││││┃││          Type
  │    %12 = %new(Base.OneTo{Int64}, %11)::Base.OneTo{Int64}        Type
  │    %13 = (1 === %5)::Bool               ││││││╻╷╷╷╷        _bcs
  │    %14 = (Base.and_int)(true, %13)::Bool│││││││╻            _bcs1
  └───       goto #3 if not %14             ││││││││┃            _bcsm
  2 ──       goto #4                        │││││││││    
  3 ── %17 = (%5 === 1)::Bool               │││││││││╻            ==
  └───       goto #4                        │││││││││    
  4 ┄─ %19 = φ (#2 => %14, #3 => %17)::Bool ││││││││     
  └───       goto #6 if not %19             ││││││││     
  5 ──       goto #12                       ││││││││     
  6 ── %22 = (%5 === 1)::Bool               │││││││││╻╷           ==
  │    %23 = (Base.and_int)(true, %22)::Bool││││││││││╻            &
  └───       goto #8 if not %23             │││││││││    
  7 ──       goto #9                        │││││││││    
  8 ──       goto #9                        │││││││││    
  9 ┄─ %27 = φ (#7 => %23, #8 => true)::Bool││││││││     
  └───       goto #11 if not %27            ││││││││     
  10 ─       goto #12                       ││││││││     
  11 ─ %30 = %new(Base.DimensionMismatch, "arrays could not be broadcast to a common size")::DimensionMismatch
  │          (Base.Broadcast.throw)(%30)    ││││││││     
  └───       $(Expr(:unreachable))          ││││││││     
  12 ┄ %33 = φ (#5 => %8, #10 => %6)::Base.OneTo{Int64}  
  │    %34 = (Core.tuple)(%33, %12)::Tuple{Base.OneTo{Int64},Base.OneTo{Int64}}
  └───       goto #13                       │││││││      
  13 ─       goto #14                       ││││││       
  14 ─       goto #15                       │││││        
  15 ─ %38 = %new(Base.Broadcast.Broadcasted{Base.Broadcast.DefaultArrayStyle{2},Tuple{Base.OneTo{Int64},Base.OneTo{Int64}},typeof(*),Tuple{Array{Float64,1},Adjoint{Float64,Array{Float64,1}}}}, *, %2, %34)::Base.Broadcast.Broadcasted{Base.Broadcast.DefaultArrayStyle{2},Tuple{Base.OneTo{Int64},Base.OneTo{Int64}},typeof(*),Tuple{Array{Float64,1},Adjoint{Float64,Array{Float64,1}}}}
  └───       goto #16                       ││││         
  16 ─ %40 = %new(BroadcastArray{Float64,2,Base.Broadcast.Broadcasted{Base.Broadcast.DefaultArrayStyle{2},Tuple{Base.OneTo{Int64},Base.OneTo{Int64}},typeof(*),Tuple{Array{Float64,1},Adjoint{Float64,Array{Float64,1}}}}}, %38)::BroadcastArray{Float64,2,Base.Broadcast.Broadcasted{Base.Broadcast.DefaultArrayStyle{2},Tuple{Base.OneTo{Int64},Base.OneTo{Int64}},typeof(*),Tuple{Array{Float64,1},Adjoint{Float64,Array{Float64,1}}}}}
  └───       goto #17                       │││          
  17 ─       goto #18                       ││           
  18 ─ %43 = Main.identity::Core.Compiler.Const(identity, false)
  │    %44 = Main.:+::Core.Compiler.Const(+, false)      
  │    %45 = invoke Base.mapfoldl_impl(%43::Function, %44::Function, $(QuoteNode(NamedTuple()))::NamedTuple{(),Tuple{}}, %40::BroadcastArray{Float64,2,Base.Broadcast.Broadcasted{Base.Broadcast.DefaultArrayStyle{2},Tuple{Base.OneTo{Int64},Base.OneTo{Int64}},typeof(*),Tuple{Array{Float64,1},Adjoint{Float64,Array{Float64,1}}}}})::Float64
  └───       return %45                     │            

Though speedwise it’s no better:

julia> @btime g($v)
  14.375 μs (3 allocations: 64 bytes)
2690.6330912377566

julia> @btime g̃($v)
  16.576 μs (12 allocations: 256 bytes)
2690.6330912377566

Thanks, I didn’t know about LazyArrays.jl. Does this rule out type-stability as the problem?

Copying this lazy BroadcastArray is similarly slow, compared to the magic built-in one:

julia> @btime copy(BroadcastArray(*, $v, $v'));
  24.958 μs (6 allocations: 78.30 KiB)

julia> @btime copy(Broadcast.broadcasted(*, $v, $v'));
  7.233 μs (2 allocations: 78.20 KiB)

julia> @btime copy($(rand(100,100)));
  8.049 μs (2 allocations: 78.20 KiB)

It wraps a Broadcasted instance, so if it’s not as fast, it’s a bug.

Indeed. Surely you could just call the same copy() method as does materialise.

Still seems strange if copy can get the elements so much faster than mapreduce can!

mapreduce handles Base.broadcasted(*, v, v') as a basic iterator, whereas materialize inlines and optimizes a tall tree of code. There is then considerable overhead for the basic iterator protocol, compared to simply looping over the array from materialize. Also, the mapreduce method for iterators does not currently generate SIMD loops here, whereas the array version does. The surprising thing is that the overhead is so small. Keno is amazing.

1 Like

Not sure if this is useful for you, but here’s a much faster devectorized solution:

julia> v = rand(100);

julia> f(v) = sum(v .* v');

julia> @btime f(v)
  5.490 μs (3 allocations: 78.22 KiB)
3.9186906586328103

julia> g(v) = (s = zero(v[1]); @inbounds @fastmath for x=v, y=v; s+=x*y; end; s);

julia> @btime g(v)
  1.853 μs (1 allocation: 16 bytes)
3.918690658632818

On my machine, the @fastmath version ends up using the vfmadd231pd instruction (256-bit SIMD fused multiply add), unrolled by a factor 4.

Thanks, that’s neat. It may even be the first time I’ve seen @fastmath change something significantly!

Of course really this is a toy example, my code has lots of sum(broadcast…) patterns which look wasteful, although easy to read.

Indeed, maybe that is the right conclusion.

I will have another go at untangling this “tall tree of code” – my attempts (not shown) adapting the loop eachindex(bc) to do the sum were not quick.