Sum, mapreduce and broadcasted

improbable22 · September 21, 2018, 6:05pm

I was trying to sum a broadcasted array without materialising it. Why is this slower?

using BenchmarkTools

v = rand(100);

@btime sum($v .* $v') ## 9.906 μs (2 allocations: 78.20 KiB)

@btime mapreduce(identity, +, Broadcast.broadcasted(*, $v, $v')) ## 21.593 μs (3 allocations: 64 bytes)

When I look at @code_warntype mapreduce(identity, +, Broadcast.broadcasted(*, v, v')) the output type is inferred, but inside it contains ::Union{Nothing, Tuple{Float64,Tuple{CartesianIndices{2,... which sounds like it’s allowing for empty sums.

But adding ; init=0.0) makes this much slower. And z=rand(0); mapreduce(identity, +, z) has no such problems, and is as fast as sum(z)

Elrod · September 21, 2018, 9:34pm

Adding materialize helps, but the benchmarks are extremely noisy.

julia> using BenchmarkTools

julia> v = rand(100);

julia> f(v) = sum(v .* v')
f (generic function with 1 method)

julia> g(v) = mapreduce(identity, +, Broadcast.broadcasted(*, v, v'))
g (generic function with 1 method)

julia> h(v) = mapreduce(identity, +, Base.Broadcast.materialize(Broadcast.broadcasted(*, v, v')))
h (generic function with 1 method)

julia> @benchmark f($v)
BenchmarkTools.Trial: 
  memory estimate:  78.20 KiB
  allocs estimate:  2
  --------------
  minimum time:     8.536 μs (0.00% GC)
  median time:      11.241 μs (0.00% GC)
  mean time:        19.604 μs (36.83% GC)
  maximum time:     43.407 ms (99.92% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> @benchmark g($v)
BenchmarkTools.Trial: 
  memory estimate:  64 bytes
  allocs estimate:  3
  --------------
  minimum time:     13.595 μs (0.00% GC)
  median time:      14.367 μs (0.00% GC)
  mean time:        14.242 μs (0.00% GC)
  maximum time:     73.228 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> @benchmark h($v)
BenchmarkTools.Trial: 
  memory estimate:  78.20 KiB
  allocs estimate:  2
  --------------
  minimum time:     8.106 μs (0.00% GC)
  median time:      11.221 μs (0.00% GC)
  mean time:        19.606 μs (37.20% GC)
  maximum time:     44.118 ms (99.93% GC)
  --------------
  samples:          10000
  evals/sample:     1

Note that you can copy and paste the above into the REPL. julia> and results are automatically ignored.

Seif_Shebl · September 21, 2018, 11:34pm

And how is that? Can you please elaborate on this? My REPL doesn’t support this.

julia> julia> using BenchmarkTools
ERROR: UndefVarError: julia not defined
Stacktrace:
 [1] top-level scope at none:0

julia> versioninfo()
Julia Version 1.0.0
Commit 5d4eaca0c9 (2018-08-08 20:58 UTC)
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: Intel(R) Core(TM) i7-4790K CPU @ 4.00GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-6.0.0 (ORCJIT, haswell)

Elrod · September 22, 2018, 12:59am

Weird. I can copy and paste it into a REPL just fine:

julia> using BenchmarkTools

julia> v = rand(100);

julia> f(v) = sum(v .* v')
f (generic function with 1 method)

julia> g(v) = mapreduce(identity, +, Broadcast.broadcasted(*, v, v'))
g (generic function with 1 method)

julia> h(v) = mapreduce(identity, +, Base.Broadcast.materialize(Broadcast.broadcasted(*, v, v')))
h (generic function with 1 method)

julia> @benchmark f($v)
BenchmarkTools.Trial: 
  memory estimate:  78.20 KiB
  allocs estimate:  2
  --------------
  minimum time:     7.319 μs (0.00% GC)
  median time:      9.342 μs (0.00% GC)
  mean time:        12.233 μs (18.48% GC)
  maximum time:     8.865 ms (99.77% GC)
  --------------
  samples:          10000
  evals/sample:     4

julia> @benchmark g($v)
BenchmarkTools.Trial: 
  memory estimate:  64 bytes
  allocs estimate:  3
  --------------
  minimum time:     13.595 μs (0.00% GC)
  median time:      14.507 μs (0.00% GC)
  mean time:        15.796 μs (0.00% GC)
  maximum time:     47.980 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> @benchmark h($v)
BenchmarkTools.Trial: 
  memory estimate:  78.20 KiB
  allocs estimate:  2
  --------------
  minimum time:     6.870 μs (0.00% GC)
  median time:      9.014 μs (0.00% GC)
  mean time:        11.420 μs (19.74% GC)
  maximum time:     8.638 ms (99.73% GC)
  --------------
  samples:          10000
  evals/sample:     4

It removes the julia> s, as well as the output (replacing it with new output). I’d been taking advantage of that to make it easy to share results, and for others to still be able to reproduce it for a long time.
However, I wasn’t aware that many people were unfamiliar with that feature until this thread.

If it isn’t working for you, that sounds like a bug.

favba · September 22, 2018, 1:07am

You necessarily have to paste the code in order to work. The REPL is able to know if you are typing or pasting the text. It will work only on the latter case.

kristoffer.carlsson · September 22, 2018, 2:54am

It doesn’t work in e.g default windows terminal.

improbable22 · September 22, 2018, 5:26am

REPL cleverness aside, yes I agree that materialize is quicker, but it allocates the entire array just like sum(), which is what I’m trying to avoid. It amounts to this, roughly:

@btime mapreduce(identity, +, copy(Broadcast.broadcasted(*, $v, $v'))) ## 9.745 μs (2 allocations: 78.20 KiB)

But what I’d like to understand is how this copy method is able to get all the elements much quicker than mapreduce (or various naive loops I tried).

dlfivefifty · September 22, 2018, 8:15am

BroadcastArray in LazyArrays.jl ( which is equivalent to Broadcasted but <: AbstractArray) gets rid of the type inference issue:

julia> g̃(v) = mapreduce(identity, +, BroadcastArray(*, v, v'))
g̃ (generic function with 1 method)

julia> @code_warntype g̃(v)
Body::Float64
1 1 ── %1  = %new(Adjoint{Float64,Array{Float64,1}}, v)::Adjoint{Float64,Array{Float64,1}}
  │    %2  = (LazyArrays.tuple)(v, %1)::Tuple{Array{Float64,1},Adjoint{Float64,Array{Float64,1}}}
  │    %3  = (Base.arraysize)(v, 1)::Int64  │││╻╷╷╷         instantiate
  │    %4  = (Base.slt_int)(%3, 0)::Bool    ││││╻╷╷╷         combine_axes
  │    %5  = (Base.ifelse)(%4, 0, %3)::Int64│││││┃│││││       broadcast_axes
  │    %6  = %new(Base.OneTo{Int64}, %5)::Base.OneTo{Int64}    axes
  │          (Base.ifelse)(false, 0, 1)     ││││││╻╷╷╷╷        broadcast_axes
  │    %8  = %new(Base.OneTo{Int64}, 1)::Base.OneTo{Int64}      axes
  │    %9  = (Base.arraysize)(v, 1)::Int64  ││││││││╻            axes
  │    %10 = (Base.slt_int)(%9, 0)::Bool    │││││││││╻╷╷╷         map
  │    %11 = (Base.ifelse)(%10, 0, %9)::Int64│││││││││┃││          Type
  │    %12 = %new(Base.OneTo{Int64}, %11)::Base.OneTo{Int64}        Type
  │    %13 = (1 === %5)::Bool               ││││││╻╷╷╷╷        _bcs
  │    %14 = (Base.and_int)(true, %13)::Bool│││││││╻            _bcs1
  └───       goto #3 if not %14             ││││││││┃            _bcsm
  2 ──       goto #4                        │││││││││    
  3 ── %17 = (%5 === 1)::Bool               │││││││││╻            ==
  └───       goto #4                        │││││││││    
  4 ┄─ %19 = φ (#2 => %14, #3 => %17)::Bool ││││││││     
  └───       goto #6 if not %19             ││││││││     
  5 ──       goto #12                       ││││││││     
  6 ── %22 = (%5 === 1)::Bool               │││││││││╻╷           ==
  │    %23 = (Base.and_int)(true, %22)::Bool││││││││││╻            &
  └───       goto #8 if not %23             │││││││││    
  7 ──       goto #9                        │││││││││    
  8 ──       goto #9                        │││││││││    
  9 ┄─ %27 = φ (#7 => %23, #8 => true)::Bool││││││││     
  └───       goto #11 if not %27            ││││││││     
  10 ─       goto #12                       ││││││││     
  11 ─ %30 = %new(Base.DimensionMismatch, "arrays could not be broadcast to a common size")::DimensionMismatch
  │          (Base.Broadcast.throw)(%30)    ││││││││     
  └───       $(Expr(:unreachable))          ││││││││     
  12 ┄ %33 = φ (#5 => %8, #10 => %6)::Base.OneTo{Int64}  
  │    %34 = (Core.tuple)(%33, %12)::Tuple{Base.OneTo{Int64},Base.OneTo{Int64}}
  └───       goto #13                       │││││││      
  13 ─       goto #14                       ││││││       
  14 ─       goto #15                       │││││        
  15 ─ %38 = %new(Base.Broadcast.Broadcasted{Base.Broadcast.DefaultArrayStyle{2},Tuple{Base.OneTo{Int64},Base.OneTo{Int64}},typeof(*),Tuple{Array{Float64,1},Adjoint{Float64,Array{Float64,1}}}}, *, %2, %34)::Base.Broadcast.Broadcasted{Base.Broadcast.DefaultArrayStyle{2},Tuple{Base.OneTo{Int64},Base.OneTo{Int64}},typeof(*),Tuple{Array{Float64,1},Adjoint{Float64,Array{Float64,1}}}}
  └───       goto #16                       ││││         
  16 ─ %40 = %new(BroadcastArray{Float64,2,Base.Broadcast.Broadcasted{Base.Broadcast.DefaultArrayStyle{2},Tuple{Base.OneTo{Int64},Base.OneTo{Int64}},typeof(*),Tuple{Array{Float64,1},Adjoint{Float64,Array{Float64,1}}}}}, %38)::BroadcastArray{Float64,2,Base.Broadcast.Broadcasted{Base.Broadcast.DefaultArrayStyle{2},Tuple{Base.OneTo{Int64},Base.OneTo{Int64}},typeof(*),Tuple{Array{Float64,1},Adjoint{Float64,Array{Float64,1}}}}}
  └───       goto #17                       │││          
  17 ─       goto #18                       ││           
  18 ─ %43 = Main.identity::Core.Compiler.Const(identity, false)
  │    %44 = Main.:+::Core.Compiler.Const(+, false)      
  │    %45 = invoke Base.mapfoldl_impl(%43::Function, %44::Function, $(QuoteNode(NamedTuple()))::NamedTuple{(),Tuple{}}, %40::BroadcastArray{Float64,2,Base.Broadcast.Broadcasted{Base.Broadcast.DefaultArrayStyle{2},Tuple{Base.OneTo{Int64},Base.OneTo{Int64}},typeof(*),Tuple{Array{Float64,1},Adjoint{Float64,Array{Float64,1}}}}})::Float64
  └───       return %45                     │

Though speedwise it’s no better:

julia> @btime g($v)
  14.375 μs (3 allocations: 64 bytes)
2690.6330912377566

julia> @btime g̃($v)
  16.576 μs (12 allocations: 256 bytes)
2690.6330912377566

improbable22 · September 22, 2018, 9:18am

Thanks, I didn’t know about LazyArrays.jl. Does this rule out type-stability as the problem?

Copying this lazy BroadcastArray is similarly slow, compared to the magic built-in one:

julia> @btime copy(BroadcastArray(*, $v, $v'));
  24.958 μs (6 allocations: 78.30 KiB)

julia> @btime copy(Broadcast.broadcasted(*, $v, $v'));
  7.233 μs (2 allocations: 78.20 KiB)

julia> @btime copy($(rand(100,100)));
  8.049 μs (2 allocations: 78.20 KiB)

dlfivefifty · September 22, 2018, 11:18am

It wraps a Broadcasted instance, so if it’s not as fast, it’s a bug.

improbable22 · September 22, 2018, 9:58pm

Indeed. Surely you could just call the same copy() method as does materialise.

Still seems strange if copy can get the elements so much faster than mapreduce can!

Ralph_Smith · September 23, 2018, 3:21am

mapreduce handles Base.broadcasted(*, v, v') as a basic iterator, whereas materialize inlines and optimizes a tall tree of code. There is then considerable overhead for the basic iterator protocol, compared to simply looping over the array from materialize. Also, the mapreduce method for iterators does not currently generate SIMD loops here, whereas the array version does. The surprising thing is that the overhead is so small. Keno is amazing.

bennedich · September 23, 2018, 6:59am

Not sure if this is useful for you, but here’s a much faster devectorized solution:

julia> v = rand(100);

julia> f(v) = sum(v .* v');

julia> @btime f(v)
  5.490 μs (3 allocations: 78.22 KiB)
3.9186906586328103

julia> g(v) = (s = zero(v[1]); @inbounds @fastmath for x=v, y=v; s+=x*y; end; s);

julia> @btime g(v)
  1.853 μs (1 allocation: 16 bytes)
3.918690658632818

On my machine, the @fastmath version ends up using the vfmadd231pd instruction (256-bit SIMD fused multiply add), unrolled by a factor 4.

improbable22 · September 23, 2018, 7:54am

Thanks, that’s neat. It may even be the first time I’ve seen @fastmath change something significantly!

Of course really this is a toy example, my code has lots of sum(broadcast…) patterns which look wasteful, although easy to read.

improbable22 · September 23, 2018, 7:57am

Indeed, maybe that is the right conclusion.

I will have another go at untangling this “tall tree of code” – my attempts (not shown) adapting the loop eachindex(bc) to do the sum were not quick.

Topic		Replies	Views
Why is a multi-argument inplace map much faster in this case than a broadcast? Performance question , broadcast , map	16	674	December 12, 2022
Lazy broadcasting speed issues Performance broadcasting , lazy-evaluation , lazy , lazyarray	13	727	March 12, 2023
Reducing Materialize in Profiling Performance	5	413	February 24, 2024
Blog post: Loop fusion and vectorization in Julia 0.6 Internals & Design announcement , broadcast	28	8419	May 4, 2017
Performance scaling of broadcasting compared to looping Performance question , optimization , loops , broadcasting	19	738	March 21, 2023

Sum, mapreduce and broadcasted

Related topics