Unknown allocation

I’m trying to do lazy evaluations as follows:

struct V
    data::Vector{Int}
end
f(v, x) = Base.broadcasted(+, x, v.data)
g(v, x) = Base.broadcasted(*, x, v.data)

const x = [1, 2]
const a = V([3, 4])
const b = V([5, 6])

julia> (x .+ a.data) .* b.data
2-element Array{Int64,1}:
 20
 36
julia> Base.materialize(g(b, f(a, x) ) )
2-element Array{Int64,1}:
 20
 36

julia> @btime ($x .+ $a.data) .* $b.data;
  37.233 ns (1 allocation: 96 bytes)
julia> @btime Base.materialize(g($b, f($a, $x) ) );
  37.323 ns (1 allocation: 96 bytes)

so far so good.

However, once I add two more operations (materialization and matrix multiplication) into it, speed slows down and comes up with unknown allocation:

struct M
    data::Matrix{Int}
end
h(m, x::Broadcast.Broadcasted) = h(m, Base.materialize(x) )
h(m, x) = m.data * x

const m = M([1 2; 3 4])

fun1() = m.data * ((x .+ a.data) .* b.data)
fun2() = h(m, g(b, f(a, x) ) )

julia> fun1()
2-element Array{Int64,1}:
  92
 204
julia> fun2()
2-element Array{Int64,1}:
  92
 204

julia> @btime fun1();
  83.895 ns (2 allocations: 192 bytes)
julia> @btime fun2();
  105.800 ns (6 allocations: 288 bytes)

now fun2() is slower and has 4 mysterious additional allocations.

as far as I understand, the code of fun1() and fun2() should be identical:

julia> @code_lowered fun1()
CodeInfo(
1 ─ %1 = Base.getproperty(Main.m, :data)
│   %2 = Base.getproperty(Main.a, :data)
│   %3 = Base.broadcasted(Main.:+, Main.x, %2)
│   %4 = Base.getproperty(Main.b, :data)
│   %5 = Base.broadcasted(Main.:*, %3, %4)
│   %6 = Base.materialize(%5)
│   %7 = %1 * %6
└──      return %7
)

### compared to:
julia> @code_lowered fun2()
CodeInfo(
1 ─ %1 = Main.f(Main.a, Main.x)
│   %2 = Main.g(Main.b, %1)
│   %3 = Main.h(Main.m, %2)
└──      return %3
)

# which can be decomposed as:
julia> @code_lowered f(a, x)
CodeInfo(
1 ─ %1 = Base.broadcasted
│   %2 = Base.getproperty(v, :data)
│   %3 = (%1)(Main.:+, x, %2)
└──      return %3
)

julia> @code_lowered g(b, f(a, x) )
CodeInfo(
1 ─ %1 = Base.broadcasted
│   %2 = Base.getproperty(v, :data)
│   %3 = (%1)(Main.:*, x, %2)
└──      return %3
)

julia> @code_lowered h(m, g(b, f(a, x) ) )
CodeInfo(
1 ─ %1 = Base.materialize
│   %2 = (%1)(x)
│   %3 = Main.h(m, %2)
└──      return %3
)

julia> @code_lowered h(m, Base.materialize(g(b, f(a, x) ) ) )
CodeInfo(
1 ─ %1 = Base.getproperty(m, :data)
│   %2 = %1 * x
└──      return %2
)

help please! why fun2() is slower than fun1()? where’re those 4 additional allocations? thanks. :sob:

julia> @benchmark fun1()
BenchmarkTools.Trial: 
  memory estimate:  192 bytes
  allocs estimate:  2
  --------------
  minimum time:     53.315 ns (0.00% GC)
  median time:      60.710 ns (0.00% GC)
  mean time:        66.382 ns (6.41% GC)
  maximum time:     1.407 μs (93.16% GC)
  --------------
  samples:          10000
  evals/sample:     984

julia> @benchmark fun2()
BenchmarkTools.Trial: 
  memory estimate:  288 bytes
  allocs estimate:  6
  --------------
  minimum time:     67.775 ns (0.00% GC)
  median time:      70.972 ns (0.00% GC)
  mean time:        90.697 ns (17.67% GC)
  maximum time:     2.995 μs (95.11% GC)
  --------------
  samples:          10000
  evals/sample:     977

julia> @inline h(m, x) = m.data * x
h (generic function with 2 methods)

julia> @inline h(m, x::Broadcast.Broadcasted) = h(m, Base.materialize(x) )
h (generic function with 2 methods)

julia> @benchmark fun2()
BenchmarkTools.Trial: 
  memory estimate:  192 bytes
  allocs estimate:  2
  --------------
  minimum time:     51.397 ns (0.00% GC)
  median time:      59.338 ns (0.00% GC)
  mean time:        65.833 ns (8.42% GC)
  maximum time:     1.678 μs (93.88% GC)
  --------------
  samples:          10000
  evals/sample:     987

EDIT:
@code_typed (with optimize=true, the default) shows you optimized Julia code, after the inliner ran. It’s also very readable (IMO).
What I got with your original version was:

julia> @code_typed fun2()
CodeInfo(
1 ─ %1  = Main.a::Core.Compiler.Const(V([3, 4]), false)
│   %2  = Main.x::Core.Compiler.Const([1, 2], false)
│   %3  = Base.getfield(%1, :data)::Array{Int64,1}
│   %4  = Core.tuple(%2, %3)::Tuple{Array{Int64,1},Array{Int64,1}}
│   %5  = %new(Base.Broadcast.Broadcasted{Base.Broadcast.DefaultArrayStyle{1},Nothing,typeof(+),Tuple{Array{Int64,1},Array{Int64,1}}}, +, %4, nothing)::Base.Broadcast.Broadcasted{Base.Broadcast.DefaultArrayStyle{1},Nothing,typeof(+),Tuple{Array{Int64,1},Array{Int64,1}}}
│   %6  = Main.b::Core.Compiler.Const(V([5, 6]), false)
│   %7  = Base.getfield(%6, :data)::Array{Int64,1}
│   %8  = Core.tuple(%5, %7)::Tuple{Base.Broadcast.Broadcasted{Base.Broadcast.DefaultArrayStyle{1},Nothing,typeof(+),Tuple{Array{Int64,1},Array{Int64,1}}},Array{Int64,1}}
│   %9  = %new(Base.Broadcast.Broadcasted{Base.Broadcast.DefaultArrayStyle{1},Nothing,typeof(*),Tuple{Base.Broadcast.Broadcasted{Base.Broadcast.DefaultArrayStyle{1},Nothing,typeof(+),Tuple{Array{Int64,1},Array{Int64,1}}},Array{Int64,1}}}, *, %8, nothing)::Base.Broadcast.Broadcasted{Base.Broadcast.DefaultArrayStyle{1},Nothing,typeof(*),Tuple{Base.Broadcast.Broadcasted{Base.Broadcast.DefaultArrayStyle{1},Nothing,typeof(+),Tuple{Array{Int64,1},Array{Int64,1}}},Array{Int64,1}}}
│   %10 = invoke Main.h(Main.m::M, %9::Base.Broadcast.Broadcasted{Base.Broadcast.DefaultArrayStyle{1},Nothing,typeof(*),Tuple{Base.Broadcast.Broadcasted{Base.Broadcast.DefaultArrayStyle{1},Nothing,typeof(+),Tuple{Array{Int64,1},Array{Int64,1}}},Array{Int64,1}}})::Array{Int64,1}
└──       return %10
) => Array{Int64,1}

There are a couple %news and a Main.h, showing that h didn’t inline. Figured I’d try inlining it to see what happens, and that was all it took to get the same performance and the same (or at least, extremely similar) @code_typed for both versions.

4 Likes

thanks! :+1:

with the original (no @inline version) h():

julia> const y = g(b, f(a, x) )
Base.Broadcast.Broadcasted(*, (Base.Broadcast.Broadcasted(+, ([1, 2], [3, 4])), [5, 6]))

julia> @btime h($m, $y);
  91.841 ns (2 allocations: 192 bytes)

here’re 2 allocations: one from materialization, another from matrix multiplication, no problem.

so, 6(total) - 2(from h()) = 4. That means there’re still four unexplained allocations involved. I guess, they may come from:

and it’s still mysterious to me! Seems like without @inline, the compiler would allocate for a Broadcasted object ?!

*** edited ***
ah yes, according to base\broadcast.jl line 1233 to 1239:

@inline function broadcasted(f, arg1, arg2, args...)
    xxxx
    broadcasted(combine_styles(arg1′, arg2′, args′...), f, arg1′, arg2′, args′...)
end
@inline broadcasted(::S, f, args...) where S<:BroadcastStyle = Broadcasted{S}(f, args)

that means a function calling broadcasted() without @inline would construct an Broadcasted object that causes allocation (although I don’t understand why allocation is involved in calling the constructor).

Just like views get heap allocated when escaping (e.g., getting passed to a non-inlined function), so do Broadcasted objects. Both are structs holding gc-managed heap-allocated objects.

Here is a (view-focused) issue you can follow.

2 Likes

sorry… what is “escaping”? could we avoid it? thanks.

I meant if it leaves the function, which normally either means being returned or passed as an argument to a function that is not inlined.
You can avoid it by not returning it, and by inlining any functions you pass it to as an argument.

2 Likes