Unknown allocation

tomtom · February 9, 2020, 2:43am

I’m trying to do lazy evaluations as follows:

struct V
    data::Vector{Int}
end
f(v, x) = Base.broadcasted(+, x, v.data)
g(v, x) = Base.broadcasted(*, x, v.data)

const x = [1, 2]
const a = V([3, 4])
const b = V([5, 6])

julia> (x .+ a.data) .* b.data
2-element Array{Int64,1}:
 20
 36
julia> Base.materialize(g(b, f(a, x) ) )
2-element Array{Int64,1}:
 20
 36

julia> @btime ($x .+ $a.data) .* $b.data;
  37.233 ns (1 allocation: 96 bytes)
julia> @btime Base.materialize(g($b, f($a, $x) ) );
  37.323 ns (1 allocation: 96 bytes)

so far so good.

However, once I add two more operations (materialization and matrix multiplication) into it, speed slows down and comes up with unknown allocation:

struct M
    data::Matrix{Int}
end
h(m, x::Broadcast.Broadcasted) = h(m, Base.materialize(x) )
h(m, x) = m.data * x

const m = M([1 2; 3 4])

fun1() = m.data * ((x .+ a.data) .* b.data)
fun2() = h(m, g(b, f(a, x) ) )

julia> fun1()
2-element Array{Int64,1}:
  92
 204
julia> fun2()
2-element Array{Int64,1}:
  92
 204

julia> @btime fun1();
  83.895 ns (2 allocations: 192 bytes)
julia> @btime fun2();
  105.800 ns (6 allocations: 288 bytes)

now fun2() is slower and has 4 mysterious additional allocations.

as far as I understand, the code of fun1() and fun2() should be identical:

julia> @code_lowered fun1()
CodeInfo(
1 ─ %1 = Base.getproperty(Main.m, :data)
│   %2 = Base.getproperty(Main.a, :data)
│   %3 = Base.broadcasted(Main.:+, Main.x, %2)
│   %4 = Base.getproperty(Main.b, :data)
│   %5 = Base.broadcasted(Main.:*, %3, %4)
│   %6 = Base.materialize(%5)
│   %7 = %1 * %6
└──      return %7
)

### compared to:
julia> @code_lowered fun2()
CodeInfo(
1 ─ %1 = Main.f(Main.a, Main.x)
│   %2 = Main.g(Main.b, %1)
│   %3 = Main.h(Main.m, %2)
└──      return %3
)

# which can be decomposed as:
julia> @code_lowered f(a, x)
CodeInfo(
1 ─ %1 = Base.broadcasted
│   %2 = Base.getproperty(v, :data)
│   %3 = (%1)(Main.:+, x, %2)
└──      return %3
)

julia> @code_lowered g(b, f(a, x) )
CodeInfo(
1 ─ %1 = Base.broadcasted
│   %2 = Base.getproperty(v, :data)
│   %3 = (%1)(Main.:*, x, %2)
└──      return %3
)

julia> @code_lowered h(m, g(b, f(a, x) ) )
CodeInfo(
1 ─ %1 = Base.materialize
│   %2 = (%1)(x)
│   %3 = Main.h(m, %2)
└──      return %3
)

julia> @code_lowered h(m, Base.materialize(g(b, f(a, x) ) ) )
CodeInfo(
1 ─ %1 = Base.getproperty(m, :data)
│   %2 = %1 * x
└──      return %2
)

help please! why fun2() is slower than fun1()? where’re those 4 additional allocations? thanks.

Elrod · February 9, 2020, 3:14am

julia> @benchmark fun1()
BenchmarkTools.Trial: 
  memory estimate:  192 bytes
  allocs estimate:  2
  --------------
  minimum time:     53.315 ns (0.00% GC)
  median time:      60.710 ns (0.00% GC)
  mean time:        66.382 ns (6.41% GC)
  maximum time:     1.407 μs (93.16% GC)
  --------------
  samples:          10000
  evals/sample:     984

julia> @benchmark fun2()
BenchmarkTools.Trial: 
  memory estimate:  288 bytes
  allocs estimate:  6
  --------------
  minimum time:     67.775 ns (0.00% GC)
  median time:      70.972 ns (0.00% GC)
  mean time:        90.697 ns (17.67% GC)
  maximum time:     2.995 μs (95.11% GC)
  --------------
  samples:          10000
  evals/sample:     977

julia> @inline h(m, x) = m.data * x
h (generic function with 2 methods)

julia> @inline h(m, x::Broadcast.Broadcasted) = h(m, Base.materialize(x) )
h (generic function with 2 methods)

julia> @benchmark fun2()
BenchmarkTools.Trial: 
  memory estimate:  192 bytes
  allocs estimate:  2
  --------------
  minimum time:     51.397 ns (0.00% GC)
  median time:      59.338 ns (0.00% GC)
  mean time:        65.833 ns (8.42% GC)
  maximum time:     1.678 μs (93.88% GC)
  --------------
  samples:          10000
  evals/sample:     987

EDIT:
@code_typed (with optimize=true, the default) shows you optimized Julia code, after the inliner ran. It’s also very readable (IMO).
What I got with your original version was:

julia> @code_typed fun2()
CodeInfo(
1 ─ %1  = Main.a::Core.Compiler.Const(V([3, 4]), false)
│   %2  = Main.x::Core.Compiler.Const([1, 2], false)
│   %3  = Base.getfield(%1, :data)::Array{Int64,1}
│   %4  = Core.tuple(%2, %3)::Tuple{Array{Int64,1},Array{Int64,1}}
│   %5  = %new(Base.Broadcast.Broadcasted{Base.Broadcast.DefaultArrayStyle{1},Nothing,typeof(+),Tuple{Array{Int64,1},Array{Int64,1}}}, +, %4, nothing)::Base.Broadcast.Broadcasted{Base.Broadcast.DefaultArrayStyle{1},Nothing,typeof(+),Tuple{Array{Int64,1},Array{Int64,1}}}
│   %6  = Main.b::Core.Compiler.Const(V([5, 6]), false)
│   %7  = Base.getfield(%6, :data)::Array{Int64,1}
│   %8  = Core.tuple(%5, %7)::Tuple{Base.Broadcast.Broadcasted{Base.Broadcast.DefaultArrayStyle{1},Nothing,typeof(+),Tuple{Array{Int64,1},Array{Int64,1}}},Array{Int64,1}}
│   %9  = %new(Base.Broadcast.Broadcasted{Base.Broadcast.DefaultArrayStyle{1},Nothing,typeof(*),Tuple{Base.Broadcast.Broadcasted{Base.Broadcast.DefaultArrayStyle{1},Nothing,typeof(+),Tuple{Array{Int64,1},Array{Int64,1}}},Array{Int64,1}}}, *, %8, nothing)::Base.Broadcast.Broadcasted{Base.Broadcast.DefaultArrayStyle{1},Nothing,typeof(*),Tuple{Base.Broadcast.Broadcasted{Base.Broadcast.DefaultArrayStyle{1},Nothing,typeof(+),Tuple{Array{Int64,1},Array{Int64,1}}},Array{Int64,1}}}
│   %10 = invoke Main.h(Main.m::M, %9::Base.Broadcast.Broadcasted{Base.Broadcast.DefaultArrayStyle{1},Nothing,typeof(*),Tuple{Base.Broadcast.Broadcasted{Base.Broadcast.DefaultArrayStyle{1},Nothing,typeof(+),Tuple{Array{Int64,1},Array{Int64,1}}},Array{Int64,1}}})::Array{Int64,1}
└──       return %10
) => Array{Int64,1}

There are a couple %news and a Main.h, showing that h didn’t inline. Figured I’d try inlining it to see what happens, and that was all it took to get the same performance and the same (or at least, extremely similar) @code_typed for both versions.

tomtom · February 9, 2020, 9:32am

thanks!

with the original (no @inline version) h():

julia> const y = g(b, f(a, x) )
Base.Broadcast.Broadcasted(*, (Base.Broadcast.Broadcasted(+, ([1, 2], [3, 4])), [5, 6]))

julia> @btime h($m, $y);
  91.841 ns (2 allocations: 192 bytes)

here’re 2 allocations: one from materialization, another from matrix multiplication, no problem.

so, 6(total) - 2(from h()) = 4. That means there’re still four unexplained allocations involved. I guess, they may come from:

and it’s still mysterious to me! Seems like without @inline, the compiler would allocate for a Broadcasted object ?!

*** edited ***
ah yes, according to base\broadcast.jl line 1233 to 1239:

@inline function broadcasted(f, arg1, arg2, args...)
    xxxx
    broadcasted(combine_styles(arg1′, arg2′, args′...), f, arg1′, arg2′, args′...)
end
@inline broadcasted(::S, f, args...) where S<:BroadcastStyle = Broadcasted{S}(f, args)

that means a function calling broadcasted() without @inline would construct an Broadcasted object that causes allocation (although I don’t understand why allocation is involved in calling the constructor).

Elrod · February 9, 2020, 3:40pm

Just like views get heap allocated when escaping (e.g., getting passed to a non-inlined function), so do Broadcasted objects. Both are structs holding gc-managed heap-allocated objects.

Here is a (view-focused) issue you can follow.

tomtom · February 10, 2020, 2:28pm

sorry… what is “escaping”? could we avoid it? thanks.

Elrod · February 10, 2020, 2:53pm

I meant if it leaves the function, which normally either means being returned or passed as an argument to a function that is not inlined.
You can avoid it by not returning it, and by inlining any functions you pass it to as an argument.

Topic		Replies	Views
Strange allocations during broadcasting Performance	13	714	March 25, 2020
Common allocation mistakes Performance memory-allocation	47	7161	August 21, 2023
Why is the function evaluation with more allocations faster? Performance	6	821	April 11, 2021
Is there way to avoid `materialize` and `copy`? Performance	20	3251	November 15, 2018
Memory allocation inconsistency (again...) General Usage	14	682	July 13, 2021

Unknown allocation

Related topics