Debugging strange allocations

Motivation

When optimizing julia code, my first step is usually to make sure the compiler infers everything. There are great tools like @code_warntype and Cthulhu, that allow to diagnose inference problems in a systematic and efficient way.

Often this step is enough to get decent performance. Sometimes however there are still lots of allocations, without obvious reason. See for instance here. Lately this consumes significant amounts of my dev time, so I would like to understand better whats going on and be more systematic about solving it.

Simple allocation model

Here is my simple allocation model. I use it to reason about my code.

  • Constructing a bitstype does not allocate
  • Constucting a non bitstype does allocate
  • Running code that @code_warntype complains about does allocate

Any allocation that cannot be explained by this, I will call strange.

Examples of allocations explained by the model

obvious0(x) = x
obvious0(1)  # expected compilation allocation
obvious1(x) = [x]
obvious1(42) # warmup
obvious1(42) # expected allocation due to non bitstype construction
struct MyBox; content; end
obvious2(x::MyBox) = x.content
x = MyBox(5)
obvious2(x) # warmup
obvious2(x) # expected allocation due to type instability
Example of strange allocation
struct TypeStableBox{T}; inner::T; end
unwrap(b::TypeStableBox) = b.inner
@noinline kw(;a,b) = unwrap(a) + unwrap(b)
function strange_kw(a,b)
    ret = unwrap(a)
    for _ in 1:1000
        ret += kw(a=a,b=b)
    end
    ret
end
a = TypeStableBox(1)
b = TypeStableBox(2)
strange_kw(a,b)  # expected allocation due to compilation
@time strange_kw(a,b) # strange allocation
# 0.000019 seconds (2.00 k allocations: 31.422 KiB)
Example of strange allocation
struct TypeStableBox{T}; inner::T; end
unwrap(b::TypeStableBox) = b.inner
@noinline argsplat(ab...) = _argsplat(ab...)
@noinline _argsplat(a,b) = unwrap(a) + unwrap(b)
function strange_argsplat(a,b)
    ret = unwrap(a)
    for _ in 1:1000
        ret += argsplat(a,b)
    end
    ret
end
a = TypeStableBox(1)
b = TypeStableBox(2)
strange_argsplat(a,b)  # expected allocation due to compilation
@time strange_argsplat(a,b) # strange allocation
#   0.000044 seconds (2.00 k allocations: 31.422 KiB)

So here are a couple of questions:

  • What would be a better allocation model?
  • Given a code that has strange allocations, how to diagnose where they occur?
  • Can I see symptoms/warning signs of strange allocations in @code_warntype?
  • How do other people debug these?

I cannot reproduce.

julia> struct TypeStableBox{T}; inner::T; end

julia> unwrap(b::TypeStableBox) = b.inner
unwrap (generic function with 1 method)

julia> @noinline kw(;a,b) = unwrap(a) + unwrap(b)
kw (generic function with 1 method)

julia> function strange_kw(a,b)
           ret = unwrap(a)
           for _ in 1:1000
               ret += kw(a=a,b=b)
           end
           ret
       end
strange_kw (generic function with 1 method)

julia> a = TypeStableBox(1)
TypeStableBox{Int64}(1)

julia> b = TypeStableBox(2)
TypeStableBox{Int64}(2)

julia> strange_kw(a,b)  # expected allocation due to compilation
3001

julia> @time strange_kw(a,b)
  0.000004 seconds (5 allocations: 176 bytes)
3001

julia> using BenchmarkTools

julia> @benchmark strange_kw($a,$b)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     1.119 μs (0.00% GC)
  median time:      1.123 μs (0.00% GC)
  mean time:        1.177 μs (0.00% GC)
  maximum time:     2.970 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     10

You will in general see allocations when dynamic dispatching.

1 Like

Dynamic dispatch is covered by the @code_warntype complains rule. Interesting to know that you can’t reproduce. I currently use Julia 1.1.0:

versioninfo()

Julia Version 1.1.0
Commit 80516ca202 (2019-01-21 21:24 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Core(TM) i7-3630QM CPU @ 2.40GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-6.0.1 (ORCJIT, ivybridge)
Environment:
  JULIA_NUM_THREADS = 8

I was referring to:

@time strange_kw(a,b)

in the global scope as being the dynamic dispatch.
I’m on:

julia> versioninfo()
Julia Version 1.3.0-DEV.428
Commit 863e03fbb1* (2019-06-19 15:07 UTC)
Platform Info:
  OS: Linux (x86_64-generic-linux)
  CPU: Intel(R) Core(TM) i9-7900X CPU @ 3.30GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-8.0.0 (ORCJIT, skylake)

So there’s a chance things are different in 1.3.
Can you post the result when you use BenchmarkTools like I did?

1 Like

It is weird. I get no allocations, when doing @benchmark for the second time. I was pretty sure, that @benchmark does a warmup call, so there is no compilation overhead reported??

julia> struct TypeStableBox{T}; inner::T; end

julia> unwrap(b::TypeStableBox) = b.inner
unwrap (generic function with 1 method)

julia> @noinline kw(;a,b) = unwrap(a) + unwrap(b)
kw (generic function with 1 method)

julia> function strange_kw(a,b)
           ret = unwrap(a)
           for _ in 1:1000
               ret += kw(a=a,b=b)
           end
           ret
       end
strange_kw (generic function with 1 method)

julia> a = TypeStableBox(1)
TypeStableBox{Int64}(1)

julia> b = TypeStableBox(2)
TypeStableBox{Int64}(2)

julia> using BenchmarkTools

julia> @benchmark strange_kw($a,$b)
BenchmarkTools.Trial: 
  memory estimate:  31.25 KiB
  allocs estimate:  2000
  --------------
  minimum time:     15.132 μs (0.00% GC)
  median time:      16.334 μs (0.00% GC)
  mean time:        23.670 μs (24.53% GC)
  maximum time:     41.522 ms (99.88% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> @benchmark strange_kw($a,$b)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     1.777 μs (0.00% GC)
  median time:      1.777 μs (0.00% GC)
  mean time:        1.824 μs (0.00% GC)
  maximum time:     3.853 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     10
2 Likes

That is weird. I don’t know why it would allocate the second time.
I’m also fairly sure that BenchmarkTools does a warmup call. Your first @benchmark was 10_000 samples anyway, of which the miunimum was 15 microseconds.
Your second benchmark was 10_000 samples, that time with 10 evals/sample, and the max was < 3.9 microseconds, with min and median of 1.77.