Why using a mutable struct type argument to create instances creates a 50x slowdown?

I created this MWE based on some package code, and I’m not getting why the two different functions have not similar performance:

using BenchmarkTools

mutable struct A
    q_0::Int
    q_1::Int
    q_2::Int
end

@noinline function g_1()
    for n in 1:1000
        a = A(rand(1:2), 1, 1)
        f(a)
    end
end

@noinline function g_2()
    for n in 1:1000
        f(A, 1, 1)
    end
end

@noinline function f(a::A)
    return a
end

@noinline function f(a::Type{A}, properties...)
    return a(rand(1:2), properties...)
end

@benchmark g_1()
@benchmark g_2()

which using Julia 1.9.1 returns

julia> @benchmark g_1()
BenchmarkTools.Trial: 10000 samples with 7 evaluations.
 Range (min … max):  4.383 ΞΌs …  15.704 ΞΌs  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     4.384 ΞΌs               β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   4.465 ΞΌs Β± 504.824 ns  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

  β–ˆ                    ▁                                      ▁
  β–ˆβ–„β–„β–β–…β–„β–ƒβ–β–„β–ƒβ–β–β–β–ƒβ–ƒβ–ƒβ–„β–…β–…β–†β–†β–ˆβ–…β–…β–…β–„β–…β–…β–„β–„β–ƒβ–„β–„β–ƒβ–„β–‡β–„β–ƒβ–„β–„β–ƒβ–„β–ƒβ–…β–„β–„β–„β–ƒβ–β–„β–„β–„β–β–β–ƒβ–ƒβ–„β–β–† β–ˆ
  4.38 ΞΌs      Histogram: log(frequency) by time      7.46 ΞΌs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark g_2()
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  213.065 ΞΌs …  3.937 ms  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     219.227 ΞΌs              β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   226.327 ΞΌs Β± 70.989 ΞΌs  β”Š GC (mean Β± Οƒ):  0.79% Β± 3.33%

  β–‚β–†β–ˆβ–‡β–‡β–…β–„β–„β–ƒβ–‚β–‚β–β–ƒβ–ƒβ–‚β–‚                                             β–‚
  β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‡β–‡β–‡β–†β–†β–‡β–†β–†β–…β–†β–…β–†β–†β–†β–‡β–‡β–†β–†β–†β–†β–…β–…β–†β–„β–„β–…β–„β–†β–ƒβ–„β–„β–„β–ƒβ–ƒβ–„β–„β–ƒβ–β–„β–ƒβ–ƒ β–ˆ
  213 ΞΌs        Histogram: log(frequency) by time       318 ΞΌs <

 Memory estimate: 78.12 KiB, allocs estimate: 3000.

As you can see there is a 50x slowdown in the second method, which for me it’s surprising, what’s the reason for that? And how can it be fixed?

The easiest way is by using @inline instead of @noinline.

FWIW, the compiler is inlining your @noinline f(a) = a anyway.

julia> @code_typed g_1()
CodeInfo(
1 ─       goto #7 if not true
2 β”„ %2  = Ο† (#1 => 1, #6 => %9)::Int64
β”‚         invoke Random.rand($(QuoteNode(Random.TaskLocalRNG()))::Random.TaskLocalRNG, $(QuoteNode(Random.SamplerRangeNDL{UInt64, Int64}(1, 0x0000000000000002)))::Random.SamplerRangeNDL{UInt64, Int64})::Int64
β”‚   %4  = (%2 === 1000)::Bool
└──       goto #4 if not %4
3 ─       goto #5
4 ─ %7  = Base.add_int(%2, 1)::Int64
└──       goto #5
5 β”„ %9  = Ο† (#4 => %7)::Int64
β”‚   %10 = Ο† (#3 => true, #4 => false)::Bool
β”‚   %11 = Base.not_int(%10)::Bool
└──       goto #7 if not %11
6 ─       goto #2
7 β”„       return nothing
) => Nothing

But I did modify it to get it to not inline, yet it still performed much better, because –

– I don’t know why you’re getting 3 allocations/call for f(::Type{A}, args...). I’d think there should be only one.
Still, marking that one @inline instead will of course result in them matching performance.

2 Likes

I see a couple factors.

g_1 runs with no allocations despite constructing mutable instances because the compiler recognized those instances have a fixed size and never make it out of that for-loop, so it moved it to the stack. If you made a local a outside the loop and returned it at the end, g_1 would do 1 allocation for that. g_2 can’t leverage this optimization because the construction happens in a separate @noinline function that the instance escapes, so the compiler must allocate it on the heap for the functions to share.

The other thing is Julia doesn’t automatically specialize on Function, Type, or Vararg that are only passed as arguments to other function calls. Replace , properties... with , i1, i2 and you shave 3000 to the expected 1000 allocations. As you said, this is only a problem with @noinline, replacing it with @inline makes all the allocations go away.

2 Likes

thanks @Elrod but I actually used the @noinline trick to show the difference I see in the actual code where there aren’t either @inline nor @noinline (but I actually noticed that on the actual code if I use @inline there is a big difference in performance)

thanks @benny I actually need properties... I think since the number of properties is actually unknown in the real code.

Given all of this, do you think using @inline in the actual code should fix the problem anyway even with different number (and types) of properties? And why does using @inline make a so big difference? you can see the β€œactual code” here if interested Agents.jl/issues/820

When a function call isn’t inlined, you jump from one function to another. That second function is compiled in isolation, and the compiled code is reused when the function is called anywhere else it’s not inlined.

When a function call is inlined, its code is pasted into the caller function’s code, so it’s compiled along with the caller function. That compiled inlined code cannot be reused anywhere else, so it is customized for the caller function. For example, when inlining f(A, 1, 1), the compiler likely took the body a(rand(1:2), properties...) and put in the arguments, making A(rand(1:2), 1, 1). This is identical to g_1 and can use the same optimizations.

Inlining doesn’t always improve things. Compile times and code size often increases, and inlining too large functions can cause instruction cache misses. You can leave most of the inlining decisions to Julia’s compiler, use @inline or @noinline to make suggestions (not guarantees) on a case-by-case basis.

One thing that could help is adding at least 1 type parameter for your method to force specialization, there is an example in the section of Performance Tips. That is, if you want the method to be compiled separately for f(A, 1, 1) vs f(A, 1, 1, 1).

2 Likes

thanks @benny, indeed using

@noinline function f(a::Type{A}, properties::Vararg{Any, N}) where {N}
    return a(rand(1:2), properties...)
end

for the second method I get

@benchmark g_2()
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  14.100 ΞΌs …  1.198 ms  β”Š GC (min … max): 0.00% … 98.58%
 Time  (median):     15.200 ΞΌs              β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   17.373 ΞΌs Β± 29.317 ΞΌs  β”Š GC (mean Β± Οƒ):  4.00% Β±  2.41%

  β–‚β–†β–„β–ˆβ–‡β–†β–†β–ƒβ–‚                            ▁▁▂▂▂▂▁▂▂▂▁            β–‚
  β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‡β–‡β–…β–‡β–†β–…β–‡β–†β–†β–†β–…β–„β–β–β–β–ƒβ–β–…β–β–„β–ƒβ–„β–…β–…β–†β–‡β–‡β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–†β–‡β–‡β–‡β–†β–„β–†β–… β–ˆ
  14.1 ΞΌs      Histogram: log(frequency) by time      29.5 ΞΌs <

 Memory estimate: 31.25 KiB, allocs estimate: 1000.

which is much better, adding @inline makes it almost the same speed as the first method:

@inline function f(a::Type{A}, properties::Vararg{Any, N}) where {N}
    return a(rand(1:2), properties...)
end

@benchmark g_2()
BenchmarkTools.Trial: 10000 samples with 6 evaluations.
 Range (min … max):  5.433 ΞΌs …  11.567 ΞΌs  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     5.600 ΞΌs               β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   5.712 ΞΌs Β± 466.232 ns  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

    β–ˆ
  β–‚β–‚β–ˆβ–ƒβ–‚β–„β–†β–β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–β–‚β–β–‚β–‚β–β–‚β–‚β–‚β–β–β–‚β–‚β–β–β–‚β–‚β–‚β–‚β–‚β–‚ β–‚
  5.43 ΞΌs         Histogram: frequency by time        8.98 ΞΌs <

 Memory estimate: 0 bytes, allocs estimate: 0.

The reason why the Julia compiler doesn’t automatically specialize this case is because when you have arbitrarily many combinations of Vararg, compiling each of them separately may not be worth it; it’s only worth it with fewer unique call signatures and more repeated uses of each of them. Carefully consider how the actual code is intended to be used, a benchmark repeating 1 call does not take into account the loss of performance in compiling too many call signatures with little reuse.

This reminds me, so far we’ve only been talking about changing properties of the method, whether it’s annotating @inline/@noinline or add a method type parameter to force specialization. But @inline/@noinline can also be put at a specific function call, and it will override the annotation at the method itself. As we’ve pointed out, successful inlining into g_2 specializes the call and eliminates the allocations just like g_1, no need to add a method parameter:

@noinline function g_2()
    for n in 1:1000
        @inline f(A, 1, 1) # overrides @noinline of f
    end
end

@noinline function f(a::Type{A}, properties...)
    return a(rand(1:2), properties...)
end

@time g_2() # no allocations

I see equivalent performance with the inlined parametric method, so I can’t explain why it’s not the same on your machine. Benchmark timings can vary when your machine is multitasking with other processes.

2 Likes

you are right, they have the same performance (I mistakenly used two different Julia versions, that’s why it was different), much appreciated both the more thoroughly explainations on Vararg and @inline anyway!