Why using a mutable struct type argument to create instances creates a 50x slowdown?

Tortar · June 23, 2023, 10:45pm

I created this MWE based on some package code, and I’m not getting why the two different functions have not similar performance:

using BenchmarkTools

mutable struct A
    q_0::Int
    q_1::Int
    q_2::Int
end

@noinline function g_1()
    for n in 1:1000
        a = A(rand(1:2), 1, 1)
        f(a)
    end
end

@noinline function g_2()
    for n in 1:1000
        f(A, 1, 1)
    end
end

@noinline function f(a::A)
    return a
end

@noinline function f(a::Type{A}, properties...)
    return a(rand(1:2), properties...)
end

@benchmark g_1()
@benchmark g_2()

which using Julia 1.9.1 returns

julia> @benchmark g_1()
BenchmarkTools.Trial: 10000 samples with 7 evaluations.
 Range (min … max):  4.383 μs …  15.704 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     4.384 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   4.465 μs ± 504.824 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █                    ▁                                      ▁
  █▄▄▁▅▄▃▁▄▃▁▁▁▃▃▃▄▅▅▆▆█▅▅▅▄▅▅▄▄▃▄▄▃▄▇▄▃▄▄▃▄▃▅▄▄▄▃▁▄▄▄▁▁▃▃▄▁▆ █
  4.38 μs      Histogram: log(frequency) by time      7.46 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark g_2()
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  213.065 μs …  3.937 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     219.227 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   226.327 μs ± 70.989 μs  ┊ GC (mean ± σ):  0.79% ± 3.33%

  ▂▆█▇▇▅▄▄▃▂▂▁▃▃▂▂                                             ▂
  ███████████████████▇▇▇▆▆▇▆▆▅▆▅▆▆▆▇▇▆▆▆▆▅▅▆▄▄▅▄▆▃▄▄▄▃▃▄▄▃▁▄▃▃ █
  213 μs        Histogram: log(frequency) by time       318 μs <

 Memory estimate: 78.12 KiB, allocs estimate: 3000.

As you can see there is a 50x slowdown in the second method, which for me it’s surprising, what’s the reason for that? And how can it be fixed?

Elrod · June 23, 2023, 10:54pm

The easiest way is by using @inline instead of @noinline.

FWIW, the compiler is inlining your @noinline f(a) = a anyway.

julia> @code_typed g_1()
CodeInfo(
1 ─       goto #7 if not true
2 ┄ %2  = φ (#1 => 1, #6 => %9)::Int64
│         invoke Random.rand($(QuoteNode(Random.TaskLocalRNG()))::Random.TaskLocalRNG, $(QuoteNode(Random.SamplerRangeNDL{UInt64, Int64}(1, 0x0000000000000002)))::Random.SamplerRangeNDL{UInt64, Int64})::Int64
│   %4  = (%2 === 1000)::Bool
└──       goto #4 if not %4
3 ─       goto #5
4 ─ %7  = Base.add_int(%2, 1)::Int64
└──       goto #5
5 ┄ %9  = φ (#4 => %7)::Int64
│   %10 = φ (#3 => true, #4 => false)::Bool
│   %11 = Base.not_int(%10)::Bool
└──       goto #7 if not %11
6 ─       goto #2
7 ┄       return nothing
) => Nothing

But I did modify it to get it to not inline, yet it still performed much better, because –

– I don’t know why you’re getting 3 allocations/call for f(::Type{A}, args...). I’d think there should be only one.
Still, marking that one @inline instead will of course result in them matching performance.

Benny · June 23, 2023, 11:03pm

I see a couple factors.

g_1 runs with no allocations despite constructing mutable instances because the compiler recognized those instances have a fixed size and never make it out of that for-loop, so it moved it to the stack. If you made a local a outside the loop and returned it at the end, g_1 would do 1 allocation for that. g_2 can’t leverage this optimization because the construction happens in a separate @noinline function that the instance escapes, so the compiler must allocate it on the heap for the functions to share.

The other thing is Julia doesn’t automatically specialize on Function, Type, or Vararg that are only passed as arguments to other function calls. Replace , properties... with , i1, i2 and you shave 3000 to the expected 1000 allocations. As you said, this is only a problem with @noinline, replacing it with @inline makes all the allocations go away.

Tortar · June 23, 2023, 11:16pm

thanks @Elrod but I actually used the @noinline trick to show the difference I see in the actual code where there aren’t either @inline nor @noinline (but I actually noticed that on the actual code if I use @inline there is a big difference in performance)

thanks @benny I actually need properties... I think since the number of properties is actually unknown in the real code.

Given all of this, do you think using @inline in the actual code should fix the problem anyway even with different number (and types) of properties? And why does using @inline make a so big difference? you can see the “actual code” here if interested Agents.jl/issues/820

Benny · June 24, 2023, 12:00am

When a function call isn’t inlined, you jump from one function to another. That second function is compiled in isolation, and the compiled code is reused when the function is called anywhere else it’s not inlined.

When a function call is inlined, its code is pasted into the caller function’s code, so it’s compiled along with the caller function. That compiled inlined code cannot be reused anywhere else, so it is customized for the caller function. For example, when inlining f(A, 1, 1), the compiler likely took the body a(rand(1:2), properties...) and put in the arguments, making A(rand(1:2), 1, 1). This is identical to g_1 and can use the same optimizations.

Inlining doesn’t always improve things. Compile times and code size often increases, and inlining too large functions can cause instruction cache misses. You can leave most of the inlining decisions to Julia’s compiler, use @inline or @noinline to make suggestions (not guarantees) on a case-by-case basis.

One thing that could help is adding at least 1 type parameter for your method to force specialization, there is an example in the section of Performance Tips. That is, if you want the method to be compiled separately for f(A, 1, 1) vs f(A, 1, 1, 1).

Tortar · June 24, 2023, 1:40pm

thanks @benny, indeed using

@noinline function f(a::Type{A}, properties::Vararg{Any, N}) where {N}
    return a(rand(1:2), properties...)
end

for the second method I get

@benchmark g_2()
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  14.100 μs …  1.198 ms  ┊ GC (min … max): 0.00% … 98.58%
 Time  (median):     15.200 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   17.373 μs ± 29.317 μs  ┊ GC (mean ± σ):  4.00% ±  2.41%

  ▂▆▄█▇▆▆▃▂                            ▁▁▂▂▂▂▁▂▂▂▁            ▂
  ██████████▇▇▅▇▆▅▇▆▆▆▅▄▁▁▁▃▁▅▁▄▃▄▅▅▆▇▇██████████████▆▇▇▇▆▄▆▅ █
  14.1 μs      Histogram: log(frequency) by time      29.5 μs <

 Memory estimate: 31.25 KiB, allocs estimate: 1000.

which is much better, adding @inline makes it almost the same speed as the first method:

@inline function f(a::Type{A}, properties::Vararg{Any, N}) where {N}
    return a(rand(1:2), properties...)
end

@benchmark g_2()
BenchmarkTools.Trial: 10000 samples with 6 evaluations.
 Range (min … max):  5.433 μs …  11.567 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     5.600 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   5.712 μs ± 466.232 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

    █
  ▂▂█▃▂▄▆▁▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▂▁▂▂▁▂▂▂▁▁▂▂▁▁▂▂▂▂▂▂ ▂
  5.43 μs         Histogram: frequency by time        8.98 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

Benny · June 24, 2023, 10:34pm

The reason why the Julia compiler doesn’t automatically specialize this case is because when you have arbitrarily many combinations of Vararg, compiling each of them separately may not be worth it; it’s only worth it with fewer unique call signatures and more repeated uses of each of them. Carefully consider how the actual code is intended to be used, a benchmark repeating 1 call does not take into account the loss of performance in compiling too many call signatures with little reuse.

This reminds me, so far we’ve only been talking about changing properties of the method, whether it’s annotating @inline/@noinline or add a method type parameter to force specialization. But @inline/@noinline can also be put at a specific function call, and it will override the annotation at the method itself. As we’ve pointed out, successful inlining into g_2 specializes the call and eliminates the allocations just like g_1, no need to add a method parameter:

@noinline function g_2()
    for n in 1:1000
        @inline f(A, 1, 1) # overrides @noinline of f
    end
end

@noinline function f(a::Type{A}, properties...)
    return a(rand(1:2), properties...)
end

@time g_2() # no allocations

I see equivalent performance with the inlined parametric method, so I can’t explain why it’s not the same on your machine. Benchmark timings can vary when your machine is multitasking with other processes.

Tortar · June 25, 2023, 12:54am

you are right, they have the same performance (I mistakenly used two different Julia versions, that’s why it was different), much appreciated both the more thoroughly explainations on Vararg and @inline anyway!

Topic		Replies	Views
Inlining a function in a struct field General Usage	16	1515	May 30, 2019
Performance issue due to function as an argument General Usage question , performance	16	842	September 22, 2023
What is the best way to re-use a temporary vector Performance performance , memory-allocation	16	335	October 22, 2024
On the performance of function calls that depends on a variable Performance metaprogramming	12	948	February 24, 2021
Performance differences when using a mutable struct that contains an array Performance	5	741	April 24, 2018

Why using a mutable struct type argument to create instances creates a 50x slowdown?

Related topics