Keyword arguments reduce performance

I’m running a simulation where I’m collecting data from a graph represented by a sparse matrix. I have a pretty simple function that suffers performance loss just by making an Int32 typed variable a keyword argument instead of a normal argument. Why is this?

function specific_ham(args, newstate; j::Int32)
    (;params) = args
    adj = args.gadj
    state = args.gstate
    cumsum = zero(Float32)
    for ptr in nzrange(adj, j)
        i = adj.rowval[ptr]
        wij = adj.nzval[ptr]
        cumsum += wij * state[i]
    end
    return (state[j]-newstate) * cumsum + (state[j]^2-newstate^2)*params.self[j] + (state[j]-newstate)*params.b[j]
end

Making j a normal argument increases performance in my case by about 8 percent in my total loop (which also does a few other things). Making j a normal argument in this function isn’t a problem, but this is a prototype to mimic some part of my other code (which has some metaprogramming, where the keyword arguments do come in handy), so I’m trying to understand why I’m getting worse performance here.

However, doing something stupid like this:

specific_hamkw(@specialize(args); @specialize(kwargs...))  = @inline specific_ham(args, (;kwargs...))

function specific_ham(args, kwargs)
  (;j, newstate) = kwargs
  ...
end

and then using specific_hamkw, doesn’t cause a performance hit. Smells like a bug?

This is on julia 1.11.4 by the way

It would be helpful if you provide a full reproducer. That is, something I can paste into my REPL that contains

  • fast version
  • slow version
  • benchmarking code showing the difference
2 Likes

Okay, it’s actually quite weird, I cannot reproduce the problem with straight code, but it somehow happens when I redefine functions (which I was also doing in my own code). Here’s the code I’m running, though I’m not sure anymore which exact part is essential for the problem:

using SparseArrays

const state = rand(1000)
const sparse_matrix = sprand(1000, 1000, 0.001)

function loop(func, args)
    loopidx=1
    ti = time_ns()
    for _ in 1:10000000
        @inline choose_j(func, args)
        loopidx += 1
    end
    tf = time_ns()
    println("The normal loop took $(tf-ti) ns")
    println("Updates per sec: $(loopidx / (tf-ti) * 1e9)")
end

function loopkw(func, args)
    loopidx=1
    ti = time_ns()
    for _ in 1:10000000
        @inline choose_j_kw(func, args)
        loopidx += 1
    end
    tf = time_ns()
    println("The kw loop took $(tf-ti) ns")
    println("Updates per sec: $(loopidx / (tf-ti) * 1e9)")
end

function collectargs(args, j)
    (;state, sparse_matrix) = args
    cumsum = zero(Float64)
    for ptr in nzrange(sparse_matrix, j)
        smij = sparse_matrix.nzval[ptr]
        i = sparse_matrix.rowval[ptr]
        cumsum += state[i] * smij
    end
    return cumsum*(2*state[j])
end

function collectargskw(args; j)
    (;state, sparse_matrix) = args
    cumsum = zero(Float64)
    for ptr in nzrange(sparse_matrix, j)
        smij = sparse_matrix.nzval[ptr]
        i = sparse_matrix.rowval[ptr]
        cumsum += state[i] * smij
    end
    return cumsum*(2*state[j])
end

function indirection(@specialize(func), args, j)
    @inline func(args, j)
end

function indirection_kw(@specialize(func), args; j)
    @inline func(args; j)
end

function choose_j(@specialize(func), args)
    j = rand(1:1000)
    @inline func(args, j)
    # @inline indirection(func, args, j)
end

function choose_j_kw(@specialize(func), args)
    j = rand(1:1000)
    @inline func(args; j)
    # @inline indirection_kw(func, args; j)
end

loop(collectargs, (;state, sparse_matrix))
loopkw(collectargskw, (;state, sparse_matrix))

Straight up running the code gives the following output:

The normal loop took 469143584 ns
Updates per sec: 2.1315438047214136e7

The kw loop took 394801708 ns
Updates per sec: 2.5329173601244908e7

On first run, the one with keyword arguments is actually always faster (why??).

Then, commenting out func and commenting in the indirection functions, gives:

choose_j (generic function with 1 method)

choose_j_kw (generic function with 1 method)

The normal loop took 434693500 ns
Updates per sec: 2.300471711677308e7

The kw loop took 754985083 ns
Updates per sec: 1.3245296132559482e7

Then, reverting the code back to the original code:

choose_j (generic function with 1 method)

choose_j_kw (generic function with 1 method)

The normal loop took 583411750 ns
Updates per sec: 1.7140554676864155e7

The kw loop took 398809166 ns
Updates per sec: 2.5074651869962286e7

What is going on here? The relative performance differences stay virtually constant as long as I don’t redefine the functions.

This is actually not the exact problem I had in my code, but I’m guessing it has a similar origin.

You load BenchmarkTools.jl here but then don’t use it.

I used it at first to try to benchmark the performance differences, but found the way I’m using now to represent better what I’m seeing in my own simulations, I forgot to take that line out when copying. Shouldn’t affect anything regarding the problem I’m facing though.

TBH I’d suggest deleting this topic and starting over with more clearly stated questions.

  • with self-contained reproducers, like @danielwe suggested
    • don’t make the reader comment out code or otherwise edit the reproducer(s)
  • if your question is about run time performance (as opposed to compilation latency/loading latency, etc.), relying on BenchmarkTools.jl or Chairmarks.jl should help your question to be taken more seriously

The code I supplied can be run as is. Also, this is doesn’t include any compile time performance, everything is inlined and the timing functions are run after compilation. I asked people to comment out the code because that is exactly where I’m seeing some anomalous behavior. Also, already on first run there is some weird performance discrepancy that I cannot explain which I would like to understand.

This use case is reflecting my real usage, where I’m doing continuous simulations, I have found real runtime performance discrepancies between different functions I was trying to optimize. I think my question is clear, why are the two functions timed so differently, which is causing real performance issues, and why does it change when redefining the functions?

I was using BenchmarkTools before but sometimes it isolates the code in a way that I don’t understand, and which doesn’t actually relate to the performance differences I’m seeing in my actual simulations and experiments.

The issue I showed now is not exactly the issue I was facing, I think but I couldn’t reproduce what I saw exactly yet, I’m trying a lot of things to see where the problem is.

In any case I hoped that getting some insight on the timing differences that I posted above would help me pinpoint the problem, I don’t see the problem with asking about that.

Performance changing dramatically due to function redefinition—that smells like problem with the unreliable approximation of `Core.Compiler.return_type` · Issue #35800 · JuliaLang/julia · GitHub to me. See the following recent thread for another example exposing this: Type instability in nested quadgk calls.

I haven’t run your code yet, will hopefully get around to that later. In any case, you should note that @specialize(arg) in a function signature doesn’t do what you think it does, in fact it doesn’t do anything at all (see Essentials · The Julia Language for details). The way to force specialization is to add a type parameter in the function signature, like function foo(x::T, y) where {T}.

Thanks for the information! Very helpful. I was aware that ::T where T causes specialization but I remember reading somewhere that @specialize had the same effect? So you’re saying that it doesn’t really work?

Also do you have any idea about the initial performance difference between the code I showed? The keyword args are actually faster consistently, where I assumed everything just should be compiled down to the same code essentially.

Not possible when there’s already been discussion.

Manually subtracting time_ns outputs across a single loop and dividing iterations with the assumption that each iteration took the exact same time is not an improvement, though you’re at least trying to get around timer resolution issues there. Benchmarking is more difficult than that, and while @time, BenchmarkTools, or whatever benchmarking libraries do have their limitations, they’ll give you a lot more information. BenchmarkTools in particular can give you statistics about how fast several loops ran, and it can vary a lot depending on a lot of external state you can’t really control (CPU cache state, OS scheduling, battery life). 8% discrepancies are well within that noise in practice.

FWIW, trying your procedure gives me these results:

The normal loop took 531823300 ns
Updates per sec: 1.8803239722667284e7
The kw loop took 531271100 ns
Updates per sec: 1.8822783697438087e7

after the uncommented indirection edit (which you indeed shouldn’t make people do in general but that’s the effect you’re trying to measure so):

The normal loop took 534545100 ns
Updates per sec: 1.8707497271979485e7
The kw loop took 559403900 ns
Updates per sec: 1.7876173190783974e7

after the change back:

The normal loop took 654211100 ns
Updates per sec: 1.5285587480860535e7
The kw loop took 679958300 ns
Updates per sec: 1.4706785695534565e7

So I’m seeing negligible differences. Also note how the 2nd time with the same code took a noticeably different time, typical noise.

Well I’m not actually assuming every iteration takes the same amount of time and I’m not even so interested in the single call performance actually. Timing like this has proven useful for actual experiments I’m running and give me the actual time that somebody has to wait in front of their computer to complete a full experiment, so unless I’m making some mental misstep somewhere, this is is actually a real performance metric I care about.

Also, the 8% I was mentioning was incredibly stable, I can run many experiments all exhibiting this stable 8 percent on average, and moreover was 8 percent in total simulation time, whereas I was only changing a part of the code, so I would say that’s significant.

Also, what platform are you running on? I should’ve mentioned I’m on an M1 Mac, running through vscode. and have found these performance discrepancies way beyond noise on my system.

I’m still uncertain, I’ve seen a similar kind of stability before, only to find that it vanishes if I change an action I thought was insignificant, such as the order I run things. The CPU does a lot of weird stuff.

That could make a difference, no idea how in this case. versioninfo gives you most of the relevant information.

julia> versioninfo()
Julia Version 1.11.4
Commit 8561cc3d68 (2025-03-10 11:36 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: 8 × Intel(R) Core(TM) i7-1065G7 CPU @ 1.30GHz
  WORD_SIZE: 64
  LLVM: libLLVM-16.0.6 (ORCJIT, icelake-client)
Threads: 1 default, 0 interactive, 1 GC (on 8 virtual cores)

Well I did enough testing in my original simulation to be quite certain that it was stable, but I completely understand your skepticism, I’ve had similar problems vanish before. Also I changed my original code now and I think I circumvented the problem anyway (not completely sure what did the trick though, which is frustrating, but I needed to get something done, so…)

When I have time I will try to keep doing some tests and try to isolate the behavior I was seeing, if possible.

Confirmed that this is still true, even though such a use of @specialize would actually be intuitive to me as well.

julia> foo(f, a, b) = f(a, b) # uses f
foo (generic function with 1 method)

julia> foo(+, 1, 0)
1

julia> methods(foo)[1].specializations
MethodInstance for foo(::typeof(+), ::Int64, ::Int64)

julia> bar(f, a, b) = foo(f, a, b) # passes f
bar (generic function with 1 method)

julia> bar(+, 1, 0)
1

julia> methods(bar)[1].specializations
MethodInstance for bar(::Function, ::Int64, ::Int64)

julia> baz(@specialize(f), a, b) = foo(f, a, b) # passes f
baz (generic function with 1 method)

julia> baz(+, 1, 0)
1

julia> methods(baz)[1].specializations
MethodInstance for baz(::Function, ::Int64, ::Int64)

julia> paz(f::F, a, b) where F = foo(f, a, b) # passes f
paz (generic function with 1 method)

julia> paz(+, 1, 0)
1

julia> methods(paz)[1].specializations
MethodInstance for paz(::typeof(+), ::Int64, ::Int64)

It’s worth mentioning that aggressive, possibly automatic inlining of callees also gets around the Function/Type/Vararg non-specialization heuristic, and without contributing call chain method specialization bloat e.g. map. While the runnable example liberally uses @inline, @inline is only a hint and can be decided against or impossible; for example, if the compiler doesn’t specialize on func then statically dispatch its call, then it doesn’t know what method body to inline.

Another thing (still haven’t had a chance to run the code): adding @inline at call sites (as opposed to method definitions) can be a footgun because it forces inlining even when the method instance cannot be statically resolved, as long as the appropriate method body can be determined (for example, if the function only has a single method—the general idea here is called world splitting). The consequence is that the inlining can break function barriers and lead to more poorly inferred code than the alternative non-inlined version. You’re potentially replacing a single dynamic dispatch with an entire non-inferred method body.

The takeaway is, unless you really know what you’re doing, avoid @inline at call sites. Add the @inline annotation to the method definition instead. That is, avoid

f() = # ...
g() = @inline f()

and prefer

@inline f() = # ...
g() = f()

Regarding @specialize, it just overriddes @nospecialize, so, yeah, if there was no @nospecialize it doesn’t do anything. That’s how I understand the doc string at least.

I was doing that before actually, only putting inline at the definition, but I found significant performance improvements putting them at the call site in my use case. I’m trying to do flexible simulations where as much as possible is compiled before the loops start. Thus I’m trading some latency (which is okay since experiments are started relatively infrequently) in for runtime speed. I annotated most of my functions with inline, but found that in practice, many functions were not inlined. After adding inline to the call sites, I saw huge performance gains (I don’t remember the exact numbers but I think all in all after inlining everything they were close to some integer multiples)