Keyword arguments reduce performance

f.ij · March 21, 2025, 2:32pm

I’m running a simulation where I’m collecting data from a graph represented by a sparse matrix. I have a pretty simple function that suffers performance loss just by making an Int32 typed variable a keyword argument instead of a normal argument. Why is this?

function specific_ham(args, newstate; j::Int32)
    (;params) = args
    adj = args.gadj
    state = args.gstate
    cumsum = zero(Float32)
    for ptr in nzrange(adj, j)
        i = adj.rowval[ptr]
        wij = adj.nzval[ptr]
        cumsum += wij * state[i]
    end
    return (state[j]-newstate) * cumsum + (state[j]^2-newstate^2)*params.self[j] + (state[j]-newstate)*params.b[j]
end

Making j a normal argument increases performance in my case by about 8 percent in my total loop (which also does a few other things). Making j a normal argument in this function isn’t a problem, but this is a prototype to mimic some part of my other code (which has some metaprogramming, where the keyword arguments do come in handy), so I’m trying to understand why I’m getting worse performance here.

f.ij · March 21, 2025, 2:41pm

However, doing something stupid like this:

specific_hamkw(@specialize(args); @specialize(kwargs...))  = @inline specific_ham(args, (;kwargs...))

function specific_ham(args, kwargs)
  (;j, newstate) = kwargs
  ...
end

and then using specific_hamkw, doesn’t cause a performance hit. Smells like a bug?

This is on julia 1.11.4 by the way

danielwe · March 21, 2025, 4:13pm

It would be helpful if you provide a full reproducer. That is, something I can paste into my REPL that contains

fast version
slow version
benchmarking code showing the difference

f.ij · March 22, 2025, 1:55pm

Okay, it’s actually quite weird, I cannot reproduce the problem with straight code, but it somehow happens when I redefine functions (which I was also doing in my own code). Here’s the code I’m running, though I’m not sure anymore which exact part is essential for the problem:

using SparseArrays

const state = rand(1000)
const sparse_matrix = sprand(1000, 1000, 0.001)

function loop(func, args)
    loopidx=1
    ti = time_ns()
    for _ in 1:10000000
        @inline choose_j(func, args)
        loopidx += 1
    end
    tf = time_ns()
    println("The normal loop took $(tf-ti) ns")
    println("Updates per sec: $(loopidx / (tf-ti) * 1e9)")
end

function loopkw(func, args)
    loopidx=1
    ti = time_ns()
    for _ in 1:10000000
        @inline choose_j_kw(func, args)
        loopidx += 1
    end
    tf = time_ns()
    println("The kw loop took $(tf-ti) ns")
    println("Updates per sec: $(loopidx / (tf-ti) * 1e9)")
end

function collectargs(args, j)
    (;state, sparse_matrix) = args
    cumsum = zero(Float64)
    for ptr in nzrange(sparse_matrix, j)
        smij = sparse_matrix.nzval[ptr]
        i = sparse_matrix.rowval[ptr]
        cumsum += state[i] * smij
    end
    return cumsum*(2*state[j])
end

function collectargskw(args; j)
    (;state, sparse_matrix) = args
    cumsum = zero(Float64)
    for ptr in nzrange(sparse_matrix, j)
        smij = sparse_matrix.nzval[ptr]
        i = sparse_matrix.rowval[ptr]
        cumsum += state[i] * smij
    end
    return cumsum*(2*state[j])
end

function indirection(@specialize(func), args, j)
    @inline func(args, j)
end

function indirection_kw(@specialize(func), args; j)
    @inline func(args; j)
end

function choose_j(@specialize(func), args)
    j = rand(1:1000)
    @inline func(args, j)
    # @inline indirection(func, args, j)
end

function choose_j_kw(@specialize(func), args)
    j = rand(1:1000)
    @inline func(args; j)
    # @inline indirection_kw(func, args; j)
end

loop(collectargs, (;state, sparse_matrix))
loopkw(collectargskw, (;state, sparse_matrix))

Straight up running the code gives the following output:

The normal loop took 469143584 ns
Updates per sec: 2.1315438047214136e7

The kw loop took 394801708 ns
Updates per sec: 2.5329173601244908e7

On first run, the one with keyword arguments is actually always faster (why??).

Then, commenting out func and commenting in the indirection functions, gives:

choose_j (generic function with 1 method)

choose_j_kw (generic function with 1 method)

The normal loop took 434693500 ns
Updates per sec: 2.300471711677308e7

The kw loop took 754985083 ns
Updates per sec: 1.3245296132559482e7

Then, reverting the code back to the original code:

choose_j (generic function with 1 method)

choose_j_kw (generic function with 1 method)

The normal loop took 583411750 ns
Updates per sec: 1.7140554676864155e7

The kw loop took 398809166 ns
Updates per sec: 2.5074651869962286e7

What is going on here? The relative performance differences stay virtually constant as long as I don’t redefine the functions.

This is actually not the exact problem I had in my code, but I’m guessing it has a similar origin.

nsajko · March 22, 2025, 2:25pm

You load BenchmarkTools.jl here but then don’t use it.

f.ij · March 22, 2025, 2:38pm

I used it at first to try to benchmark the performance differences, but found the way I’m using now to represent better what I’m seeing in my own simulations, I forgot to take that line out when copying. Shouldn’t affect anything regarding the problem I’m facing though.

nsajko · March 22, 2025, 2:57pm

TBH I’d suggest deleting this topic and starting over with more clearly stated questions.

with self-contained reproducers, like @danielwe suggested
- don’t make the reader comment out code or otherwise edit the reproducer(s)
if your question is about run time performance (as opposed to compilation latency/loading latency, etc.), relying on BenchmarkTools.jl or Chairmarks.jl should help your question to be taken more seriously

f.ij · March 22, 2025, 3:59pm

The code I supplied can be run as is. Also, this is doesn’t include any compile time performance, everything is inlined and the timing functions are run after compilation. I asked people to comment out the code because that is exactly where I’m seeing some anomalous behavior. Also, already on first run there is some weird performance discrepancy that I cannot explain which I would like to understand.

This use case is reflecting my real usage, where I’m doing continuous simulations, I have found real runtime performance discrepancies between different functions I was trying to optimize. I think my question is clear, why are the two functions timed so differently, which is causing real performance issues, and why does it change when redefining the functions?

I was using BenchmarkTools before but sometimes it isolates the code in a way that I don’t understand, and which doesn’t actually relate to the performance differences I’m seeing in my actual simulations and experiments.

The issue I showed now is not exactly the issue I was facing, I think but I couldn’t reproduce what I saw exactly yet, I’m trying a lot of things to see where the problem is.

In any case I hoped that getting some insight on the timing differences that I posted above would help me pinpoint the problem, I don’t see the problem with asking about that.

danielwe · March 22, 2025, 4:28pm

Performance changing dramatically due to function redefinition—that smells like problem with the unreliable approximation of `Core.Compiler.return_type` · Issue #35800 · JuliaLang/julia · GitHub to me. See the following recent thread for another example exposing this: Type instability in nested quadgk calls.

I haven’t run your code yet, will hopefully get around to that later. In any case, you should note that @specialize(arg) in a function signature doesn’t do what you think it does, in fact it doesn’t do anything at all (see Essentials · The Julia Language for details). The way to force specialization is to add a type parameter in the function signature, like function foo(x::T, y) where {T}.

f.ij · March 22, 2025, 4:43pm

Thanks for the information! Very helpful. I was aware that ::T where T causes specialization but I remember reading somewhere that @specialize had the same effect? So you’re saying that it doesn’t really work?

Also do you have any idea about the initial performance difference between the code I showed? The keyword args are actually faster consistently, where I assumed everything just should be compiled down to the same code essentially.

Benny · March 22, 2025, 4:47pm

Not possible when there’s already been discussion.

Manually subtracting time_ns outputs across a single loop and dividing iterations with the assumption that each iteration took the exact same time is not an improvement, though you’re at least trying to get around timer resolution issues there. Benchmarking is more difficult than that, and while @time, BenchmarkTools, or whatever benchmarking libraries do have their limitations, they’ll give you a lot more information. BenchmarkTools in particular can give you statistics about how fast several loops ran, and it can vary a lot depending on a lot of external state you can’t really control (CPU cache state, OS scheduling, battery life). 8% discrepancies are well within that noise in practice.

FWIW, trying your procedure gives me these results:

The normal loop took 531823300 ns
Updates per sec: 1.8803239722667284e7
The kw loop took 531271100 ns
Updates per sec: 1.8822783697438087e7

after the uncommented indirection edit (which you indeed shouldn’t make people do in general but that’s the effect you’re trying to measure so):

The normal loop took 534545100 ns
Updates per sec: 1.8707497271979485e7
The kw loop took 559403900 ns
Updates per sec: 1.7876173190783974e7

after the change back:

The normal loop took 654211100 ns
Updates per sec: 1.5285587480860535e7
The kw loop took 679958300 ns
Updates per sec: 1.4706785695534565e7

So I’m seeing negligible differences. Also note how the 2nd time with the same code took a noticeably different time, typical noise.

f.ij · March 22, 2025, 4:57pm

Well I’m not actually assuming every iteration takes the same amount of time and I’m not even so interested in the single call performance actually. Timing like this has proven useful for actual experiments I’m running and give me the actual time that somebody has to wait in front of their computer to complete a full experiment, so unless I’m making some mental misstep somewhere, this is is actually a real performance metric I care about.

Also, the 8% I was mentioning was incredibly stable, I can run many experiments all exhibiting this stable 8 percent on average, and moreover was 8 percent in total simulation time, whereas I was only changing a part of the code, so I would say that’s significant.

Also, what platform are you running on? I should’ve mentioned I’m on an M1 Mac, running through vscode. and have found these performance discrepancies way beyond noise on my system.

Benny · March 22, 2025, 5:09pm

I’m still uncertain, I’ve seen a similar kind of stability before, only to find that it vanishes if I change an action I thought was insignificant, such as the order I run things. The CPU does a lot of weird stuff.

That could make a difference, no idea how in this case. versioninfo gives you most of the relevant information.

julia> versioninfo()
Julia Version 1.11.4
Commit 8561cc3d68 (2025-03-10 11:36 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: 8 × Intel(R) Core(TM) i7-1065G7 CPU @ 1.30GHz
  WORD_SIZE: 64
  LLVM: libLLVM-16.0.6 (ORCJIT, icelake-client)
Threads: 1 default, 0 interactive, 1 GC (on 8 virtual cores)

f.ij · March 22, 2025, 5:20pm

Well I did enough testing in my original simulation to be quite certain that it was stable, but I completely understand your skepticism, I’ve had similar problems vanish before. Also I changed my original code now and I think I circumvented the problem anyway (not completely sure what did the trick though, which is frustrating, but I needed to get something done, so…)

When I have time I will try to keep doing some tests and try to isolate the behavior I was seeing, if possible.

Benny · March 22, 2025, 5:24pm

Confirmed that this is still true, even though such a use of @specialize would actually be intuitive to me as well.

julia> foo(f, a, b) = f(a, b) # uses f
foo (generic function with 1 method)

julia> foo(+, 1, 0)
1

julia> methods(foo)[1].specializations
MethodInstance for foo(::typeof(+), ::Int64, ::Int64)

julia> bar(f, a, b) = foo(f, a, b) # passes f
bar (generic function with 1 method)

julia> bar(+, 1, 0)
1

julia> methods(bar)[1].specializations
MethodInstance for bar(::Function, ::Int64, ::Int64)

julia> baz(@specialize(f), a, b) = foo(f, a, b) # passes f
baz (generic function with 1 method)

julia> baz(+, 1, 0)
1

julia> methods(baz)[1].specializations
MethodInstance for baz(::Function, ::Int64, ::Int64)

julia> paz(f::F, a, b) where F = foo(f, a, b) # passes f
paz (generic function with 1 method)

julia> paz(+, 1, 0)
1

julia> methods(paz)[1].specializations
MethodInstance for paz(::typeof(+), ::Int64, ::Int64)

It’s worth mentioning that aggressive, possibly automatic inlining of callees also gets around the Function/Type/Vararg non-specialization heuristic, and without contributing call chain method specialization bloat e.g. map. While the runnable example liberally uses @inline, @inline is only a hint and can be decided against or impossible; for example, if the compiler doesn’t specialize on func then statically dispatch its call, then it doesn’t know what method body to inline.

danielwe · March 22, 2025, 6:31pm

Another thing (still haven’t had a chance to run the code): adding @inline at call sites (as opposed to method definitions) can be a footgun because it forces inlining even when the method instance cannot be statically resolved, as long as the appropriate method body can be determined (for example, if the function only has a single method—the general idea here is called world splitting). The consequence is that the inlining can break function barriers and lead to more poorly inferred code than the alternative non-inlined version. You’re potentially replacing a single dynamic dispatch with an entire non-inferred method body.

The takeaway is, unless you really know what you’re doing, avoid @inline at call sites. Add the @inline annotation to the method definition instead. That is, avoid

f() = # ...
g() = @inline f()

and prefer

@inline f() = # ...
g() = f()

nsajko · March 22, 2025, 6:32pm

Regarding @specialize, it just overriddes @nospecialize, so, yeah, if there was no @nospecialize it doesn’t do anything. That’s how I understand the doc string at least.

f.ij · March 22, 2025, 7:53pm

I was doing that before actually, only putting inline at the definition, but I found significant performance improvements putting them at the call site in my use case. I’m trying to do flexible simulations where as much as possible is compiled before the loops start. Thus I’m trading some latency (which is okay since experiments are started relatively infrequently) in for runtime speed. I annotated most of my functions with inline, but found that in practice, many functions were not inlined. After adding inline to the call sites, I saw huge performance gains (I don’t remember the exact numbers but I think all in all after inlining everything they were close to some integer multiples)

f.ij · March 24, 2025, 1:30pm

Okay I think I have found a way to reproduce my problem, though in this example the performance difference is not big, but present nonetheless. Could you maybe try to run this code on your machine:

using SparseArrays, BenchmarkTools

const state = rand(1000)
const sparse_matrix = sprand(1000, 1000, 0.001)

function loop(func, args)
    loopidx=1
    ti = time_ns()
    for _ in 1:10000000
        @inline func(args)
        loopidx += 1
    end
    tf = time_ns()
    println("The $func loop took $(tf-ti) ns")
    println("Updates per sec: $(loopidx / (tf-ti) * 1e9)")
end

function collectargs(args::A, j) where A
    (;state, sparse_matrix) = args
    cumsum = zero(Float64)
    for ptr in nzrange(sparse_matrix, j)
        smij = sparse_matrix.nzval[ptr]
        i = sparse_matrix.rowval[ptr]
        cumsum += state[i] * smij
    end
    return cumsum*(2*state[j])
end

function mmc_choose(args::A) where A
    j = rand(1:1000)
    @inline mmc(args, j)
end

function mmc_choose_kw(args::A) where A
    j = rand(1:1000)
    @inline mmc_kw(args; j)
end

function mmc(args::A, j) where A
    T = 2
    dE = @inline collectargs(args, j)
    exp_fac = exp(-dE/T)
    if rand() < exp_fac
        state[j] = -state[j]
    end
end

function mmc_kw(args::A; j) where A
    T = 2
    dE = @inline collectargs(args, j)
    exp_fac = exp(-dE/T)
    if rand() < exp_fac
        state[j] = -state[j]
    end
end

loop(mmc_choose, (;state, sparse_matrix))
loop(mmc_choose_kw, (;state, sparse_matrix))

This gives me consistently around 4% better performance if keyword arguments are not used:

The mmc_choose loop took 173703167 ns
Updates per sec: 5.756948000838695e7
The mmc_choose_kw loop took 180543750 ns
Updates per sec: 5.538824246200713e7

sadish-d · April 10, 2025, 7:49pm

performance numbers from two runs on my machine:

The mmc_choose loop took 171236792 ns
Updates per sec: 5.839867053804652e7
The mmc_choose_kw loop took 169083250 ns
Updates per sec: 5.9142469759718955e7

The mmc_choose loop took 175069416 ns
Updates per sec: 5.712020539327098e7
The mmc_choose_kw loop took 170649000 ns
Updates per sec: 5.8599821856559366e7

versioninfo:

Julia Version 1.10.3
Commit 0b4590a5507 (2024-04-30 10:59 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: macOS (arm64-apple-darwin22.4.0)
  CPU: 8 × Apple M1
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-15.0.7 (ORCJIT, apple-m1)
Threads: 1 default, 0 interactive, 1 GC (on 4 virtual cores)
Environment:
  JULIA_EDITOR = code
  JULIA_NUM_THREADS =

Topic		Replies	Views
Specifying keyword argument results in much slower code Performance question	2	1756	January 29, 2019
Performance difference between optional args and keyword args Performance question	8	1335	April 15, 2021
Performance of functions with keyword arguments General Usage	6	3982	September 16, 2017
Performance of typed keyword arguments General Usage performance	10	2437	December 15, 2018
Call overhead of keyword args General Usage	2	615	November 10, 2018

Keyword arguments reduce performance

Related topics