Inconsistent allocation with threads and function arguments in 1.12

In a large code suite, I’ve noticed substantial performance degradation using v1.12. I’m seeing large numbers of allocations in parts of the code that are threaded and involve passing functions as arguments to other functions (perhaps related to this Performance Tip). I tried playing around with how the functions being passed are defined and how the type of the function-as-argument gets defined, and the allocations can change dramatically. This is not true of the same code on v1.11, nor is it true if I remove the Thread.@threads in this section of the code. In the threaded v1.12, I also find that small changes the function definitions can greatly effect these results (reducing allocations in one call but increasing them in another).

Here’s a minimal example demonstrating the problem. I’ll show the results of demo() for v1.11 and v1.12, with and without the @threads in define_Y!(). The main conclusions:

  • v1.12 threaded shows high variability between function definitions
  • v1.11 threaded shows no variability between function definitions
  • Both versions unthreaded show the same results for all function definitions

MWE

import Pkg; Pkg.activate(; temp=true)
Pkg.add(["BandedMatrices", "FillArrays", "BenchmarkTools"])

using BandedMatrices
using FillArrays
using LinearAlgebra
using BenchmarkTools
using Base.Threads

# --- Problem scaffold -----

# toy "physics" — just something smooth and nontrivial
g(x, X::Vector, c)  = exp(-c*x) * (1 + x) + sum(X) * 1e-6
h(x, X::Vector, c)  = c * exp(-c*x) + x

# even/odd "basis" and their X-derivatives (analytic for the toy basis)
sinm(m::Int, X::Real) = sin(m * π * X)
cosm(m::Int, X::Real) = cos(m * π * X)

D_sinm(m::Int, X::Real) =  m * π * cos(m * π * X)
D_cosm(m::Int, X::Real) = -m * π * sin(m * π * X)

function inner_product(op1, m, f1, op2, f2, _basis, k, X::AbstractVector, _order)
    @inbounds begin
        acc = 0.0
        @fastmath for i in 1:length(X)-1
            x₁ = X[i]; x₂ = X[i+1]
            # integrand: (op1 on mode m * f1) * (op2 on mode k * f2)
            g(x) = (op1(m, x) * f1(x)) * (op2(k, x) * f2(x))
            acc += 0.5 * (x₂ - x₁) * (g(x₁) + g(x₂))  # trapezoid
        end
        return acc
    end
end

# --- Three front-ends that differ only in how they define the callables -------

# (A) Named local functions (your current form)
function define_Y_named(X::Vector, c; order::Union{Nothing,Integer}=5)
    gg(x)  = g(x, X, c)
    hh(x)  = h(x, X, c)

    Y = BandedMatrix(Zeros(2length(X), 2length(X)), (3, 3))
    define_Y!(Y, gg, hh, X, order)

    return Y
end

# (B) Arrow closures capturing X, c
function define_Y_arrow(X::Vector, c; order::Union{Nothing,Integer}=5)
    gg  = x -> g(x, X, c)
    hh  = x -> h(x, X, c)

    Y = BandedMatrix(Zeros(2length(X), 2length(X)), (3, 3))
    define_Y!(Y, gg, hh, X, order)

    return Y
end

# (C) Concrete callable structs (forces concrete callee types)
struct G{X,T};  X::X; c::T; end
struct H{X,T};  X::X; c::T; end
(gg::G)(x)   = g(x,  gg.X,  gg.c)
(hh::H)(x) = h(x, hh.X, hh.c)

function define_Y_functor(X::Vector, c; order::Union{Nothing,Integer}=5)
    gg, hh = G(X, c), H(X, c)

    Y = BandedMatrix(Zeros(2length(X), 2length(X)), (3, 3))
    define_Y!(Y, gg, hh, X, order)

    return Y
end

# --- Threaded fill (parametric on the callables to encourage specialization) ---

function define_Y!(Y::BandedMatrix, gg::F1, hh::F2,
                   X::AbstractVector{<:Real}, order::Union{Nothing,Integer}) where {F1,F2}
    Threads.@threads for m in eachindex(X)
        Me = 2m
        Mo = Me - 1
        Y[Mo, Mo] = inner_product(D_cosm, m, gg, D_cosm, hh, cosm, m, X, order)
        Y[Me, Me] = inner_product(D_sinm, m, gg, D_sinm, hh, sinm, m, X, order)
    end
    return Y
end

# --- Driver / benchmark -------------------------------------------------------

function demo(; N=400)
    X = collect(range(0.0, 1.0; length = N))
    c = 0.7

    define_Y_named(X, c);  define_Y_arrow(X, c);  define_Y_functor(X, c)

    println("\nAllocations / time (Named local functions)")
    @btime define_Y_named($X, $c);

    println("\nAllocations / time (Arrow closures)")
    @btime define_Y_arrow($X, $c);

    println("\nAllocations / time (Callable structs)")
    @btime define_Y_functor($X, $c);
end

v1.12, threaded

Allocations / time (Named local functions)
  20.153 ms (61 allocations: 69.08 KiB)

Allocations / time (Arrow closures)
  20.195 ms (10827 allocations: 348.36 KiB)

Allocations / time (Callable structs)
  20.166 ms (61 allocations: 69.08 KiB)

Installation Information
==========================

Julia Version 1.12.0
Commit b907bd0600f (2025-10-07 15:42 UTC)
Build Info:
  Official https://julialang.org release
Platform Info:
  OS: macOS (arm64-apple-darwin22.4.0)
  CPU: 12 × Apple M4 Pro
  WORD_SIZE: 64
  LLVM: libLLVM-18.1.7 (ORCJIT, apple-m4)
  GC: Built with stock GC
Threads: 8 default, 1 interactive, 8 GC (on 8 virtual cores)
Environment:
  JULIA_NUM_THREADS = auto
Status `/private/var/folders/nt/ct_lf2n94_1c21908mmbx55w0000gq/T/jl_HXcSaX/Project.toml`
  [aae01518] BandedMatrices v1.9.5
  [6e4b80f9] BenchmarkTools v1.6.0
  [1a297f60] FillArrays v1.14.0
Status `/private/var/folders/nt/ct_lf2n94_1c21908mmbx55w0000gq/T/jl_HXcSaX/Manifest.toml`
  [4c555306] ArrayLayouts v1.12.0
  [aae01518] BandedMatrices v1.9.5
  [6e4b80f9] BenchmarkTools v1.6.0
  [34da2185] Compat v4.18.1
  [1a297f60] FillArrays v1.14.0
⌅ [682c06a0] JSON v0.21.4
  [69de0a69] Parsers v2.8.3
  [aea7be01] PrecompileTools v1.3.3
  [21216c6a] Preferences v1.5.0
  [90137ffa] StaticArrays v1.9.15
  [1e83bf80] StaticArraysCore v1.4.3
  [10745b16] Statistics v1.11.1
  [56f22d72] Artifacts v1.11.0
  [ade2ca70] Dates v1.11.0
  [8f399da3] Libdl v1.11.0
  [37e2e46d] LinearAlgebra v1.12.0
  [56ddb016] Logging v1.11.0
  [a63ad114] Mmap v1.11.0
  [de0858da] Printf v1.11.0
  [9abbd945] Profile v1.11.0
  [9a3f8284] Random v1.11.0
  [ea8e919c] SHA v0.7.0
  [f489334b] StyledStrings v1.11.0
  [fa267f1f] TOML v1.0.3
  [cf7118a7] UUIDs v1.11.0
  [4ec0a83e] Unicode v1.11.0
  [e66e0078] CompilerSupportLibraries_jll v1.3.0+1
  [4536629a] OpenBLAS_jll v0.3.29+0
  [8e850b90] libblastrampoline_jll v5.13.1+1
Info Packages marked with ⌅ have new versions available but compatibility constraints restrict them from upgrading. To see why use `status --outdated -m`

v1.12, unthreaded

Allocations / time (Named local functions)
  149.662 ms (3 allocations: 64.08 KiB)

Allocations / time (Arrow closures)
  149.204 ms (3 allocations: 64.08 KiB)

Allocations / time (Callable structs)
  147.845 ms (3 allocations: 64.08 KiB)

Installation Information
==========================

Julia Version 1.12.0
Commit b907bd0600f (2025-10-07 15:42 UTC)
Build Info:
  Official https://julialang.org release
Platform Info:
  OS: macOS (arm64-apple-darwin22.4.0)
  CPU: 12 × Apple M4 Pro
  WORD_SIZE: 64
  LLVM: libLLVM-18.1.7 (ORCJIT, apple-m4)
  GC: Built with stock GC
Threads: 8 default, 1 interactive, 8 GC (on 8 virtual cores)
Environment:
  JULIA_NUM_THREADS = auto
Status `/private/var/folders/nt/ct_lf2n94_1c21908mmbx55w0000gq/T/jl_6nTBnG/Project.toml`
  [aae01518] BandedMatrices v1.9.5
  [6e4b80f9] BenchmarkTools v1.6.0
  [1a297f60] FillArrays v1.14.0
Status `/private/var/folders/nt/ct_lf2n94_1c21908mmbx55w0000gq/T/jl_6nTBnG/Manifest.toml`
  [4c555306] ArrayLayouts v1.12.0
  [aae01518] BandedMatrices v1.9.5
  [6e4b80f9] BenchmarkTools v1.6.0
  [34da2185] Compat v4.18.1
  [1a297f60] FillArrays v1.14.0
⌅ [682c06a0] JSON v0.21.4
  [69de0a69] Parsers v2.8.3
  [aea7be01] PrecompileTools v1.3.3
  [21216c6a] Preferences v1.5.0
  [90137ffa] StaticArrays v1.9.15
  [1e83bf80] StaticArraysCore v1.4.3
  [10745b16] Statistics v1.11.1
  [56f22d72] Artifacts v1.11.0
  [ade2ca70] Dates v1.11.0
  [8f399da3] Libdl v1.11.0
  [37e2e46d] LinearAlgebra v1.12.0
  [56ddb016] Logging v1.11.0
  [a63ad114] Mmap v1.11.0
  [de0858da] Printf v1.11.0
  [9abbd945] Profile v1.11.0
  [9a3f8284] Random v1.11.0
  [ea8e919c] SHA v0.7.0
  [f489334b] StyledStrings v1.11.0
  [fa267f1f] TOML v1.0.3
  [cf7118a7] UUIDs v1.11.0
  [4ec0a83e] Unicode v1.11.0
  [e66e0078] CompilerSupportLibraries_jll v1.3.0+1
  [4536629a] OpenBLAS_jll v0.3.29+0
  [8e850b90] libblastrampoline_jll v5.13.1+1
Info Packages marked with ⌅ have new versions available but compatibility constraints restrict them from upgrading. To see why use `status --outdated -m`

v1.11, threaded

Allocations / time (Named local functions)
  20.179 ms (45 allocations: 68.83 KiB)

Allocations / time (Arrow closures)
  20.220 ms (45 allocations: 68.83 KiB)

Allocations / time (Callable structs)
  20.010 ms (45 allocations: 68.83 KiB)

Installation Information
==========================

Julia Version 1.11.7
Commit f2b3dbda30a (2025-09-08 12:10 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: macOS (arm64-apple-darwin24.0.0)
  CPU: 12 × Apple M4 Pro
  WORD_SIZE: 64
  LLVM: libLLVM-16.0.6 (ORCJIT, apple-m1)
Threads: 8 default, 0 interactive, 4 GC (on 8 virtual cores)
Environment:
  JULIA_NUM_THREADS = auto
Status `/private/var/folders/nt/ct_lf2n94_1c21908mmbx55w0000gq/T/jl_M6vZl9/Project.toml`
  [aae01518] BandedMatrices v1.9.5
  [6e4b80f9] BenchmarkTools v1.6.0
  [1a297f60] FillArrays v1.14.0
Status `/private/var/folders/nt/ct_lf2n94_1c21908mmbx55w0000gq/T/jl_M6vZl9/Manifest.toml`
  [4c555306] ArrayLayouts v1.12.0
  [aae01518] BandedMatrices v1.9.5
  [6e4b80f9] BenchmarkTools v1.6.0
  [34da2185] Compat v4.18.1
  [1a297f60] FillArrays v1.14.0
⌅ [682c06a0] JSON v0.21.4
  [69de0a69] Parsers v2.8.3
⌅ [aea7be01] PrecompileTools v1.2.1
  [21216c6a] Preferences v1.5.0
  [90137ffa] StaticArrays v1.9.15
  [1e83bf80] StaticArraysCore v1.4.3
  [10745b16] Statistics v1.11.1
  [56f22d72] Artifacts v1.11.0
  [ade2ca70] Dates v1.11.0
  [8f399da3] Libdl v1.11.0
  [37e2e46d] LinearAlgebra v1.11.0
  [56ddb016] Logging v1.11.0
  [a63ad114] Mmap v1.11.0
  [de0858da] Printf v1.11.0
  [9abbd945] Profile v1.11.0
  [9a3f8284] Random v1.11.0
  [ea8e919c] SHA v0.7.0
  [fa267f1f] TOML v1.0.3
  [cf7118a7] UUIDs v1.11.0
  [4ec0a83e] Unicode v1.11.0
  [e66e0078] CompilerSupportLibraries_jll v1.1.1+0
  [4536629a] OpenBLAS_jll v0.3.27+1
  [8e850b90] libblastrampoline_jll v5.11.0+0
Info Packages marked with ⌅ have new versions available but compatibility constraints restrict them from upgrading. To see why use `status --outdated -m`

v1.11, unthreaded

Allocations / time (Named local functions)
  145.687 ms (3 allocations: 64.08 KiB)

Allocations / time (Arrow closures)
  146.928 ms (3 allocations: 64.08 KiB)

Allocations / time (Callable structs)
  147.664 ms (3 allocations: 64.08 KiB)

Installation Information
==========================

Julia Version 1.11.7
Commit f2b3dbda30a (2025-09-08 12:10 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: macOS (arm64-apple-darwin24.0.0)
  CPU: 12 × Apple M4 Pro
  WORD_SIZE: 64
  LLVM: libLLVM-16.0.6 (ORCJIT, apple-m1)
Threads: 8 default, 0 interactive, 4 GC (on 8 virtual cores)
Environment:
  JULIA_NUM_THREADS = auto
Status `/private/var/folders/nt/ct_lf2n94_1c21908mmbx55w0000gq/T/jl_5pHsc5/Project.toml`
  [aae01518] BandedMatrices v1.9.5
  [6e4b80f9] BenchmarkTools v1.6.0
  [1a297f60] FillArrays v1.14.0
Status `/private/var/folders/nt/ct_lf2n94_1c21908mmbx55w0000gq/T/jl_5pHsc5/Manifest.toml`
  [4c555306] ArrayLayouts v1.12.0
  [aae01518] BandedMatrices v1.9.5
  [6e4b80f9] BenchmarkTools v1.6.0
  [34da2185] Compat v4.18.1
  [1a297f60] FillArrays v1.14.0
⌅ [682c06a0] JSON v0.21.4
  [69de0a69] Parsers v2.8.3
⌅ [aea7be01] PrecompileTools v1.2.1
  [21216c6a] Preferences v1.5.0
  [90137ffa] StaticArrays v1.9.15
  [1e83bf80] StaticArraysCore v1.4.3
  [10745b16] Statistics v1.11.1
  [56f22d72] Artifacts v1.11.0
  [ade2ca70] Dates v1.11.0
  [8f399da3] Libdl v1.11.0
  [37e2e46d] LinearAlgebra v1.11.0
  [56ddb016] Logging v1.11.0
  [a63ad114] Mmap v1.11.0
  [de0858da] Printf v1.11.0
  [9abbd945] Profile v1.11.0
  [9a3f8284] Random v1.11.0
  [ea8e919c] SHA v0.7.0
  [fa267f1f] TOML v1.0.3
  [cf7118a7] UUIDs v1.11.0
  [4ec0a83e] Unicode v1.11.0
  [e66e0078] CompilerSupportLibraries_jll v1.1.1+0
  [4536629a] OpenBLAS_jll v0.3.27+1
  [8e850b90] libblastrampoline_jll v5.11.0+0
Info Packages marked with ⌅ have new versions available but compatibility constraints restrict them from upgrading. To see why use `status --outdated -m`

This is interesting. I can’t reproduce this on 16 threads, but only if I run with 10 or fewer threads. I modified the driver slightly, do the benchmark in random order:

using Random
function demo(; N=400)
    X = collect(range(0.0, 1.0; length = N))
    c = 0.7

    for bm in shuffle!([1,2,3])
        if bm == 1
            println("\nAllocations / time (Named local functions)")
            @btime define_Y_named($X, $c)
        elseif bm == 2
            println("\nAllocations / time (Arrow closures)")
            @btime define_Y_arrow($X, $c)
        else
            println("\nAllocations / time (Callable structs)")
            @btime define_Y_functor($X, $c)
        end
    end

    return nothing
end

Then I run it repeatedly in new sessions. I get different results:

Allocations / time (Named local functions)
6.717 ms (61 allocations: 48.84 KiB)

Allocations / time (Callable structs)
12.357 ms (957645 allocations: 29.27 MiB)

Allocations / time (Arrow closures)
6.733 ms (45 allocations: 47.34 KiB)

Allocations / time (Arrow closures)
7.028 ms (61 allocations: 48.84 KiB)

Allocations / time (Callable structs)
7.067 ms (45 allocations: 47.34 KiB)

Allocations / time (Named local functions)
11.402 ms (957645 allocations: 29.27 MiB)

Allocations / time (Arrow closures)
11.594 ms (957645 allocations: 29.27 MiB)

Allocations / time (Named local functions)
6.614 ms (45 allocations: 47.34 KiB)

Allocations / time (Callable structs)
6.775 ms (45 allocations: 47.34 KiB)

Julia Version 1.12.0
Commit b907bd0600f (2025-10-07 15:42 UTC)
Build Info:
Official https://julialang.org release
Platform Info:
OS: Linux (x86_64-linux-gnu)
CPU: 16 × AMD Ryzen 7 2700X Eight-Core Processor
WORD_SIZE: 64
LLVM: libLLVM-18.1.7 (ORCJIT, znver1)
GC: Built with stock GC
Threads: 8 default, 1 interactive, 8 GC (on 16 virtual cores)

Moreover, if I run demo() repeatedly in the same session, I get a different order, but always the same ones allocating (it can be more than one, or none).

In short, this isn’t about how the functions are defined, named, unnamed, or struct, but about compilation which depends on, whatever. I saw something similar a week ago, I’ll try to find it.

I saw you have a github issue, I think this should continue there. It’s something internal.

1 Like

Maybe this is related and a workaround:

If you use type-parameters for the function argumenst of inner_product then the allocations are reduced:

julia> demo() # before
Allocations / time (Named local functions)
5.107 ms (958830 allocations: 14.69 MiB)
Allocations / time (Arrow closures)
4.409 ms (40 allocations: 46.98 KiB)
Allocations / time (Callable structs)
5.273 ms (958830 allocations: 29.30 MiB)
julia> demo() #after
Allocations / time (Named local functions)
4.422 ms (1230 allocations: 77.29 KiB)
Allocations / time (Arrow closures)
4.421 ms (40 allocations: 46.98 KiB)
Allocations / time (Callable structs)
4.410 ms (40 allocations: 46.98 KiB)                                                                                                                                                                            

where the change was:

function inner_product(
    op1::F1, m, f1::F2, op2::F3, f2::F4, _basis, k, X::AbstractVector, _order
) where {F1,F2,F3,F4}

Then, if we interpolate the function names in the thread construct, using @spawn, the number of allocations become similar for all runs:

julia> demo()
Allocations / time (Named local functions)
4.438 ms (2416 allocations: 232.92 KiB)
Allocations / time (Arrow closures)
4.433 ms (2416 allocations: 232.92 KiB)
Allocations / time (Callable structs)
4.433 ms (2416 allocations: 232.92 KiB)

The number of allocations is 2416 because of the many tasks beign spawned. Reducing that by chunking, we get:

julia> demo()
Allocations / time (Named local functions)
4.420 ms (37 allocations: 46.52 KiB)
Allocations / time (Arrow closures)
4.423 ms (37 allocations: 46.52 KiB)
Allocations / time (Callable structs)
4.423 ms (37 allocations: 46.52 KiB)

with:

using ChunkSplitters: index_chunks
function define_Y!(Y::BandedMatrix, gg::F1, hh::F2,
                   X::AbstractVector{<:Real}, order::Union{Nothing,Integer}) where {F1,F2}
    Threads.@sync for mchunk in index_chunks(X; n=Threads.nthreads())
        Threads.@spawn for m in mchunk
            Me = 2m
            Mo = Me - 1
            Y[Mo, Mo] = inner_product($D_cosm, m, $gg, $D_cosm, $hh, cosm, m, X, order)
            Y[Me, Me] = inner_product($D_sinm, m, $gg, $D_sinm, $hh, sinm, m, X, order)
        end
    end
    return Y
end

Thus, I think the issues are both some non-specialization associated with passing functions are arguments and the capture-variable issue in the threads construct.

I think the non-deterministic behaviour in the original example points to something more sinister which deserves to be investigated. Let’s see if some compiler internals people on github find something, though it may take time.

Thanks @lmiq, this is useful information. In my full code, I had suspicions it was related to a lack of specialization when passing functions-as-arguments through to other functions, but this runs deeply into my code base and the nondeterminism made it difficult to debug. Looks like interpolation solves that aspect. Still, as @sgaure says, I think this points to something deeper in the compiler logic.

I agree we should move this to the GitHub Issue so the conversation is all in one place. I’ll copy @lmiq’s comments there.