Where is this one allocation coming from?

I have to run functions like this many times. I’m interested in reducing allocations and don’t see where that one allocation is coming from.

using BenchmarkTools
using ComponentArrays
using UnPack

α0 = rand(2, 2, 5)
α = rand(3)

w = rand(2, 9, 2, 2, 5, 10)
P_1 = rand(5, 10)
eb = rand(5, 10)

data = ComponentArray{Float64}(; w, P_1, eb)
θ = ComponentArray{Float64}(; α0, α)

@views function U(g, a, e, k, t; data=data, θ=θ)
    # params
    @unpack α0, α = θ

    # data
    @unpack w, P_1, eb = data

    w1 = w[g, a, e, 1, k, t]

    α0[g, e, k] + α[1] * w1 + α[2] * (e - eb[k, t])^2 + α[3] * P_1[k, t]
end
julia> @benchmark U(1, 5, 2, 2, 1; data=$data, θ=$θ)
BenchmarkTools.Trial: 10000 samples with 996 evaluations.
 Range (min … max):  22.744 ns …  1.173 μs  ┊ GC (min … max): 0.00% … 95.75%
 Time  (median):     25.942 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   27.687 ns ± 20.010 ns  ┊ GC (mean ± σ):  1.15% ±  1.66%

  ▃▃▇▇█▇▅▂▁   ▁  ▁▁                                           ▂
  █████████████████▇▇▇▅▅▇▇▆▇▇▇▇▆▇▇▅▅▅▄▅▄▄▃▅▆▆▅▅▅▄▄▄▅▄▄▄▄▂▅▄▅▄ █
  22.7 ns      Histogram: log(frequency) by time      64.3 ns <

 Memory estimate: 16 bytes, allocs estimate: 1.

Is the ComponentArray type parameterized by the types of its contents? If not, there will be a type instability when you access the components?

Turns out that it is parameterized, so that’s not the issue!

1 Like

Yeah, I edited the post above and still see the allocation.

I get no allocations, are you on julia v1.7?

julia> @benchmark U(1, 5, 2, 2, 1; data=$data, θ=$θ)
BenchmarkTools.Trial: 10000 samples with 999 evaluations.
 Range (min … max):  11.752 ns … 17.496 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     11.763 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   11.778 ns ±  0.209 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

       ▇█▁                                                     
  ▃█▆▃▄███▄▃▂▂▂▂▂▂▁▁▁▁▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▂▂▂▁▂▁▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂ ▃
  11.8 ns         Histogram: frequency by time        11.9 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

What do you get from

@code_warntype U(1, 5, 2, 2, 1; data=data, θ=θ)

?

I am on 1.7:

julia> versioninfo()
Julia Version 1.7.0
Commit 3bf9d17731 (2021-11-30 12:12 UTC)
Platform Info:
  OS: macOS (x86_64-apple-darwin19.5.0)
  CPU: Intel(R) Core(TM) i7-8559U CPU @ 2.70GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-12.0.1 (ORCJIT, skylake)

julia> @benchmark U(1, 5, 2, 2, 1; data=$data, θ=$θ)
BenchmarkTools.Trial: 10000 samples with 996 evaluations.
 Range (min … max):  23.635 ns … 680.684 ns  ┊ GC (min … max): 0.00% … 96.10%
 Time  (median):     25.550 ns               ┊ GC (median):    0.00%
 Time  (mean ± σ):   27.063 ns ±  12.358 ns  ┊ GC (mean ± σ):  0.69% ±  1.66%

    █▇▆▄▂                                                      ▁
  ▇███████▇▆▅▅▅▄▂▃▄▃▃▃▂▃▂▂▄▃██▇▅▃▇▇▆▆█▇▆▇▆▆▄▃▄▄▃▂▃▄▄▃▄▃▃▄▃▂▃▅▆ █
  23.6 ns       Histogram: log(frequency) by time      58.7 ns <

 Memory estimate: 16 bytes, allocs estimate: 1.

Here is @code_warntype

julia> @code_warntype U(1, 5, 2, 2, 1; data=data, θ=θ)
MethodInstance for (::var"#U##kw")(::NamedTuple{(:data, :θ), Tuple{ComponentVector{Float64, Vector{Float64}, Tuple{Axis{(w = ViewAxis(1:3600, ShapedAxis((2, 9, 2, 2, 5, 10), NamedTuple())), P_1 = ViewAxis(3601:3650, ShapedAxis((5, 10), NamedTuple())), eb = ViewAxis(3651:3700, ShapedAxis((5, 10), NamedTuple())))}}}, ComponentVector{Float64, Vector{Float64}, Tuple{Axis{(α0 = ViewAxis(1:20, ShapedAxis((2, 2, 5), NamedTuple())), α = 21:23)}}}}}, ::typeof(U), ::Int64, ::Int64, ::Int64, ::Int64, ::Int64)
  from (::var"#U##kw")(::Any, ::typeof(U), g, a, e, k, t) in Main at REPL[11]:1
Arguments
  _::Core.Const(var"#U##kw"())
  @_2::NamedTuple{(:data, :θ), Tuple{ComponentVector{Float64, Vector{Float64}, Tuple{Axis{(w = ViewAxis(1:3600, ShapedAxis((2, 9, 2, 2, 5, 10), NamedTuple())), P_1 = ViewAxis(3601:3650, ShapedAxis((5, 10), NamedTuple())), eb = ViewAxis(3651:3700, ShapedAxis((5, 10), NamedTuple())))}}}, ComponentVector{Float64, Vector{Float64}, Tuple{Axis{(α0 = ViewAxis(1:20, ShapedAxis((2, 2, 5), NamedTuple())), α = 21:23)}}}}}
  @_3::Core.Const(U)
  g::Int64
  a::Int64
  e::Int64
  k::Int64
  t::Int64
Locals
  data::ComponentVector{Float64, Vector{Float64}, Tuple{Axis{(w = ViewAxis(1:3600, ShapedAxis((2, 9, 2, 2, 5, 10), NamedTuple())), P_1 = ViewAxis(3601:3650, ShapedAxis((5, 10), NamedTuple())), eb = ViewAxis(3651:3700, ShapedAxis((5, 10), NamedTuple())))}}}
  θ::ComponentVector{Float64, Vector{Float64}, Tuple{Axis{(α0 = ViewAxis(1:20, ShapedAxis((2, 2, 5), NamedTuple())), α = 21:23)}}}
  @_11::ComponentVector{Float64, Vector{Float64}, Tuple{Axis{(w = ViewAxis(1:3600, ShapedAxis((2, 9, 2, 2, 5, 10), NamedTuple())), P_1 = ViewAxis(3601:3650, ShapedAxis((5, 10), NamedTuple())), eb = ViewAxis(3651:3700, ShapedAxis((5, 10), NamedTuple())))}}}
  @_12::ComponentVector{Float64, Vector{Float64}, Tuple{Axis{(α0 = ViewAxis(1:20, ShapedAxis((2, 2, 5), NamedTuple())), α = 21:23)}}}
Body::Float64
1 ─ %1  = Base.haskey(@_2, :data)::Core.Const(true)
│         Core.typeassert(%1, Core.Bool)
│         (@_11 = Base.getindex(@_2, :data))
└──       goto #3
2 ─       Core.Const(:(@_11 = Main.data))
3 ┄ %6  = @_11::ComponentVector{Float64, Vector{Float64}, Tuple{Axis{(w = ViewAxis(1:3600, ShapedAxis((2, 9, 2, 2, 5, 10), NamedTuple())), P_1 = ViewAxis(3601:3650, ShapedAxis((5, 10), NamedTuple())), eb = ViewAxis(3651:3700, ShapedAxis((5, 10), NamedTuple())))}}}
│         (data = %6)
│   %8  = Base.haskey(@_2, :θ)::Core.Const(true)
│         Core.typeassert(%8, Core.Bool)
│         (@_12 = Base.getindex(@_2, :θ))
└──       goto #5
4 ─       Core.Const(:(@_12 = Main.θ))
5 ┄ %13 = @_12::ComponentVector{Float64, Vector{Float64}, Tuple{Axis{(α0 = ViewAxis(1:20, ShapedAxis((2, 2, 5), NamedTuple())), α = 21:23)}}}
│         (θ = %13)
│   %15 = (:data, :θ)::Core.Const((:data, :θ))
│   %16 = Core.apply_type(Core.NamedTuple, %15)::Core.Const(NamedTuple{(:data, :θ)})
│   %17 = Base.structdiff(@_2, %16)::Core.Const(NamedTuple())
│   %18 = Base.pairs(%17)::Core.Const(Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}}())
│   %19 = Base.isempty(%18)::Core.Const(true)
│         Core.typeassert(%19, Core.Bool)
└──       goto #7
6 ─       Core.Const(:(Base.kwerr(@_2, @_3, g, a, e, k, t)))
7 ┄ %23 = Main.:(var"#U#1")(data, θ, @_3, g, a, e, k, t)::Float64
└──       return %23

That looks pretty much identical to mine, not sure what differs then. You could try wrapping the call to U in another function to see if it’s a benchmarking artifact or something

Precisely wrapping U is when I noticed the problem:

function sumU(g, a, e, t; data=data, θ=θ)
    sum(U(g, a, e, k, t; data=data, θ=θ) for k in 1:5)
end
julia> @benchmark sumU(1, 5, 2, 1; data=$data, θ=$θ)
BenchmarkTools.Trial: 10000 samples with 825 evaluations.
 Range (min … max):  134.985 ns …   4.770 μs  ┊ GC (min … max):  0.00% … 95.69%
 Time  (median):     150.855 ns               ┊ GC (median):     0.00%
 Time  (mean ± σ):   172.783 ns ± 286.681 ns  ┊ GC (mean ± σ):  12.38% ±  7.18%

           ▅█▁                                                   
  ▁▃█▇▃▂▂▂▃███▄▂▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▂
  135 ns           Histogram: frequency by time          231 ns <

 Memory estimate: 320 bytes, allocs estimate: 7.

Could it be something really bad, like a problem with my processor? I don’t immediately have access to another computer to try it on.

This is strange. If I declare data and θ as constants, and don’t interpolate them the allocation goes away:

using BenchmarkTools
using ComponentArrays
using UnPack

α0 = rand(2, 2, 5)
α = rand(3)

w = rand(2, 9, 2, 2, 5, 10)
P_1 = rand(5, 10)
eb = rand(5, 10)

const data = ComponentArray{Float64}(; w, P_1, eb)
const θ = ComponentArray{Float64}(; α0, α)

@views function U(g, a, e, k, t; data=data, θ=θ)
    # params
    @unpack α0, α = θ

    # data
    @unpack w, P_1, eb = data

    w1 = w[g, a, e, 1, k, t]

    α0[g, e, k] + α[1] * w1 + α[2] * (e - eb[k, t])^2 + α[3] * P_1[k, t]
end
julia> @benchmark U(1, 5, 2, 2, 1; data=$data, θ=$θ)
BenchmarkTools.Trial: 10000 samples with 995 evaluations.
 Range (min … max):  22.362 ns … 809.574 ns  ┊ GC (min … max): 0.00% … 94.48%
 Time  (median):     24.205 ns               ┊ GC (median):    0.00%
 Time  (mean ± σ):   25.862 ns ±  14.257 ns  ┊ GC (mean ± σ):  0.85% ±  1.62%

  ▄▇██▇▆▆▆▅▄▂▁▁▁      ▁▁▁▁▁▁                                   ▃
  ████████████████▇▇█▇███████▇▇▇▇▇▇▇▅▅▆▆▆▆▇▇▇▇▅▆▅▆▅▃▁▅▃▃▄▃▁▁▆▅ █
  22.4 ns       Histogram: log(frequency) by time      52.5 ns <

 Memory estimate: 16 bytes, allocs estimate: 1.

but

julia> @benchmark U(1, 5, 2, 2, 1; data=data, θ=θ)
BenchmarkTools.Trial: 10000 samples with 998 evaluations.
 Range (min … max):  10.298 ns … 76.475 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     10.642 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   11.481 ns ±  3.589 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ██▂▂▁▁▂▁▁▁▂▃ ▃▂                                             ▂
  ████████████▇███▅▆▇▆▆▆▆▆▄▅▅▅▄▅▄▄▄▄▄▁▅▁▄▅▄▄▄▇█▆▄▃▄▄▄▄▅▁▄▄▁▄▅ █
  10.3 ns      Histogram: log(frequency) by time      29.1 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

I don’t think so, the amount of allocations is determined by the compiler, not by the processor. The compiler could emit different instructions based on the model of the processor though, I have no idea if that’s what’s happening here.

I would try to benchmark the function as it appears in your real application, the problem might still be a benchmarking artifact.

Is this a bug in BenchmarkTools?

Just so you don’t think it is machine specific, as I noticed this

  OS: macOS (x86_64-apple-darwin19.5.0)
  CPU: Intel(R) Core(TM) i7-8559U CPU @ 2.70GHz

Here’s mine

Julia Version 1.7.0
Commit 3bf9d17731 (2021-11-30 12:12 UTC)
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Xeon(R) CPU E5-2660 v3 @ 2.60GHz

and I get the same allocation

julia> @benchmark U(1, 5, 2, 2, 1; data=$data, θ=$θ)
BenchmarkTools.Trial: 10000 samples with 995 evaluations.
 Range (min … max):  28.153 ns …  2.754 μs  ┊ GC (min … max): 0.00% … 97.95%
 Time  (median):     31.552 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   32.399 ns ± 46.441 ns  ┊ GC (mean ± σ):  2.46% ±  1.70%

       █▆                            ▁
  ▃▅▄▄███▇▄▃▄▄▄▃▃▃▃▃▃▃▃▃▄▄▄▄▅▆▇▅▄▃▃▃▅██▅▃▂▃▃▄▄▃▂▂▂▂▂▂▂▂▁▂▂▂▂▂ ▃
  28.2 ns         Histogram: frequency by time        38.1 ns <

 Memory estimate: 16 bytes, allocs estimate: 1.

1 Like

I don’t get any allocation, with:

Julia Version 1.7.0
Commit 3bf9d17731 (2021-11-30 12:12 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Core(TM) i7-8550U CPU @ 1.80GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-12.0.1 (ORCJIT, skylake)

julia> @benchmark U(1, 5, 2, 2, 1; data=$data, θ=$θ)
BenchmarkTools.Trial: 10000 samples with 999 evaluations.
 Range (min … max):   8.541 ns … 44.842 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):      9.331 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   10.246 ns ±  2.191 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █   ▂▁                                                       
  █▁▁███▃▁▇▃▅▂▄▂▁▂▂▁▃▁▁▅▂▁▃▁▁▂▃▁▂▂▁▂▂▁▁▁▁▁▁▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▂
  8.54 ns         Histogram: frequency by time        17.5 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.


Optimization level maybe?

If I make all globals const, allocations go to 0 for me.

What happens if instead of using ComponentArrays you just use named tuples? i. e.:

julia> θ = (α0 = α0, α = α);

julia> data = (w = w, P_1 = P_1, eb = eb);

julia> @benchmark U(1, 5, 2, 2, 1; data=$data, θ=$θ)
BenchmarkTools.Trial: 10000 samples with 999 evaluations.
 Range (min … max):  6.535 ns … 40.737 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     7.278 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   7.945 ns ±  1.801 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

      █  ▅                                                    
  ▄▁▃▁█▄▂█▂▅▃▁▅▂▁▅▁▁▃▁▁▃▂▁▂▄▁▁▃▃▁▂▅▄▁▂▂▆▂▂▁▃▂▁▂▁▃▁▁▁▁▃▃▁▁▁▁▃ ▃
  6.53 ns        Histogram: frequency by time          11 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

(I get 0 allocations in all cases, so hard to test anything here)

Well, I can answer my own question, because in another machine I have (also Linux, also 1.7.0), I get the allocation, both with ComponentArrays or tuples.

1 Like

MInimal example I could come up with:

julia> a=[1.0]; b=(1.0,);

julia> U(x, y) = x[1];

julia> @btime U($a, $b)
  14.108 ns (1 allocation: 16 bytes)
1.0

I have tested in 4 different machines, only in one of them, and only in 1.7.0 (not in 1.6), the allocation occurs.

I filled an issue here: strange allocation in 1.7.0 (machine dependent) · Issue #43362 · JuliaLang/julia · GitHub

3 Likes

Is this also related?

using BenchmarkTools

const szAge = 64 - 19 + 1
const szE = 2
const β = 0.92
const δ = 0.10

maxT(am, af; Z=realage(szAge)) = Z - max(am, af) + 1

realage(x) = x + 18
realageinv(x) = x - 18

function insideproduct_num(am, em, af, ef; μ0m=μ0m, μ0f=μ0f)
    exponent = 0.5*(β*(1 - δ))
    power = 1.0
    p = one(promote_type(eltype(μ0m), eltype(μ0f)))
    for k in 0:maxT(realage(am), realage(af))-1
        p *= abs(μ0m[am + k, em] * μ0f[af + k, ef])^power
        power *= exponent
    end
    return p
end

function insideproduct_den(am, em, af, ef; Sm=Sm, Sf=Sf)
    exponent = 0.5*(β*(1 - δ))
    power = 1.0
    p = one(promote_type(eltype(Sm), eltype(Sf)))
    for k in 0:maxT(realage(am), realage(af))-1
        p *= abs(Sm[am + k, em] * Sf[af + k, ef])^power
        power *= exponent
    end
    return p
end

function fillauxmat!(auxmat; Π=Π, Sm=Sm, Sf=Sf, μ0m=μ0m, μ0f=μ0f)
    for ef in 1:szE, af in 1:szAge, em in 1:szE, am in 1:szAge
        auxmat[am, em, af, ef] = Π[am, em, af, ef] * sqrt(Sm[am, em]*Sf[af, ef]) /
                insideproduct_den(am, em, af, ef; Sm=Sm, Sf=Sf)
    end
    return auxmat
end

function MMsysM!(res, μ0m; auxmat=auxmat, Sm=Sm, μ0f=μ0f)
    for em in 1:szE, am in 1:szAge
        s = zero(eltype(μ0m))
        for ef in 1:szE, af in 1:szAge
            s += auxmat[am, em, af, ef] * insideproduct_num(am, em, af, ef; μ0m=μ0m, μ0f=μ0f)
        end
        res[am, em] = Sm[am, em] - μ0m[am, em] - s
    end
    return res
end

function MMsysF!(res, μ0f; auxmat=auxmat, Sf=Sf, μ0m=μ0m)
    for ef in 1:szE, af in 1:szAge
        s = zero(eltype(μ0f))
        for em in 1:szE, am in 1:szAge
            s += auxmat[am, em, af, ef] * insideproduct_num(am, em, af, ef; μ0m=μ0m, μ0f=μ0f)
        end
        res[af, ef] = Sf[af, ef] - μ0f[af, ef] - s
    end
    return res
end

function fullsys!(resm, resf; auxmat=auxmat, Sm=Sm, Sf=Sf, μ0m=μ0m, μ0f=μ0f)
    fillauxmat!(auxmat; Π=Π, Sm=Sm, Sf=Sf, μ0m=μ0m, μ0f=μ0f)

    MMsysM!(resm, μ0m; auxmat=auxmat, Sm=Sm, μ0f=μ0f)
    MMsysF!(resf, μ0f; auxmat=auxmat, Sf=Sf, μ0m=μ0m)
    return resm, resf
end

Π = 1000 .* rand(szAge, szE, szAge, szE)
auxmat = similar(Π)

Sm = 1000 .* rand(szAge, szE)
Sf = 1000 .* rand(szAge, szE)

μ0m = rand(szAge, szE)
μ0f = rand(szAge, szE)

resm = similar(μ0m)
resf = similar(μ0f)

If I benchmark fillauxmat!, MMsysM!, and MMsysF! individually, I get no allocations, but benchmarking fullsys! yields 2 allocations:

julia> @benchmark fillauxmat!($auxmat; Π=$Π, Sm=$Sm, Sf=$Sf, μ0m=$μ0m, μ0f=$μ0f)
BenchmarkTools.Trial: 1870 samples with 1 evaluation.
 Range (min … max):  2.307 ms …   3.706 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     2.631 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   2.670 ms ± 164.151 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

      ▁     ▄▁ ▁▁ █▃▅▄▅▅▄▅▄▄▄▃▄▃▂▃▃▂▁▂▁▁▁▁▁▁ ▁                ▁
  ▅▁▄▄██▄█▆███████████████████████████████████▇▇▇▇▇▇▅▆▆▅▄▆▁▆▆ █
  2.31 ms      Histogram: log(frequency) by time      3.21 ms <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark MMsysF!($resf, $μ0f; auxmat=$auxmat, Sf=$Sf, μ0m=$μ0m)
BenchmarkTools.Trial: 1898 samples with 1 evaluation.
 Range (min … max):  2.271 ms …   3.571 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     2.595 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   2.631 ms ± 155.933 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

                 █  ▁ ▁                                        
  ▂▁▂▁▂▃▂▂▂▃▆▃▃▃▅█▇▄█▇█▇▇▇▅▆▅▅▄▅▄▃▄▃▃▃▃▃▃▃▃▃▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂ ▃
  2.27 ms         Histogram: frequency by time         3.2 ms <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark MMsysM!($resm, $μ0m; auxmat=$auxmat, Sm=$Sm, μ0f=$μ0f)
BenchmarkTools.Trial: 1843 samples with 1 evaluation.
 Range (min … max):  2.276 ms …   5.752 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     2.641 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   2.708 ms ± 222.042 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

              ▅▆ ▄██▄▄▁▃                                       
  ▂▂▁▁▃▂▂▂▄▃▃▄██▆████████▇▆▆▅▅▄▅▅▃▃▃▄▄▄▃▄▄▄▄▄▄▃▃▂▃▃▂▂▂▂▂▂▂▂▂▂ ▄
  2.28 ms         Histogram: frequency by time        3.39 ms <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark fullsys!($resm, $resf; auxmat=$auxmat, Sm=$Sm, Sf=$Sf, μ0m=$μ0m, μ0f=$μ0f)
BenchmarkTools.Trial: 631 samples with 1 evaluation.
 Range (min … max):  7.043 ms …   9.923 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     7.879 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   7.919 ms ± 333.778 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

                  ▁ ▄▇▃▅█▄▅▆▄▃▄▃▁▂                             
  ▃▁▁▁▂▃▂▁▅▄▃▃▄▄▇▇█▇██████████████▇▆▇▇▅▅▇▅▃▃▄▃▂▃▄▂▂▁▃▁▂▃▁▂▃▂▂ ▄
  7.04 ms         Histogram: frequency by time        9.02 ms <

 Memory estimate: 96 bytes, allocs estimate: 2.

I am on this system

julia> versioninfo()
Julia Version 1.7.1
Commit ac5cc99908 (2021-12-22 19:35 UTC)
Platform Info:
  OS: macOS (x86_64-apple-darwin19.5.0)
  CPU: Intel(R) Core(TM) i7-8559U CPU @ 2.70GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-12.0.1 (ORCJIT, skylake)
Environment:
  JULIA_EDITOR = code
  JULIA_NUM_THREADS = 6

(apologies for the maximal working example)