ForwardDiff.hessian! with StaticArrays, unexpected allocations and performance

I am struggling with allocations and performance when calculating Hessians with ForwardDiff.jl. Here’s an MWE using a clean REPL with Julia 1.10.3

julia> using DiffResults, ForwardDiff, StaticArrays, BenchmarkTools

julia> g = r -> (r[1]^2 - 3) * (r[2]^2 - 2);

julia> x = SA_F32[0.5, 2.7];

julia> hres = DiffResults.HessianResult(x);

julia> @btime ForwardDiff.hessian!($hres, $g, $x)
  68.948 ns (1 allocation: 80 bytes)
ImmutableDiffResult(-14.547502, (Float32[5.2900004, -14.85], Float32[10.580001 5.4; 5.4 -5.5]))

The allocation is unexpected, and it seems that it happens inside hess = extract_jacobian(T,partials(T,fd2), x), where hess is returned as a Matrix (instead of an SMatrix), and then subsequently converted to an SMatrix. It seems that partials(T, fd2) is not statically sized, so that the similar call inside extract_jacobian is forced to allocate a Matrix .

Now, I also tried just doing regular non-! hessian, and curiously, at first I got zero allocations and 5ns runtime, but for reasons I cannot explain it now takes almost 10^3 times longer and allocates significantly:

julia> @btime ForwardDiff.hessian($g, $x)
  2.878 μs (9 allocations: 208 bytes)
2×2 SMatrix{2, 2, Float32, 4} with indices SOneTo(2)×SOneTo(2):
 10.58   5.4
  5.4   -5.5

Compare this with

1.10.3> @btime ForwardDiff.gradient($g, $x)
  3.100 ns (0 allocations: 0 bytes)

This is on Windows with Julia 1.10.3, but it also happened on 1.10.2. I am particularly confused why ForwardDiff.hessian suddenly jumped from 5ns to 3us, but my actual usecase is the ForwardDiff.hessian!.

Possibly relevant: Allocation on ForwardDiff + DiffResults + StaticArrays

Update: Now I am running two different REPLs with different environments, and I get this:
REPL1:

julia> @btime ForwardDiff.hessian($g, $x)
  7.900 ns (0 allocations: 0 bytes)
2×2 SMatrix{2, 2, Float32, 4} with indices SOneTo(2)×SOneTo(2):
 10.58   5.4
  5.4   -5.5

REPL2:

julia> @btime ForwardDiff.hessian($g, $x)
  2.167 μs (9 allocations: 208 bytes)
2×2 SMatrix{2, 2, Float32, 4} with indices SOneTo(2)×SOneTo(2):
 10.58   5.4
  5.4   -5.5

REPL1:

(jl_L5sXDq) pkg> st
Status `C:\Users\DNF\AppData\Local\Temp\jl_L5sXDq\Project.toml`
  [f68482b8] Cthulhu v2.12.5
  [163ba53b] DiffResults v1.1.0
  [f6369f11] ForwardDiff v0.10.36
  [90137ffa] StaticArrays v1.9.3

REPL2:

(MyProject) pkg> st
Project MyProject v1.0.0-DEV
Status `C:\Users\DNF\.julia\dev\myproject.jl\Project.toml`
  [26cce99e] BasicInterpolators v0.7.1
  [13f3f980] CairoMakie v0.12.0
  [ae650224] ChunkSplitters v2.4.2
  [f68482b8] Cthulhu v2.12.5
  [717857b8] DSP v0.7.9
  [163ba53b] DiffResults v1.1.0   # <= same as REPL1
  [31c24e10] Distributions v0.25.108
  [7a1cc6ca] FFTW v1.8.0
  [442a2c76] FastGaussQuadrature v1.0.2
  [1a297f60] FillArrays v1.11.0
  [f6369f11] ForwardDiff v0.10.36   # <= same as REPL1
  [e9467ef8] GLMakie v0.10.0
  [98e50ef6] JuliaFormatter v1.0.56
  [bdcacae8] LoopVectorization v0.12.170
  [23992714] MAT v0.10.6
  [ee78f7c6] Makie v0.21.0
  [429524aa] Optim v1.9.4
  [9b87118b] PackageCompiler v2.1.17
  [85a6dd25] PositiveFactorizations v0.2.4
  [92933f4c] ProgressMeter v1.10.0
  [295af30f] Revise v3.5.14
  [fdea26ae] SIMD v3.5.0
  [90137ffa] StaticArrays v1.9.3    # <= same as REPL1
  [09ab397b] StructArrays v0.6.18
  [20346346] TriangularIndices v0.1.0
  [37e2e46d] LinearAlgebra

Right now I guess this is both a autodiff question and a Pkg question, unfortunately. But the main question is still about ForwardDiff.hessian! and its small amount of allocations.

I’ll just bump this once. Perhaps I should rather open an issue at ForwardDiff.jl.

That is very surprising. Are they both clean REPLs with only ForwardDiff and StaticArrays loaded?

ForwardDiff, DiffResults, StaticArrays and BenchmarkTools. Both clean.

Today, however, both REPLs are slow, with a runtime of 2us for the hessian (no allocations), and 50ns for the hessian! (1 allocation, 80bytes).