Allocations with StaticArrays and `+=` in Julia `v1.11`

kaipartmann · October 22, 2024, 9:40am

Hello everyone,
I am the main author of the package Peridynamics.jl and recently changed to Julia v1.11.
With the new Julia version, a significant increase in the simulation time can be found when using a certain material.

Peridynamics.jl Benchmark

Please see the following benchmark, which is also described in more detail in the tutorials (Tutorial mode I fracture).

using Peridynamics

function mode_i(mat, npyz)
    l, Δx, δ, a = 1.0, 1/npyz, 3.015/npyz, 0.5
    pos, vol = uniform_box(l, l, 0.1l, Δx)
    ids = sortperm(pos[2,:])
    body = Body(mat, pos[:, ids], vol[ids])
    material!(body; horizon=3.015Δx, E=2.1e5, nu=0.25, rho=8e-6, Gc=2.7)
    point_set!(p -> p[1] ≤ -l/2+a && 0 ≤ p[2] ≤ 2δ, body, :set_a)
    point_set!(p -> p[1] ≤ -l/2+a && -2δ ≤ p[2] < 0, body, :set_b)
    precrack!(body, :set_a, :set_b)
    point_set!(p -> p[2] > l/2-Δx, body, :set_top)
    point_set!(p -> p[2] < -l/2+Δx, body, :set_bottom)
    velocity_bc!(t -> -30, body, :set_bottom, :y)
    velocity_bc!(t -> 30, body, :set_top, :y)
    vv = VelocityVerlet(steps=2000)
    job = Job(body, vv; path=joinpath(@__DIR__, "results", "mode_i"))
    @time submit(job)
    return nothing
end

mode_i(NOSBMaterial(), 25) # compilation
mode_i(NOSBMaterial(), 50) # work

❯ julia +1.10.5 --project -t 6 mode_i_benchmark.jl
  0.719477 seconds (1.39 M allocations: 450.082 MiB, 2.77% gc time, 74.16% compilation time)
  6.090774 seconds (1.28 M allocations: 1.118 GiB, 1.06% gc time, 2.17% compilation time)

❯ julia +1.11.1 --project -t 6 mode_i_benchmark.jl
  9.574071 seconds (2.15 G allocations: 82.415 GiB, 15.59% gc time, 6.09% compilation time)
177.261652 seconds (40.34 G allocations: 1.501 TiB, 23.15% gc time)

The same simulation takes 29x longer on v1.11.1 than v1.10.5.

When investigating the problem, I narrowed it down to the calculation of the deformation gradient:

github.com

kaipartmann/Peridynamics.jl/blob/f9b8fe60e743d828cf96d2366610c6a69d7ac054/src/physics/correspondence.jl#L174-L194


      
          function calc_deformation_gradient(storage::NOSBStorage, system::BondSystem,
                                             mat::NOSBMaterial, params::NOSBPointParameters, i::Int)
              K = zeros(SMatrix{3,3})
              _F = zeros(SMatrix{3,3})
              ω0 = 0.0
              for bond_id in each_bond_idx(system, i)
                  bond = system.bonds[bond_id]
                  j, L = bond.neighbor, bond.length
                  ΔXij = get_coordinates_diff(system, i, j)
                  Δxij = get_coordinates_diff(storage, i, j)
                  Vj = system.volume[j]
                  ωij = influence_function(mat, params, L) * storage.bond_active[bond_id]
                  ω0 += ωij
                  temp = ωij * Vj
                  K += temp * ΔXij * ΔXij'
                  _F += temp * Δxij * ΔXij'
              end
              Kinv = inv(K)
              F = _F * Kinv
              return F, Kinv, ω0
          end

MWE

The same behavior can be reproduced with a small MWE:

using LinearAlgebra, StaticArrays, BenchmarkTools

function mwe1(a, X, n)
    K = zeros(SMatrix{3,3})
    for i in 1:n
        k = a * i
        K += k * X * X'
    end
    return K
end

@btime mwe1(1e-5, SVector{3}(1.0, 1.0, 1.0), 1_000_000);

❯ julia +1.10.5 --project -t 6 mwe.jl 
  1.070 ms (0 allocations: 0 bytes)

❯ julia +1.11.1 --project -t 6 mwe.jl 
  129.890 ms (7000000 allocations: 289.92 MiB)

I think the problem is related to allocations when using += with a SMatrix or SVector.

As a quick fix, rewriting the calculation by hand leads to even slightly better performance on v1.11:

function mwe2(a, X, n)
    K11, K12, K13 = 0.0, 0.0, 0.0
    K21, K22, K23 = 0.0, 0.0, 0.0
    K31, K32, K33 = 0.0, 0.0, 0.0
    X1, X2, X3 = X[1], X[2], X[3]
    for i in 1:n
        k = a * i
        K11 += k * X1 * X1
        K12 += k * X1 * X2
        K13 += k * X1 * X3
        K21 += k * X2 * X1
        K22 += k * X2 * X2
        K23 += k * X2 * X3
        K31 += k * X3 * X1
        K32 += k * X3 * X2
        K33 += k * X3 * X3
    end
    K = SMatrix{3,3}(K11, K21, K31, K12, K22, K32, K13, K23, K33)
    return K
end

@btime mwe2(1e-5, SVector{3}(1.0, 1.0, 1.0), 1_000_000);

❯ julia +1.10.5 --project -t 6 mwe.jl 
  1.211 ms (0 allocations: 0 bytes)

❯ julia +1.11.1 --project -t 6 mwe.jl 
  1.072 ms (0 allocations: 0 bytes)

I already opened an issue on the StaticArrays.jl repo: Allocations with StaticArrays and `+=` in Julia v1.11 · Issue #1282 · JuliaArrays/StaticArrays.jl · GitHub

Please let me know if there’s anything else I can do to communicate the regression for future fixes.

eldee · October 22, 2024, 10:20am

I’m not necessarily sure how to interpret this, but if I change K += k * X * X' in mwe1 into K += k * (X * X') or K += (k * X) * X', this gets rid of the allocations, while leaving the += intact.

mikmoore · October 22, 2024, 2:01pm

julia> X = SVector(1.0, 1.0, 1.0);

julia> @which *(1.0, X, X')
*(α::Number, u::AbstractVector, tv::Union{Adjoint{T, var"#s4975"}, Transpose{T, var"#s4975"}} where {T, var"#s4975"<:(AbstractVector)})
     @ LinearAlgebra ~/.julia/juliaup/julia-1.11.1+0.x64.linux.gnu/share/julia/stdlib/v1.11/LinearAlgebra/src/matmul.jl:1125

~~The dispatch is falling back to a generic matmul routine, rather than using a StaticArrays version.~~ Adding parentheses makes this parse as a sequence of binary * (rather than a single trinary *), which is why that fixes it.

Personally, I continue to find the n-ary parsing of infix multiplication to be more trouble than it’s worth.

EDIT: First, my original post mistakenly used X::SMatrix instead of X::SVector. I fixed that. Second, it seems that this should be falling back to broadcast(*, k, X, X') which should still be performant. So I’m not precisely sure where the issue comes from.

kaipartmann · October 22, 2024, 6:30pm

Thank you @eldee, this is very interesting:

function mwe3(a, X, n)
    K = zeros(SMatrix{3,3})
    for i in 1:n
        k = a * i
        K += k * (X * X')
    end
    return K
end
@btime mwe3(1e-5, SVector{3}(1.0, 1.0, 1.0), 1_000_000);

Faster than the hand-written version and also a bit faster with v1.11:

❯ julia +1.10.5 --project -t 6 mwe.jl 
  805.458 μs (0 allocations: 0 bytes)

❯ julia +1.11.1 --project -t 6 mwe.jl
  708.208 μs (0 allocations: 0 bytes)

stevengj · October 22, 2024, 8:24pm

Aside: to ensure that K stays the same type during the loop, you might want to initialize it based on the types of a and X, e.g.:

K = zero(SMatrix{3,3,typeof(a*1*zero(eltype(X))),9})

Benny · October 22, 2024, 10:02pm

I think the MWE can be trimmed even further to just existing methods, rules out anything in the custom method.

v1.10.5 (note that I’m running this on an unplugged laptop vulnerable to throttling, so the comparative timings here should be ignored, it’s the allocations that matter)

julia> using BenchmarkTools, StaticArrays; x = SVector{3}(1.0, 1.0, 1.0);

julia> @btime $1e-5 * $x * ($x)';
  10.110 ns (0 allocations: 0 bytes)

julia> @btime $1e-5 * ($x * ($x)');
  4.104 ns (0 allocations: 0 bytes)

v1.11.1

julia> using BenchmarkTools, StaticArrays; x = SVector{3}(1.0, 1.0, 1.0);

julia> @btime $1e-5 * $x * ($x)';
  330.159 ns (9 allocations: 368 bytes)

julia> @btime $1e-5 * ($x * ($x)');
  11.111 ns (0 allocations: 0 bytes)

The rationale is that replacing the mwe1 line with K += k * ones(SMatrix{3,3}) results in 0 allocations, so the ternary * seems to be key. In this case, the ternary * call appears to dispatch to a LinearAlgebra method *(α::Number, u::AbstractVector, tv::AdjOrTransAbsVec) = broadcast(*, α, u, tv) in both versions.

The lowering did seem to change, but not in a way that changes what happens with `*`, it's just harder to read now.

v1.10.5

julia> Meta.@lower a * b * c
:($(Expr(:thunk, CodeInfo(
    @ none within `top-level scope`
1 ─ %1 = a * b * c
└──      return %1
))))

julia> Meta.@lower a * (b * c)
:($(Expr(:thunk, CodeInfo(
    @ none within `top-level scope`
1 ─ %1 = a
│   %2 = b * c
│   %3 = %1 * %2
└──      return %3
))))

v1.11.1

julia> Meta.@lower a * b * c
:($(Expr(:thunk, CodeInfo(
    @ none within `top-level scope`
1 ─ %1 = a
│   %2 = b
│   %3 = c
│   %4 = %1 * %2 * %3
└──      return %4
))))

julia> Meta.@lower a * (b * c)
:($(Expr(:thunk, CodeInfo(
    @ none within `top-level scope`
1 ─ %1 = *
│   %2 = a
│   %3 = b
│   %4 = c
│   %5 = %3 * %4
│   %6 = (%1)(%2, %5)
└──      return %6
))))

Benny · October 26, 2024, 7:54am

Semi-related, I’ve also seen some extra v1.11.1 allocations via @benchmark in modifying elements of SVectors via Accessors e.g. @reset sv[i] = x. The weird thing is that it doesn’t happen every compilation, so I could redefine the method and rerun the benchmark to get the zero allocations I expected. I can’t reproduce it now, it might require more than simple instantiations of a small vector, so this is just an FYI for anyone who runs into this again.

Topic		Replies	Views
Inexplicable allocations when summing `StaticArrays` Performance	8	1126	December 3, 2018
Common allocation mistakes Performance memory-allocation	47	7156	August 21, 2023
Local arrays in for loops New to Julia	55	1826	March 3, 2021
Linear algebra performance issue General Usage	20	1092	July 10, 2020
Help with allocations (variable length least squares) New to Julia memory-allocation	10	1160	March 25, 2021

Allocations with StaticArrays and `+=` in Julia `v1.11`

Peridynamics.jl Benchmark

MWE

Related topics