Allocations with StaticArrays and `+=` in Julia `v1.11`

Hello everyone,
I am the main author of the package Peridynamics.jl and recently changed to Julia v1.11.
With the new Julia version, a significant increase in the simulation time can be found when using a certain material.

Peridynamics.jl Benchmark

Please see the following benchmark, which is also described in more detail in the tutorials (Tutorial mode I fracture).

using Peridynamics

function mode_i(mat, npyz)
    l, Δx, δ, a = 1.0, 1/npyz, 3.015/npyz, 0.5
    pos, vol = uniform_box(l, l, 0.1l, Δx)
    ids = sortperm(pos[2,:])
    body = Body(mat, pos[:, ids], vol[ids])
    material!(body; horizon=3.015Δx, E=2.1e5, nu=0.25, rho=8e-6, Gc=2.7)
    point_set!(p -> p[1] ≤ -l/2+a && 0 ≤ p[2] ≤ 2δ, body, :set_a)
    point_set!(p -> p[1] ≤ -l/2+a && -2δ ≤ p[2] < 0, body, :set_b)
    precrack!(body, :set_a, :set_b)
    point_set!(p -> p[2] > l/2-Δx, body, :set_top)
    point_set!(p -> p[2] < -l/2+Δx, body, :set_bottom)
    velocity_bc!(t -> -30, body, :set_bottom, :y)
    velocity_bc!(t -> 30, body, :set_top, :y)
    vv = VelocityVerlet(steps=2000)
    job = Job(body, vv; path=joinpath(@__DIR__, "results", "mode_i"))
    @time submit(job)
    return nothing
end

mode_i(NOSBMaterial(), 25) # compilation
mode_i(NOSBMaterial(), 50) # work
❯ julia +1.10.5 --project -t 6 mode_i_benchmark.jl
  0.719477 seconds (1.39 M allocations: 450.082 MiB, 2.77% gc time, 74.16% compilation time)
  6.090774 seconds (1.28 M allocations: 1.118 GiB, 1.06% gc time, 2.17% compilation time)

❯ julia +1.11.1 --project -t 6 mode_i_benchmark.jl
  9.574071 seconds (2.15 G allocations: 82.415 GiB, 15.59% gc time, 6.09% compilation time)
177.261652 seconds (40.34 G allocations: 1.501 TiB, 23.15% gc time)

The same simulation takes 29x longer on v1.11.1 than v1.10.5.

When investigating the problem, I narrowed it down to the calculation of the deformation gradient:

MWE

The same behavior can be reproduced with a small MWE:

using LinearAlgebra, StaticArrays, BenchmarkTools

function mwe1(a, X, n)
    K = zeros(SMatrix{3,3})
    for i in 1:n
        k = a * i
        K += k * X * X'
    end
    return K
end

@btime mwe1(1e-5, SVector{3}(1.0, 1.0, 1.0), 1_000_000);
❯ julia +1.10.5 --project -t 6 mwe.jl 
  1.070 ms (0 allocations: 0 bytes)

❯ julia +1.11.1 --project -t 6 mwe.jl 
  129.890 ms (7000000 allocations: 289.92 MiB)

I think the problem is related to allocations when using += with a SMatrix or SVector.

As a quick fix, rewriting the calculation by hand leads to even slightly better performance on v1.11:

function mwe2(a, X, n)
    K11, K12, K13 = 0.0, 0.0, 0.0
    K21, K22, K23 = 0.0, 0.0, 0.0
    K31, K32, K33 = 0.0, 0.0, 0.0
    X1, X2, X3 = X[1], X[2], X[3]
    for i in 1:n
        k = a * i
        K11 += k * X1 * X1
        K12 += k * X1 * X2
        K13 += k * X1 * X3
        K21 += k * X2 * X1
        K22 += k * X2 * X2
        K23 += k * X2 * X3
        K31 += k * X3 * X1
        K32 += k * X3 * X2
        K33 += k * X3 * X3
    end
    K = SMatrix{3,3}(K11, K21, K31, K12, K22, K32, K13, K23, K33)
    return K
end

@btime mwe2(1e-5, SVector{3}(1.0, 1.0, 1.0), 1_000_000);
❯ julia +1.10.5 --project -t 6 mwe.jl 
  1.211 ms (0 allocations: 0 bytes)

❯ julia +1.11.1 --project -t 6 mwe.jl 
  1.072 ms (0 allocations: 0 bytes)

I already opened an issue on the StaticArrays.jl repo: Allocations with StaticArrays and `+=` in Julia v1.11 · Issue #1282 · JuliaArrays/StaticArrays.jl · GitHub

Please let me know if there’s anything else I can do to communicate the regression for future fixes.

9 Likes

I’m not necessarily sure how to interpret this, but if I change K += k * X * X' in mwe1 into K += k * (X * X') or K += (k * X) * X', this gets rid of the allocations, while leaving the += intact.

6 Likes
julia> X = SVector(1.0, 1.0, 1.0);

julia> @which *(1.0, X, X')
*(α::Number, u::AbstractVector, tv::Union{Adjoint{T, var"#s4975"}, Transpose{T, var"#s4975"}} where {T, var"#s4975"<:(AbstractVector)})
     @ LinearAlgebra ~/.julia/juliaup/julia-1.11.1+0.x64.linux.gnu/share/julia/stdlib/v1.11/LinearAlgebra/src/matmul.jl:1125

The dispatch is falling back to a generic matmul routine, rather than using a StaticArrays version. Adding parentheses makes this parse as a sequence of binary * (rather than a single trinary *), which is why that fixes it.

Personally, I continue to find the n-ary parsing of infix multiplication to be more trouble than it’s worth.

EDIT: First, my original post mistakenly used X::SMatrix instead of X::SVector. I fixed that. Second, it seems that this should be falling back to broadcast(*, k, X, X') which should still be performant. So I’m not precisely sure where the issue comes from.

3 Likes

Thank you @eldee, this is very interesting:

function mwe3(a, X, n)
    K = zeros(SMatrix{3,3})
    for i in 1:n
        k = a * i
        K += k * (X * X')
    end
    return K
end
@btime mwe3(1e-5, SVector{3}(1.0, 1.0, 1.0), 1_000_000);

Faster than the hand-written version and also a bit faster with v1.11:

❯ julia +1.10.5 --project -t 6 mwe.jl 
  805.458 μs (0 allocations: 0 bytes)

❯ julia +1.11.1 --project -t 6 mwe.jl
  708.208 μs (0 allocations: 0 bytes)
3 Likes

Aside: to ensure that K stays the same type during the loop, you might want to initialize it based on the types of a and X, e.g.:

K = zero(SMatrix{3,3,typeof(a*1*zero(eltype(X))),9})
2 Likes
I think the MWE can be trimmed even further to just existing methods, rules out anything in the custom method.

v1.10.5 (note that I’m running this on an unplugged laptop vulnerable to throttling, so the comparative timings here should be ignored, it’s the allocations that matter)

julia> using BenchmarkTools, StaticArrays; x = SVector{3}(1.0, 1.0, 1.0);

julia> @btime $1e-5 * $x * ($x)';
  10.110 ns (0 allocations: 0 bytes)

julia> @btime $1e-5 * ($x * ($x)');
  4.104 ns (0 allocations: 0 bytes)

v1.11.1

julia> using BenchmarkTools, StaticArrays; x = SVector{3}(1.0, 1.0, 1.0);

julia> @btime $1e-5 * $x * ($x)';
  330.159 ns (9 allocations: 368 bytes)

julia> @btime $1e-5 * ($x * ($x)');
  11.111 ns (0 allocations: 0 bytes)

The rationale is that replacing the mwe1 line with K += k * ones(SMatrix{3,3}) results in 0 allocations, so the ternary * seems to be key. In this case, the ternary * call appears to dispatch to a LinearAlgebra method *(α::Number, u::AbstractVector, tv::AdjOrTransAbsVec) = broadcast(*, α, u, tv) in both versions.

The lowering did seem to change, but not in a way that changes what happens with `*`, it's just harder to read now.

v1.10.5

julia> Meta.@lower a * b * c
:($(Expr(:thunk, CodeInfo(
    @ none within `top-level scope`
1 ─ %1 = a * b * c
└──      return %1
))))

julia> Meta.@lower a * (b * c)
:($(Expr(:thunk, CodeInfo(
    @ none within `top-level scope`
1 ─ %1 = a
│   %2 = b * c
│   %3 = %1 * %2
└──      return %3
))))

v1.11.1

julia> Meta.@lower a * b * c
:($(Expr(:thunk, CodeInfo(
    @ none within `top-level scope`
1 ─ %1 = a
│   %2 = b
│   %3 = c
│   %4 = %1 * %2 * %3
└──      return %4
))))

julia> Meta.@lower a * (b * c)
:($(Expr(:thunk, CodeInfo(
    @ none within `top-level scope`
1 ─ %1 = *
│   %2 = a
│   %3 = b
│   %4 = c
│   %5 = %3 * %4
│   %6 = (%1)(%2, %5)
└──      return %6
))))
4 Likes