This is a followup from: Accelerating an N^2 nested loop - #14 by lmiq

I have the following code, which tries to compute some sort of acceleration for particles. There is an implementation based on standard 1D arrays to represent the coordinates in each dimension of the space (X,Y), and other using StaticArrays. For some reason the version using StaticArrays is somewhat slower, but this is (perhaps) the subject for another thread.

The issue here is that that StaticArrays version has a significant performance regression in 1.9.0-beta3 relative to older versions (1.6.7 and 1.8.5). I get:

```
julia> main()
65.380 μs (0 allocations: 0 bytes)
nothing
102.683 μs (0 allocations: 0 bytes)
nothing
julia> versioninfo()
Julia Version 1.6.7
Commit 3b76b25b64 (2022-07-19 15:11 UTC)
Platform Info:
OS: Linux (x86_64-pc-linux-gnu)
CPU: 11th Gen Intel(R) Core(TM) i5-11500H @ 2.90GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-11.0.1 (ORCJIT, tigerlake)
```

```
julia> main()
66.436 μs (0 allocations: 0 bytes)
nothing
102.095 μs (0 allocations: 0 bytes)
nothing
julia> versioninfo()
Julia Version 1.8.5
Commit 17cfb8e65ea (2023-01-08 06:45 UTC)
```

```
julia> main()
64.640 μs (0 allocations: 0 bytes)
nothing
159.491 μs (0 allocations: 0 bytes)
nothing
julia> versioninfo()
Julia Version 1.9.0-beta3
Commit 24204a73447 (2023-01-18 07:20 UTC)
```

Note that in 1.9 the second code is ~60% slower than in 1.6 and 1.8.

The code to be run is (just copy/paste):

```
import Pkg
Pkg.activate(;temp=true)
Pkg.add("StaticArrays")
Pkg.add("BenchmarkTools")
using Test
using StaticArrays
# With simple 1D arrays
function XYDOT2!(X,Y,X_DOT,Y_DOT,EPS)
fill!(X_DOT,0.0); fill!(Y_DOT,0.0)
@inbounds for i in eachindex(X,Y,X_DOT,Y_DOT)
x_i, y_i = X[i], Y[i]
acc_x, acc_y = 0.0, 0.0
@fastmath for j=(i+1):lastindex(X)
x_j, y_j = X[j], Y[j]
diff_x = x_i - x_j; diff_y = y_i - y_j
inv_d = inv(diff_x^2 + diff_y^2 + EPS)
dxdt = diff_x * inv_d; dydt = diff_y * inv_d
acc_x += dxdt; acc_y += dydt
X_DOT[j] -= dxdt; Y_DOT[j] -= dydt
end
X_DOT[i] += acc_x; Y_DOT[i] += acc_y
end
return nothing
end
# with vectors of SVectors
function XYDOT_SV!(P::Vector{T}, XYDOT::Vector{T}, EPS) where T<:SVector
fill!(XYDOT, zero(T))
@inbounds for i in eachindex(P,XYDOT)
p_i = P[i]
acc = zero(T)
@fastmath for j=(i+1):lastindex(P)
diff = p_i - P[j]
invd = inv(diff[1]^2 + diff[2]^2 + EPS)
dpdt = diff * invd
acc += dpdt
XYDOT[j] -= dpdt
end
XYDOT[i] += acc
end
return nothing
end
function main()
N = 512
EPS = 0.1
P = rand(SVector{2,Float64},N)
XYDOT = similar(P)
X = [ x[1] for x in P ];
Y = [ x[2] for x in P ];
X_DOT = similar(X)
Y_DOT = similar(Y)
XYDOT2!(X,Y,X_DOT,Y_DOT,EPS)
XYDOT_SV!(P,XYDOT,EPS)
@test all(isapprox.([SVector(x,y) for (x,y) in zip(X_DOT,Y_DOT)], XYDOT))
display(@btime XYDOT2!($X,$Y,$X_DOT,$Y_DOT,$EPS))
display(@btime XYDOT_SV!($P,$XYDOT,$EPS))
return nothing
end
main()
```