Possible performance regression in 1.9 / StaticArrays

This is a followup from: Accelerating an N^2 nested loop - #14 by lmiq

I have the following code, which tries to compute some sort of acceleration for particles. There is an implementation based on standard 1D arrays to represent the coordinates in each dimension of the space (X,Y), and other using StaticArrays. For some reason the version using StaticArrays is somewhat slower, but this is (perhaps) the subject for another thread.

The issue here is that that StaticArrays version has a significant performance regression in 1.9.0-beta3 relative to older versions (1.6.7 and 1.8.5). I get:

julia> main()
  65.380 μs (0 allocations: 0 bytes)
nothing
  102.683 μs (0 allocations: 0 bytes)
nothing

julia> versioninfo()
Julia Version 1.6.7
Commit 3b76b25b64 (2022-07-19 15:11 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: 11th Gen Intel(R) Core(TM) i5-11500H @ 2.90GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-11.0.1 (ORCJIT, tigerlake)
julia> main()
  66.436 μs (0 allocations: 0 bytes)
nothing
  102.095 μs (0 allocations: 0 bytes)
nothing

julia> versioninfo()
Julia Version 1.8.5
Commit 17cfb8e65ea (2023-01-08 06:45 UTC)
julia> main()
  64.640 μs (0 allocations: 0 bytes)
nothing
  159.491 μs (0 allocations: 0 bytes)
nothing

julia> versioninfo()
Julia Version 1.9.0-beta3
Commit 24204a73447 (2023-01-18 07:20 UTC)

Note that in 1.9 the second code is ~60% slower than in 1.6 and 1.8.

The code to be run is (just copy/paste):

import Pkg
Pkg.activate(;temp=true)
Pkg.add("StaticArrays")
Pkg.add("BenchmarkTools")
using Test
using StaticArrays

# With simple 1D arrays
function XYDOT2!(X,Y,X_DOT,Y_DOT,EPS)
    fill!(X_DOT,0.0); fill!(Y_DOT,0.0)
    @inbounds for i in eachindex(X,Y,X_DOT,Y_DOT)
        x_i, y_i = X[i], Y[i]
        acc_x, acc_y = 0.0, 0.0
        @fastmath for j=(i+1):lastindex(X)
            x_j, y_j = X[j], Y[j]
            diff_x = x_i - x_j; diff_y = y_i - y_j
            inv_d = inv(diff_x^2 + diff_y^2 + EPS)
            dxdt = diff_x * inv_d; dydt = diff_y * inv_d
            acc_x += dxdt; acc_y += dydt
            X_DOT[j] -= dxdt; Y_DOT[j] -= dydt
        end
        X_DOT[i] += acc_x; Y_DOT[i] += acc_y
    end
    return nothing
end

# with vectors of SVectors
function XYDOT_SV!(P::Vector{T}, XYDOT::Vector{T}, EPS) where T<:SVector
    fill!(XYDOT, zero(T))
    @inbounds for i in eachindex(P,XYDOT)
        p_i = P[i] 
        acc = zero(T)
        @fastmath for j=(i+1):lastindex(P)
            diff = p_i - P[j]
            invd = inv(diff[1]^2 + diff[2]^2 + EPS)
            dpdt = diff * invd
            acc += dpdt
            XYDOT[j] -= dpdt
        end
        XYDOT[i] += acc
    end
    return nothing
end

function main()
    N = 512
    EPS = 0.1
    P = rand(SVector{2,Float64},N)
    XYDOT = similar(P)
    X = [ x[1] for x in P ];
    Y = [ x[2] for x in P ];
    X_DOT = similar(X)
    Y_DOT = similar(Y)

    XYDOT2!(X,Y,X_DOT,Y_DOT,EPS)   
    XYDOT_SV!(P,XYDOT,EPS)
    @test all(isapprox.([SVector(x,y) for (x,y) in zip(X_DOT,Y_DOT)], XYDOT))
     
    display(@btime XYDOT2!($X,$Y,$X_DOT,$Y_DOT,$EPS))
    display(@btime XYDOT_SV!($P,$XYDOT,$EPS))

    return nothing
end

main()
1 Like

More info:

Seems that is something related to @fastmath and how it deals with some simple operation there. I can reproduce the regression with this code:

import Pkg
Pkg.activate(;temp=true)
Pkg.add("StaticArrays")
Pkg.add("BenchmarkTools")
using Test
using StaticArrays

# with vectors of SVectors
function XYDOT_SV!(P::Vector{T},XYDOT::Vector{T}) where T<:SVector
    fill!(XYDOT,zero(T))
    @inbounds for i in eachindex(P,XYDOT)
        p_i = P[i] 
        acc = zero(T)
        @fastmath for j=(i+1):lastindex(P)
            diff = p_i - P[j]
            acc += diff
            XYDOT[j] += diff
        end
        XYDOT[i] += acc
    end
    return nothing
end

function main()
    N = 512
    P = rand(SVector{2,Float64},N)
    XYDOT = similar(P)
    display(@btime XYDOT_SV!($P,$XYDOT))
    return nothing
end

main()

But if I remove the @fastmath macro, the time of the older Julia versions becomes similar to the new ones.

Thus, it seems that @fastmath is not doing something it was doing before.

Whatever it is, it is solved on the nightly build:

julia> main()
  68.436 μs (0 allocations: 0 bytes)
nothing
  103.894 μs (0 allocations: 0 bytes)
nothing

julia> versioninfo()
Julia Version 1.10.0-DEV.423
Commit 134f3e7dfaa (2023-01-23 15:06 UTC)

It is something between 1.8.5 and 1.9.0-alpha1, but I’m not sure how to bisect it further.

1 Like

Quick aside, you don’t need to do display the @btimes.

1 Like

By using git bisect, for example.

1 Like

Yeah, I found that. Nobody has a script to automate the process given a test script?

https://git-scm.com/docs/git-bisect#_bisect_run

I’m playing with that. Is it normal that many (if not most) commits fail to compile?

I’m having to use git bisect skip many times, such that bisection becomes not very effective.

Basically I’m compiling with

make clean
make

for each commit selected by the bisection mechanism. I’ll try make cleanall now…

No. You want to be more specific about what kind of error you get. make cleanall is usually a more effective cleanup step.

Seems that cleanall solved most of the problems.

So, the first commit that presents the performance issue appears to be this one, if I understand correctly the output of the bisection:

I have filled an issue here: Performance regression using static arrays on 1.9.0 · Issue #48398 · JuliaLang/julia · GitHub

I’m trying to find the commit where the regression was fixed, but for now the bisection failed.

For the records, to bisect the problem one does:

  1. Create a script that returns an error code 0 if the problem is present, or 1 if not. This can be done in Julia by leaving the script with exit(0) or exit(1) . Since the process involves compiling Julia from source at every commit, this must be mediated by a bash script, which is simply:
#!/bin/bash
make cleanall
make
./julia mytest.jl
exit $?

where $? is the exit code of the last line of the previous command, in this case the julia script. Let us call this script test.sh, and the julia script mytest.jl.

  1. Clone the Julia repository, and put the two scripts in the main directory.

  2. Start the git bisect routine, with:

git bisect start
git bisect good v1.8.5
git bisect bad v1.9.0-beta3

where here I knew that between these two versions some commit created the problem. One can also use a commit hash in place of the versions.

  1. Run the bisection with:
git bisect run ./test.sh

This will run until the commit where the problem appears for the first time occurred. It takes quite a while, because it has to recompile Julia from source every time. At it will report something like:

Then, you can see the commit on the github page, for example by searching for it in the repository (and clicking in thecommits search results at the left):

In this specific case, the Julia script ran was:

In this specific case, the Julia script ran was:
import Pkg
Pkg.activate(;temp=true)
Pkg.add("StaticArrays")
Pkg.add("BenchmarkTools")
using Test
using StaticArrays
using BenchmarkTools
using InteractiveUtils: versioninfo

# With simple 1D arrays
function XYDOT2!(X,Y,X_DOT,Y_DOT,EPS)
    fill!(X_DOT,0.0); fill!(Y_DOT,0.0)
    @inbounds for i in eachindex(X,Y,X_DOT,Y_DOT)
        x_i, y_i = X[i], Y[i]
        acc_x, acc_y = 0.0, 0.0
        @fastmath for j=(i+1):lastindex(X)
            x_j, y_j = X[j], Y[j]
            diff_x = x_i - x_j; diff_y = y_i - y_j
            inv_d = inv(diff_x^2 + diff_y^2 + EPS)
            dxdt = diff_x * inv_d; dydt = diff_y * inv_d
            acc_x += dxdt; acc_y += dydt
            X_DOT[j] -= dxdt; Y_DOT[j] -= dydt
        end
        X_DOT[i] += acc_x; Y_DOT[i] += acc_y
    end
    return nothing
end

# with vectors of SVectors
function XYDOT_SV!(P::Vector{T}, XYDOT::Vector{T}, EPS) where T<:SVector
    fill!(XYDOT, zero(T))
    @inbounds for i in eachindex(P,XYDOT)
        p_i = P[i] 
        acc = zero(T)
        @fastmath for j=(i+1):lastindex(P)
            diff = p_i - P[j]
            invd = inv(diff[1]^2 + diff[2]^2 + EPS)
            dpdt = diff * invd
            acc += dpdt
            XYDOT[j] -= dpdt
        end
        XYDOT[i] += acc
    end
    return nothing
end

function main()
    N = 512
    EPS = 0.1
    P = rand(SVector{2,Float64},N)
    XYDOT = similar(P)
    X = [ x[1] for x in P ];
    Y = [ x[2] for x in P ];
    X_DOT = similar(X)
    Y_DOT = similar(Y)

    XYDOT2!(X,Y,X_DOT,Y_DOT,EPS)   
    XYDOT_SV!(P,XYDOT,EPS)
    @test all(isapprox.([SVector(x,y) for (x,y) in zip(X_DOT,Y_DOT)], XYDOT))
     
    t1 = @belapsed XYDOT2!($X,$Y,$X_DOT,$Y_DOT,$EPS)
    t2 = @belapsed XYDOT_SV!($P,$XYDOT,$EPS)

    return t1, t2
end

t1, t2 = main()

versioninfo()
println("t1 = $t1")
println("t2 = $t2")

if t2/t1 < 2.0
    println("performance regression detected!")
    exit(1)
else 
    println("peformance ok")
    exit(0)
end
9 Likes

StaticArray’s excessive code generation was one of the motivations for the linked PR.
We’re still missing a few compiler optimizations, but eventually the compiler should be able to delete the memcpy in many cases, by simply copying pointers instead, yielding better performance than what we had initially.

1 Like