Poor performance of SIMD vectorization in the latest version of Julia (v1.11.2)

I have observed poor performance of our SIMD-vectorized numerical integration method when using the latest version of Julia (1.11.2).

To show this, I am attaching two precision diagrams obtained by running this Jupyter document:

Thanks

I donā€™t know others, but I have little clue of what Iā€™m looking at and how to infer the poor performance. Whatā€™s log10(CPU) on the x-axis? Is that time?

1 Like

Looks like the x-axis is logarithmic time, red and blue are non-simd and simd, respectively. On 1.10.7 the simd version is an order of magnitude faster then the non-simd, while on 1.11.2 there is little difference, and that itā€™s the simd version that has become slower.

Iā€™m not super eager about going into jupyter docs, is it possible to extract the relevant code parts and paste them here, @Mikel_Antonana?

2 Likes

Extra strange because at first glance the packages uses SIMD.jl, suggesting explicit rather than automatic vectorization thatā€™s undergoing a large ecosystem shift.

Yes, our implementation applies explicit SIMD vectorization. The experiment consists of running many integrations for different step sizes and we obtain for each of them the error in energy and CPU time.

The code where we obtain the CPU time is the following:

        solx=solve(prob,alg,dt=dt0,adaptive=adaptiveB,save_everystep = false)
        if solx.retcode==ReturnCode.Success
           cpus[i]=0.
           for k in 1:nruns
               cpus[i]+=@elapsed solve(prob,alg,dt=dt0,adaptive=adaptiveB, save_everystep = false)
           end
           cpus[i]=cpus[i]/nruns
        else
           cpus[i]=Inf
        end

Sorry, thatā€™s not helpful.

It is unclear what you mean by ā€œexplicit SIMD vectorizationā€ ā€“ people use this term to mean anything from ā€œsprinkle @simd over the codeā€ to ā€œoh yeah, thatā€™s hand-written assembly / llvmcallā€.

The very first thing to look at is what package-versions you have, on old and new julia. You do that by looking into your Project.toml and Manifest.toml. (Iā€™m not very firm with Pkg ā€“ you can find out the path via Pkg.status(), but there should be a repl command to dump your environment). Then you diff them ā€“ maybe some package versions changed?

An especially important candidate is LoopVectorization.jl, i.e. people who use the @turbo macro. That could explain everything, since that isnā€™t really working anymore on 1.12.

I did some of your work ā€“ NbodyIRKGL16.jl/Manifest.toml at 65a2ee801939e89315e3a97b48477bc3d51a5900 Ā· mikelehu/NbodyIRKGL16.jl Ā· GitHub

So that gives us a good guess.

Next, you need to figure out where the code is slower. For that you profile your code and look at a flamegraph. Most probably, there is one obvious function that takes all the time in both old and new julia versions. If that code is using @turbo, then you have found the culprit!

Then, you should link that code here. Somebody will need to look at the code and improve it, i.e. do some of the transformations from the old @turbo by hand.

4 Likes

The slowdown is on 1.11 vs 1.10, not on 1.12. As far as I know, @turbo does work on 1.11.

3 Likes

LV isnā€™t in their Project.toml, and from a cursory look their code just uses SIMD.jl directly. Maybe another direct dependency uses LV? Itā€™s not obvious to me whether their code or an imported routine is taking up runtime.

Definitely worth narrowing down and benchmarking the hot spots, whatever the cause.

1 Like

Yes, the vectorization in our implementation is based only on SIMD.jl package. I verified that all packages versions installed in Julia 1.11.2 are identical to those in Julia 1.10:

Status ~/.julia/environments/v1.11/Project.toml
[2b5f629d] DiffEqBase v6.161.0
[7034ab61] FastBroadcast v0.3.5
[7073ff75] IJulia v1.26.0
[1dea7af3] OrdinaryDiffEq v6.90.1
[d96e819e] Parameters v0.12.3
[91a5bcdd] Plots v1.40.9
[731186ca] RecursiveArrayTools v3.27.4
[189a3867] Reexport v1.2.2
[fdea26ae] SIMD v3.7.0
[0bca4576] SciMLBase v2.70.0

foobar_lv2ā€™s concern is that the code you benchmarked may primarily use a imported method that does use @turbo, even indirectly, and LoopVectorization does show up in the stored Manifest.toml. But itā€™s too soon to say whether thatā€™s the case before more profiling. It was also too soon to say the difference is in SIMD; its performance becoming closer to the non-SIMD(?) version could be caused by a regression in a completely independent aspect. The fact you use manual vectorization actually leans away from that hypothesis because itā€™s much more stable than autovectorization, but it hardly disproves it either, and itā€™s just as plausible v1.11ā€™s switch to LLVM16 changed something about the compiled SIMD code.

1 Like

If using SIMD.jl regressed, the first thing Iā€™d do is profile, like others suggested.

Assuming the two code versions are almost the same, including uses of external packages, differing only in use of SIMD.jl vs not within your own code, then at least we can be reasonably confident that the changes are with respect to your own code, limiting the need to search other packages.

Once youā€™ve profiled to find guilty functions that slowed down, Iā€™d look at them.
You can define _a = Ref{Any}() in your REPL, and then use Revise.jl to add _a[] = (args, to, the, function....) to get all the args of even deeply nested function calls in your REPL.
Then, remove that line after running the code to get the args, and inspect/microbenchmark from your repl

julia> args = _a[];

julia> @code_typed YourPackage.your_functions(args...)

julia> Cthulhu.@descend YourPackage.your_functions(args...)

of course, you could also Cthulhu.@descend directly if everything is type stable from your starting point.

One thing you can look for is if things stopped inlining that used to inline before. SIMD.jl is vulnerable to causing small functions not to inline, and that could cause major regressions if there was a slight change in heuristics.
If you find some culprits, you can use @inline to fix it.

The problem is likely to be something else entirely, of course, but youā€™d have to look to find it. Not much use for us to speculate blindly.

8 Likes

I have prepared the following code, which might help us identify the problem.

When I run @btime NbodyODE!(dU, U, Gm, 0.) on my computer, the CPU times are as follows:

  • Version 1.10.7: 180.753 ns (0 allocations: 0 bytes)
  • Version 1.11.2: 1.290 Ī¼s (0 allocations: 0 bytes)

This means version 1.11.2 is approximately 7.13 times slower, (1.290e-6/1.80753e-7=7.13).

I greatly appreciate your interest and assistance.
Thank you!

using BenchmarkTools
using SIMD


struct VecArray{s_,T,dim}
    data::Array{T,dim}
end


Base.eltype(::Type{VecArray{s,T,dim}}) where {s,T,dim} = T


@inline function Base.getindex(v::VecArray{s,T,dim},k...) where {s,T,dim}
    Vec{s,T}(NTuple{s,T}(@inbounds v.data[is,k...] for is=1:s))
end


@inline function Base.setindex!(v::VecArray{s,T,dim},vi::Vec{s,T},k...) where {s,T,dim}
    @inbounds for is in 1:s
         v.data[is,k...] = vi[is]
    end
    return nothing
end


@inline function Base.setindex!(v::VecArray{s,T,dim},vk::T2,k...) where {s,T,T2,dim}
    vk_ = convert(T,vk)
    @inbounds for is in 1:s
         v.data[is,k...] = vk_
    end
    return nothing
end



function NbodyODE!(F,u,Gm,t)
     N = length(Gm)
     for i in 1:N
        for k in 1:3
            F[k, i, 2] = 0
        end
     end
     for i in 1:N
        xi = u[1,i,1]
        yi = u[2,i,1]
        zi = u[3,i,1]
        Gmi = Gm[i]
        for j in i+1:N
            xij = xi - u[1,j,1]
            yij = yi - u[2,j,1]
            zij = zi - u[3,j,1]
            Gmj = Gm[j]
            dotij = (xij*xij+yij*yij+zij*zij)
            auxij = 1/(sqrt(dotij)*dotij)
            Gmjauxij = Gmj*auxij
            F[1,i,2] -= Gmjauxij*xij
            F[2,i,2] -= Gmjauxij*yij
            F[3,i,2] -= Gmjauxij*zij
            Gmiauxij = Gmi*auxij
            F[1,j,2] += Gmiauxij*xij
            F[2,j,2] += Gmiauxij*yij
            F[3,j,2] += Gmiauxij*zij
        end
     end
     for i in 1:3, j in 1:N
        F[i,j,1] = u[i,j,2]
     end
    return nothing
end


s=8
N=5
u0=rand(3,N,2)
Gm=rand(N)

dims=size(u0)
zz=zeros(Float64, s, dims...)
U=VecArray{s,Float64,length(dims)+1}(zz)
dU=deepcopy(U)

indices=eachindex(u0)
for k in indices
    U[k]=u0[k]
end    

@btime NbodyODE!(dU,U,Gm,0.)
1 Like

Edit to @btime NbodyODE!($dU,$U,$Gm,$(0.)) so you donā€™t run into compilation artifacts for non-const globals and atypical constants.

Tuple(::Generator{..}) isnā€™t inlined in 1.11. There are many calls to this.
Add a call-site @inline to get rid of them:

@inline function Base.getindex(v::VecArray{s,T,dim},k...) where {s,T,dim}
    Vec{s,T}(@inline NTuple{s,T}(@inbounds v.data[is,k...] for is=1:s))
end
3 Likes

ā€¦or use the ntuple function (perhaps more idiomatic?):

@inline function Base.getindex(v::VecArray{s,T,dim},k...) where {s,T,dim}
    Vec{s,T}(ntuple(is -> @inbounds(v.data[is,k...]), Val(s)))
end
4 Likes

I guess Val here is unnecessary, right? s is already a type parameter, and this could be just ntuple(..., s) with the function specialized for each s anyway.

1 Like

The implementations are different:

  • ntuple(f, Val(n)) is always[1] type stable
  • ntuple(f, n) is special-cased for every n <= 10 to be type stable if n is constpropped, while for n > 10 it falls back to an explicitly @noinlined ([f(i) for i in 1:n]...,)

When n is a literal or type parameter I always prefer the Val version for predictable performance without gotchas.

See:


  1. Except if the optionally-generated function chooses the non-generated version, but I donā€™t think that ever happens in normal usage. ā†©ļøŽ

2 Likes

And itā€™s documented:

By taking a Val(N) argument, it is possible that this version of
ntuple may generate more efficient code than the version
taking the length as an integer. But ntuple(f, N) is
preferable to ntuple(f, Val(N)) in cases where N cannot be
determined at compile time.

Though itā€™s worth noting that ntuple(f, Val(N)) is implemented with an optionally-generated function. I think even with ntuple inlining, the ungenerated Tuple call could end up type-unstable.

1 Like

Hereā€™s the only comment Iā€™ve been able to find regarding when the generated vs. non-generated version is used in an optionally-generated function. Itā€™s been a while, but Iā€™m assuming this is still true, meaning itā€™s safe to assume that you always get the generated version in normal usage.

Currently the @generated branch is always used. In the future, which branch is used will mostly depend on whether the JIT compiler is enabled and available, and if itā€™s not available, then it will depend on how much we were able to compile before the compiler was taken away. So I think it will mostly be a concern for those that might need static compilation and JIT-less deployment.

I have tested the implementation with the recommended modification (understanding that this is the preferred option),

@inline function Base.getindex(v::VecArray{s,T,dim},k...) where {s,T,dim}
    Vec{s,T}(ntuple(is -> @inbounds(v.data[is,k...]), s))
end 

and I now get good CPU times in both versions of Julia.
I am sincerely very grateful for your help,
Mikel

4 Likes