Poor performance of SIMD vectorization in the latest version of Julia (v1.11.2)

Mikel_Antonana · January 3, 2025, 10:37am

I have observed poor performance of our SIMD-vectorized numerical integration method when using the latest version of Julia (1.11.2).

To show this, I am attaching two precision diagrams obtained by running this Jupyter document:

WorkPrecisionDiagram-with-1-10-7-version.png
WorkPrecisionDiagram-with-1-11-2-version.png

WorkPrecisionDiagram-with-1-10-7-version950×400 28.8 KB

WorkPrecisionDiagram-with-1-11-2-version950×400 28.9 KB

Thanks

giordano · January 3, 2025, 10:50am

I don’t know others, but I have little clue of what I’m looking at and how to infer the poor performance. What’s log10(CPU) on the x-axis? Is that time?

DNF · January 3, 2025, 10:54am

Looks like the x-axis is logarithmic time, red and blue are non-simd and simd, respectively. On 1.10.7 the simd version is an order of magnitude faster then the non-simd, while on 1.11.2 there is little difference, and that it’s the simd version that has become slower.

I’m not super eager about going into jupyter docs, is it possible to extract the relevant code parts and paste them here, @Mikel_Antonana?

Benny · January 3, 2025, 12:05pm

Extra strange because at first glance the packages uses SIMD.jl, suggesting explicit rather than automatic vectorization that’s undergoing a large ecosystem shift.

Mikel_Antonana · January 3, 2025, 1:30pm

Yes, our implementation applies explicit SIMD vectorization. The experiment consists of running many integrations for different step sizes and we obtain for each of them the error in energy and CPU time.

The code where we obtain the CPU time is the following:

        solx=solve(prob,alg,dt=dt0,adaptive=adaptiveB,save_everystep = false)
        if solx.retcode==ReturnCode.Success
           cpus[i]=0.
           for k in 1:nruns
               cpus[i]+=@elapsed solve(prob,alg,dt=dt0,adaptive=adaptiveB, save_everystep = false)
           end
           cpus[i]=cpus[i]/nruns
        else
           cpus[i]=Inf
        end

foobar_lv2 · January 3, 2025, 11:22pm

Sorry, that’s not helpful.

It is unclear what you mean by “explicit SIMD vectorization” – people use this term to mean anything from “sprinkle @simd over the code” to “oh yeah, that’s hand-written assembly / llvmcall”.

The very first thing to look at is what package-versions you have, on old and new julia. You do that by looking into your Project.toml and Manifest.toml. (I’m not very firm with Pkg – you can find out the path via Pkg.status(), but there should be a repl command to dump your environment). Then you diff them – maybe some package versions changed?

An especially important candidate is LoopVectorization.jl, i.e. people who use the @turbo macro. That could explain everything, since that isn’t really working anymore on 1.12.

I did some of your work – NbodyIRKGL16.jl/Manifest.toml at 65a2ee801939e89315e3a97b48477bc3d51a5900 · mikelehu/NbodyIRKGL16.jl · GitHub

So that gives us a good guess.

Next, you need to figure out where the code is slower. For that you profile your code and look at a flamegraph. Most probably, there is one obvious function that takes all the time in both old and new julia versions. If that code is using @turbo, then you have found the culprit!

Then, you should link that code here. Somebody will need to look at the code and improve it, i.e. do some of the transformations from the old @turbo by hand.

DNF · January 4, 2025, 12:39am

The slowdown is on 1.11 vs 1.10, not on 1.12. As far as I know, @turbo does work on 1.11.

Benny · January 4, 2025, 3:35am

LV isn’t in their Project.toml, and from a cursory look their code just uses SIMD.jl directly. Maybe another direct dependency uses LV? It’s not obvious to me whether their code or an imported routine is taking up runtime.

Definitely worth narrowing down and benchmarking the hot spots, whatever the cause.

Mikel_Antonana · January 4, 2025, 3:12pm

Yes, the vectorization in our implementation is based only on SIMD.jl package. I verified that all packages versions installed in Julia 1.11.2 are identical to those in Julia 1.10:

Status ~/.julia/environments/v1.11/Project.toml
[2b5f629d] DiffEqBase v6.161.0
[7034ab61] FastBroadcast v0.3.5
[7073ff75] IJulia v1.26.0
[1dea7af3] OrdinaryDiffEq v6.90.1
[d96e819e] Parameters v0.12.3
[91a5bcdd] Plots v1.40.9
[731186ca] RecursiveArrayTools v3.27.4
[189a3867] Reexport v1.2.2
[fdea26ae] SIMD v3.7.0
[0bca4576] SciMLBase v2.70.0

Benny · January 4, 2025, 3:38pm

foobar_lv2’s concern is that the code you benchmarked may primarily use a imported method that does use @turbo, even indirectly, and LoopVectorization does show up in the stored Manifest.toml. But it’s too soon to say whether that’s the case before more profiling. It was also too soon to say the difference is in SIMD; its performance becoming closer to the non-SIMD(?) version could be caused by a regression in a completely independent aspect. The fact you use manual vectorization actually leans away from that hypothesis because it’s much more stable than autovectorization, but it hardly disproves it either, and it’s just as plausible v1.11’s switch to LLVM16 changed something about the compiled SIMD code.

Elrod · January 7, 2025, 5:21pm

If using SIMD.jl regressed, the first thing I’d do is profile, like others suggested.

Assuming the two code versions are almost the same, including uses of external packages, differing only in use of SIMD.jl vs not within your own code, then at least we can be reasonably confident that the changes are with respect to your own code, limiting the need to search other packages.

Once you’ve profiled to find guilty functions that slowed down, I’d look at them.
You can define _a = Ref{Any}() in your REPL, and then use Revise.jl to add _a[] = (args, to, the, function....) to get all the args of even deeply nested function calls in your REPL.
Then, remove that line after running the code to get the args, and inspect/microbenchmark from your repl

julia> args = _a[];

julia> @code_typed YourPackage.your_functions(args...)

julia> Cthulhu.@descend YourPackage.your_functions(args...)

of course, you could also Cthulhu.@descend directly if everything is type stable from your starting point.

One thing you can look for is if things stopped inlining that used to inline before. SIMD.jl is vulnerable to causing small functions not to inline, and that could cause major regressions if there was a slight change in heuristics.
If you find some culprits, you can use @inline to fix it.

The problem is likely to be something else entirely, of course, but you’d have to look to find it. Not much use for us to speculate blindly.

Mikel_Antonana · January 7, 2025, 7:45pm

I have prepared the following code, which might help us identify the problem.

When I run @btime NbodyODE!(dU, U, Gm, 0.) on my computer, the CPU times are as follows:

Version 1.10.7: 180.753 ns (0 allocations: 0 bytes)
Version 1.11.2: 1.290 μs (0 allocations: 0 bytes)

This means version 1.11.2 is approximately 7.13 times slower, (1.290e-6/1.80753e-7=7.13).

I greatly appreciate your interest and assistance.
Thank you!

using BenchmarkTools
using SIMD


struct VecArray{s_,T,dim}
    data::Array{T,dim}
end


Base.eltype(::Type{VecArray{s,T,dim}}) where {s,T,dim} = T


@inline function Base.getindex(v::VecArray{s,T,dim},k...) where {s,T,dim}
    Vec{s,T}(NTuple{s,T}(@inbounds v.data[is,k...] for is=1:s))
end


@inline function Base.setindex!(v::VecArray{s,T,dim},vi::Vec{s,T},k...) where {s,T,dim}
    @inbounds for is in 1:s
         v.data[is,k...] = vi[is]
    end
    return nothing
end


@inline function Base.setindex!(v::VecArray{s,T,dim},vk::T2,k...) where {s,T,T2,dim}
    vk_ = convert(T,vk)
    @inbounds for is in 1:s
         v.data[is,k...] = vk_
    end
    return nothing
end



function NbodyODE!(F,u,Gm,t)
     N = length(Gm)
     for i in 1:N
        for k in 1:3
            F[k, i, 2] = 0
        end
     end
     for i in 1:N
        xi = u[1,i,1]
        yi = u[2,i,1]
        zi = u[3,i,1]
        Gmi = Gm[i]
        for j in i+1:N
            xij = xi - u[1,j,1]
            yij = yi - u[2,j,1]
            zij = zi - u[3,j,1]
            Gmj = Gm[j]
            dotij = (xij*xij+yij*yij+zij*zij)
            auxij = 1/(sqrt(dotij)*dotij)
            Gmjauxij = Gmj*auxij
            F[1,i,2] -= Gmjauxij*xij
            F[2,i,2] -= Gmjauxij*yij
            F[3,i,2] -= Gmjauxij*zij
            Gmiauxij = Gmi*auxij
            F[1,j,2] += Gmiauxij*xij
            F[2,j,2] += Gmiauxij*yij
            F[3,j,2] += Gmiauxij*zij
        end
     end
     for i in 1:3, j in 1:N
        F[i,j,1] = u[i,j,2]
     end
    return nothing
end


s=8
N=5
u0=rand(3,N,2)
Gm=rand(N)

dims=size(u0)
zz=zeros(Float64, s, dims...)
U=VecArray{s,Float64,length(dims)+1}(zz)
dU=deepcopy(U)

indices=eachindex(u0)
for k in indices
    U[k]=u0[k]
end    

@btime NbodyODE!(dU,U,Gm,0.)

Benny · January 7, 2025, 7:48pm

Edit to @btime NbodyODE!($dU,$U,$Gm,$(0.)) so you don’t run into compilation artifacts for non-const globals and atypical constants.

Elrod · January 7, 2025, 9:08pm

Tuple(::Generator{..}) isn’t inlined in 1.11. There are many calls to this.
Add a call-site @inline to get rid of them:

@inline function Base.getindex(v::VecArray{s,T,dim},k...) where {s,T,dim}
    Vec{s,T}(@inline NTuple{s,T}(@inbounds v.data[is,k...] for is=1:s))
end

danielwe · January 7, 2025, 10:41pm

…or use the ntuple function (perhaps more idiomatic?):

@inline function Base.getindex(v::VecArray{s,T,dim},k...) where {s,T,dim}
    Vec{s,T}(ntuple(is -> @inbounds(v.data[is,k...]), Val(s)))
end

lmiq · January 8, 2025, 1:00am

I guess Val here is unnecessary, right? s is already a type parameter, and this could be just ntuple(..., s) with the function specialized for each s anyway.

danielwe · January 8, 2025, 1:30am

The implementations are different:

ntuple(f, Val(n)) is always^[1] type stable
ntuple(f, n) is special-cased for every n <= 10 to be type stable if n is constpropped, while for n > 10 it falls back to an explicitly @noinlined ([f(i) for i in 1:n]...,)

When n is a literal or type parameter I always prefer the Val version for predictable performance without gotchas.

See:

ntuple(f, n) implementation here: julia/base/ntuple.jl at v1.11.2 · JuliaLang/julia · GitHub
ntuple(f, Val(n)) implementation here: julia/base/ntuple.jl at v1.11.2 · JuliaLang/julia · GitHub

Except if the optionally-generated function chooses the non-generated version, but I don’t think that ever happens in normal usage. ↩︎

Benny · January 8, 2025, 9:42am

And it’s documented:

By taking a Val(N) argument, it is possible that this version of
ntuple may generate more efficient code than the version
taking the length as an integer. But ntuple(f, N) is
preferable to ntuple(f, Val(N)) in cases where N cannot be
determined at compile time.

Though it’s worth noting that ntuple(f, Val(N)) is implemented with an optionally-generated function. I think even with ntuple inlining, the ungenerated Tuple call could end up type-unstable.

danielwe · January 8, 2025, 9:56am

Here’s the only comment I’ve been able to find regarding when the generated vs. non-generated version is used in an optionally-generated function. It’s been a while, but I’m assuming this is still true, meaning it’s safe to assume that you always get the generated version in normal usage.

Currently the @generated branch is always used. In the future, which branch is used will mostly depend on whether the JIT compiler is enabled and available, and if it’s not available, then it will depend on how much we were able to compile before the compiler was taken away. So I think it will mostly be a concern for those that might need static compilation and JIT-less deployment.

Mikel_Antonana · January 8, 2025, 11:17am

I have tested the implementation with the recommended modification (understanding that this is the preferred option),

@inline function Base.getindex(v::VecArray{s,T,dim},k...) where {s,T,dim}
    Vec{s,T}(ntuple(is -> @inbounds(v.data[is,k...]), s))
end

and I now get good CPU times in both versions of Julia.
I am sincerely very grateful for your help,
Mikel

Topic		Replies	Views
LoopVectorization: @turbo performs worse than @inbounds on trivial loop New to Julia question , simd , loopvectorization	9	2103	August 28, 2021
A simple SIMD.jl loop that is slower than a vanilla `@inbounds @simd` Performance simd	8	1887	June 27, 2021
Help understanding vectorization (or lack thereof) Performance	15	1221	June 8, 2018
SIMD struggles, seeking solutions (with KangarooTwelve.jl) Performance	23	891	November 7, 2023
Julia 1.11 beta high latency Internals & Design	25	2838	September 17, 2024

Poor performance of SIMD vectorization in the latest version of Julia (v1.11.2)

Related topics