Nbabel nbody integrator speed up

Something fishy is going on with this benchmarks.

First of all, it is really strange that python implementation is so much faster than C++/Fortran.

Secondly, after I fix various allocations issues, speedup was minimal, just a few seconds. After profiling it turns out that compute_acceleration is taking huge chunk of time, which is reasonable, since it is O(N^2) and everything else is just O(N).

But after I throw in @Threads.threads in compute_acceleration loop, timing became much much better:

  • 18 sec for 1k
  • 55 sec for 2k

So, my suspicions that either python implementation uses more sophisticated process of accelerations calculation, or it uses multi-threading too.

Code is here: https://gist.github.com/Arkoniak/b44d3e61b6b24d5241fc5d6d173b9e1b

1 Like

There’s a risk that if i != j is preventing SIMD utilization in the innermost loop.

1 Like

I’m definitely look into that, thank you!

Yes, I’m Brazilian, in fact I’m a masters candidate at UNICAMP’s Computing Institute. I am definitely look into that pdf, thank you professor.

1 Like

That python benchmark uses several tricks, including Pythran and another library, which beats (so far) several implementations.

Edit: Thank you so much for editing it and putting in a gist!

In the Fortran and Python implementations the loop to compute accelerations is done properly (N(N-1)/2) cycles:

     do i = 1, P%nstars - 1
         do j = i + 1, P%nstars
   for index_p0 in range(nb_particules - 1):
        for index_p1 in range(index_p0 + 1, nb_particules):

while in the Julia implementation it is not (N^2 cycles):

    @inbounds @Threads.threads for i = 1:N
        @fastmath @inbounds @simd for j = 1:N
            if i != j

Concerning the code:

My “base time is” 661 ms (8.41 MB allocations), for the code provided.

Removing all the decorations (simd, inbounds, threads, etc), actually made the code faster (586 ms).

Fixing the loop above and using static arrays took the time to 581 ms and allocations went to 174 kb. No allocations occur anymore in the main routines.

Now, adding back the @inbounds flag to the acceleration and energy computation loops, time goes to 500 ms, 171.78 KiB.

If one considers this speedup reliable (~30%), that would put the Julia benchmark there in pair with the python and rust codes.

I am running with this data, with

using BenchmarkTools
include("./nbabel.jl")
using .NB
@btime NBabel("./input128.dat")
Code


module NB

"""
This is an implementation of the NBabel N-body problem.
See nbabel.org for more information.
This is 'naive & native' Julia with type and loop annotations for
fastmath and vectorization, because this is really is no effort.
Without annotations the code will be as slow as naive Python/IDL.
    include("nbabel.jl")
    NBabel("IC/input128", show=true)
Julius Donnert INAF-IRA 2017
"""

using Printf
using DelimitedFiles
using StaticArrays

tbeg = 0.0
tend = 10.0
dt = 0.001

function NBabel(fname::String; tend = tend, dt = dt, show=false)

    if show
	    println("Reading file : $fname")
    end

    ID, mass, pos, vel = read_ICs(fname)

    return NBabelCalcs(ID, mass, pos, vel, tend, dt, show)
end

function NBabelCalcs(ID, mass, pos, vel, tend = tend, dt = dt, show=false)
    acc = similar(vel)
    acc = compute_acceleration(pos, mass, acc)
    last_acc = copy(acc)

    Ekin, Epot = compute_energy(pos, vel, mass)
    Etot_ICs = Ekin + Epot

    t = 0.0
    nstep = 0

    while t < tend

        pos = update_positions(pos, vel, acc, dt)

        acc, last_acc = last_acc, acc

        acc = compute_acceleration(pos, mass, acc)

        vel = update_velocities(vel, acc, last_acc, dt)
        
        t += dt
        nstep += 1

        if show && nstep%100 == 0

            Ekin, Epot = compute_energy(pos, vel, mass)
            Etot = Ekin + Epot
            dE = (Etot - Etot_ICs)/Etot_ICs

            @printf "t = %g, Etot=%g, Ekin=%g, Epot=%g, dE=%g \n" t Etot Ekin Epot dE
        end
    end

    Ekin, Epot = compute_energy(pos, vel, mass)
    Etot = Ekin + Epot
    return (; Ekin, Epot, Etot)
    # return size(pos,1)
end

function update_positions(pos, vel, acc, dt)
    N = length(pos)
    for i in 1:N
        pos[i] = (0.5 * acc[i] * dt + vel[i])*dt + pos[i]
    end
    return pos
end

function update_velocities(vel, acc, last_acc, dt)
    N = length(vel)
    for i in 1:N
        vel[i] = vel[i] + 0.5 * dt * (acc[i] + last_acc[i])
    end
    return vel
end

"""
Force calculation.
"""
function compute_acceleration(pos, mass, acc)
    N = length(pos)
    for i in 1:N
      acc[i] = zeros(eltype(acc)) 
    end
    @inbounds for i = 1:N-1
        for j = i+1:N
            dr = pos[i] - pos[j]
            rinv3 = 1/sqrt(sum(dr .^ 2)) ^ 3
            acc[i] = acc[i] - mass[i] * rinv3 * dr
            acc[j] = acc[j] + mass[j] * rinv3 * dr
        end
    end
    return acc
end


"""
Kinetic and potential energy.
"""
function compute_energy(pos, vel, mass)
    N = length(vel)

    Ekin = 0.0

    for i = 1:N
        Ekin += 0.5 * mass[i] * sum(vel[i] .^ 2)
    end

    Epot = 0.0

    @inbounds for i = 1:N-1
        for j = i+1:N
            dr = pos[i] - pos[j]
            rinv = 1/sqrt(sum(dr .^ 2))
            Epot -= mass[i] * mass[j] * rinv
        end
    end

    return Ekin, Epot
end

function read_ICs(fname::String)

    ICs = readdlm(fname)

    N = size(ICs,1)

    pos = Array{SVector{3,Float64}}(undef, N)
    vel = Array{SVector{3,Float64}}(undef, N)

    id = @view ICs[:, 1]
    mass = @view ICs[:, 2]

    for i in axes(ICs, 1)
        pos[i] = SVector{3,Float64}(ICs[i, 3], ICs[i, 4], ICs[i, 5])
    end

    for i in axes(ICs, 1)
        vel[i] = SVector{3,Float64}(ICs[i, 6], ICs[i, 7], ICs[i, 8])
    end

    return id, mass, pos, vel
end

export NBabel

end


#function main(args)
#    NBabel(args[1], show=true)
#end
#main(ARGS)
6 Likes

If you’re referring to these:

function update_positions(pos::Array{Float64,2}, vel::Array{Float64,2},
                          acc::Array{Float64,2}, dt::Float64)

    @fastmath @inbounds @simd for i=1:size(pos,1)

        pos[i,1] += dt * vel[i,1] + 0.5 * dt^2 * acc[i,1]
        pos[i,2] += dt * vel[i,2] + 0.5 * dt^2 * acc[i,2]
        pos[i,3] += dt * vel[i,3] + 0.5 * dt^2 * acc[i,3]

    end

    return pos
end

Then no, this is actually ideal.

That is, assuming the arrays have a lot of rows. Given that removing @simd and friends improved performance, I’m guessing they have <16?

EDIT: Actually, assuming these arrays have only 3 columns…

function update_positions(pos::Array{Float64,2}, vel::Array{Float64,2},
                          acc::Array{Float64,2}, dt::Float64)

    @fastmath @inbounds @simd for i=eachindex(pos)

        pos[i] += dt * vel[i] + 0.5 * dt^2 * acc[i]

    end

    return pos
end

Or use 1:3size(pos,1) assuming all of them have the same number of rows. My point is linear indexing is best, and means memory layout doesn’t matter in this part of the code, so long as it is dense.

Meaning if another part of the code is better off with these transposed, then it’s worth it as this part is unaffected.

2 Likes

Now I noticed that I was working on top of your version already!

The original version has this benchmark:

julia> @btime NBabel("./input128.dat")
  799.363 ms (2662599 allocations: 308.29 MiB)

While my final version has:

julia> @btime NBabel("./input128.dat")
  510.348 ms (3076 allocations: 171.78 KiB)
(Ekin = 5.542372639657371, Epot = -0.4094795146623831, Etot = 5.132893124994988)

Thus, if we take this comparison seriously, the speedup is of about 56%, thus the Julia version would be as fast or faster than the Python version of the published benchmark.

1 Like

Wow, that’s impressive. All of it with a few tweaks!

“input128.dat” is not very relevant here for two reasons

  1. Threads have more or less constant overhead which is less noticeable if the number of particles is large. But for the small number of particles, speedup from parallel execution is smaller than overhead from parallelization, so you can see a slowdown.

  2. Acceleration is O(N^2), while all other operations are O(N). Which means that for different numbers of particles proportion between acceleration calculation and everything else will be different. And in the end it means that timing will grow nonlinear. Things that were important for small numbers of particles will be irrelevant for large and vice versa.

But yes, with Threads removing (but everything else remaining in place) and fixed acceleration calculations I get 30 sec for 1k and 119 sec for 2k, but this is including compilation and initial data construction which can be speed up too.

Now, interesting question, who can write multi-threading version, so it run even faster :slight_smile:

2 Likes

Fair points. Redoing the benchmarks with 2k particles now…

I was, but then I’m confused, wouldn’t it in general be faster to access them in the order

pos[1,i]
pos[2,i]
pos[3,i]

and then increment, i, i.e., accessing one column at the time? I assume that pos[1,i] and pos[2,i] are next to each other in memory. Could there be cases where it is not more efficient to access contiguous memory?

2 Likes

One reasonable way to do that is to linearize the indexes of of the loops that do N*(N-1)/2 operatons, and split the loop in two, one on the number of threads and the other on the fraction of the indexes for that thread.

(anyway, this is a very naive implementation of a particle simulation, other things should be done before actually pretending this is a competitive code: linked cells, multipoles, etc)

For 2K I get, in the original implementation:

julia> @btime NBabel("./input2k.dat")
  210.316 s (41123336 allocations: 4.74 GiB)
(Ekin = 0.2428869241248313, Epot = -0.4780190257062157, Etot = -0.2351321015813844)

And in the latest one:

julia> @btime NBabel("./input2k.dat")
  172.240 s (70714 allocations: 2.40 MiB)
(Ekin = 0.24191235979106984, Epot = -0.46958

Thus, a 210/172 = 22% speedup. They report a 40% difference in favor to the python implementation, so there still room for improvement (although in their time-scale this would take Julia to 141s, which is already faster than the C++ or Fortran versions, thus clearly there are some catches in the benchmark, we are not comparing the same thing in all cases).

Edit: The Fortran version published there takes here 217s.

I do not know how to benchmark the python version because I do not have jit, etc, installed.

1 Like
using BenchmarkTools

pos = rand(128,3); dest = similar(pos);
pos_t = copy(pos'); dest_t = similar(pos_t);

function foo!(dest, pos)
    @inbounds @simd ivdep for i in axes(pos, 2)
        dest[1,i] = pos[1,i] * 2
        dest[2,i] = pos[2,i] * 3
        dest[3,i] = pos[3,i] * 4
    end
end

@benchmark foo!($dest_t, $pos_t)
@benchmark foo!($dest', $pos')

This yields:

julia> @benchmark foo!($dest_t, $pos_t)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     68.267 ns (0.00% GC)
  median time:      68.645 ns (0.00% GC)
  mean time:        72.310 ns (0.00% GC)
  maximum time:     187.716 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     976

julia> @benchmark foo!($dest', $pos')
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     17.311 ns (0.00% GC)
  median time:      17.783 ns (0.00% GC)
  mean time:        18.416 ns (0.00% GC)
  maximum time:     52.715 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     997

The transposed version (i.e., equivalent to accessing them pos[i,1], pos[i,2], etc, was about 4x faster.

This is because 3 is not a good number for SIMD, so accessing them

pos[1,i]
pos[2,i]
pos[3,i]

doesn’t let you SIMD well.
As a rule of thumb, it’s a lot easier to SIMD between loop iterations.
Meaning, for best performance, you want successive iterations of the loop to access successive indexes.

5 Likes

I see, that’s very good to know! Does this reasoning extend to SVector{3} as well, which I assume is packed into a vector like a matrix of 3 x N? I could then potentially slow down my code by switching from the N x 3 matrix layout to a vector of SVector?

I did a similar benchmark to yours, and found that the N x 4 layout was faster than the 4 x N as well, but 4 should be a good value to SIMD on? I had to go up to around N x 16 before what I previously thought was the better memory layout to actually be faster. Is it so hard for the compiler to apply SIMD within a single loop iteration that the arrays have to be large enough for cache locality to become important before the column-major access wins out?

2 Likes

Well, there are two things that should be done to proper multithread it.

  1. Split in chunks as you said, which can be done in multiple ways. In similar situations I’ve used nonlinear splitting over index of outer loop, so every chunk has more or less equal number of calculations and then run them with @spawn. It usually worked the best.

  2. One should be careful with race conditions - different threads will update the same elements of the acceleration array, so it’s best to introduce nthreads arrays and update each of them in separate thread and as a final step add them together to a final array.

2 Likes

You could. HybridArrays may give you the best of both worlds, by making them N x 3. A slice A[n,:] should still return an SVector. If everything inlines, it should still be able to SIMD across loop iterations, while giving you the convenience of expressing some operations on the vectors instead making everything loops.

Another factor to consider is unrolling. In the example with N x 4 vs 4 x N, the N x 4 case will SIMD and be 4x unrolled (one per column). The 4 x N will only do a single operation per loop iteration.

Theoretically, when you don’t have dependencies (i.e, s += x[i], where each iteration depends on the previous), your CPU should be able to execute different loop iterations in parallel via out of order processing + speculative execution, but in practice, I normally find some unrolling tends to help.
Maybe it’s because of better out of order, or maybe it’s because of better density of relevant instructions, vs things like incrementing and checking loop counters.

4 Likes

Thanks for the details Chris. Stuff is never as simple as one wish it was, but it’s great to have knowledgeable people around to learn from :slight_smile:

2 Likes

SVector{3} is not packed into vector, it lives on a stack, that’s the main point.