Nbabel nbody integrator speed up

Hi folks, I found this implementation of a Nbabel integrator in Julia, which I believe it’s a very simple one. The author report a runtime of `2.57`s against `0.48`s of Pythran. Would any of you help me improve this? I believe we can leverage packages like LoopVectorization and Tullio (perhaps?).

``````"""
This is an implementation of the NBabel N-body problem.

This is 'naive & native' Julia with type and loop annotations for
fastmath and vectorization, because this is really is no effort.

Without annotations the code will be as slow as naive Python/IDL.

include("nbabel.jl")
NBabel("IC/input128", show=true)

Julius Donnert INAF-IRA 2017
"""

using Printf
using DelimitedFiles

tbeg = 0.0
tend = 10.0
dt = 0.001

function NBabel(fname::String; show=false)

if show
end

ID, mass, pos, vel = read_ICs(fname)

acc = similar(vel)
acc = compute_acceleration(pos, mass, acc)
last_acc = copy(acc)

Ekin, Epot = compute_energy(pos, vel, mass)
Etot_ICs = Ekin + Epot

t = 0.0
nstep = 0

while t < tend

pos = update_positions(pos, vel, acc, dt)

last_acc[:,:] = acc[:,:]

acc = compute_acceleration(pos, mass, acc)

vel = update_velocities(vel, acc, last_acc, dt)

t += dt
nstep += 1

if show && nstep%100 == 0

Ekin, Epot = compute_energy(pos, vel, mass)
Etot = Ekin + Epot
dE = (Etot - Etot_ICs)/Etot_ICs

@printf "t = %g, Etot=%g, Ekin=%g, Epot=%g, dE=%g \n" t Etot Ekin Epot dE
end
end

return size(pos,1)
end

function update_positions(pos::Array{Float64,2}, vel::Array{Float64,2},
acc::Array{Float64,2}, dt::Float64)

@fastmath @inbounds @simd for i=1:size(pos,1)

pos[i,1] += dt * vel[i,1] + 0.5 * dt^2 * acc[i,1]
pos[i,2] += dt * vel[i,2] + 0.5 * dt^2 * acc[i,2]
pos[i,3] += dt * vel[i,3] + 0.5 * dt^2 * acc[i,3]

end

return pos
end

function update_velocities(vel::Array{Float64,2}, acc::Array{Float64,2},
last_acc::Array{Float64,2}, dt::Float64)

@fastmath @inbounds @simd for i=1:size(vel,1)

vel[i,1] += 0.5 * dt * (acc[i,1] + last_acc[i,1])
vel[i,2] += 0.5 * dt * (acc[i,2] + last_acc[i,2])
vel[i,3] += 0.5 * dt * (acc[i,3] + last_acc[i,3])

end

return vel
end

"""
Force calculation.
"""
function compute_acceleration(pos::Array{Float64,2}, mass::Array{Float64,1},
acc::Array{Float64,2})
N = size(pos,1)

acc[:,:] .= 0

pos_i = [0.0,0,0]
acc_i = [0.0,0,0]

@inbounds for i = 1:N

pos_i[:] = pos[i,:]
acc_i[:] .= 0

@fastmath @inbounds @simd for j = 1:N
if i != j

dx = pos_i[1] - pos[j,1]
dy = pos_i[2] - pos[j,2]
dz = pos_i[3] - pos[j,3]

rinv3 = 1/sqrt((dx^2 + dy^2 + dz^2)^3)

acc_i[1] -= mass[j] * rinv3 * dx
acc_i[2] -= mass[j] * rinv3 * dy
acc_i[3] -= mass[j] * rinv3 * dz
end
end

acc[i,:] = acc_i[:]
end

return acc
end

"""
Kinetic and potential energy.
"""
function compute_energy(pos::Array{Float64,2}, vel::Array{Float64,2}, mass::Array{Float64})

N = size(vel,1)

Ekin = 0

@simd for i = 1:N
@inbounds Ekin += 0.5 * mass[i] * (vel[i,1]^2 + vel[i,2]^2 + vel[i,3]^2)
end

Epot = 0.0

@inbounds for i = 1:N-1
@fastmath @inbounds for j = i+1:N

dx = pos[i,1] - pos[j,1]
dy = pos[i,2] - pos[j,2]
dz = pos[i,3] - pos[j,3]

rinv = 1/sqrt(dx^2 + dy^2 + dz^2)
Epot -= mass[i] * mass[j] * rinv
end
end

return Ekin, Epot
end

N = size(ICs,1)

id = Array{Float64}(undef, N)
mass = Array{Float64}(undef, N)
pos = Array{Float64}(undef, N,3)
vel = Array{Float64}(undef, N,3)

id[:] = ICs[:,1]
mass[:] = ICs[:,2]

pos[:,1] = ICs[:,3]
pos[:,2] = ICs[:,4]
pos[:,3] = ICs[:,5]

vel[:,1] = ICs[:,6]
vel[:,2] = ICs[:,7]
vel[:,3] = ICs[:,8]

return id, mass, pos, vel
end

function main(args)
NBabel(args[1], show=true)
end
main(ARGS)
``````
1 Like

Have you read the performance tips in the Julia manual? There are a few simple tricks that will help you out. For starters, the memory layout of the matrices should be transposed for better memory access performance.

2 Likes

Since it is all 3 dimensional, one should use `SVector` or even tuples for point calculations. And double buffering instead of copying `last_acc[:, :] = acc[:, :]`

3 Likes

There are quite a few problems in that code, not only related to using Julia. For instance the loop that computes accelerations passes twice for each pair of particles. Fixing that only will reduce the time significantly. If you are willing to use this code as a learning experience, start removing all those flags of simd, inbounds, fast math, not only they are redundant, but the code has to be improved in many other ways before trying those.

Edit. From your name I guess you speak Portuguese. If that is the case, you might be interested in the course on particle simulations I am teaching. The material is here: https://github.com/m3g/CKP/blob/master/disciplina/simulacoes2.pdf

4 Likes

Something fishy is going on with this benchmarks.

First of all, it is really strange that python implementation is so much faster than C++/Fortran.

Secondly, after I fix various allocations issues, speedup was minimal, just a few seconds. After profiling it turns out that `compute_acceleration` is taking huge chunk of time, which is reasonable, since it is O(N^2) and everything else is just O(N).

But after I throw in `@Threads.threads` in compute_acceleration loop, timing became much much better:

• 18 sec for 1k
• 55 sec for 2k

So, my suspicions that either python implementation uses more sophisticated process of accelerations calculation, or it uses multi-threading too.

1 Like

There’s a risk that `if i != j` is preventing SIMD utilization in the innermost loop.

1 Like

I’m definitely look into that, thank you!

Yes, I’m Brazilian, in fact I’m a masters candidate at UNICAMP’s Computing Institute. I am definitely look into that pdf, thank you professor.

1 Like

That python benchmark uses several tricks, including Pythran and another library, which beats (so far) several implementations.

Edit: Thank you so much for editing it and putting in a gist!

In the Fortran and Python implementations the loop to compute accelerations is done properly (N(N-1)/2) cycles:

``````     do i = 1, P%nstars - 1
do j = i + 1, P%nstars
``````
``````   for index_p0 in range(nb_particules - 1):
for index_p1 in range(index_p0 + 1, nb_particules):
``````

while in the Julia implementation it is not (N^2 cycles):

``````    @inbounds @Threads.threads for i = 1:N
@fastmath @inbounds @simd for j = 1:N
if i != j
``````

Concerning the code:

My “base time is” 661 ms (8.41 MB allocations), for the code provided.

Removing all the decorations (simd, inbounds, threads, etc), actually made the code faster (586 ms).

Fixing the loop above and using static arrays took the time to 581 ms and allocations went to 174 kb. No allocations occur anymore in the main routines.

Now, adding back the `@inbounds` flag to the acceleration and energy computation loops, time goes to 500 ms, 171.78 KiB.

If one considers this speedup reliable (~30%), that would put the Julia benchmark there in pair with the python and rust codes.

I am running with this data, with

``````using BenchmarkTools
include("./nbabel.jl")
using .NB
@btime NBabel("./input128.dat")
``````
Code
``````

module NB

"""
This is an implementation of the NBabel N-body problem.
This is 'naive & native' Julia with type and loop annotations for
fastmath and vectorization, because this is really is no effort.
Without annotations the code will be as slow as naive Python/IDL.
include("nbabel.jl")
NBabel("IC/input128", show=true)
Julius Donnert INAF-IRA 2017
"""

using Printf
using DelimitedFiles
using StaticArrays

tbeg = 0.0
tend = 10.0
dt = 0.001

function NBabel(fname::String; tend = tend, dt = dt, show=false)

if show
end

ID, mass, pos, vel = read_ICs(fname)

return NBabelCalcs(ID, mass, pos, vel, tend, dt, show)
end

function NBabelCalcs(ID, mass, pos, vel, tend = tend, dt = dt, show=false)
acc = similar(vel)
acc = compute_acceleration(pos, mass, acc)
last_acc = copy(acc)

Ekin, Epot = compute_energy(pos, vel, mass)
Etot_ICs = Ekin + Epot

t = 0.0
nstep = 0

while t < tend

pos = update_positions(pos, vel, acc, dt)

acc, last_acc = last_acc, acc

acc = compute_acceleration(pos, mass, acc)

vel = update_velocities(vel, acc, last_acc, dt)

t += dt
nstep += 1

if show && nstep%100 == 0

Ekin, Epot = compute_energy(pos, vel, mass)
Etot = Ekin + Epot
dE = (Etot - Etot_ICs)/Etot_ICs

@printf "t = %g, Etot=%g, Ekin=%g, Epot=%g, dE=%g \n" t Etot Ekin Epot dE
end
end

Ekin, Epot = compute_energy(pos, vel, mass)
Etot = Ekin + Epot
return (; Ekin, Epot, Etot)
# return size(pos,1)
end

function update_positions(pos, vel, acc, dt)
N = length(pos)
for i in 1:N
pos[i] = (0.5 * acc[i] * dt + vel[i])*dt + pos[i]
end
return pos
end

function update_velocities(vel, acc, last_acc, dt)
N = length(vel)
for i in 1:N
vel[i] = vel[i] + 0.5 * dt * (acc[i] + last_acc[i])
end
return vel
end

"""
Force calculation.
"""
function compute_acceleration(pos, mass, acc)
N = length(pos)
for i in 1:N
acc[i] = zeros(eltype(acc))
end
@inbounds for i = 1:N-1
for j = i+1:N
dr = pos[i] - pos[j]
rinv3 = 1/sqrt(sum(dr .^ 2)) ^ 3
acc[i] = acc[i] - mass[i] * rinv3 * dr
acc[j] = acc[j] + mass[j] * rinv3 * dr
end
end
return acc
end

"""
Kinetic and potential energy.
"""
function compute_energy(pos, vel, mass)
N = length(vel)

Ekin = 0.0

for i = 1:N
Ekin += 0.5 * mass[i] * sum(vel[i] .^ 2)
end

Epot = 0.0

@inbounds for i = 1:N-1
for j = i+1:N
dr = pos[i] - pos[j]
rinv = 1/sqrt(sum(dr .^ 2))
Epot -= mass[i] * mass[j] * rinv
end
end

return Ekin, Epot
end

N = size(ICs,1)

pos = Array{SVector{3,Float64}}(undef, N)
vel = Array{SVector{3,Float64}}(undef, N)

id = @view ICs[:, 1]
mass = @view ICs[:, 2]

for i in axes(ICs, 1)
pos[i] = SVector{3,Float64}(ICs[i, 3], ICs[i, 4], ICs[i, 5])
end

for i in axes(ICs, 1)
vel[i] = SVector{3,Float64}(ICs[i, 6], ICs[i, 7], ICs[i, 8])
end

return id, mass, pos, vel
end

export NBabel

end

#function main(args)
#    NBabel(args[1], show=true)
#end
#main(ARGS)
``````
6 Likes

If you’re referring to these:

``````function update_positions(pos::Array{Float64,2}, vel::Array{Float64,2},
acc::Array{Float64,2}, dt::Float64)

@fastmath @inbounds @simd for i=1:size(pos,1)

pos[i,1] += dt * vel[i,1] + 0.5 * dt^2 * acc[i,1]
pos[i,2] += dt * vel[i,2] + 0.5 * dt^2 * acc[i,2]
pos[i,3] += dt * vel[i,3] + 0.5 * dt^2 * acc[i,3]

end

return pos
end
``````

Then no, this is actually ideal.

That is, assuming the arrays have a lot of rows. Given that removing `@simd` and friends improved performance, I’m guessing they have <16?

EDIT: Actually, assuming these arrays have only 3 columns…

``````function update_positions(pos::Array{Float64,2}, vel::Array{Float64,2},
acc::Array{Float64,2}, dt::Float64)

@fastmath @inbounds @simd for i=eachindex(pos)

pos[i] += dt * vel[i] + 0.5 * dt^2 * acc[i]

end

return pos
end
``````

Or use `1:3size(pos,1)` assuming all of them have the same number of rows. My point is linear indexing is best, and means memory layout doesn’t matter in this part of the code, so long as it is dense.

Meaning if another part of the code is better off with these transposed, then it’s worth it as this part is unaffected.

2 Likes

Now I noticed that I was working on top of your version already!

The original version has this benchmark:

``````julia> @btime NBabel("./input128.dat")
799.363 ms (2662599 allocations: 308.29 MiB)
``````

While my final version has:

``````julia> @btime NBabel("./input128.dat")
510.348 ms (3076 allocations: 171.78 KiB)
(Ekin = 5.542372639657371, Epot = -0.4094795146623831, Etot = 5.132893124994988)

``````

Thus, if we take this comparison seriously, the speedup is of about 56%, thus the Julia version would be as fast or faster than the Python version of the published benchmark.

1 Like

Wow, that’s impressive. All of it with a few tweaks!

“input128.dat” is not very relevant here for two reasons

1. Threads have more or less constant overhead which is less noticeable if the number of particles is large. But for the small number of particles, speedup from parallel execution is smaller than overhead from parallelization, so you can see a slowdown.

2. Acceleration is `O(N^2)`, while all other operations are `O(N)`. Which means that for different numbers of particles proportion between acceleration calculation and everything else will be different. And in the end it means that timing will grow nonlinear. Things that were important for small numbers of particles will be irrelevant for large and vice versa.

But yes, with Threads removing (but everything else remaining in place) and fixed `acceleration` calculations I get 30 sec for 1k and 119 sec for 2k, but this is including compilation and initial data construction which can be speed up too.

Now, interesting question, who can write multi-threading version, so it run even faster

2 Likes

Fair points. Redoing the benchmarks with 2k particles now…

I was, but then I’m confused, wouldn’t it in general be faster to access them in the order

``````pos[1,i]
pos[2,i]
pos[3,i]
``````

and then increment, `i`, i.e., accessing one column at the time? I assume that `pos[1,i]` and `pos[2,i]` are next to each other in memory. Could there be cases where it is not more efficient to access contiguous memory?

2 Likes

One reasonable way to do that is to linearize the indexes of of the loops that do N*(N-1)/2 operatons, and split the loop in two, one on the number of threads and the other on the fraction of the indexes for that thread.

(anyway, this is a very naive implementation of a particle simulation, other things should be done before actually pretending this is a competitive code: linked cells, multipoles, etc)

For 2K I get, in the original implementation:

``````julia> @btime NBabel("./input2k.dat")
210.316 s (41123336 allocations: 4.74 GiB)
(Ekin = 0.2428869241248313, Epot = -0.4780190257062157, Etot = -0.2351321015813844)
``````

And in the latest one:

``````julia> @btime NBabel("./input2k.dat")
172.240 s (70714 allocations: 2.40 MiB)
(Ekin = 0.24191235979106984, Epot = -0.46958
``````

Thus, a 210/172 = 22% speedup. They report a 40% difference in favor to the python implementation, so there still room for improvement (although in their time-scale this would take Julia to 141s, which is already faster than the C++ or Fortran versions, thus clearly there are some catches in the benchmark, we are not comparing the same thing in all cases).

Edit: The Fortran version published there takes here 217s.

I do not know how to benchmark the python version because I do not have jit, etc, installed.

1 Like
``````using BenchmarkTools

pos = rand(128,3); dest = similar(pos);
pos_t = copy(pos'); dest_t = similar(pos_t);

function foo!(dest, pos)
@inbounds @simd ivdep for i in axes(pos, 2)
dest[1,i] = pos[1,i] * 2
dest[2,i] = pos[2,i] * 3
dest[3,i] = pos[3,i] * 4
end
end

@benchmark foo!(\$dest_t, \$pos_t)
@benchmark foo!(\$dest', \$pos')
``````

This yields:

``````julia> @benchmark foo!(\$dest_t, \$pos_t)
BenchmarkTools.Trial:
memory estimate:  0 bytes
allocs estimate:  0
--------------
minimum time:     68.267 ns (0.00% GC)
median time:      68.645 ns (0.00% GC)
mean time:        72.310 ns (0.00% GC)
maximum time:     187.716 ns (0.00% GC)
--------------
samples:          10000
evals/sample:     976

julia> @benchmark foo!(\$dest', \$pos')
BenchmarkTools.Trial:
memory estimate:  0 bytes
allocs estimate:  0
--------------
minimum time:     17.311 ns (0.00% GC)
median time:      17.783 ns (0.00% GC)
mean time:        18.416 ns (0.00% GC)
maximum time:     52.715 ns (0.00% GC)
--------------
samples:          10000
evals/sample:     997
``````

The transposed version (i.e., equivalent to accessing them `pos[i,1]`, `pos[i,2]`, etc, was about 4x faster.

This is because 3 is not a good number for SIMD, so accessing them

``````pos[1,i]
pos[2,i]
pos[3,i]
``````

doesn’t let you SIMD well.
As a rule of thumb, it’s a lot easier to SIMD between loop iterations.
Meaning, for best performance, you want successive iterations of the loop to access successive indexes.

5 Likes

I see, that’s very good to know! Does this reasoning extend to `SVector{3}` as well, which I assume is packed into a vector like a matrix of 3 x N? I could then potentially slow down my code by switching from the N x 3 matrix layout to a vector of `SVector`?

I did a similar benchmark to yours, and found that the N x 4 layout was faster than the 4 x N as well, but 4 should be a good value to SIMD on? I had to go up to around N x 16 before what I previously thought was the better memory layout to actually be faster. Is it so hard for the compiler to apply SIMD within a single loop iteration that the arrays have to be large enough for cache locality to become important before the column-major access wins out?

2 Likes