I was referring to a vector of SVector
being stored in memory like a matrix, but I was a bit sloppy with my use of words so that might not have been totally clear.
Tangential to the question (but still may be interesting for somebody who is interested in this topic): there is a package https://github.com/SciML/NBodySimulator.jl which is basically doing the same calculations, but uses DiffEq
internally, so it is more stable numerically than this simple implementation.
Also it is worth to know that the standard algorithm for integrating newton’s equations in particle simulations is velocityverlet, which is straight forward to implement in this package.
I could not resist testing some other things. First, a simple parallelization, in which I just avoid racing problems by creating an array for acc
for each thread. With 4 threads (my laptop has 4 cores), I get, now (from 172 s) of the latest version:
julia> @time NBabel("./input2k.dat")
114.693995 seconds (281.92 k allocations: 32.341 MiB, 0.00% gc time)
(Ekin = 0.24083906800778096, Epot = 0.47480580878465395, Etot = 0.233966740776873)
Code
module NB
"""
This is an implementation of the NBabel Nbody problem.
See nbabel.org for more information.
This is 'naive & native' Julia with type and loop annotations for
fastmath and vectorization, because this is really is no effort.
Without annotations the code will be as slow as naive Python/IDL.
include("nbabel.jl")
NBabel("IC/input128", show=true)
Julius Donnert INAFIRA 2017
"""
using Printf
using DelimitedFiles
using StaticArrays
tbeg = 0.0
tend = 10.0
dt = 0.001
function NBabel(fname::String; tend = tend, dt = dt, show=false)
if show
println("Reading file : $fname")
end
ID, mass, pos, vel = read_ICs(fname)
return NBabelCalcs(ID, mass, pos, vel, tend, dt, show)
end
function NBabelCalcs(ID, mass, pos, vel, tend = tend, dt = dt, show=false)
nthreads = Threads.nthreads()
acc_parallel = Array{eltype(vel)}(undef,length(vel),nthreads)
acc = compute_acceleration(pos, mass, acc_parallel)
last_acc = copy(acc)
Ekin, Epot = compute_energy(pos, vel, mass)
Etot_ICs = Ekin + Epot
t = 0.0
nstep = 0
while t < tend
pos = update_positions(pos, vel, acc, dt)
last_acc .= acc
acc = compute_acceleration(pos, mass, acc_parallel)
vel = update_velocities(vel, acc, last_acc, dt)
t += dt
nstep += 1
if show && nstep%100 == 0
Ekin, Epot = compute_energy(pos, vel, mass)
Etot = Ekin + Epot
dE = (Etot  Etot_ICs)/Etot_ICs
@printf "t = %g, Etot=%g, Ekin=%g, Epot=%g, dE=%g \n" t Etot Ekin Epot dE
end
end
Ekin, Epot = compute_energy(pos, vel, mass)
Etot = Ekin + Epot
return (; Ekin, Epot, Etot)
# return size(pos,1)
end
function update_positions(pos, vel, acc, dt)
N = length(pos)
for i in 1:N
pos[i] = (0.5 * acc[i] * dt + vel[i])*dt + pos[i]
end
return pos
end
function update_velocities(vel, acc, last_acc, dt)
N = length(vel)
for i in 1:N
vel[i] = vel[i] + 0.5 * dt * (acc[i] + last_acc[i])
end
return vel
end
"""
Force calculation.
"""
function compute_acceleration(pos, mass, acc)
nthreads = Threads.nthreads()
N = length(pos)
for i in 1:N
for j in 1:nthreads
acc[i,j] = zeros(eltype(acc))
end
end
@inbounds Threads.@threads for i = 1:N1
id = Threads.threadid()
for j = i+1:N
dr = pos[i]  pos[j]
rinv3 = 1/sqrt(sum(dr .^ 2)) ^ 3
acc[i,id] = acc[i,id]  mass[i] * rinv3 * dr
acc[j,id] = acc[j,id] + mass[j] * rinv3 * dr
end
end
for i in 1:N
for j in 2:nthreads
acc[i,1] += acc[i,j]
end
end
return @view acc[:,1]
end
"""
Kinetic and potential energy.
"""
function compute_energy(pos, vel, mass)
N = length(vel)
Ekin = 0.0
for i = 1:N
Ekin += 0.5 * mass[i] * sum(vel[i] .^ 2)
end
nthreads = Threads.nthreads()
Epot = zeros(nthreads)
@inbounds Threads.@threads for i = 1:N1
id = Threads.threadid()
for j = i+1:N
dr = pos[i]  pos[j]
rinv = 1/sqrt(sum(dr .^ 2))
Epot[id] = mass[i] * mass[j] * rinv
end
end
return Ekin, sum(Epot)
end
function read_ICs(fname::String)
ICs = readdlm(fname)
N = size(ICs,1)
pos = Array{SVector{3,Float64}}(undef, N)
vel = Array{SVector{3,Float64}}(undef, N)
id = @view ICs[:, 1]
mass = @view ICs[:, 2]
for i in axes(ICs, 1)
pos[i] = SVector{3,Float64}(ICs[i, 3], ICs[i, 4], ICs[i, 5])
end
for i in axes(ICs, 1)
vel[i] = SVector{3,Float64}(ICs[i, 6], ICs[i, 7], ICs[i, 8])
end
return id, mass, pos, vel
end
export NBabel
end
#function main(args)
# NBabel(args[1], show=true)
#end
#main(ARGS)
And rencently @Elrod taught me that one way to take advantage of simd was to define our structures with 4 fields, even the fourth field is not used in reality. For that I created a Particle
structure with four fields, and the operations that are required for that structure (I think there is a simpler way, I have to revise the lessons ):
struct Particle
x :: Float64
y :: Float64
z :: Float64
w :: Float64
end
Particle(x,y,z) = Particle(x,y,z,0)
norm2(p :: Particle) = p.x^2 + p.y^2 + p.z^2 + p.w^2
norm(p :: Particle) = sqrt(norm2(p))
import Base.+, Base., Base.*, Base.zero
(p1::Particle,p2::Particle) = Particle(p1.xp2.x,p1.yp2.y,p1.zp2.z,p1.wp2.w)
+(p1::Particle,p2::Particle) = Particle(p1.x+p2.x,p1.y+p2.y,p1.z+p2.z,p1.w+p2.w)
*(c,p1::Particle) = Particle(c*p1.x,c*p1.y,c*p1.z,c*p1.w)
*(p1::Particle,c) = Particle(c*p1.x,c*p1.y,c*p1.z,c*p1.w)
zero(T::Type{Particle}) = Particle(0,0,0,0)
The code was adapted accordingly, and now using a single thread I get (from 172s):
julia> @time NBabel("./input2k.dat")
150.969980 seconds (130.74 k allocations: 11.073 MiB)
(Ekin = 0.24191235979106984, Epot = 0.46958087700389634, Etot = 0.2276685172128265)
I do not exclude the possibility of Chis Elrod saying that what I did does not make sense and my benchmark is flawed , but if that is the case we will keep learning. We got a 22s speedup (12%):
And, finally, I using 4 threads:
julia> @time NBabel("./input2k.dat")
97.990265 seconds (281.44 k allocations: 32.498 MiB, 0.00% gc time)
(Ekin = 0.24083906800778096, Epot = 0.47480580878465395, Etot = 0.233966740776873)
Thus, from the 210s of the original version, we have now 151s for the singlethreaded execution. With the same improvement rate, that would match exactly the Pythran implementation (151/210)*173=124.4s (124s reported for the pythran version).
With the multiplethreaded version (which is way from optimal, because no care was taken to provide a balanced load, and competing process are present), I get here 98s.
Code
module NB
"""
This is an implementation of the NBabel Nbody problem.
See nbabel.org for more information.
This is 'naive & native' Julia with type and loop annotations for
fastmath and vectorization, because this is really is no effort.
Without annotations the code will be as slow as naive Python/IDL.
include("nbabel.jl")
NBabel("IC/input128", show=true)
Julius Donnert INAFIRA 2017
"""
using Printf
using DelimitedFiles
using StaticArrays
tbeg = 0.0
tend = 10.0
dt = 0.001
struct Particle
x :: Float64
y :: Float64
z :: Float64
w :: Float64
end
Particle(x,y,z) = Particle(x,y,z,0)
norm2(p :: Particle) = p.x^2 + p.y^2 + p.z^2 + p.w^2
norm(p :: Particle) = sqrt(norm2(p))
import Base.+, Base., Base.*, Base.zero
(p1::Particle,p2::Particle) = Particle(p1.xp2.x,p1.yp2.y,p1.zp2.z,p1.wp2.w)
+(p1::Particle,p2::Particle) = Particle(p1.x+p2.x,p1.y+p2.y,p1.z+p2.z,p1.w+p2.w)
*(c,p1::Particle) = Particle(c*p1.x,c*p1.y,c*p1.z,c*p1.w)
*(p1::Particle,c) = Particle(c*p1.x,c*p1.y,c*p1.z,c*p1.w)
zero(T::Type{Particle}) = Particle(0,0,0,0)
function NBabel(fname::String; tend = tend, dt = dt, show=false)
if show
println("Reading file : $fname")
end
ID, mass, pos, vel = read_ICs(fname)
return NBabelCalcs(ID, mass, pos, vel, tend, dt, show)
end
function NBabelCalcs(ID, mass, pos, vel, tend = tend, dt = dt, show=false)
nthreads = Threads.nthreads()
acc_parallel = Array{eltype(vel)}(undef,length(vel),nthreads)
acc = compute_acceleration(pos, mass, acc_parallel)
last_acc = copy(acc)
Ekin, Epot = compute_energy(pos, vel, mass)
Etot_ICs = Ekin + Epot
t = 0.0
nstep = 0
while t < tend
pos = update_positions(pos, vel, acc, dt)
last_acc .= acc
acc = compute_acceleration(pos, mass, acc_parallel)
vel = update_velocities(vel, acc, last_acc, dt)
t += dt
nstep += 1
if show && nstep%100 == 0
Ekin, Epot = compute_energy(pos, vel, mass)
Etot = Ekin + Epot
dE = (Etot  Etot_ICs)/Etot_ICs
@printf "t = %g, Etot=%g, Ekin=%g, Epot=%g, dE=%g \n" t Etot Ekin Epot dE
end
end
Ekin, Epot = compute_energy(pos, vel, mass)
Etot = Ekin + Epot
return (; Ekin, Epot, Etot)
# return size(pos,1)
end
function update_positions(pos, vel, acc, dt)
N = length(pos)
for i in 1:N
pos[i] = (0.5 * acc[i] * dt + vel[i])*dt + pos[i]
end
return pos
end
function update_velocities(vel, acc, last_acc, dt)
N = length(vel)
for i in 1:N
vel[i] = vel[i] + 0.5 * dt * (acc[i] + last_acc[i])
end
return vel
end
"""
Force calculation.
"""
function compute_acceleration(pos, mass, acc)
nthreads = Threads.nthreads()
N = length(pos)
for i in 1:N
for j in 1:nthreads
acc[i,j] = zero(eltype(acc))
end
end
@inbounds Threads.@threads for i = 1:N1
id = Threads.threadid()
@simd for j = i+1:N
dr = pos[i]  pos[j]
rinv3 = 1/norm(dr)^3
acc[i,id] = acc[i,id]  mass[i] * rinv3 * dr
acc[j,id] = acc[j,id] + mass[j] * rinv3 * dr
end
end
for i in 1:N
for j in 2:nthreads
acc[i,1] += acc[i,j]
end
end
return @view acc[:,1]
end
"""
Kinetic and potential energy.
"""
function compute_energy(pos, vel, mass)
N = length(vel)
Ekin = 0.0
for i = 1:N
Ekin += 0.5 * mass[i] * norm2(vel[i])
end
nthreads = Threads.nthreads()
Epot = zeros(nthreads)
@inbounds Threads.@threads for i = 1:N1
id = Threads.threadid()
@simd for j = i+1:N
dr = pos[i]  pos[j]
rinv = 1/norm(dr)
Epot[id] = mass[i] * mass[j] * rinv
end
end
return Ekin, sum(Epot)
end
function read_ICs(fname::String)
ICs = readdlm(fname)
N = size(ICs,1)
pos = Array{Particle}(undef, N)
vel = Array{Particle}(undef, N)
id = @view ICs[:, 1]
mass = @view ICs[:, 2]
for i in axes(ICs, 1)
pos[i] = Particle(ICs[i, 3], ICs[i, 4], ICs[i, 5])
end
for i in axes(ICs, 1)
vel[i] = Particle(ICs[i, 6], ICs[i, 7], ICs[i, 8])
end
return id, mass, pos, vel
end
export NBabel
end
#function main(args)
# NBabel(args[1], show=true)
#end
#main(ARGS)
The codes are: https://github.com/lmiq/nbabel/tree/master/julia
The times reported there were measured with the system time command, so they include compilation and everything (and used O3 checkbounds=no
, which improved the times quite a bit).

Adding fourth dimension for simd speed up is counterintuitive, but it works! Amazing, just amazing. Interestingly enough it also improve numerical stability.

There is a small mistake in your code: you declare
pos
andvel
asArray{Particle}
.Array
is an abstract type, so this container is very slow. It should beVector
, then you’ll get no extra allocations.
With this gist https://gist.github.com/Arkoniak/d729b119b0bdf218bbf282b847089bca I get on single thread
 27 sec for 1k
 95 sec for 2k
One more update: https://gist.github.com/Arkoniak/5e52e111082f39d21a2f00ba2dbd1049
This is threaded version with split balancing.
With 4 threads on i77700HQ CPU @ 2.80GHz
 14 sec for 1k
 35 sec for 2k
I’ve used command time julia O3 nbabel4.jl ../data/input1k
, so this timing includes all compilations.
Great!
One small note, the acceleration loop is going to N
instead of N1
(line 107):
@inbounds for i in 1:N
If you have some time, could you expand I bit on how you did the split balancing?
edit: just to be sure, the way you implemented it only will work if the number of particles is a multiple of the number of processors, is that correct? Otherwise there should be a remaining chunk there.
Of course is Chris Elrod that sould explain that, but the thing is that using that one can load a integer number of operations in the processor cache, which is of 512 bytes in these processors (and the struct is, with four fields, of 256 bytes), thus each processor cycle loads exactly 2 “particles”. The numerical stability is strange, though.
I have made that splitting procedure long time ago, so it took me some time to remember how it works. It is indeed slightly cryptic
Idea is simple though. We have following nested loops
for i in 1:N
for j in i+1:N
# Some operation
end
end
and we are assuming, that operation
inside this loop take the same time, irrelevant of index. Since each task has some time to get scheduled we want to have bare minimum of the tasks spawned. It leads us to the task of splitting external i
loop in k
chunks (where k
is the number of threads), such that amount of operations in each chunk is equal.
Now, it can be solved exactly, but I am making another assumption, that N >> 1
and we can throw away discrete nature of i
and change it to continuous variable x
. Then number of internal operations for each x
is just N  x
(one can easily see that, taking into account linearity of the problem and the fact that when x = 0
number of elements in internal loop is N
and when x = N
number of elements is 0
.
In this approach, number of operations is \int_{0}^{N} (N  x) dx = \frac{N^2}{2} and each chunk should contain \frac{N^2}{2k} elements. For chunk number m, number of elements in all chunks from 1 to m is given by
\int_0^{t_m} (N  x) dx = \frac{N^2}{2k}m
where t_m is the point where we should split another chunk.
After integration and substitution we obtain
N t_m  \frac{t_m^2}{2} = \frac{N^2}{2k}m
Solving this equation we get
t_m = N (1 \pm \sqrt{1  \frac{m}{k}})
and in order to satisfy boundary conditions t_0 = 0, t_k = N we choose minus sign.
As a last and to be honest unnecessary step I change the definition of m to m \rightarrow k  m and the answer is
t_m = N (1  \sqrt{\frac{m}{k}}), \forall{m = k..0}
Chunks are just intervals (t_m, t_{m  1}]
Of course it can be done more accurate and this assumptions are more imbalanced for small numbers of N, but multithreading for small N is meaningless anyway and deviations related to scheduling and other things will nullify more exact solutions (or at least that’s what I believe in, could be wrong).
–
One small note, the acceleration loop is going to
N
instead ofN1
(line 107):
I do not quite get it, since it is written as a loop from 1 to N, I do not see any N  1
.
That double loop should be a loop on the pairs of particles, therefore it should be
for i in 1:N1
for j in i+1:N
...
end
end
ah, but ok, the result is the same, because if i=N
then the next loop will not enter anyway.
And thank you very much for the detailed explanation. I will get pen an paper and study it. Very interesting.
Just to conclude: the codes I changed are here:
https://github.com/lmiq/nbabel/tree/master/julia
At the end, the serial version of the code takes in my computer ~30s against ~50s for 1k particles, and 120s vs. 200s for 2k particles. My parallel implementation only consists of adding a vector to avoid concurrency and adding the threads flags, and with that I get ~20s and ~80s, respectively for 1k and 2k.
It seems that the serial version ends up being slightly faster than the Python implementation of the original post (as now reported in the page of the developer), thus probably all the “easy” optimizations are likely done.
I wander if someone accepts the challenge of implementing a GPUaccelerated version I would try to do it to learn, but now I have to grade a hundred exams .

Has anybody doublechecked the reported runtimes of the pythontransonicpythran version? Also note that some benchmarks on nbabel run until t=1 as suggested by the page, others until t=10.

Has anyone checked whether the py version is multithreaded?

I wrote a GPU kernel a while ago. Just replace the lennard jones force by gravity. You find it here.
The runtimes were checked by the developer himself in comparison to the new Julia implementations (the serial version is faster than the fastest python version). Those are merged in the main repository. I was unable to run the fast Python version, I did not manage to install the dependencies correctly.
Hi, I created the mentioned benchmark repository. It’s great to see what you were able to obtain with Julia!
It’s interesting to see that the super optimized Julia version (sequential) is “only” ~ 1.1 to 1.2 faster than Python. IMHO, it shows that we are not too bad with our simple version using TransonicPythran.
However, it’s nice to see how the Julia version can be easily parallelized. I didn’t try to do that in TransonicPythran, which supports parallelization with OpenMP.
The strategy with 4 floats struct to allow more SIMD instructions is really nice! I wonder what is the effect on energy consumption (which was the subject of the initial paper).
Julia is really great in general and a great source of inspiration for efficient computing with Python
I’m surprised that you were not able to run the Python benchmarks. It should be sufficient to run:
conda create n env_py python pythran transonic pandas
conda activate env_py
cd nbabel/py
make bench1k
Hi @paugier, great work on Transonic! Did you have time to check how Rust SIMD performed compared to Julia and Pythran on you setup? If have any trouble there, let me know!
I appreciate the compliment, but nothing I can code can be called “superoptimized”, rs, I am coding in Julia for less than a year now. The sequential version without the “4element” struct trick is an ‘adequate’ version, taking care of the type stability that is required for the compiler to be able to do its optimizations. How to keep codes typestable is what one has to learn from the beginning to use Julia properly. Probably type annotations are behind the Python addons that allow the code to become fast as well. In Julia one learns to write code taking that in account from start, that is in the very core of the Julia functionality and performance.
The use of the 4field structs is a more advanced trick, indeed, and it may or may not work, depending on how the data is treated afterwards. I would guess (really, only guess), that in that case it saves energy, because fewer access to memory and processor cycles are performed.
On the other side, the parallel version that I posted is very elementary. Above you can find a better splitbalanced version which has better scaleup.
Welcome to the community! And congratulations for the work with Transonic! Really impressive.
It seems to me that there are few things in the best Julia versions that are quite “advanced” (not only “type stability”). At least one needs a good understanding of computing in general and a good knowledge of Julia to code that. In my point of view, the same kind of understanding and knowledge that are needed to write efficient modern C++. This is why I used “super optimized” but you are right, of course, I guess one still can go further in advanced optimization.
By the way, I said that the “super optimized Julia version (sequential) is “only” ~ 1.1 to 1.2 faster than Python” but I need to correct 2 things. First, not “super optimized” but only “optimized”. And second, the speed comparison strongly depends on the size: for 16384 particles (the size considered in the article), the Python implementation is actually still 20% faster on my PC (comparing sequential with sequential). For t_end = 0.2
: 144 s for Python and 178 s for Julia (1.5.3).
The use of the 4field structs is a more advanced trick, indeed, and it may or may not work, depending on how the data is treated afterwards. I would guess (really, only guess), that in that case it saves energy, because fewer access to memory and processor cycles are performed.
Of course, but some SIMD instructions are known to be less efficient in terms of energy consumption so it’s not so clear.
Probably type annotations are behind the Python addons that allow the code to become fast as well.
With Transonic, type annotations are needed only for aheadoftime compilation (in this case, only for the arguments of 1 function). But one would get the same performance with the @jit
decorator without these annotations.
Oh I had a problem to install the nightly rust with rustup. Because it does not seem compatible with rust installed with apt and I’d like to keep this rust for other stuff. And then I was too lazy to do it in a docker
The greatest difference between the “original” Julia version and the others is that the loop that computed accelerations was not being performed in the same way than in all the other versions. In this post I fixed that. In that version I also use a Julia package called StaticArrays
, which is useful when one is dealing with many small arrays and want them to be stackallocated. Not really comparable, but that is just using a package, like one has to use numpy
properly. The rest is just the natural way to write code in Julia. Of course each one, being more familiar to one or other language will find some things more natural than others. At the final version I don’t use the StaticArrays
package and manually define the 4field
struct, but the natural way to go would be to use it.
At the end, the function that computes the accelerations (and take most of the time), is very easy to read, and does not contain anything fancy, I think:
function compute_acceleration!(positions, masses, accelerations)
N = length(positions)
for i in eachindex(accelerations)
accelerations[i] = zero(eltype(accelerations))
end
for i = 1:N  1
for j = i + 1:N
dr = positions[i]  positions[j]
rinv3 = 1 / norm(dr)^3
accelerations[i] = masses[i] * rinv3 * dr
accelerations[j] += masses[j] * rinv3 * dr
end
end
return nothing
end
The “strange” things there are maybe the eachindex
which can be replaced by 1:length()
, and the zeros(eltype())
, which could be replaced as well by something syntactically simpler. Those are not for performance, but for generality. That same function can be used for “points” of any dimension, for example, if the appropriate function defining what is the zero
of such point is defined. Actually it can be used to any kind of object
(the term does not really apply here) for which the programmer defines what are the zero, the norm, subtraction and the multiplication by a scalar.
This isn’t an example of advanced highperformance Julia code, except for the @simd
macros and knowing to use length 4 rather than 3. But it reads pretty clean and simple to me (I mean it’s good code!), but just functions and forloops. This is also the kind of simple code you should be able to get running fast with numpy as there isn’t much different that the compiler would do.
More advanced highperformance julia you will see much more use of the type system to pass more information to the compiler, maybe @generated
functions and things like the LoopVectorization.jl @avx
macro. But as someone who also writes high performance C++, julia is much easier.
Concerning the benchmarks, as for every of these comparisons, one has to be careful in not comparing things improperly. It does not make much sense that a C++ implementation of this is 40% slower than a Fortran implementation of the same thing, also it does not make much sense that any Julia implementation is faster than both.
I have myself fallen into that trap recently, thinking that a Julia code was faster than my own Fortran implementation of the same thing, and an advice of a advanced Julia user taught me that I was simply using the wrong compiler flags for the compilation of the Fortran code. With that fixed the codes ran at the same speed.
There is not much than can be improved in those implementations without going to parallel or cuda versions, thus it does not make much sense that any of the alternatives (Julia, C++, Fortran, Java) are much slower than the other.
It is also natural that the experienced Python developer that implemented the Python version got a better implementation than the suboptimal implementations of the other languages. I could not implement the same thing in Python as of today.
That being said, of course one would expect that the Python version was slower, and the efficiency that @paugier gets with Transonic shows what a good job they are doing in bringing fast computing to Python. Kudos to them! That certainly puts pressure in all other languages that want to occupy that space!