Nbabel nbody integrator speed up

I was referring to a vector of SVector being stored in memory like a matrix, but I was a bit sloppy with my use of words so that might not have been totally clear.

1 Like

Tangential to the question (but still may be interesting for somebody who is interested in this topic): there is a package https://github.com/SciML/NBodySimulator.jl which is basically doing the same calculations, but uses DiffEq internally, so it is more stable numerically than this simple implementation.

4 Likes

Also it is worth to know that the standard algorithm for integrating newton’s equations in particle simulations is velocity-verlet, which is straight forward to implement in this package.

3 Likes

I could not resist testing some other things. First, a simple parallelization, in which I just avoid racing problems by creating an array for acc for each thread. With 4 threads (my laptop has 4 cores), I get, now (from 172 s) of the latest version:

julia> @time NBabel("./input2k.dat")
114.693995 seconds (281.92 k allocations: 32.341 MiB, 0.00% gc time)
(Ekin = 0.24083906800778096, Epot = -0.47480580878465395, Etot = -0.233966740776873)


Code

module NB

"""
This is an implementation of the NBabel N-body problem.
This is 'naive & native' Julia with type and loop annotations for
fastmath and vectorization, because this is really is no effort.
Without annotations the code will be as slow as naive Python/IDL.
include("nbabel.jl")
NBabel("IC/input128", show=true)
Julius Donnert INAF-IRA 2017
"""

using Printf
using DelimitedFiles
using StaticArrays

tbeg = 0.0
tend = 10.0
dt = 0.001

function NBabel(fname::String; tend = tend, dt = dt, show=false)

if show
println("Reading file : $fname") end ID, mass, pos, vel = read_ICs(fname) return NBabelCalcs(ID, mass, pos, vel, tend, dt, show) end function NBabelCalcs(ID, mass, pos, vel, tend = tend, dt = dt, show=false) nthreads = Threads.nthreads() acc_parallel = Array{eltype(vel)}(undef,length(vel),nthreads) acc = compute_acceleration(pos, mass, acc_parallel) last_acc = copy(acc) Ekin, Epot = compute_energy(pos, vel, mass) Etot_ICs = Ekin + Epot t = 0.0 nstep = 0 while t < tend pos = update_positions(pos, vel, acc, dt) last_acc .= acc acc = compute_acceleration(pos, mass, acc_parallel) vel = update_velocities(vel, acc, last_acc, dt) t += dt nstep += 1 if show && nstep%100 == 0 Ekin, Epot = compute_energy(pos, vel, mass) Etot = Ekin + Epot dE = (Etot - Etot_ICs)/Etot_ICs @printf "t = %g, Etot=%g, Ekin=%g, Epot=%g, dE=%g \n" t Etot Ekin Epot dE end end Ekin, Epot = compute_energy(pos, vel, mass) Etot = Ekin + Epot return (; Ekin, Epot, Etot) # return size(pos,1) end function update_positions(pos, vel, acc, dt) N = length(pos) for i in 1:N pos[i] = (0.5 * acc[i] * dt + vel[i])*dt + pos[i] end return pos end function update_velocities(vel, acc, last_acc, dt) N = length(vel) for i in 1:N vel[i] = vel[i] + 0.5 * dt * (acc[i] + last_acc[i]) end return vel end """ Force calculation. """ function compute_acceleration(pos, mass, acc) nthreads = Threads.nthreads() N = length(pos) for i in 1:N for j in 1:nthreads acc[i,j] = zeros(eltype(acc)) end end @inbounds Threads.@threads for i = 1:N-1 id = Threads.threadid() for j = i+1:N dr = pos[i] - pos[j] rinv3 = 1/sqrt(sum(dr .^ 2)) ^ 3 acc[i,id] = acc[i,id] - mass[i] * rinv3 * dr acc[j,id] = acc[j,id] + mass[j] * rinv3 * dr end end for i in 1:N for j in 2:nthreads acc[i,1] += acc[i,j] end end return @view acc[:,1] end """ Kinetic and potential energy. """ function compute_energy(pos, vel, mass) N = length(vel) Ekin = 0.0 for i = 1:N Ekin += 0.5 * mass[i] * sum(vel[i] .^ 2) end nthreads = Threads.nthreads() Epot = zeros(nthreads) @inbounds Threads.@threads for i = 1:N-1 id = Threads.threadid() for j = i+1:N dr = pos[i] - pos[j] rinv = 1/sqrt(sum(dr .^ 2)) Epot[id] -= mass[i] * mass[j] * rinv end end return Ekin, sum(Epot) end function read_ICs(fname::String) ICs = readdlm(fname) N = size(ICs,1) pos = Array{SVector{3,Float64}}(undef, N) vel = Array{SVector{3,Float64}}(undef, N) id = @view ICs[:, 1] mass = @view ICs[:, 2] for i in axes(ICs, 1) pos[i] = SVector{3,Float64}(ICs[i, 3], ICs[i, 4], ICs[i, 5]) end for i in axes(ICs, 1) vel[i] = SVector{3,Float64}(ICs[i, 6], ICs[i, 7], ICs[i, 8]) end return id, mass, pos, vel end export NBabel end #function main(args) # NBabel(args[1], show=true) #end #main(ARGS)  And rencently @Elrod taught me that one way to take advantage of simd was to define our structures with 4 fields, even the fourth field is not used in reality. For that I created a Particle structure with four fields, and the operations that are required for that structure (I think there is a simpler way, I have to revise the lessons ): struct Particle x :: Float64 y :: Float64 z :: Float64 w :: Float64 end Particle(x,y,z) = Particle(x,y,z,0) norm2(p :: Particle) = p.x^2 + p.y^2 + p.z^2 + p.w^2 norm(p :: Particle) = sqrt(norm2(p)) import Base.+, Base.-, Base.*, Base.zero -(p1::Particle,p2::Particle) = Particle(p1.x-p2.x,p1.y-p2.y,p1.z-p2.z,p1.w-p2.w) +(p1::Particle,p2::Particle) = Particle(p1.x+p2.x,p1.y+p2.y,p1.z+p2.z,p1.w+p2.w) *(c,p1::Particle) = Particle(c*p1.x,c*p1.y,c*p1.z,c*p1.w) *(p1::Particle,c) = Particle(c*p1.x,c*p1.y,c*p1.z,c*p1.w) zero(T::Type{Particle}) = Particle(0,0,0,0)  The code was adapted accordingly, and now using a single thread I get (from 172s): julia> @time NBabel("./input2k.dat") 150.969980 seconds (130.74 k allocations: 11.073 MiB) (Ekin = 0.24191235979106984, Epot = -0.46958087700389634, Etot = -0.2276685172128265)  I do not exclude the possibility of Chis Elrod saying that what I did does not make sense and my benchmark is flawed , but if that is the case we will keep learning. We got a 22s speedup (12%): And, finally, I using 4 threads: julia> @time NBabel("./input2k.dat") 97.990265 seconds (281.44 k allocations: 32.498 MiB, 0.00% gc time) (Ekin = 0.24083906800778096, Epot = -0.47480580878465395, Etot = -0.233966740776873)  Thus, from the 210s of the original version, we have now 151s for the single-threaded execution. With the same improvement rate, that would match exactly the Pythran implementation (151/210)*173=124.4s (124s reported for the pythran version). With the multiple-threaded version (which is way from optimal, because no care was taken to provide a balanced load, and competing process are present), I get here 98s. Code  module NB """ This is an implementation of the NBabel N-body problem. See nbabel.org for more information. This is 'naive & native' Julia with type and loop annotations for fastmath and vectorization, because this is really is no effort. Without annotations the code will be as slow as naive Python/IDL. include("nbabel.jl") NBabel("IC/input128", show=true) Julius Donnert INAF-IRA 2017 """ using Printf using DelimitedFiles using StaticArrays tbeg = 0.0 tend = 10.0 dt = 0.001 struct Particle x :: Float64 y :: Float64 z :: Float64 w :: Float64 end Particle(x,y,z) = Particle(x,y,z,0) norm2(p :: Particle) = p.x^2 + p.y^2 + p.z^2 + p.w^2 norm(p :: Particle) = sqrt(norm2(p)) import Base.+, Base.-, Base.*, Base.zero -(p1::Particle,p2::Particle) = Particle(p1.x-p2.x,p1.y-p2.y,p1.z-p2.z,p1.w-p2.w) +(p1::Particle,p2::Particle) = Particle(p1.x+p2.x,p1.y+p2.y,p1.z+p2.z,p1.w+p2.w) *(c,p1::Particle) = Particle(c*p1.x,c*p1.y,c*p1.z,c*p1.w) *(p1::Particle,c) = Particle(c*p1.x,c*p1.y,c*p1.z,c*p1.w) zero(T::Type{Particle}) = Particle(0,0,0,0) function NBabel(fname::String; tend = tend, dt = dt, show=false) if show println("Reading file :$fname")
end

ID, mass, pos, vel = read_ICs(fname)

return NBabelCalcs(ID, mass, pos, vel, tend, dt, show)
end

function NBabelCalcs(ID, mass, pos, vel, tend = tend, dt = dt, show=false)
acc = compute_acceleration(pos, mass, acc_parallel)
last_acc = copy(acc)

Ekin, Epot = compute_energy(pos, vel, mass)
Etot_ICs = Ekin + Epot

t = 0.0
nstep = 0

while t < tend

pos = update_positions(pos, vel, acc, dt)

last_acc .= acc

acc = compute_acceleration(pos, mass, acc_parallel)

vel = update_velocities(vel, acc, last_acc, dt)

t += dt
nstep += 1

if show && nstep%100 == 0

Ekin, Epot = compute_energy(pos, vel, mass)
Etot = Ekin + Epot
dE = (Etot - Etot_ICs)/Etot_ICs

@printf "t = %g, Etot=%g, Ekin=%g, Epot=%g, dE=%g \n" t Etot Ekin Epot dE
end
end

Ekin, Epot = compute_energy(pos, vel, mass)
Etot = Ekin + Epot
return (; Ekin, Epot, Etot)
# return size(pos,1)
end

function update_positions(pos, vel, acc, dt)
N = length(pos)
for i in 1:N
pos[i] = (0.5 * acc[i] * dt + vel[i])*dt + pos[i]
end
return pos
end

function update_velocities(vel, acc, last_acc, dt)
N = length(vel)
for i in 1:N
vel[i] = vel[i] + 0.5 * dt * (acc[i] + last_acc[i])
end
return vel
end

"""
Force calculation.
"""
function compute_acceleration(pos, mass, acc)
N = length(pos)
for i in 1:N
acc[i,j] = zero(eltype(acc))
end
end
@simd for j = i+1:N
dr = pos[i] - pos[j]
rinv3 = 1/norm(dr)^3
acc[i,id] = acc[i,id] - mass[i] * rinv3 * dr
acc[j,id] = acc[j,id] + mass[j] * rinv3 * dr
end
end
for i in 1:N
acc[i,1] += acc[i,j]
end
end
return @view acc[:,1]
end

"""
Kinetic and potential energy.
"""
function compute_energy(pos, vel, mass)
N = length(vel)

Ekin = 0.0

for i = 1:N
Ekin += 0.5 * mass[i] * norm2(vel[i])
end

@simd for j = i+1:N
dr = pos[i] - pos[j]
rinv = 1/norm(dr)
Epot[id] -= mass[i] * mass[j] * rinv
end
end

return Ekin, sum(Epot)
end

N = size(ICs,1)

pos = Array{Particle}(undef, N)
vel = Array{Particle}(undef, N)

id = @view ICs[:, 1]
mass = @view ICs[:, 2]

for i in axes(ICs, 1)
pos[i] = Particle(ICs[i, 3], ICs[i, 4], ICs[i, 5])
end

for i in axes(ICs, 1)
vel[i] = Particle(ICs[i, 6], ICs[i, 7], ICs[i, 8])
end

return id, mass, pos, vel
end

export NBabel

end

#function main(args)
#    NBabel(args[1], show=true)
#end
#main(ARGS)


The codes are: https://github.com/lmiq/nbabel/tree/master/julia

The times reported there were measured with the system time command, so they include compilation and everything (and used -O3 -check-bounds=no, which improved the times quite a bit).

7 Likes
1. Adding fourth dimension for simd speed up is counterintuitive, but it works! Amazing, just amazing. Interestingly enough it also improve numerical stability.

2. There is a small mistake in your code: you declare pos and vel as Array{Particle}. Array is an abstract type, so this container is very slow. It should be Vector, then you’ll get no extra allocations.

With this gist https://gist.github.com/Arkoniak/d729b119b0bdf218bbf282b847089bca I get on single thread

• 27 sec for 1k
• 95 sec for 2k
4 Likes

This is threaded version with split balancing.

With 4 threads on i7-7700HQ CPU @ 2.80GHz

• 14 sec for 1k
• 35 sec for 2k

I’ve used command time julia -O3 nbabel4.jl ../data/input1k, so this timing includes all compilations.

5 Likes

Great!

One small note, the acceleration loop is going to N instead of N-1 (line 107):

 @inbounds for i in 1:N


If you have some time, could you expand I bit on how you did the split balancing?

edit: just to be sure, the way you implemented it only will work if the number of particles is a multiple of the number of processors, is that correct? Otherwise there should be a remaining chunk there.

Of course is Chris Elrod that sould explain that, but the thing is that using that one can load a integer number of operations in the processor cache, which is of 512 bytes in these processors (and the struct is, with four fields, of 256 bytes), thus each processor cycle loads exactly 2 “particles”. The numerical stability is strange, though.

1 Like

I have made that splitting procedure long time ago, so it took me some time to remember how it works. It is indeed slightly cryptic

Idea is simple though. We have following nested loops

for i in 1:N
for j in i+1:N
# Some operation
end
end


and we are assuming, that operation inside this loop take the same time, irrelevant of index. Since each task has some time to get scheduled we want to have bare minimum of the tasks spawned. It leads us to the task of splitting external i loop in k chunks (where k is the number of threads), such that amount of operations in each chunk is equal.

Now, it can be solved exactly, but I am making another assumption, that N >> 1 and we can throw away discrete nature of i and change it to continuous variable x. Then number of internal operations for each x is just N - x (one can easily see that, taking into account linearity of the problem and the fact that when x = 0 number of elements in internal loop is N and when x = N number of elements is 0.

In this approach, number of operations is \int_{0}^{N} (N - x) dx = \frac{N^2}{2} and each chunk should contain \frac{N^2}{2k} elements. For chunk number m, number of elements in all chunks from 1 to m is given by

\int_0^{t_m} (N - x) dx = \frac{N^2}{2k}m

where t_m is the point where we should split another chunk.

After integration and substitution we obtain

N t_m - \frac{t_m^2}{2} = \frac{N^2}{2k}m

Solving this equation we get

t_m = N (1 \pm \sqrt{1 - \frac{m}{k}})

and in order to satisfy boundary conditions t_0 = 0, t_k = N we choose minus sign.

As a last and to be honest unnecessary step I change the definition of m to m \rightarrow k - m and the answer is

t_m = N (1 - \sqrt{\frac{m}{k}}), \forall{m = k..0}

Chunks are just intervals (t_m, t_{m - 1}]

Of course it can be done more accurate and this assumptions are more imbalanced for small numbers of N, but multithreading for small N is meaningless anyway and deviations related to scheduling and other things will nullify more exact solutions (or at least that’s what I believe in, could be wrong).

One small note, the acceleration loop is going to N instead of N-1 (line 107):

I do not quite get it, since it is written as a loop from 1 to N, I do not see any N - 1.

6 Likes

That double loop should be a loop on the pairs of particles, therefore it should be

for i in 1:N-1
for j in i+1:N
...
end
end


ah, but ok, the result is the same, because if i=N then the next loop will not enter anyway.

And thank you very much for the detailed explanation. I will get pen an paper and study it. Very interesting.

2 Likes

Just to conclude: the codes I changed are here:

https://github.com/lmiq/nbabel/tree/master/julia

At the end, the serial version of the code takes in my computer ~30s against ~50s for 1k particles, and 120s vs. 200s for 2k particles. My parallel implementation only consists of adding a vector to avoid concurrency and adding the threads flags, and with that I get ~20s and ~80s, respectively for 1k and 2k.

It seems that the serial version ends up being slightly faster than the Python implementation of the original post (as now reported in the page of the developer), thus probably all the “easy” optimizations are likely done.

I wander if someone accepts the challenge of implementing a GPU-accelerated version I would try to do it to learn, but now I have to grade a hundred exams .

2 Likes
1. Has anybody double-checked the reported runtimes of the python-transonic-pythran version? Also note that some benchmarks on nbabel run until t=1 as suggested by the page, others until t=10.

2. Has anyone checked whether the py version is multithreaded?

3. I wrote a GPU kernel a while ago. Just replace the lennard jones force by gravity. You find it here.

2 Likes

The runtimes were checked by the developer himself in comparison to the new Julia implementations (the serial version is faster than the fastest python version). Those are merged in the main repository. I was unable to run the fast Python version, I did not manage to install the dependencies correctly.

2 Likes

Hi, I created the mentioned benchmark repository. It’s great to see what you were able to obtain with Julia!

It’s interesting to see that the super optimized Julia version (sequential) is “only” ~ 1.1 to 1.2 faster than Python. IMHO, it shows that we are not too bad with our simple version using Transonic-Pythran.

However, it’s nice to see how the Julia version can be easily parallelized. I didn’t try to do that in Transonic-Pythran, which supports parallelization with OpenMP.

The strategy with 4 floats struct to allow more SIMD instructions is really nice! I wonder what is the effect on energy consumption (which was the subject of the initial paper).

Julia is really great in general and a great source of inspiration for efficient computing with Python

I’m surprised that you were not able to run the Python benchmarks. It should be sufficient to run:

conda create -n env_py python pythran transonic pandas
conda activate env_py
cd nbabel/py
make bench1k

3 Likes

Hi @paugier, great work on Transonic! Did you have time to check how Rust SIMD performed compared to Julia and Pythran on you setup? If have any trouble there, let me know!

I appreciate the compliment, but nothing I can code can be called “super-optimized”, rs, I am coding in Julia for less than a year now. The sequential version without the “4-element” struct trick is an ‘adequate’ version, taking care of the type stability that is required for the compiler to be able to do its optimizations. How to keep codes type-stable is what one has to learn from the beginning to use Julia properly. Probably type annotations are behind the Python add-ons that allow the code to become fast as well. In Julia one learns to write code taking that in account from start, that is in the very core of the Julia functionality and performance.

The use of the 4-field structs is a more advanced trick, indeed, and it may or may not work, depending on how the data is treated afterwards. I would guess (really, only guess), that in that case it saves energy, because fewer access to memory and processor cycles are performed.

On the other side, the parallel version that I posted is very elementary. Above you can find a better split-balanced version which has better scale-up.

Welcome to the community! And congratulations for the work with Transonic! Really impressive.

3 Likes

It seems to me that there are few things in the best Julia versions that are quite “advanced” (not only “type stability”). At least one needs a good understanding of computing in general and a good knowledge of Julia to code that. In my point of view, the same kind of understanding and knowledge that are needed to write efficient modern C++. This is why I used “super optimized” but you are right, of course, I guess one still can go further in advanced optimization.

By the way, I said that the “super optimized Julia version (sequential) is “only” ~ 1.1 to 1.2 faster than Python” but I need to correct 2 things. First, not “super optimized” but only “optimized”. And second, the speed comparison strongly depends on the size: for 16384 particles (the size considered in the article), the Python implementation is actually still 20% faster on my PC (comparing sequential with sequential). For t_end = 0.2: 144 s for Python and 178 s for Julia (1.5.3).

The use of the 4-field structs is a more advanced trick, indeed, and it may or may not work, depending on how the data is treated afterwards. I would guess (really, only guess), that in that case it saves energy, because fewer access to memory and processor cycles are performed.

Of course, but some SIMD instructions are known to be less efficient in terms of energy consumption so it’s not so clear.

Probably type annotations are behind the Python add-ons that allow the code to become fast as well.

With Transonic, type annotations are needed only for ahead-of-time compilation (in this case, only for the arguments of 1 function). But one would get the same performance with the @jit decorator without these annotations.

Oh I had a problem to install the nightly rust with rustup. Because it does not seem compatible with rust installed with apt and I’d like to keep this rust for other stuff. And then I was too lazy to do it in a docker

1 Like

The greatest difference between the “original” Julia version and the others is that the loop that computed accelerations was not being performed in the same way than in all the other versions. In this post I fixed that. In that version I also use a Julia package called StaticArrays, which is useful when one is dealing with many small arrays and want them to be stack-allocated. Not really comparable, but that is just using a package, like one has to use numpy properly. The rest is just the natural way to write code in Julia. Of course each one, being more familiar to one or other language will find some things more natural than others. At the final version I don’t use the StaticArrays package and manually define the 4-field struct, but the natural way to go would be to use it.

At the end, the function that computes the accelerations (and take most of the time), is very easy to read, and does not contain anything fancy, I think:

function compute_acceleration!(positions, masses, accelerations)
N = length(positions)

for i in eachindex(accelerations)
accelerations[i] = zero(eltype(accelerations))
end

for i = 1:N - 1
for j = i + 1:N
dr = positions[i] - positions[j]
rinv3 = 1 / norm(dr)^3
accelerations[i] -= masses[i] * rinv3 * dr
accelerations[j] += masses[j] * rinv3 * dr
end
end
return nothing
end


The “strange” things there are maybe the eachindex which can be replaced by 1:length(), and the zeros(eltype()), which could be replaced as well by something syntactically simpler. Those are not for performance, but for generality. That same function can be used for “points” of any dimension, for example, if the appropriate function defining what is the zero of such point is defined. Actually it can be used to any kind of object (the term does not really apply here) for which the programmer defines what are the zero, the norm, subtraction and the multiplication by a scalar.

2 Likes

This isn’t an example of advanced high-performance Julia code, except for the @simd macros and knowing to use length 4 rather than 3. But it reads pretty clean and simple to me (I mean it’s good code!), but just functions and for-loops. This is also the kind of simple code you should be able to get running fast with numpy as there isn’t much different that the compiler would do.

More advanced high-performance julia you will see much more use of the type system to pass more information to the compiler, maybe @generated functions and things like the LoopVectorization.jl @avx macro. But as someone who also writes high performance C++, julia is much easier.

5 Likes

Concerning the benchmarks, as for every of these comparisons, one has to be careful in not comparing things improperly. It does not make much sense that a C++ implementation of this is 40% slower than a Fortran implementation of the same thing, also it does not make much sense that any Julia implementation is faster than both.

I have myself fallen into that trap recently, thinking that a Julia code was faster than my own Fortran implementation of the same thing, and an advice of a advanced Julia user taught me that I was simply using the wrong compiler flags for the compilation of the Fortran code. With that fixed the codes ran at the same speed.

There is not much than can be improved in those implementations without going to parallel or cuda versions, thus it does not make much sense that any of the alternatives (Julia, C++, Fortran, Java) are much slower than the other.

It is also natural that the experienced Python developer that implemented the Python version got a better implementation than the sub-optimal implementations of the other languages. I could not implement the same thing in Python as of today.

That being said, of course one would expect that the Python version was slower, and the efficiency that @paugier gets with Transonic shows what a good job they are doing in bringing fast computing to Python. Kudos to them! That certainly puts pressure in all other languages that want to occupy that space!

1 Like