Nbabel nbody integrator speed up

Well, there are two things that should be done to proper multithread it.

  1. Split in chunks as you said, which can be done in multiple ways. In similar situations I’ve used nonlinear splitting over index of outer loop, so every chunk has more or less equal number of calculations and then run them with @spawn. It usually worked the best.

  2. One should be careful with race conditions - different threads will update the same elements of the acceleration array, so it’s best to introduce nthreads arrays and update each of them in separate thread and as a final step add them together to a final array.

2 Likes

You could. HybridArrays may give you the best of both worlds, by making them N x 3. A slice A[n,:] should still return an SVector. If everything inlines, it should still be able to SIMD across loop iterations, while giving you the convenience of expressing some operations on the vectors instead making everything loops.

Another factor to consider is unrolling. In the example with N x 4 vs 4 x N, the N x 4 case will SIMD and be 4x unrolled (one per column). The 4 x N will only do a single operation per loop iteration.

Theoretically, when you don’t have dependencies (i.e, s += x[i], where each iteration depends on the previous), your CPU should be able to execute different loop iterations in parallel via out of order processing + speculative execution, but in practice, I normally find some unrolling tends to help.
Maybe it’s because of better out of order, or maybe it’s because of better density of relevant instructions, vs things like incrementing and checking loop counters.

4 Likes

Thanks for the details Chris. Stuff is never as simple as one wish it was, but it’s great to have knowledgeable people around to learn from :slight_smile:

2 Likes

SVector{3} is not packed into vector, it lives on a stack, that’s the main point.

I was referring to a vector of SVector being stored in memory like a matrix, but I was a bit sloppy with my use of words so that might not have been totally clear.

1 Like

Tangential to the question (but still may be interesting for somebody who is interested in this topic): there is a package https://github.com/SciML/NBodySimulator.jl which is basically doing the same calculations, but uses DiffEq internally, so it is more stable numerically than this simple implementation.

4 Likes

Also it is worth to know that the standard algorithm for integrating newton’s equations in particle simulations is velocity-verlet, which is straight forward to implement in this package.

3 Likes

I could not resist testing some other things. First, a simple parallelization, in which I just avoid racing problems by creating an array for acc for each thread. With 4 threads (my laptop has 4 cores), I get, now (from 172 s) of the latest version:

julia> @time NBabel("./input2k.dat")
114.693995 seconds (281.92 k allocations: 32.341 MiB, 0.00% gc time)
(Ekin = 0.24083906800778096, Epot = -0.47480580878465395, Etot = -0.233966740776873)

Code

module NB

"""
This is an implementation of the NBabel N-body problem.
See nbabel.org for more information.
This is 'naive & native' Julia with type and loop annotations for
fastmath and vectorization, because this is really is no effort.
Without annotations the code will be as slow as naive Python/IDL.
    include("nbabel.jl")
    NBabel("IC/input128", show=true)
Julius Donnert INAF-IRA 2017
"""

using Printf
using DelimitedFiles
using StaticArrays

tbeg = 0.0
tend = 10.0
dt = 0.001

function NBabel(fname::String; tend = tend, dt = dt, show=false)

    if show
	    println("Reading file : $fname")
    end

    ID, mass, pos, vel = read_ICs(fname)

    return NBabelCalcs(ID, mass, pos, vel, tend, dt, show)
end

function NBabelCalcs(ID, mass, pos, vel, tend = tend, dt = dt, show=false)
    nthreads = Threads.nthreads()
    acc_parallel = Array{eltype(vel)}(undef,length(vel),nthreads)
    acc = compute_acceleration(pos, mass, acc_parallel)
    last_acc = copy(acc)

    Ekin, Epot = compute_energy(pos, vel, mass)
    Etot_ICs = Ekin + Epot

    t = 0.0
    nstep = 0

    while t < tend

        pos = update_positions(pos, vel, acc, dt)

        last_acc .= acc

        acc = compute_acceleration(pos, mass, acc_parallel)

        vel = update_velocities(vel, acc, last_acc, dt)
        
        t += dt
        nstep += 1

        if show && nstep%100 == 0

            Ekin, Epot = compute_energy(pos, vel, mass)
            Etot = Ekin + Epot
            dE = (Etot - Etot_ICs)/Etot_ICs

            @printf "t = %g, Etot=%g, Ekin=%g, Epot=%g, dE=%g \n" t Etot Ekin Epot dE
        end
    end

    Ekin, Epot = compute_energy(pos, vel, mass)
    Etot = Ekin + Epot
    return (; Ekin, Epot, Etot)
    # return size(pos,1)
end

function update_positions(pos, vel, acc, dt)
    N = length(pos)
    for i in 1:N
        pos[i] = (0.5 * acc[i] * dt + vel[i])*dt + pos[i]
    end
    return pos
end

function update_velocities(vel, acc, last_acc, dt)
    N = length(vel)
    for i in 1:N
        vel[i] = vel[i] + 0.5 * dt * (acc[i] + last_acc[i])
    end
    return vel
end

"""
Force calculation.
"""
function compute_acceleration(pos, mass, acc)
    nthreads = Threads.nthreads()
    N = length(pos)
    for i in 1:N
      for j in 1:nthreads
        acc[i,j] = zeros(eltype(acc))
      end
    end
    @inbounds Threads.@threads for i = 1:N-1
        id = Threads.threadid()
        for j = i+1:N
            dr = pos[i] - pos[j]
            rinv3 = 1/sqrt(sum(dr .^ 2)) ^ 3
            acc[i,id] = acc[i,id] - mass[i] * rinv3 * dr
            acc[j,id] = acc[j,id] + mass[j] * rinv3 * dr
        end
    end
    for i in 1:N
      for j in 2:nthreads
        acc[i,1] += acc[i,j]
      end
    end
    return @view acc[:,1]
end


"""
Kinetic and potential energy.
"""
function compute_energy(pos, vel, mass)
    N = length(vel)

    Ekin = 0.0

    for i = 1:N
        Ekin += 0.5 * mass[i] * sum(vel[i] .^ 2)
    end

    nthreads = Threads.nthreads()
    Epot = zeros(nthreads)

    @inbounds Threads.@threads for i = 1:N-1
        id = Threads.threadid()
        for j = i+1:N
            dr = pos[i] - pos[j]
            rinv = 1/sqrt(sum(dr .^ 2))
            Epot[id] -= mass[i] * mass[j] * rinv
        end
    end

    return Ekin, sum(Epot)
end

function read_ICs(fname::String)

    ICs = readdlm(fname)

    N = size(ICs,1)

    pos = Array{SVector{3,Float64}}(undef, N)
    vel = Array{SVector{3,Float64}}(undef, N)

    id = @view ICs[:, 1]
    mass = @view ICs[:, 2]

    for i in axes(ICs, 1)
        pos[i] = SVector{3,Float64}(ICs[i, 3], ICs[i, 4], ICs[i, 5])
    end

    for i in axes(ICs, 1)
        vel[i] = SVector{3,Float64}(ICs[i, 6], ICs[i, 7], ICs[i, 8])
    end

    return id, mass, pos, vel
end

export NBabel

end


#function main(args)
#    NBabel(args[1], show=true)
#end
#main(ARGS)

And rencently @Elrod taught me that one way to take advantage of simd was to define our structures with 4 fields, even the fourth field is not used in reality. For that I created a Particle structure with four fields, and the operations that are required for that structure (I think there is a simpler way, I have to revise the lessons :slight_smile: ):

struct Particle
  x :: Float64
  y :: Float64
  z :: Float64
  w :: Float64
end
Particle(x,y,z) = Particle(x,y,z,0)
norm2(p :: Particle) = p.x^2 + p.y^2 + p.z^2 + p.w^2
norm(p :: Particle) = sqrt(norm2(p))
import Base.+, Base.-, Base.*, Base.zero
-(p1::Particle,p2::Particle) = Particle(p1.x-p2.x,p1.y-p2.y,p1.z-p2.z,p1.w-p2.w)
+(p1::Particle,p2::Particle) = Particle(p1.x+p2.x,p1.y+p2.y,p1.z+p2.z,p1.w+p2.w)
*(c,p1::Particle) = Particle(c*p1.x,c*p1.y,c*p1.z,c*p1.w)
*(p1::Particle,c) = Particle(c*p1.x,c*p1.y,c*p1.z,c*p1.w)
zero(T::Type{Particle}) = Particle(0,0,0,0)

The code was adapted accordingly, and now using a single thread I get (from 172s):

julia> @time NBabel("./input2k.dat")
150.969980 seconds (130.74 k allocations: 11.073 MiB)
(Ekin = 0.24191235979106984, Epot = -0.46958087700389634, Etot = -0.2276685172128265)

I do not exclude the possibility of Chis Elrod saying that what I did does not make sense and my benchmark is flawed :slight_smile: , but if that is the case we will keep learning. We got a 22s speedup (12%):

And, finally, I using 4 threads:

julia> @time NBabel("./input2k.dat")
 97.990265 seconds (281.44 k allocations: 32.498 MiB, 0.00% gc time)
(Ekin = 0.24083906800778096, Epot = -0.47480580878465395, Etot = -0.233966740776873)

Thus, from the 210s of the original version, we have now 151s for the single-threaded execution. With the same improvement rate, that would match exactly the Pythran implementation (151/210)*173=124.4s (124s reported for the pythran version).

With the multiple-threaded version (which is way from optimal, because no care was taken to provide a balanced load, and competing process are present), I get here 98s.

Code

module NB

"""
This is an implementation of the NBabel N-body problem.
See nbabel.org for more information.
This is 'naive & native' Julia with type and loop annotations for
fastmath and vectorization, because this is really is no effort.
Without annotations the code will be as slow as naive Python/IDL.
    include("nbabel.jl")
    NBabel("IC/input128", show=true)
Julius Donnert INAF-IRA 2017
"""

using Printf
using DelimitedFiles
using StaticArrays

tbeg = 0.0
tend = 10.0
dt = 0.001

struct Particle
  x :: Float64
  y :: Float64
  z :: Float64
  w :: Float64
end
Particle(x,y,z) = Particle(x,y,z,0)
norm2(p :: Particle) = p.x^2 + p.y^2 + p.z^2 + p.w^2
norm(p :: Particle) = sqrt(norm2(p))
import Base.+, Base.-, Base.*, Base.zero
-(p1::Particle,p2::Particle) = Particle(p1.x-p2.x,p1.y-p2.y,p1.z-p2.z,p1.w-p2.w)
+(p1::Particle,p2::Particle) = Particle(p1.x+p2.x,p1.y+p2.y,p1.z+p2.z,p1.w+p2.w)
*(c,p1::Particle) = Particle(c*p1.x,c*p1.y,c*p1.z,c*p1.w)
*(p1::Particle,c) = Particle(c*p1.x,c*p1.y,c*p1.z,c*p1.w)
zero(T::Type{Particle}) = Particle(0,0,0,0)

function NBabel(fname::String; tend = tend, dt = dt, show=false)

    if show
	    println("Reading file : $fname")
    end

    ID, mass, pos, vel = read_ICs(fname)

    return NBabelCalcs(ID, mass, pos, vel, tend, dt, show)
end

function NBabelCalcs(ID, mass, pos, vel, tend = tend, dt = dt, show=false)
    nthreads = Threads.nthreads()
    acc_parallel = Array{eltype(vel)}(undef,length(vel),nthreads)
    acc = compute_acceleration(pos, mass, acc_parallel)
    last_acc = copy(acc)

    Ekin, Epot = compute_energy(pos, vel, mass)
    Etot_ICs = Ekin + Epot

    t = 0.0
    nstep = 0

    while t < tend

        pos = update_positions(pos, vel, acc, dt)

        last_acc .= acc

        acc = compute_acceleration(pos, mass, acc_parallel)

        vel = update_velocities(vel, acc, last_acc, dt)
        
        t += dt
        nstep += 1

        if show && nstep%100 == 0

            Ekin, Epot = compute_energy(pos, vel, mass)
            Etot = Ekin + Epot
            dE = (Etot - Etot_ICs)/Etot_ICs

            @printf "t = %g, Etot=%g, Ekin=%g, Epot=%g, dE=%g \n" t Etot Ekin Epot dE
        end
    end

    Ekin, Epot = compute_energy(pos, vel, mass)
    Etot = Ekin + Epot
    return (; Ekin, Epot, Etot)
    # return size(pos,1)
end

function update_positions(pos, vel, acc, dt)
    N = length(pos)
    for i in 1:N
        pos[i] = (0.5 * acc[i] * dt + vel[i])*dt + pos[i]
    end
    return pos
end

function update_velocities(vel, acc, last_acc, dt)
    N = length(vel)
    for i in 1:N
        vel[i] = vel[i] + 0.5 * dt * (acc[i] + last_acc[i])
    end
    return vel
end

"""
Force calculation.
"""
function compute_acceleration(pos, mass, acc)
    nthreads = Threads.nthreads()
    N = length(pos)
    for i in 1:N
      for j in 1:nthreads
        acc[i,j] = zero(eltype(acc))
      end
    end
    @inbounds Threads.@threads for i = 1:N-1
        id = Threads.threadid()
        @simd for j = i+1:N
            dr = pos[i] - pos[j]
            rinv3 = 1/norm(dr)^3
            acc[i,id] = acc[i,id] - mass[i] * rinv3 * dr
            acc[j,id] = acc[j,id] + mass[j] * rinv3 * dr
        end
    end
    for i in 1:N
      for j in 2:nthreads
        acc[i,1] += acc[i,j]
      end
    end
    return @view acc[:,1]
end


"""
Kinetic and potential energy.
"""
function compute_energy(pos, vel, mass)
    N = length(vel)

    Ekin = 0.0

    for i = 1:N
        Ekin += 0.5 * mass[i] * norm2(vel[i])
    end

    nthreads = Threads.nthreads()
    Epot = zeros(nthreads)

    @inbounds Threads.@threads for i = 1:N-1
        id = Threads.threadid()
        @simd for j = i+1:N
            dr = pos[i] - pos[j]
            rinv = 1/norm(dr)
            Epot[id] -= mass[i] * mass[j] * rinv
        end
    end

    return Ekin, sum(Epot)
end

function read_ICs(fname::String)

    ICs = readdlm(fname)

    N = size(ICs,1)

    pos = Array{Particle}(undef, N)
    vel = Array{Particle}(undef, N)

    id = @view ICs[:, 1]
    mass = @view ICs[:, 2]

    for i in axes(ICs, 1)
        pos[i] = Particle(ICs[i, 3], ICs[i, 4], ICs[i, 5])
    end

    for i in axes(ICs, 1)
        vel[i] = Particle(ICs[i, 6], ICs[i, 7], ICs[i, 8])
    end

    return id, mass, pos, vel
end

export NBabel

end


#function main(args)
#    NBabel(args[1], show=true)
#end
#main(ARGS)

The codes are: https://github.com/lmiq/nbabel/tree/master/julia

The times reported there were measured with the system time command, so they include compilation and everything (and used -O3 -check-bounds=no, which improved the times quite a bit).

7 Likes
  1. Adding fourth dimension for simd speed up is counterintuitive, but it works! Amazing, just amazing. Interestingly enough it also improve numerical stability.

  2. There is a small mistake in your code: you declare pos and vel as Array{Particle}. Array is an abstract type, so this container is very slow. It should be Vector, then you’ll get no extra allocations.

With this gist NBabel with 4th dimension for simd calculations. · GitHub I get on single thread

  • 27 sec for 1k
  • 95 sec for 2k
4 Likes

One more update: https://gist.github.com/Arkoniak/5e52e111082f39d21a2f00ba2dbd1049

This is threaded version with split balancing.

With 4 threads on i7-7700HQ CPU @ 2.80GHz

  • 14 sec for 1k
  • 35 sec for 2k

I’ve used command time julia -O3 nbabel4.jl ../data/input1k, so this timing includes all compilations.

5 Likes

Great!

One small note, the acceleration loop is going to N instead of N-1 (line 107):

 @inbounds for i in 1:N

If you have some time, could you expand I bit on how you did the split balancing?

edit: just to be sure, the way you implemented it only will work if the number of particles is a multiple of the number of processors, is that correct? Otherwise there should be a remaining chunk there.

Of course is Chris Elrod that sould explain that, but the thing is that using that one can load a integer number of operations in the processor cache, which is of 512 bytes in these processors (and the struct is, with four fields, of 256 bytes), thus each processor cycle loads exactly 2 “particles”. The numerical stability is strange, though.

1 Like

I have made that splitting procedure long time ago, so it took me some time to remember how it works. It is indeed slightly cryptic :slight_smile:

Idea is simple though. We have following nested loops

for i in 1:N
  for j in i+1:N
    # Some operation
  end
end

and we are assuming, that operation inside this loop take the same time, irrelevant of index. Since each task has some time to get scheduled we want to have bare minimum of the tasks spawned. It leads us to the task of splitting external i loop in k chunks (where k is the number of threads), such that amount of operations in each chunk is equal.

Now, it can be solved exactly, but I am making another assumption, that N >> 1 and we can throw away discrete nature of i and change it to continuous variable x. Then number of internal operations for each x is just N - x (one can easily see that, taking into account linearity of the problem and the fact that when x = 0 number of elements in internal loop is N and when x = N number of elements is 0.

In this approach, number of operations is \int_{0}^{N} (N - x) dx = \frac{N^2}{2} and each chunk should contain \frac{N^2}{2k} elements. For chunk number m, number of elements in all chunks from 1 to m is given by

\int_0^{t_m} (N - x) dx = \frac{N^2}{2k}m

where t_m is the point where we should split another chunk.

After integration and substitution we obtain

N t_m - \frac{t_m^2}{2} = \frac{N^2}{2k}m

Solving this equation we get

t_m = N (1 \pm \sqrt{1 - \frac{m}{k}})

and in order to satisfy boundary conditions t_0 = 0, t_k = N we choose minus sign.

As a last and to be honest unnecessary step I change the definition of m to m \rightarrow k - m and the answer is

t_m = N (1 - \sqrt{\frac{m}{k}}), \forall{m = k..0}

Chunks are just intervals (t_m, t_{m - 1}]

Of course it can be done more accurate and this assumptions are more imbalanced for small numbers of N, but multithreading for small N is meaningless anyway and deviations related to scheduling and other things will nullify more exact solutions (or at least that’s what I believe in, could be wrong).

One small note, the acceleration loop is going to N instead of N-1 (line 107):

I do not quite get it, since it is written as a loop from 1 to N, I do not see any N - 1.

7 Likes

That double loop should be a loop on the pairs of particles, therefore it should be

for i in 1:N-1
   for j in i+1:N
      ...
   end
end

ah, but ok, the result is the same, because if i=N then the next loop will not enter anyway.

And thank you very much for the detailed explanation. I will get pen an paper and study it. Very interesting.

2 Likes

Just to conclude: the codes I changed are here:

https://github.com/lmiq/nbabel/tree/master/julia

At the end, the serial version of the code takes in my computer ~30s against ~50s for 1k particles, and 120s vs. 200s for 2k particles. My parallel implementation only consists of adding a vector to avoid concurrency and adding the threads flags, and with that I get ~20s and ~80s, respectively for 1k and 2k.

It seems that the serial version ends up being slightly faster than the Python implementation of the original post (as now reported in the page of the developer), thus probably all the “easy” optimizations are likely done.

I wander if someone accepts the challenge of implementing a GPU-accelerated version :slight_smile: I would try to do it to learn, but now I have to grade a hundred exams :frowning: .

2 Likes
  1. Has anybody double-checked the reported runtimes of the python-transonic-pythran version? Also note that some benchmarks on nbabel run until t=1 as suggested by the page, others until t=10.

  2. Has anyone checked whether the py version is multithreaded?

  3. I wrote a GPU kernel a while ago. Just replace the lennard jones force by gravity. You find it here.

2 Likes

The runtimes were checked by the developer himself in comparison to the new Julia implementations (the serial version is faster than the fastest python version). Those are merged in the main repository. I was unable to run the fast Python version, I did not manage to install the dependencies correctly.

2 Likes

Hi, I created the mentioned benchmark repository. It’s great to see what you were able to obtain with Julia!

It’s interesting to see that the super optimized Julia version (sequential) is “only” ~ 1.1 to 1.2 faster than Python. IMHO, it shows that we are not too bad with our simple version using Transonic-Pythran.

However, it’s nice to see how the Julia version can be easily parallelized. I didn’t try to do that in Transonic-Pythran, which supports parallelization with OpenMP.

The strategy with 4 floats struct to allow more SIMD instructions is really nice! I wonder what is the effect on energy consumption (which was the subject of the initial paper).

Julia is really great in general and a great source of inspiration for efficient computing with Python :slight_smile:

I’m surprised that you were not able to run the Python benchmarks. It should be sufficient to run:

conda create -n env_py python pythran transonic pandas
conda activate env_py
cd nbabel/py
make bench1k
3 Likes

Hi @paugier, great work on Transonic! Did you have time to check how Rust SIMD performed compared to Julia and Pythran on you setup? If have any trouble there, let me know!

I appreciate the compliment, but nothing I can code can be called “super-optimized”, rs, I am coding in Julia for less than a year now. The sequential version without the “4-element” struct trick is an ‘adequate’ version, taking care of the type stability that is required for the compiler to be able to do its optimizations. How to keep codes type-stable is what one has to learn from the beginning to use Julia properly. Probably type annotations are behind the Python add-ons that allow the code to become fast as well. In Julia one learns to write code taking that in account from start, that is in the very core of the Julia functionality and performance.

The use of the 4-field structs is a more advanced trick, indeed, and it may or may not work, depending on how the data is treated afterwards. I would guess (really, only guess), that in that case it saves energy, because fewer access to memory and processor cycles are performed.

On the other side, the parallel version that I posted is very elementary. Above you can find a better split-balanced version which has better scale-up.

Welcome to the community! And congratulations for the work with Transonic! Really impressive.

4 Likes

It seems to me that there are few things in the best Julia versions that are quite “advanced” (not only “type stability”). At least one needs a good understanding of computing in general and a good knowledge of Julia to code that. In my point of view, the same kind of understanding and knowledge that are needed to write efficient modern C++. This is why I used “super optimized” but you are right, of course, I guess one still can go further in advanced optimization.

By the way, I said that the “super optimized Julia version (sequential) is “only” ~ 1.1 to 1.2 faster than Python” but I need to correct 2 things. First, not “super optimized” but only “optimized”. And second, the speed comparison strongly depends on the size: for 16384 particles (the size considered in the article), the Python implementation is actually still 20% faster on my PC (comparing sequential with sequential). For t_end = 0.2: 144 s for Python and 178 s for Julia (1.5.3).

The use of the 4-field structs is a more advanced trick, indeed, and it may or may not work, depending on how the data is treated afterwards. I would guess (really, only guess), that in that case it saves energy, because fewer access to memory and processor cycles are performed.

Of course, but some SIMD instructions are known to be less efficient in terms of energy consumption so it’s not so clear.

Probably type annotations are behind the Python add-ons that allow the code to become fast as well.

With Transonic, type annotations are needed only for ahead-of-time compilation (in this case, only for the arguments of 1 function). But one would get the same performance with the @jit decorator without these annotations.