Nbabel nbody integrator speed up

In order to understand the differences in performance better and learn a little bit of python and pythran, I have simplified the code to a single “acceleration”-like double loop. I will discuss some important points that are definitive for obtaining good performance in Julia:

  1. The most naive Julia implementation of the loop is (I compute the sum of the components of the forces in all cases just to be sure that am I doing the same thing every time):
function forces!(x,f)
  n = length(x)
  for i in 1:n
    f[i] .= 0.
    for j in i+1:n
      dx = x[i] - x[j]  # <-- Allocates a new vector on every iteration
      f[i] = f[i] + dx # <-- also allocates
      f[j] = f[j] - dx  # <-- also allocates
    end
  end
  fsum = 0.
  for i in 1:n
    for j in 1:3
      fsum += f[i][j] 
    end
  end
  return fsum
end

N = 20_000
x = [ Float64[i,i,i] for i in 1:N ]
f = [ zeros(3) for i in 1:N ]
println("sum of forces = ",forces!(x,f))
    

This code runs in ~40s here. (all times are measured using time julia test.jl, in bash).

I have marked one critical line, which allocates a new vector at every iteration. This has to be managed by hand, because the lengths of the vectors in x are not know except at runtime.

  1. The fix for that line is to preallocate the dx vector, meaning:
function forces!(x,f)
  n = length(x)
  dx = zeros(3) # <-- preallocate dx
  for i in 1:n
    f[i] .= 0.
    for j in i+1:n
      dx .= x[i] .- x[j] # <-- brodacast assignment
      f[i] .= f[i] .+ dx  # <-- broadcast (this could have been done in the previous version as well)
      f[j] .= f[j] .- dx   # <-- broadcast
    end
  end
  fsum = 0.
  for i in 1:n
    for j in 1:3
      fsum += f[i][j] 
    end
  end
  return fsum
end

N = 20_000
x = [ Float64[i,i,i] for i in 1:N ]
f = [ zeros(3) for i in 1:N ]
println("sum of forces = ",forces!(x,f))  

Now execution takes ~4s. A 10x speedup just because of that.

  1. Still, the allocation of that vector and the assignment of its elements can be avoided at all, if one uses tuples, static vectors or, more simply, just numbers:
function forces!(x,f)
  n = length(x)
  for i in 1:n
    f[i] .= 0.
    for j in i+1:n
      for k in 1:3 # <-- now we do a loop here
        dxₖ = x[i][k] - x[j][k]
        f[i][k] = f[i][k] + dxₖ
        f[j][k] = f[j][k] - dxₖ
      end
    end
  end
  fsum = 0.
  for i in 1:n
    for j in 1:3
      fsum += f[i][j] 
    end
  end
  return fsum
end

N = 20_000
x = [ Float64[i,i,i] for i in 1:N ]
f = [ zeros(3) for i in 1:N ]
println("sum of forces = ",forces!(x,f)) 

Execution time is reduced to ~2s! A further 2x speedup for avoiding that assignment. This is the huge benefit of working with static variables.

  1. Finally, the actual good Julian way to do all this, is to use StaticArrays for x and f. The code is much cleaner, and dx is naturally static in this case:
using StaticArrays

function forces!(x,f)
  n = length(x)
  for i in 1:n
    f[i] = zeros(eltype(f))
    for j in i+1:n
      dx = x[i] - x[j]
      f[i] = f[i] + dx
      f[j] = f[j] - dx
    end
  end
  fsum = 0.
  for i in 1:n
    for j in 1:3
      fsum += f[i][j]
    end
  end
  return fsum
end

N = 20_000
x = [ @SVector Float64[i,i,i] for i in 1:N ]
f = [ @SVector zeros(3) for i in 1:N ]
println("sum of forces = ",forces!(x,f))

Execution time goes to slightly less than ~2s now.

Thus, by taking care of allocations and using correctly static arrays, one can have differences of 20 times in execution time relative to first “naive” implementations.

I have tried to write a python-pythran version of the same code, which I post here below. Edited the code to avoid allocating or assigning to the critical vector, as in the “third” version of Julia. It takes now impressive 1.34s. Of course, not including compilation (as is the case of the Julia versions).

Compilation and test with this Makefile:

CC=gcc
CXX=g++

test: __pythran__/test.py
        time python test.py 

__pythran__/test.py: test.py
        transonic test.py -af "-march=native -Ofast"

and the code is:

import numpy as np
from transonic import boost

@boost
def forces(x:"float[:,:]",f:"float[:,:]"):
    n = x.shape[0]
    for i in range(n):
        for k in range(3):
            f[i][k] = 0.
        for j in range(i + 1, n):
            for k in range(3):  # <-- Loop as in the non-allocating Julia version
                vk = x[i][k] - x[j][k]
                f[i][k] = f[i][k] + vk
                f[j][k] = f[j][k] - vk
    fsum = 0.
    for i in range(n):
        for j in range(3):
          fsum += f[i][j]

    return fsum

if __name__ == "__main__":

    import sys

    N = 20000
    x = np.array([[i,i,i] for i in range(N)],np.float64)
    f = np.array([np.zeros(3) for i in range(N)],np.float64)

    print("sum of forces = ",forces(x,f))

May be of your interest, @paugier .

Since at the end the times are similar and comparable to compilation time, I ran the python-pythran implementation and the static-arrays julia versions with 100_000 points. I get 20s for python-pythran and 16s for julia-staticarrays.

edit: I noticed that I was initializing x as a vector of integers in all cases. Fixing that reduced times of all Julia implementations.

new edit: fixed some assignments in the second version, indicated below. The problem was that f[i] = f[i] + dx allocates, while f[i] .= f[i] .+ dx does not). The second version is then much improved.

Now, @simd, @inbounds or the trick of using 4-element vectors instead of 3-element vectors did not make any difference that was worth mentioning.

New edit:

The fortran equivalent takes 0.5s for the 20k case, and 11s for the 100k case, thus beating everyone here:

Fortran Code

Compiled with: gfotran -O3

program main
  integer, parameter :: n = 100000
  double precision :: x(n,3), f(n,3), dx(3), fsum
  do i = 1,n
    do j = 1,3
      x(i,j) = i
    end do
  end do
  do i = 1,n
    do j = 1, 3
      f(i,j) = 0.
    end do
    do j = i+1,n
      do k = 1,3
        dx(k) = x(i,k) - x(j,k)
        f(i,k) = f(i,k) + dx(k)
        f(j,k) = f(j,k) - dx(k)
      end do
    end do
  end do
  fsum = 0.d0
  do i = 1,n
    do j = 1,3
      fsum = fsum + f(i,j) 
    end do
  end do
  write(*,*) " sum of forces = ", fsum
end

5 Likes

Couple of notes. I can be wrong, since I am not near the computer, so take my word with a grain of salt.

  1. In your second example f[I] = f[I] + dx is still allocates, which is why you are gaining only 2x speed up. It should be f[I] .= f[I] + dx. Same goes for the next line.

  2. It is less relevant in the third example that you are using StaticArray and calculate data “on the stack”, and more important that vector of vectors generate large amount of cache misses (i.e. it’s a bad structure for high performance calculations). By utilizing memory in a more CPU friendly way, you can achieve better results. One way is to use tuples or struct arrays, or you can use Matrix and should get the same result.

1 Like

Right. Fixing that the time there goes to 4s.

Yes, probably using a Matrix is better in this case, where the access to the values is continuous, and that is why the Fortran version I wrote turns out to be faster.

Actually in the third example I am still using the vector of vectors, but avoiding all allocations and assignments associated with dx. At least in the small test it didn’t perform bad. Later I will test that in a larger example (numbered the tests now).

I just found out that this thread contributed to a comment on Nature Astronomy:

https://www.nature.com/articles/s41550-021-01342-y.epdf?sharing_token=D6WDBKpfOOcWHERGZz4AbtRgN0jAjWel9jnR3ZoTv0Pc7q-oiK1_CxsdnLnqzBdV9-Xj6CQQ98qvVv9fAgz7ySxWpoPJr5C4XgOXuetbs26fddzM8jDAaP_RUyzuPDdOG_uCV2N67FsOGrLM5IQku9K2si3Kw7cSPET-ES0qGQc%3D

And that we were acknowledged :slight_smile:

8 Likes

So, according to Nature, Julia is a friend of Greta and of the environment.

1 Like