Help speeding up a test progam from a book

The derivation of Rosenbrock methods assumes the Jacobian is exactly correct in its order derivation. Thus if it’s wrong, you get order loss. ROSW, or Rosenbrock-W methods, don’t need this but “exactly” but still need a good Jacobian and it’s tied to the error. If these Jacobians are too far off you will see more error.

Implicit methods only use the Jacobian in the implicit solver so that’s more of a line search which is why they can re-use Jacobians. The middle ground is (E)SDIRK methods which are line-by-line implicit while having multiple steps all using the same Jacobian I - gamma*J. These methods are thus similar to Rosenbrock methods (in fact, Rosenbrock methods can be thought of as special SDIRK methods where a single Newton step is done at each stage of the method) but robust to Jacobian reuse and thus are good when the second derivative is tame. I plan on mentioning this in a review on the methods soon.

You must have changed the structure planet because there is no v member - only vx, vy, vz.

I think he/she is using my StaticArrays version

Though I appreciate all of the work folks have doing trying to speed this program up - I think the point was missed. I was trying to do basically a clone of the original C/C++ version of the program. On the Julia home page under “A Summary of Features” it states: “Good performance, approaching that of statically-compiled languages like C.” So, I was looking for something like “compiler options” to see if the program could be optimized - not completely re-writing the code. The author states in his book that the compiled C++ version of this little program took ~6.5 seconds and my Julia version took ~20 seconds. That’s nearly 3 times slower.

For me, the c code ran in 0.75 seconds and julia in 1 second.

1 Like

That’s damn good! Great!

Did you find the -O3 option?

That shaved about 4 seconds off - the original code I posted. That’s a big improvement.

Also, did you recompile your system image? That helps, especially with -O3.

https://docs.julialang.org/en/stable/devdocs/sysimg/

We are discussing single precision arithmentic, and indeed variable presision arithmentic.
In the original source on the benchmarksgame.alioth website for example the mass of Jupiter is given to 17 decimal places. That constant seems to be pretty well used in lots of N-body benchmarks if you Google for it.
Is the mass of Jupiter really known to 17 places? I tried Wikipedia but that just points to a paper.
I guess that number is accurate, and I woudl be interested to hear from an astronomer how it is obtained.

Also relevant I read a paper a couple of years ago where the energy cost of computations in various formats was being worked out, eg. picojoules per floating point operation. It is not really that esoteric if we think about exascale computing, where every energy efficiency will have to be used unless we want to couple up our supercomputers to the nearest power station (I read the other day that the working limit is being taken as 20MW for an exascale machine BTW).
Sadly I have lost the reference to that paper, which made a lot of sense regardign choosing an appropriate type for computations and not just flinging double floats at all problems.

Well, the integrators the benchmarks are using are definitely not getting anywhere near 17 places of precision, so it doesn’t matter. Especially for long-time integrations of N-body problems, the error due to missing a large asteroid could easily matter at the lower digits.

Nope, that’s not an asteroid. Its the Death Star lurking out beyond Jupiter… As discovered in my Hollywood movie plot when some young, good looking computer programmers break into NASA clutching a magnetic tape containing a new computer language called ‘Julia’. Julia codes reveal that NASA has been hiding the presence of the Death Star from us for decades.

Seriously though these constants seem to be widely used in this n-body benchmark. I’m just wondering what is the ultimate source.

Exactly.

Re this missing the point of the exercise: Sure, reordering

dist=sqrt(dist2);mag=dt/(dist*dist*dist);
#into
dist=sqrt(dist2);mag=dt/(dist2*dist2);mag*=dist; 

is not julia-specific but rather something that should be done in C as well.

This is not an entirely different algorithm, just a way of writing the same algorithm that is more friendly to out-of-order execution. It is an optimization that would be permitted in fastmath for a C compiler, even though I don’t think clang, julia or gcc are smart enough to do it automatically (who knows about icc?). And keeping out-of-order execution in mind when writing julia code is good practice, since julia is low-level enough for such considerations to make sense- the same as keeping cache prediction in mind, and the same way as when writing C code (as opposed to e.g. python, where you should keep CPython internals in mind, instead of your processor manual).

Switching to lower precision math is beyond the scope of the language benchmark game, this I will shamelessly admit. Even if weird square-root games are done in many C compilers on many architectures, if you allow fastmath.

I recently looked a little at llvm; just fyi (especially @ChrisRackauckas for mixed precision math in other projects), you can get a big speedup by using

rsqrtss(x::Float32) =
   Base.llvmcall(("declare <4 x float> @llvm.x86.sse.rsqrt.ss(<4 x float>) nounwind readnone", 
 "%v = insertelement <4 x float> undef, float %0, i32 0
  %resv = call <4 x float> @llvm.x86.sse.rsqrt.ss(<4 x float> %v)
  %res = extractelement <4 x float> %resv, i32 0
  ret float %res"), Float32, Tuple{Float32}, x)  

This computes a very fast 12-15 bit reciprocal squareroot (single instruction on x86, inlined). The llvm declaration for this intrinsic is kinda weird, but llvm is smart enough to only use one xmm register and not invalidate the others. Still, the resulting native code makes somewhat strange choices in register allocation (unnecessary movps); apparently there are not a lot of people using this intrinsic.

Using this instead of the division and square-root gives a speedup of ca. 2x vs double precision and 1.5x vs single precision on my machine (at the cost of abysmal precision).

3 Likes

Simply using StaticArrays instead of a mutable struct can give you more than 10X speed up. I doubt even C++ can give you the same. Of course allow -O3 and disable bounds-checking when running the following equivalent code (some changes to suite SVectors, of course):

using StaticArrays
const planet = SVector{7, Float64}

const solar_mass = (4 * pi * pi)
const days_per_year = 365.24

function advance(nbodies, bodies, dt)
  for i = 1:nbodies
    b = bodies[i]
    for j = i+1:nbodies
      b2 = bodies[j]
      dx = b[1] - b2[1]
      dy = b[2] - b2[2]
      dz = b[3] - b2[3]
      distance = sqrt(dx * dx + dy * dy + dz * dz)
      mag = dt / (distance * distance * distance)
      b4  = dx * b2[7] * mag
      b5  = dy * b2[7] * mag
      b6  = dz * b2[7] * mag
      b24 = dx * b[7]  * mag
      b25 = dy * b[7]  * mag
      b26 = dz * b[7]  * mag
      b  -= planet(0, 0, 0, b4, b5, b6, 0)
      b2 += planet(0, 0, 0, b24,b25,b26, 0)
    end
  end
  for i = 1:nbodies
    b = bodies[i]
    # b[1] += dt * b[4]
    # b[2] += dt * b[5]
    # b[3] += dt * b[6]
    b += planet(dt*b[4], dt*b[5], dt*b[6], 0, 0, 0, 0)
  end
end

function energy(nbodies::Int, bodies)
  e = 0.0
  for i = 1:nbodies
    b::planet = bodies[i]
    e += 0.5 * b[7] * (b[4] * b[4] + b[5] * b[5] + b[6] * b[6])
    for j = i + 1:nbodies
      b2::planet = bodies[j]
      dx = b[1] - b2[1]
      dy = b[2] - b2[2]
      dz = b[3] - b2[3]
      distance = sqrt(dx * dx + dy * dy + dz * dz)
      e -= (b[7] * b2[7]) / distance
     end
  end 
  return e
end

function offset_momentum(nbodies::Int, bodies::Vector{planet})
  px = 0.0
  py = 0.0
  pz = 0.0

  for i = 1:nbodies
    px += bodies[i][4] * bodies[i][7]
    py += bodies[i][5] * bodies[i][7]
    pz += bodies[i][6] * bodies[i][7]
  end
  # bodies[1][4] = -px / solar_mass
  # bodies[1][5] = -py / solar_mass
  # bodies[1][6] = -pz / solar_mass
  bodies[1] -= planet(0,0,0, px/solar_mass, py/solar_mass, pz/solar_mass, 0)
end

const NBODIES = 5

bodies =  Array{planet}(NBODIES) 
bodies[1] = planet( # sun
          0, 0, 0, 0, 0, 0, solar_mass
          )
bodies[2] = planet(   #jupiter
          4.84143144246472090e+00,
          -1.16032004402742839e+00,
          -1.03622044471123109e-01,
          1.66007664274403694e-03  * days_per_year,
          7.69901118419740425e-03  * days_per_year,
          -6.90460016972063023e-05 * days_per_year,
          9.54791938424326609e-04  * solar_mass
          )
bodies[3] = planet( #saturn
          8.34336671824457987e+00,
          4.12479856412430479e+00,
          -4.03523417114321381e-01,
          -2.76742510726862411e-03 * days_per_year,
          4.99852801234917238e-03  * days_per_year,
          2.30417297573763929e-05  * days_per_year,
          2.85885980666130812e-04  * solar_mass
          )
bodies[4] = planet( #uranus
          1.28943695621391310e+01,
          -1.51111514016986312e+01,
          -2.23307578892655734e-01,
          2.96460137564761618e-03  * days_per_year,
          2.37847173959480950e-03  * days_per_year,
          -2.96589568540237556e-05 * days_per_year,
          4.36624404335156298e-05  * solar_mass          
          )
bodies[5] = planet( #neptune
          1.53796971148509165e+01,
          -2.59193146099879641e+01,
          1.79258772950371181e-01,
          2.68067772490389322e-03  * days_per_year,
          1.62824170038242295e-03  * days_per_year,
          -9.51592254519715870e-05 * days_per_year,
          5.15138902046611451e-05  * solar_mass
          )

function main(iterations::Int)
  n = iterations
  offset_momentum(NBODIES, bodies);
  @printf("%.9f\n", energy(NBODIES, bodies))
  for i = 1:n
  advance(NBODIES, bodies, 0.01)
  end
  @printf("%.9f\n", energy(NBODIES, bodies))
end

@time main(5000_000)

On my machine, this takes up:
-0.169075164
0.053141 seconds (26 allocations: 944 bytes)

and your original code takes up (again with -O3 and disabling bounds-checking):

-0.169075164
0.588650 seconds (28 allocations: 928 bytes)

So, I think Julia with StaticArrays challenges the fastest compiled languages.

3 Likes

Interesting. On my machine, your version is 6X faster than the version by @kristoffer.carlsson (both with -O3). His also uses StaticArrays. Why is your version so much faster?

I think this is because he used a mutable struct which is sub-optimal, but if you replace it with struct you should have the same speed as my implementation. I didn’t test it myself, though.

1 Like

Not to necro this thread too much, however just wanted to point out that reassignment to bodies was missing, e.g. bodies[i] = b, since b and b2 are no longer references.

This caused Julia compiler to actually eliminate all the computation in advance. After fixing this up, I did not see a speedup compared to @kristoffer.carlsson’s version.

1 Like

Thanks for the note. An important step (that is easy to forget) when optimizing things is to check that you actually get sensible results in the end :).

2 Likes

Thanks for the correction, I fixed that and compared against C++ version by the OP. Here are the results which, again, show how powerful Julia is using SArrays:

C++ (with -Ofast):

-0.169075164
-0.169083134
Process returned 0 (0x0)   execution time : 0.485 s 

Julia (with -O3 --check-bounds=no):

-0.169075164
-0.169083134
  0.391008 seconds (24 allocations: 1.125 KiB)

The full versions for reproducability:

using Printf
using StaticArrays
const planet = SVector{7, Float64}

const solar_mass = 4 * π^2 
const days_per_year = 365.24

function advance(nbodies, bodies, dt)
  for i = 1:nbodies
    b = bodies[i]
    for j = i+1:nbodies
      b2 = bodies[j]
      dx = b[1] - b2[1]
      dy = b[2] - b2[2]
      dz = b[3] - b2[3]
      dsq = dx*dx + dy*dy + dz*dz
      distance = sqrt(dsq)
      mag = dt / (dsq * distance)

      b4  = dx * b2[7] * mag
      b5  = dy * b2[7] * mag
      b6  = dz * b2[7] * mag
      b24 = dx * b[7]  * mag
      b25 = dy * b[7]  * mag
      b26 = dz * b[7]  * mag
      bodies[i] -= planet(0, 0, 0, b4, b5, b6, 0)
      bodies[j] += planet(0, 0, 0, b24,b25,b26, 0)
    end
  end
  for i = 1:nbodies
    b = bodies[i]
    bodies[i] += planet(dt*b[4], dt*b[5], dt*b[6], 0, 0, 0, 0)
  end
end

function energy(nbodies::Int, bodies)
  e = 0.0
  for i = 1:nbodies
    b::planet = bodies[i]
    e += 0.5 * b[7] * (b[4] * b[4] + b[5] * b[5] + b[6] * b[6])
    for j = i + 1:nbodies
      b2::planet = bodies[j]
      dx = b[1] - b2[1]
      dy = b[2] - b2[2]
      dz = b[3] - b2[3]
      distance = sqrt(dx*dx + dy*dy + dz*dz)
      e -= (b[7] * b2[7]) / distance
     end
  end 
  return e
end

function offset_momentum(nbodies::Int, bodies::Vector{planet})
  px = 0.0
  py = 0.0
  pz = 0.0

  for i = 1:nbodies
    px += bodies[i][4] * bodies[i][7]
    py += bodies[i][5] * bodies[i][7]
    pz += bodies[i][6] * bodies[i][7]
  end
  bodies[1] -= planet(0,0,0, px/solar_mass, py/solar_mass, pz/solar_mass, 0)
end

function main(iterations::Int)
    NBODIES = 5

    bodies =  Array{planet}(undef, NBODIES) 
    bodies[1] = planet( # sun
          0, 0, 0, 0, 0, 0, solar_mass
          )
    bodies[2] = planet(   #jupiter
          4.84143144246472090e+00,
          -1.16032004402742839e+00,
          -1.03622044471123109e-01,
          1.66007664274403694e-03  * days_per_year,
          7.69901118419740425e-03  * days_per_year,
          -6.90460016972063023e-05 * days_per_year,
          9.54791938424326609e-04  * solar_mass
          )
    bodies[3] = planet( #saturn
          8.34336671824457987e+00,
          4.12479856412430479e+00,
          -4.03523417114321381e-01,
          -2.76742510726862411e-03 * days_per_year,
          4.99852801234917238e-03  * days_per_year,
          2.30417297573763929e-05  * days_per_year,
          2.85885980666130812e-04  * solar_mass
          )
    bodies[4] = planet( #uranus
          1.28943695621391310e+01,
          -1.51111514016986312e+01,
          -2.23307578892655734e-01,
          2.96460137564761618e-03  * days_per_year,
          2.37847173959480950e-03  * days_per_year,
          -2.96589568540237556e-05 * days_per_year,
          4.36624404335156298e-05  * solar_mass          
          )
    bodies[5] = planet( #neptune
          1.53796971148509165e+01,
          -2.59193146099879641e+01,
          1.79258772950371181e-01,
          2.68067772490389322e-03  * days_per_year,
          1.62824170038242295e-03  * days_per_year,
          -9.51592254519715870e-05 * days_per_year,
          5.15138902046611451e-05  * solar_mass
          )

    n = iterations
    offset_momentum(NBODIES, bodies);
    
    @printf("%.9f\n", energy(NBODIES, bodies))
    for i = 1:n
        advance(NBODIES, bodies, 0.01)
    end
    @printf("%.9f\n", energy(NBODIES, bodies))
end

main(500)
@time main(5_000_000)

and

/* The Computer Language Benchmarks Game
 * http://benchmarksgame.alioth.debian.org/
 *
 * contributed by Christoph Bauer
 *
 */
#include <math.h>
#include <stdio.h>
#include <stdlib.h>
#define pi 3.141592653589793
#define solar_mass (4 * pi * pi)
#define days_per_year 365.24
struct planet {
  double x, y, z;
  double vx, vy, vz;
  double mass;
};

void advance(int nbodies, struct planet * bodies, double dt)
{
  int i, j;

  for (i = 0; i < nbodies; i++) {
    struct planet * b = &(bodies[i]);
    for (j = i + 1; j < nbodies; j++) {
      struct planet * b2 = &(bodies[j]);
      double dx = b->x - b2->x;
      double dy = b->y - b2->y;
      double dz = b->z - b2->z;
      double distance = sqrt(dx * dx + dy * dy + dz * dz);
      double mag = dt / (distance * distance * distance);
      b->vx -= dx * b2->mass * mag;
      b->vy -= dy * b2->mass * mag;
      b->vz -= dz * b2->mass * mag;
      b2->vx += dx * b->mass * mag;
      b2->vy += dy * b->mass * mag;
      b2->vz += dz * b->mass * mag;
    }
  }
  for (i = 0; i < nbodies; i++) {
    struct planet * b = &(bodies[i]);
    b->x += dt * b->vx;
    b->y += dt * b->vy;
    b->z += dt * b->vz;
  }
}

double energy(int nbodies, struct planet * bodies)
{
  double e;
  int i, j;

  e = 0.0;
  for (i = 0; i < nbodies; i++) {
    struct planet * b = &(bodies[i]);
    e += 0.5 * b->mass * (b->vx * b->vx + b->vy * b->vy + b->vz * b->vz);
    for (j = i + 1; j < nbodies; j++) {
      struct planet * b2 = &(bodies[j]);
      double dx = b->x - b2->x;
      double dy = b->y - b2->y;
      double dz = b->z - b2->z;
      double distance = sqrt(dx * dx + dy * dy + dz * dz);
      e -= (b->mass * b2->mass) / distance;
    }
  }
  return e;
}

void offset_momentum(int nbodies, struct planet * bodies)
{
  double px = 0.0, py = 0.0, pz = 0.0;
  int i;
  for (i = 0; i < nbodies; i++) {
    px += bodies[i].vx * bodies[i].mass;
    py += bodies[i].vy * bodies[i].mass;
    pz += bodies[i].vz * bodies[i].mass;
  }
  bodies[0].vx = - px / solar_mass;
  bodies[0].vy = - py / solar_mass;
  bodies[0].vz = - pz / solar_mass;
}

#define NBODIES 5
struct planet bodies[NBODIES] = {
  {                               /* sun */
    0, 0, 0, 0, 0, 0, solar_mass
  },
  {                               /* jupiter */
    4.84143144246472090e+00,
    -1.16032004402742839e+00,
    -1.03622044471123109e-01,
    1.66007664274403694e-03 * days_per_year,
    7.69901118419740425e-03 * days_per_year,
    -6.90460016972063023e-05 * days_per_year,
    9.54791938424326609e-04 * solar_mass
  },
  {                               /* saturn */
    8.34336671824457987e+00,
    4.12479856412430479e+00,
    -4.03523417114321381e-01,
    -2.76742510726862411e-03 * days_per_year,
    4.99852801234917238e-03 * days_per_year,
    2.30417297573763929e-05 * days_per_year,
    2.85885980666130812e-04 * solar_mass
  },
  {                               /* uranus */
    1.28943695621391310e+01,
    -1.51111514016986312e+01,
    -2.23307578892655734e-01,
    2.96460137564761618e-03 * days_per_year,
    2.37847173959480950e-03 * days_per_year,
    -2.96589568540237556e-05 * days_per_year,
    4.36624404335156298e-05 * solar_mass
  },
  {                               /* neptune */
    1.53796971148509165e+01,
    -2.59193146099879641e+01,
    1.79258772950371181e-01,
    2.68067772490389322e-03 * days_per_year,
    1.62824170038242295e-03 * days_per_year,
    -9.51592254519715870e-05 * days_per_year,
    5.15138902046611451e-05 * solar_mass
  }
};

int main(int argc, char ** argv)
{
  int n = 5000000; // atoi(argv[1]);
  int i;

  offset_momentum(NBODIES, bodies);
  printf ("%.9f\n", energy(NBODIES, bodies));
  for (i = 1; i <= n; i++)
    advance(NBODIES, bodies, 0.01);
  printf ("%.9f\n", energy(NBODIES, bodies));
  return 0;
}
1 Like