Help speeding up a test progam from a book

Frank_Applin · October 12, 2017, 8:13pm

You must have changed the structure planet because there is no v member - only vx, vy, vz.

kristoffer.carlsson · October 13, 2017, 8:18am

I think he/she is using my StaticArrays version

Frank_Applin · October 13, 2017, 1:04pm

Though I appreciate all of the work folks have doing trying to speed this program up - I think the point was missed. I was trying to do basically a clone of the original C/C++ version of the program. On the Julia home page under “A Summary of Features” it states: “Good performance, approaching that of statically-compiled languages like C.” So, I was looking for something like “compiler options” to see if the program could be optimized - not completely re-writing the code. The author states in his book that the compiled C++ version of this little program took ~6.5 seconds and my Julia version took ~20 seconds. That’s nearly 3 times slower.

kristoffer.carlsson · October 13, 2017, 1:10pm

For me, the c code ran in 0.75 seconds and julia in 1 second.

Frank_Applin · October 13, 2017, 1:17pm

That’s damn good! Great!

ChrisRackauckas · October 13, 2017, 2:06pm

Did you find the -O3 option?

Frank_Applin · October 13, 2017, 2:19pm

That shaved about 4 seconds off - the original code I posted. That’s a big improvement.

tshort · October 13, 2017, 2:43pm

Also, did you recompile your system image? That helps, especially with -O3.

https://docs.julialang.org/en/stable/devdocs/sysimg/

John_Hearns · October 13, 2017, 3:03pm

We are discussing single precision arithmentic, and indeed variable presision arithmentic.
In the original source on the benchmarksgame.alioth website for example the mass of Jupiter is given to 17 decimal places. That constant seems to be pretty well used in lots of N-body benchmarks if you Google for it.
Is the mass of Jupiter really known to 17 places? I tried Wikipedia but that just points to a paper.
I guess that number is accurate, and I woudl be interested to hear from an astronomer how it is obtained.

Also relevant I read a paper a couple of years ago where the energy cost of computations in various formats was being worked out, eg. picojoules per floating point operation. It is not really that esoteric if we think about exascale computing, where every energy efficiency will have to be used unless we want to couple up our supercomputers to the nearest power station (I read the other day that the working limit is being taken as 20MW for an exascale machine BTW).
Sadly I have lost the reference to that paper, which made a lot of sense regardign choosing an appropriate type for computations and not just flinging double floats at all problems.

ChrisRackauckas · October 14, 2017, 3:28am

Well, the integrators the benchmarks are using are definitely not getting anywhere near 17 places of precision, so it doesn’t matter. Especially for long-time integrations of N-body problems, the error due to missing a large asteroid could easily matter at the lower digits.

John_Hearns · October 14, 2017, 6:40am

Nope, that’s not an asteroid. Its the Death Star lurking out beyond Jupiter… As discovered in my Hollywood movie plot when some young, good looking computer programmers break into NASA clutching a magnetic tape containing a new computer language called ‘Julia’. Julia codes reveal that NASA has been hiding the presence of the Death Star from us for decades.

Seriously though these constants seem to be widely used in this n-body benchmark. I’m just wondering what is the ultimate source.

foobar_lv · October 14, 2017, 5:26pm

Exactly.

Re this missing the point of the exercise: Sure, reordering

dist=sqrt(dist2);mag=dt/(dist*dist*dist);
#into
dist=sqrt(dist2);mag=dt/(dist2*dist2);mag*=dist;

is not julia-specific but rather something that should be done in C as well.

This is not an entirely different algorithm, just a way of writing the same algorithm that is more friendly to out-of-order execution. It is an optimization that would be permitted in fastmath for a C compiler, even though I don’t think clang, julia or gcc are smart enough to do it automatically (who knows about icc?). And keeping out-of-order execution in mind when writing julia code is good practice, since julia is low-level enough for such considerations to make sense- the same as keeping cache prediction in mind, and the same way as when writing C code (as opposed to e.g. python, where you should keep CPython internals in mind, instead of your processor manual).

Switching to lower precision math is beyond the scope of the language benchmark game, this I will shamelessly admit. Even if weird square-root games are done in many C compilers on many architectures, if you allow fastmath.

foobar_lv · November 4, 2017, 8:29pm

I recently looked a little at llvm; just fyi (especially @ChrisRackauckas for mixed precision math in other projects), you can get a big speedup by using

rsqrtss(x::Float32) =
   Base.llvmcall(("declare <4 x float> @llvm.x86.sse.rsqrt.ss(<4 x float>) nounwind readnone", 
 "%v = insertelement <4 x float> undef, float %0, i32 0
  %resv = call <4 x float> @llvm.x86.sse.rsqrt.ss(<4 x float> %v)
  %res = extractelement <4 x float> %resv, i32 0
  ret float %res"), Float32, Tuple{Float32}, x)

This computes a very fast 12-15 bit reciprocal squareroot (single instruction on x86, inlined). The llvm declaration for this intrinsic is kinda weird, but llvm is smart enough to only use one xmm register and not invalidate the others. Still, the resulting native code makes somewhat strange choices in register allocation (unnecessary movps); apparently there are not a lot of people using this intrinsic.

Using this instead of the division and square-root gives a speedup of ca. 2x vs double precision and 1.5x vs single precision on my machine (at the cost of abysmal precision).

Seif_Shebl · November 6, 2017, 10:15pm

Simply using StaticArrays instead of a mutable struct can give you more than 10X speed up. I doubt even C++ can give you the same. Of course allow -O3 and disable bounds-checking when running the following equivalent code (some changes to suite SVectors, of course):

using StaticArrays
const planet = SVector{7, Float64}

const solar_mass = (4 * pi * pi)
const days_per_year = 365.24

function advance(nbodies, bodies, dt)
  for i = 1:nbodies
    b = bodies[i]
    for j = i+1:nbodies
      b2 = bodies[j]
      dx = b[1] - b2[1]
      dy = b[2] - b2[2]
      dz = b[3] - b2[3]
      distance = sqrt(dx * dx + dy * dy + dz * dz)
      mag = dt / (distance * distance * distance)
      b4  = dx * b2[7] * mag
      b5  = dy * b2[7] * mag
      b6  = dz * b2[7] * mag
      b24 = dx * b[7]  * mag
      b25 = dy * b[7]  * mag
      b26 = dz * b[7]  * mag
      b  -= planet(0, 0, 0, b4, b5, b6, 0)
      b2 += planet(0, 0, 0, b24,b25,b26, 0)
    end
  end
  for i = 1:nbodies
    b = bodies[i]
    # b[1] += dt * b[4]
    # b[2] += dt * b[5]
    # b[3] += dt * b[6]
    b += planet(dt*b[4], dt*b[5], dt*b[6], 0, 0, 0, 0)
  end
end

function energy(nbodies::Int, bodies)
  e = 0.0
  for i = 1:nbodies
    b::planet = bodies[i]
    e += 0.5 * b[7] * (b[4] * b[4] + b[5] * b[5] + b[6] * b[6])
    for j = i + 1:nbodies
      b2::planet = bodies[j]
      dx = b[1] - b2[1]
      dy = b[2] - b2[2]
      dz = b[3] - b2[3]
      distance = sqrt(dx * dx + dy * dy + dz * dz)
      e -= (b[7] * b2[7]) / distance
     end
  end 
  return e
end

function offset_momentum(nbodies::Int, bodies::Vector{planet})
  px = 0.0
  py = 0.0
  pz = 0.0

  for i = 1:nbodies
    px += bodies[i][4] * bodies[i][7]
    py += bodies[i][5] * bodies[i][7]
    pz += bodies[i][6] * bodies[i][7]
  end
  # bodies[1][4] = -px / solar_mass
  # bodies[1][5] = -py / solar_mass
  # bodies[1][6] = -pz / solar_mass
  bodies[1] -= planet(0,0,0, px/solar_mass, py/solar_mass, pz/solar_mass, 0)
end

const NBODIES = 5

bodies =  Array{planet}(NBODIES) 
bodies[1] = planet( # sun
          0, 0, 0, 0, 0, 0, solar_mass
          )
bodies[2] = planet(   #jupiter
          4.84143144246472090e+00,
          -1.16032004402742839e+00,
          -1.03622044471123109e-01,
          1.66007664274403694e-03  * days_per_year,
          7.69901118419740425e-03  * days_per_year,
          -6.90460016972063023e-05 * days_per_year,
          9.54791938424326609e-04  * solar_mass
          )
bodies[3] = planet( #saturn
          8.34336671824457987e+00,
          4.12479856412430479e+00,
          -4.03523417114321381e-01,
          -2.76742510726862411e-03 * days_per_year,
          4.99852801234917238e-03  * days_per_year,
          2.30417297573763929e-05  * days_per_year,
          2.85885980666130812e-04  * solar_mass
          )
bodies[4] = planet( #uranus
          1.28943695621391310e+01,
          -1.51111514016986312e+01,
          -2.23307578892655734e-01,
          2.96460137564761618e-03  * days_per_year,
          2.37847173959480950e-03  * days_per_year,
          -2.96589568540237556e-05 * days_per_year,
          4.36624404335156298e-05  * solar_mass          
          )
bodies[5] = planet( #neptune
          1.53796971148509165e+01,
          -2.59193146099879641e+01,
          1.79258772950371181e-01,
          2.68067772490389322e-03  * days_per_year,
          1.62824170038242295e-03  * days_per_year,
          -9.51592254519715870e-05 * days_per_year,
          5.15138902046611451e-05  * solar_mass
          )

function main(iterations::Int)
  n = iterations
  offset_momentum(NBODIES, bodies);
  @printf("%.9f\n", energy(NBODIES, bodies))
  for i = 1:n
  advance(NBODIES, bodies, 0.01)
  end
  @printf("%.9f\n", energy(NBODIES, bodies))
end

@time main(5000_000)

On my machine, this takes up:
-0.169075164
0.053141 seconds (26 allocations: 944 bytes)

and your original code takes up (again with -O3 and disabling bounds-checking):

-0.169075164
0.588650 seconds (28 allocations: 928 bytes)

So, I think Julia with StaticArrays challenges the fastest compiled languages.

tshort · November 7, 2017, 1:30am

Interesting. On my machine, your version is 6X faster than the version by @kristoffer.carlsson (both with -O3). His also uses StaticArrays. Why is your version so much faster?

Seif_Shebl · November 7, 2017, 8:02am

I think this is because he used a mutable struct which is sub-optimal, but if you replace it with struct you should have the same speed as my implementation. I didn’t test it myself, though.

smallnamespace · August 23, 2019, 11:31am

Not to necro this thread too much, however just wanted to point out that reassignment to bodies was missing, e.g. bodies[i] = b, since b and b2 are no longer references.

This caused Julia compiler to actually eliminate all the computation in advance. After fixing this up, I did not see a speedup compared to @kristoffer.carlsson’s version.

kristoffer.carlsson · August 23, 2019, 11:34am

Thanks for the note. An important step (that is easy to forget) when optimizing things is to check that you actually get sensible results in the end :).

Seif_Shebl · August 23, 2019, 8:14pm

Thanks for the correction, I fixed that and compared against C++ version by the OP. Here are the results which, again, show how powerful Julia is using SArrays:

C++ (with -Ofast):

-0.169075164
-0.169083134
Process returned 0 (0x0)   execution time : 0.485 s

Julia (with -O3 --check-bounds=no):

-0.169075164
-0.169083134
  0.391008 seconds (24 allocations: 1.125 KiB)

The full versions for reproducability:

using Printf
using StaticArrays
const planet = SVector{7, Float64}

const solar_mass = 4 * π^2 
const days_per_year = 365.24

function advance(nbodies, bodies, dt)
  for i = 1:nbodies
    b = bodies[i]
    for j = i+1:nbodies
      b2 = bodies[j]
      dx = b[1] - b2[1]
      dy = b[2] - b2[2]
      dz = b[3] - b2[3]
      dsq = dx*dx + dy*dy + dz*dz
      distance = sqrt(dsq)
      mag = dt / (dsq * distance)

      b4  = dx * b2[7] * mag
      b5  = dy * b2[7] * mag
      b6  = dz * b2[7] * mag
      b24 = dx * b[7]  * mag
      b25 = dy * b[7]  * mag
      b26 = dz * b[7]  * mag
      bodies[i] -= planet(0, 0, 0, b4, b5, b6, 0)
      bodies[j] += planet(0, 0, 0, b24,b25,b26, 0)
    end
  end
  for i = 1:nbodies
    b = bodies[i]
    bodies[i] += planet(dt*b[4], dt*b[5], dt*b[6], 0, 0, 0, 0)
  end
end

function energy(nbodies::Int, bodies)
  e = 0.0
  for i = 1:nbodies
    b::planet = bodies[i]
    e += 0.5 * b[7] * (b[4] * b[4] + b[5] * b[5] + b[6] * b[6])
    for j = i + 1:nbodies
      b2::planet = bodies[j]
      dx = b[1] - b2[1]
      dy = b[2] - b2[2]
      dz = b[3] - b2[3]
      distance = sqrt(dx*dx + dy*dy + dz*dz)
      e -= (b[7] * b2[7]) / distance
     end
  end 
  return e
end

function offset_momentum(nbodies::Int, bodies::Vector{planet})
  px = 0.0
  py = 0.0
  pz = 0.0

  for i = 1:nbodies
    px += bodies[i][4] * bodies[i][7]
    py += bodies[i][5] * bodies[i][7]
    pz += bodies[i][6] * bodies[i][7]
  end
  bodies[1] -= planet(0,0,0, px/solar_mass, py/solar_mass, pz/solar_mass, 0)
end

function main(iterations::Int)
    NBODIES = 5

    bodies =  Array{planet}(undef, NBODIES) 
    bodies[1] = planet( # sun
          0, 0, 0, 0, 0, 0, solar_mass
          )
    bodies[2] = planet(   #jupiter
          4.84143144246472090e+00,
          -1.16032004402742839e+00,
          -1.03622044471123109e-01,
          1.66007664274403694e-03  * days_per_year,
          7.69901118419740425e-03  * days_per_year,
          -6.90460016972063023e-05 * days_per_year,
          9.54791938424326609e-04  * solar_mass
          )
    bodies[3] = planet( #saturn
          8.34336671824457987e+00,
          4.12479856412430479e+00,
          -4.03523417114321381e-01,
          -2.76742510726862411e-03 * days_per_year,
          4.99852801234917238e-03  * days_per_year,
          2.30417297573763929e-05  * days_per_year,
          2.85885980666130812e-04  * solar_mass
          )
    bodies[4] = planet( #uranus
          1.28943695621391310e+01,
          -1.51111514016986312e+01,
          -2.23307578892655734e-01,
          2.96460137564761618e-03  * days_per_year,
          2.37847173959480950e-03  * days_per_year,
          -2.96589568540237556e-05 * days_per_year,
          4.36624404335156298e-05  * solar_mass          
          )
    bodies[5] = planet( #neptune
          1.53796971148509165e+01,
          -2.59193146099879641e+01,
          1.79258772950371181e-01,
          2.68067772490389322e-03  * days_per_year,
          1.62824170038242295e-03  * days_per_year,
          -9.51592254519715870e-05 * days_per_year,
          5.15138902046611451e-05  * solar_mass
          )

    n = iterations
    offset_momentum(NBODIES, bodies);
    
    @printf("%.9f\n", energy(NBODIES, bodies))
    for i = 1:n
        advance(NBODIES, bodies, 0.01)
    end
    @printf("%.9f\n", energy(NBODIES, bodies))
end

main(500)
@time main(5_000_000)

and

/* The Computer Language Benchmarks Game
 * http://benchmarksgame.alioth.debian.org/
 *
 * contributed by Christoph Bauer
 *
 */
#include <math.h>
#include <stdio.h>
#include <stdlib.h>
#define pi 3.141592653589793
#define solar_mass (4 * pi * pi)
#define days_per_year 365.24
struct planet {
  double x, y, z;
  double vx, vy, vz;
  double mass;
};

void advance(int nbodies, struct planet * bodies, double dt)
{
  int i, j;

  for (i = 0; i < nbodies; i++) {
    struct planet * b = &(bodies[i]);
    for (j = i + 1; j < nbodies; j++) {
      struct planet * b2 = &(bodies[j]);
      double dx = b->x - b2->x;
      double dy = b->y - b2->y;
      double dz = b->z - b2->z;
      double distance = sqrt(dx * dx + dy * dy + dz * dz);
      double mag = dt / (distance * distance * distance);
      b->vx -= dx * b2->mass * mag;
      b->vy -= dy * b2->mass * mag;
      b->vz -= dz * b2->mass * mag;
      b2->vx += dx * b->mass * mag;
      b2->vy += dy * b->mass * mag;
      b2->vz += dz * b->mass * mag;
    }
  }
  for (i = 0; i < nbodies; i++) {
    struct planet * b = &(bodies[i]);
    b->x += dt * b->vx;
    b->y += dt * b->vy;
    b->z += dt * b->vz;
  }
}

double energy(int nbodies, struct planet * bodies)
{
  double e;
  int i, j;

  e = 0.0;
  for (i = 0; i < nbodies; i++) {
    struct planet * b = &(bodies[i]);
    e += 0.5 * b->mass * (b->vx * b->vx + b->vy * b->vy + b->vz * b->vz);
    for (j = i + 1; j < nbodies; j++) {
      struct planet * b2 = &(bodies[j]);
      double dx = b->x - b2->x;
      double dy = b->y - b2->y;
      double dz = b->z - b2->z;
      double distance = sqrt(dx * dx + dy * dy + dz * dz);
      e -= (b->mass * b2->mass) / distance;
    }
  }
  return e;
}

void offset_momentum(int nbodies, struct planet * bodies)
{
  double px = 0.0, py = 0.0, pz = 0.0;
  int i;
  for (i = 0; i < nbodies; i++) {
    px += bodies[i].vx * bodies[i].mass;
    py += bodies[i].vy * bodies[i].mass;
    pz += bodies[i].vz * bodies[i].mass;
  }
  bodies[0].vx = - px / solar_mass;
  bodies[0].vy = - py / solar_mass;
  bodies[0].vz = - pz / solar_mass;
}

#define NBODIES 5
struct planet bodies[NBODIES] = {
  {                               /* sun */
    0, 0, 0, 0, 0, 0, solar_mass
  },
  {                               /* jupiter */
    4.84143144246472090e+00,
    -1.16032004402742839e+00,
    -1.03622044471123109e-01,
    1.66007664274403694e-03 * days_per_year,
    7.69901118419740425e-03 * days_per_year,
    -6.90460016972063023e-05 * days_per_year,
    9.54791938424326609e-04 * solar_mass
  },
  {                               /* saturn */
    8.34336671824457987e+00,
    4.12479856412430479e+00,
    -4.03523417114321381e-01,
    -2.76742510726862411e-03 * days_per_year,
    4.99852801234917238e-03 * days_per_year,
    2.30417297573763929e-05 * days_per_year,
    2.85885980666130812e-04 * solar_mass
  },
  {                               /* uranus */
    1.28943695621391310e+01,
    -1.51111514016986312e+01,
    -2.23307578892655734e-01,
    2.96460137564761618e-03 * days_per_year,
    2.37847173959480950e-03 * days_per_year,
    -2.96589568540237556e-05 * days_per_year,
    4.36624404335156298e-05 * solar_mass
  },
  {                               /* neptune */
    1.53796971148509165e+01,
    -2.59193146099879641e+01,
    1.79258772950371181e-01,
    2.68067772490389322e-03 * days_per_year,
    1.62824170038242295e-03 * days_per_year,
    -9.51592254519715870e-05 * days_per_year,
    5.15138902046611451e-05 * solar_mass
  }
};

int main(int argc, char ** argv)
{
  int n = 5000000; // atoi(argv[1]);
  int i;

  offset_momentum(NBODIES, bodies);
  printf ("%.9f\n", energy(NBODIES, bodies));
  for (i = 1; i <= n; i++)
    advance(NBODIES, bodies, 0.01);
  printf ("%.9f\n", energy(NBODIES, bodies));
  return 0;
}

Topic		Replies	Views
Trying to understand low performance compared to C++ Performance	13	330	October 2, 2024
Benchmark game challenge and some optimization questions Performance	29	2788	January 13, 2024
Why Julia can be faster than C? New to Julia	3	731	October 11, 2022
Julia speed Performance question	28	2212	March 14, 2024
Help to get my slow Julia code to run as fast as Rust/Java/Lisp Performance	100	4587	August 6, 2021

Help speeding up a test progam from a book

Related topics