You must have changed the structure planet because there is no v member - only vx, vy, vz.
I think he/she is using my StaticArrays
version
Though I appreciate all of the work folks have doing trying to speed this program up - I think the point was missed. I was trying to do basically a clone of the original C/C++ version of the program. On the Julia home page under āA Summary of Featuresā it states: āGood performance, approaching that of statically-compiled languages like C.ā So, I was looking for something like ācompiler optionsā to see if the program could be optimized - not completely re-writing the code. The author states in his book that the compiled C++ version of this little program took ~6.5 seconds and my Julia version took ~20 seconds. Thatās nearly 3 times slower.
For me, the c code ran in 0.75 seconds and julia in 1 second.
Thatās damn good! Great!
Did you find the -O3
option?
That shaved about 4 seconds off - the original code I posted. Thatās a big improvement.
Also, did you recompile your system image? That helps, especially with -O3.
We are discussing single precision arithmentic, and indeed variable presision arithmentic.
In the original source on the benchmarksgame.alioth website for example the mass of Jupiter is given to 17 decimal places. That constant seems to be pretty well used in lots of N-body benchmarks if you Google for it.
Is the mass of Jupiter really known to 17 places? I tried Wikipedia but that just points to a paper.
I guess that number is accurate, and I woudl be interested to hear from an astronomer how it is obtained.
Also relevant I read a paper a couple of years ago where the energy cost of computations in various formats was being worked out, eg. picojoules per floating point operation. It is not really that esoteric if we think about exascale computing, where every energy efficiency will have to be used unless we want to couple up our supercomputers to the nearest power station (I read the other day that the working limit is being taken as 20MW for an exascale machine BTW).
Sadly I have lost the reference to that paper, which made a lot of sense regardign choosing an appropriate type for computations and not just flinging double floats at all problems.
Well, the integrators the benchmarks are using are definitely not getting anywhere near 17 places of precision, so it doesnāt matter. Especially for long-time integrations of N-body problems, the error due to missing a large asteroid could easily matter at the lower digits.
Nope, thatās not an asteroid. Its the Death Star lurking out beyond Jupiterā¦ As discovered in my Hollywood movie plot when some young, good looking computer programmers break into NASA clutching a magnetic tape containing a new computer language called āJuliaā. Julia codes reveal that NASA has been hiding the presence of the Death Star from us for decades.
Seriously though these constants seem to be widely used in this n-body benchmark. Iām just wondering what is the ultimate source.
Exactly.
Re this missing the point of the exercise: Sure, reordering
dist=sqrt(dist2);mag=dt/(dist*dist*dist);
#into
dist=sqrt(dist2);mag=dt/(dist2*dist2);mag*=dist;
is not julia-specific but rather something that should be done in C as well.
This is not an entirely different algorithm, just a way of writing the same algorithm that is more friendly to out-of-order execution. It is an optimization that would be permitted in fastmath for a C compiler, even though I donāt think clang, julia or gcc are smart enough to do it automatically (who knows about icc?). And keeping out-of-order execution in mind when writing julia code is good practice, since julia is low-level enough for such considerations to make sense- the same as keeping cache prediction in mind, and the same way as when writing C code (as opposed to e.g. python, where you should keep CPython internals in mind, instead of your processor manual).
Switching to lower precision math is beyond the scope of the language benchmark game, this I will shamelessly admit. Even if weird square-root games are done in many C compilers on many architectures, if you allow fastmath.
I recently looked a little at llvm; just fyi (especially @ChrisRackauckas for mixed precision math in other projects), you can get a big speedup by using
rsqrtss(x::Float32) =
Base.llvmcall(("declare <4 x float> @llvm.x86.sse.rsqrt.ss(<4 x float>) nounwind readnone",
"%v = insertelement <4 x float> undef, float %0, i32 0
%resv = call <4 x float> @llvm.x86.sse.rsqrt.ss(<4 x float> %v)
%res = extractelement <4 x float> %resv, i32 0
ret float %res"), Float32, Tuple{Float32}, x)
This computes a very fast 12-15 bit reciprocal squareroot (single instruction on x86, inlined). The llvm declaration for this intrinsic is kinda weird, but llvm is smart enough to only use one xmm register and not invalidate the others. Still, the resulting native code makes somewhat strange choices in register allocation (unnecessary movps); apparently there are not a lot of people using this intrinsic.
Using this instead of the division and square-root gives a speedup of ca. 2x vs double precision and 1.5x vs single precision on my machine (at the cost of abysmal precision).
Simply using StaticArrays
instead of a mutable struct
can give you more than 10X
speed up. I doubt even C++ can give you the same. Of course allow -O3
and disable bounds-checking
when running the following equivalent code (some changes to suite SVector
s, of course):
using StaticArrays
const planet = SVector{7, Float64}
const solar_mass = (4 * pi * pi)
const days_per_year = 365.24
function advance(nbodies, bodies, dt)
for i = 1:nbodies
b = bodies[i]
for j = i+1:nbodies
b2 = bodies[j]
dx = b[1] - b2[1]
dy = b[2] - b2[2]
dz = b[3] - b2[3]
distance = sqrt(dx * dx + dy * dy + dz * dz)
mag = dt / (distance * distance * distance)
b4 = dx * b2[7] * mag
b5 = dy * b2[7] * mag
b6 = dz * b2[7] * mag
b24 = dx * b[7] * mag
b25 = dy * b[7] * mag
b26 = dz * b[7] * mag
b -= planet(0, 0, 0, b4, b5, b6, 0)
b2 += planet(0, 0, 0, b24,b25,b26, 0)
end
end
for i = 1:nbodies
b = bodies[i]
# b[1] += dt * b[4]
# b[2] += dt * b[5]
# b[3] += dt * b[6]
b += planet(dt*b[4], dt*b[5], dt*b[6], 0, 0, 0, 0)
end
end
function energy(nbodies::Int, bodies)
e = 0.0
for i = 1:nbodies
b::planet = bodies[i]
e += 0.5 * b[7] * (b[4] * b[4] + b[5] * b[5] + b[6] * b[6])
for j = i + 1:nbodies
b2::planet = bodies[j]
dx = b[1] - b2[1]
dy = b[2] - b2[2]
dz = b[3] - b2[3]
distance = sqrt(dx * dx + dy * dy + dz * dz)
e -= (b[7] * b2[7]) / distance
end
end
return e
end
function offset_momentum(nbodies::Int, bodies::Vector{planet})
px = 0.0
py = 0.0
pz = 0.0
for i = 1:nbodies
px += bodies[i][4] * bodies[i][7]
py += bodies[i][5] * bodies[i][7]
pz += bodies[i][6] * bodies[i][7]
end
# bodies[1][4] = -px / solar_mass
# bodies[1][5] = -py / solar_mass
# bodies[1][6] = -pz / solar_mass
bodies[1] -= planet(0,0,0, px/solar_mass, py/solar_mass, pz/solar_mass, 0)
end
const NBODIES = 5
bodies = Array{planet}(NBODIES)
bodies[1] = planet( # sun
0, 0, 0, 0, 0, 0, solar_mass
)
bodies[2] = planet( #jupiter
4.84143144246472090e+00,
-1.16032004402742839e+00,
-1.03622044471123109e-01,
1.66007664274403694e-03 * days_per_year,
7.69901118419740425e-03 * days_per_year,
-6.90460016972063023e-05 * days_per_year,
9.54791938424326609e-04 * solar_mass
)
bodies[3] = planet( #saturn
8.34336671824457987e+00,
4.12479856412430479e+00,
-4.03523417114321381e-01,
-2.76742510726862411e-03 * days_per_year,
4.99852801234917238e-03 * days_per_year,
2.30417297573763929e-05 * days_per_year,
2.85885980666130812e-04 * solar_mass
)
bodies[4] = planet( #uranus
1.28943695621391310e+01,
-1.51111514016986312e+01,
-2.23307578892655734e-01,
2.96460137564761618e-03 * days_per_year,
2.37847173959480950e-03 * days_per_year,
-2.96589568540237556e-05 * days_per_year,
4.36624404335156298e-05 * solar_mass
)
bodies[5] = planet( #neptune
1.53796971148509165e+01,
-2.59193146099879641e+01,
1.79258772950371181e-01,
2.68067772490389322e-03 * days_per_year,
1.62824170038242295e-03 * days_per_year,
-9.51592254519715870e-05 * days_per_year,
5.15138902046611451e-05 * solar_mass
)
function main(iterations::Int)
n = iterations
offset_momentum(NBODIES, bodies);
@printf("%.9f\n", energy(NBODIES, bodies))
for i = 1:n
advance(NBODIES, bodies, 0.01)
end
@printf("%.9f\n", energy(NBODIES, bodies))
end
@time main(5000_000)
On my machine, this takes up:
-0.169075164
0.053141 seconds (26 allocations: 944 bytes)
and your original code takes up (again with -O3
and disabling bounds-checking
):
-0.169075164
0.588650 seconds (28 allocations: 928 bytes)
So, I think Julia with StaticArrays
challenges the fastest compiled languages.
Interesting. On my machine, your version is 6X faster than the version by @kristoffer.carlsson (both with -O3). His also uses StaticArrays. Why is your version so much faster?
I think this is because he used a mutable struct
which is sub-optimal, but if you replace it with struct
you should have the same speed as my implementation. I didnāt test it myself, though.
Not to necro this thread too much, however just wanted to point out that reassignment to bodies
was missing, e.g. bodies[i] = b
, since b
and b2
are no longer references.
This caused Julia compiler to actually eliminate all the computation in advance
. After fixing this up, I did not see a speedup compared to @kristoffer.carlssonās version.
Thanks for the note. An important step (that is easy to forget) when optimizing things is to check that you actually get sensible results in the end :).
Thanks for the correction, I fixed that and compared against C++ version by the OP. Here are the results which, again, show how powerful Julia is using SArrays:
C++ (with -Ofast
):
-0.169075164
-0.169083134
Process returned 0 (0x0) execution time : 0.485 s
Julia (with -O3 --check-bounds=no
):
-0.169075164
-0.169083134
0.391008 seconds (24 allocations: 1.125 KiB)
The full versions for reproducability:
using Printf
using StaticArrays
const planet = SVector{7, Float64}
const solar_mass = 4 * Ļ^2
const days_per_year = 365.24
function advance(nbodies, bodies, dt)
for i = 1:nbodies
b = bodies[i]
for j = i+1:nbodies
b2 = bodies[j]
dx = b[1] - b2[1]
dy = b[2] - b2[2]
dz = b[3] - b2[3]
dsq = dx*dx + dy*dy + dz*dz
distance = sqrt(dsq)
mag = dt / (dsq * distance)
b4 = dx * b2[7] * mag
b5 = dy * b2[7] * mag
b6 = dz * b2[7] * mag
b24 = dx * b[7] * mag
b25 = dy * b[7] * mag
b26 = dz * b[7] * mag
bodies[i] -= planet(0, 0, 0, b4, b5, b6, 0)
bodies[j] += planet(0, 0, 0, b24,b25,b26, 0)
end
end
for i = 1:nbodies
b = bodies[i]
bodies[i] += planet(dt*b[4], dt*b[5], dt*b[6], 0, 0, 0, 0)
end
end
function energy(nbodies::Int, bodies)
e = 0.0
for i = 1:nbodies
b::planet = bodies[i]
e += 0.5 * b[7] * (b[4] * b[4] + b[5] * b[5] + b[6] * b[6])
for j = i + 1:nbodies
b2::planet = bodies[j]
dx = b[1] - b2[1]
dy = b[2] - b2[2]
dz = b[3] - b2[3]
distance = sqrt(dx*dx + dy*dy + dz*dz)
e -= (b[7] * b2[7]) / distance
end
end
return e
end
function offset_momentum(nbodies::Int, bodies::Vector{planet})
px = 0.0
py = 0.0
pz = 0.0
for i = 1:nbodies
px += bodies[i][4] * bodies[i][7]
py += bodies[i][5] * bodies[i][7]
pz += bodies[i][6] * bodies[i][7]
end
bodies[1] -= planet(0,0,0, px/solar_mass, py/solar_mass, pz/solar_mass, 0)
end
function main(iterations::Int)
NBODIES = 5
bodies = Array{planet}(undef, NBODIES)
bodies[1] = planet( # sun
0, 0, 0, 0, 0, 0, solar_mass
)
bodies[2] = planet( #jupiter
4.84143144246472090e+00,
-1.16032004402742839e+00,
-1.03622044471123109e-01,
1.66007664274403694e-03 * days_per_year,
7.69901118419740425e-03 * days_per_year,
-6.90460016972063023e-05 * days_per_year,
9.54791938424326609e-04 * solar_mass
)
bodies[3] = planet( #saturn
8.34336671824457987e+00,
4.12479856412430479e+00,
-4.03523417114321381e-01,
-2.76742510726862411e-03 * days_per_year,
4.99852801234917238e-03 * days_per_year,
2.30417297573763929e-05 * days_per_year,
2.85885980666130812e-04 * solar_mass
)
bodies[4] = planet( #uranus
1.28943695621391310e+01,
-1.51111514016986312e+01,
-2.23307578892655734e-01,
2.96460137564761618e-03 * days_per_year,
2.37847173959480950e-03 * days_per_year,
-2.96589568540237556e-05 * days_per_year,
4.36624404335156298e-05 * solar_mass
)
bodies[5] = planet( #neptune
1.53796971148509165e+01,
-2.59193146099879641e+01,
1.79258772950371181e-01,
2.68067772490389322e-03 * days_per_year,
1.62824170038242295e-03 * days_per_year,
-9.51592254519715870e-05 * days_per_year,
5.15138902046611451e-05 * solar_mass
)
n = iterations
offset_momentum(NBODIES, bodies);
@printf("%.9f\n", energy(NBODIES, bodies))
for i = 1:n
advance(NBODIES, bodies, 0.01)
end
@printf("%.9f\n", energy(NBODIES, bodies))
end
main(500)
@time main(5_000_000)
and
/* The Computer Language Benchmarks Game
* http://benchmarksgame.alioth.debian.org/
*
* contributed by Christoph Bauer
*
*/
#include <math.h>
#include <stdio.h>
#include <stdlib.h>
#define pi 3.141592653589793
#define solar_mass (4 * pi * pi)
#define days_per_year 365.24
struct planet {
double x, y, z;
double vx, vy, vz;
double mass;
};
void advance(int nbodies, struct planet * bodies, double dt)
{
int i, j;
for (i = 0; i < nbodies; i++) {
struct planet * b = &(bodies[i]);
for (j = i + 1; j < nbodies; j++) {
struct planet * b2 = &(bodies[j]);
double dx = b->x - b2->x;
double dy = b->y - b2->y;
double dz = b->z - b2->z;
double distance = sqrt(dx * dx + dy * dy + dz * dz);
double mag = dt / (distance * distance * distance);
b->vx -= dx * b2->mass * mag;
b->vy -= dy * b2->mass * mag;
b->vz -= dz * b2->mass * mag;
b2->vx += dx * b->mass * mag;
b2->vy += dy * b->mass * mag;
b2->vz += dz * b->mass * mag;
}
}
for (i = 0; i < nbodies; i++) {
struct planet * b = &(bodies[i]);
b->x += dt * b->vx;
b->y += dt * b->vy;
b->z += dt * b->vz;
}
}
double energy(int nbodies, struct planet * bodies)
{
double e;
int i, j;
e = 0.0;
for (i = 0; i < nbodies; i++) {
struct planet * b = &(bodies[i]);
e += 0.5 * b->mass * (b->vx * b->vx + b->vy * b->vy + b->vz * b->vz);
for (j = i + 1; j < nbodies; j++) {
struct planet * b2 = &(bodies[j]);
double dx = b->x - b2->x;
double dy = b->y - b2->y;
double dz = b->z - b2->z;
double distance = sqrt(dx * dx + dy * dy + dz * dz);
e -= (b->mass * b2->mass) / distance;
}
}
return e;
}
void offset_momentum(int nbodies, struct planet * bodies)
{
double px = 0.0, py = 0.0, pz = 0.0;
int i;
for (i = 0; i < nbodies; i++) {
px += bodies[i].vx * bodies[i].mass;
py += bodies[i].vy * bodies[i].mass;
pz += bodies[i].vz * bodies[i].mass;
}
bodies[0].vx = - px / solar_mass;
bodies[0].vy = - py / solar_mass;
bodies[0].vz = - pz / solar_mass;
}
#define NBODIES 5
struct planet bodies[NBODIES] = {
{ /* sun */
0, 0, 0, 0, 0, 0, solar_mass
},
{ /* jupiter */
4.84143144246472090e+00,
-1.16032004402742839e+00,
-1.03622044471123109e-01,
1.66007664274403694e-03 * days_per_year,
7.69901118419740425e-03 * days_per_year,
-6.90460016972063023e-05 * days_per_year,
9.54791938424326609e-04 * solar_mass
},
{ /* saturn */
8.34336671824457987e+00,
4.12479856412430479e+00,
-4.03523417114321381e-01,
-2.76742510726862411e-03 * days_per_year,
4.99852801234917238e-03 * days_per_year,
2.30417297573763929e-05 * days_per_year,
2.85885980666130812e-04 * solar_mass
},
{ /* uranus */
1.28943695621391310e+01,
-1.51111514016986312e+01,
-2.23307578892655734e-01,
2.96460137564761618e-03 * days_per_year,
2.37847173959480950e-03 * days_per_year,
-2.96589568540237556e-05 * days_per_year,
4.36624404335156298e-05 * solar_mass
},
{ /* neptune */
1.53796971148509165e+01,
-2.59193146099879641e+01,
1.79258772950371181e-01,
2.68067772490389322e-03 * days_per_year,
1.62824170038242295e-03 * days_per_year,
-9.51592254519715870e-05 * days_per_year,
5.15138902046611451e-05 * solar_mass
}
};
int main(int argc, char ** argv)
{
int n = 5000000; // atoi(argv[1]);
int i;
offset_momentum(NBODIES, bodies);
printf ("%.9f\n", energy(NBODIES, bodies));
for (i = 1; i <= n; i++)
advance(NBODIES, bodies, 0.01);
printf ("%.9f\n", energy(NBODIES, bodies));
return 0;
}