Help speeding up a test progam from a book

You are right about that. Yuyichao knows his stuff!!

Presumably you’ve seen this, but in case not:

https://juliaberry.github.io/

I don’t know what “some focus” means, exactly, but please understand that it is amazing that Julia on ARM is simply an apt-get away these days (thanks to Yichao, Viral, Avik, Simon, and many others). When we started working on ARM, the only option was a local build which took several days if you were lucky and there were no errors halfway through requiring a reconfigure and rebuild.

2 Likes

Wow. I apologize - no I didn’t see that. That’s great!

The stuff I’ve been working on focuses more on sysfs (which is slower), but it is common (and somewhat standard) across linux devices.

Hi,
I got significant speedups by the following two optimizations:

function advance(bodies::AbstractVector{Planet}, dt::Float64)
    nbodies = length(bodies)
	for i = 1:nbodies
		b = bodies[i]
		for j = (i + 1):nbodies
		    b2 = bodies[j]
            dx = b.x - b2.x
            d2 = sum(dx.*dx)        
            dist = sqrt(d2)                     
            mag = dt / (d2*d2)    
            mag *= dist
            
		    b.v  -= dx * (b2.mass * mag)
            b2.v += dx * (b.mass  * mag)
		end
	end
    for i = 1:nbodies
        b = bodies[i]
        b.x += dt * b.v
    end
end


function main(iterations::Int)
    n = iterations

    offset_momentum(bodies);
    #@printf("%.9f\n", energy(bodies))
    for i = 1:n
        advance(bodies, 0.01)
    end
    #@printf("%.9f\n", energy(bodies));
    return nothing
end

I did three things: Firstly, I added parentheses to the update of the velocities. This made no difference in the generated code because llvm is too smart; however, you want to compute the product mass*mag only once, not three times, and this way the code is obviously correct (and not surprisingly correct).

Secondly, I removed the @printf in main. I do not know why, but the inliner was extremely unhappy if this is not done.

Thirdly, you will see that I changed the computation of mag. If you look at the generated native code, you will see that the original version emitted two instructions (square root and division) with high latency, which depend on each other. This way, the division and the square root are computed in parallel.

In practice, you might consider doing the division and square root in single precision. I know that this defies the point of this exercise, but Float64 division and square root are really really slow, so consider whether you get away with cheaper OPs. I am not so sure about hamiltonian integrators (preserving energy, momentum, angular momentum; not my field), but the errors accumulated by the time-discretization are much, much larger than the floating point errors anyway.

1 Like

Mixed-precision algorithms are very hot topic in applied math right now.

https://nickhigham.wordpress.com/2015/10/20/the-rise-of-mixed-precision-arithmetic/

I think we should explore this more in Julia.

It’s a common misconception that symplectic integrators “exactly” preserve energy, angular momentum, etc. They don’t at all. Instead, they ensure that they integrate on some symplectic manifold, meaning that they have a form of almost-exact periodicity. But since it’s not necessarily the right symplectic manifold (it could be off by the order of the error), you will see (small) oscillations in the energy and other properties. Some examples can be found in GitHub - SciML/SciMLTutorials.jl: Tutorials for doing scientific machine learning (SciML) and high-performance differential equation solving with open source software. .

5 Likes

What you really want to do is emit the juicy RSQRTSS and RCPSS instructions (on x86) and their vectorized counterparts. In terms of speedup, single precision is chump change compared to these beautiful bastards (compute approximate reciprocal square-root and approx reciprocal of floats for almost the same price as multipliation, as of http://www.agner.org/optimize/instruction_tables.pdf).

Which brings me to the question: Is there a sensible way of accessing these instructions in Julia? Like @really_fast_math? Or as @inline RSQRTSS(x::Float) = something, which hopefully compiles to the single fast instruction on architectures where it exists, and otherwise to 1/sqrt(x)? (we would need to figure out whether llvm understands this operation and possibly do some version of asm - but julia/llvm need to understand this well enough to optimize out boilerplate/ select a free register/ etc, so a native x86 emit does not cut the cake)

Re symplectic integrators: I kinda remember that the long-term energy non-conservation due to time-discretization is exponentially small in the time-step, because all sorts of things cancel. Floating point errors are different, and hence might blow up in our face. If you say that this is not an issue, then I will believe you; I do not remember the detailed proofs for approximate energy conservation, but I do remember writing into the margins of my textbook that its proof was not directly applicable to floating point errors. This does not preclude better theorems being available in the literature; as I said, symplectic integrators are not my field.
(approximate evaluations of an exact function are not the same as exact evaluations of an approximate function, especially if this function and the approximate function have extra structure like smoothness, analyticity, equivariance under some group, symplecticity, exactness (as a differential form), closedness (exact and no de-rham)).

This question becomes very relevant if we perform the square root with the ~15 bits of precision given by RSQRTSS.

There is @fastmath which can replace a bunch of operations with faster things like FMA and less precise but faster math functions. I’d check to see if that does anything here?

Well there’s two different things going on here. There’s divergence from the symplectic manifold which is analytically zero but technically non-zero due to floating point issues. But then symplectic and energy conservation are not the same thing. That’s shown very clearly in the Kepler example where a 6th order symplectic integrator is shown oscillating the energy (at a small but proportional to the error amount) but displays a form of periodicity (but with the slight linear drift mentioned before).

http://nbviewer.jupyter.org/github/JuliaDiffEq/DiffEqTutorials.jl/blob/master/PhysicalModels/KeplerProblem.ipynb

Floating point errors are not even close to the errors one gets from almost any DE discretization unless you’re working in the extreme.

2 Likes

I thought it might be helpful to state this as simply as possible based on some reading I just did, for the rest of us non-experts (I’m sure @ChrisRackauckas will correct me if I’m mistaken):

Symplectic integrators conserve the symplectic two-form dq\wedge dp by construction (subject only to machine error), but typically introduce an error term to the physical Hamiltonian
H \to H + O(\Delta t)

3 Likes

Unfortunately, we don’t get the low-precision version with @fastmath (on my machine @fastmath even produced a slow-down).

In fact I could not get julia to emit this at all, but llvm wants to emit reciprocal squareroot + iterative improvement instead of 1/sqrt. Whether this is an improvement over the built-in square root instruction and one divide is very architecture-dependent. Only if our application is happy with the abysmal precision of the machine reciprocal squareroot, then we get the really big improvements - but, as far as I know, there is no C compiler that optimizes anything to a naked rsqrtss, under any compiler options.

I don’t know whether there is any way of getting llvm to emit a naked rsqrtss.
See https://www.bountysource.com/issues/13241137-optimize-reciprocal-square-root-with-fast-math-x86

Floating point errors are not even close to the errors one gets from almost any DE discretization unless you’re working in the extreme.

Exactly. However, I had guessed that ~15 bits of precision for the fast reciprocal sqrt vs analytic zero drift from the symplectic manifold is pretty extreme, especially if you don’t really care about the (typically ill-posed) initial value problem but instead want to look at other properties (hence don’t really care about discretization errors that almost preserve conserved quantities, attractors, ergodic measures on attractors, lyapunov exponents, relevant bifurcations, relevant symmetries, whatever you actually care about).

On the other hand, at least in this example, approximate energy conservation over this timescale still works if we do the square root in Float16. And switching the square-root and division to Float32 gives another nice performance-boost (still slower than what rsqrtss should give us, but meh)

The point where sloppy divisions and squareroots should really shine is systems without extra structure: If your discretization error is of order 10 bits, then everything above 16 bits of precision for your flops is wasted.

Since you obviously know more about integrators than me and are here already, a slightly offtopic question: Why are typical linear-implicit solvers (like rosenbrock) not re-using the LU-factorization of the jacobian of the previous time-steps?

Such an approach should work for problems / phase-space regions where the second derivative is tame. We still need to compute first derivative by auto-diff in order to check whether we need to do a new LU and correct errors from using the LU of the wrong jacobian; but we can save time in computing the LU.

Alternative formulation: Are there analogs of rosenbrock methods for iterative/approximate linear solvers? We have really good starting points for iterations from the previous time-steps; it would be a shame to let these go to waste by using a de-novo (Gauss) solver.

The derivation of Rosenbrock methods assumes the Jacobian is exactly correct in its order derivation. Thus if it’s wrong, you get order loss. ROSW, or Rosenbrock-W methods, don’t need this but “exactly” but still need a good Jacobian and it’s tied to the error. If these Jacobians are too far off you will see more error.

Implicit methods only use the Jacobian in the implicit solver so that’s more of a line search which is why they can re-use Jacobians. The middle ground is (E)SDIRK methods which are line-by-line implicit while having multiple steps all using the same Jacobian I - gamma*J. These methods are thus similar to Rosenbrock methods (in fact, Rosenbrock methods can be thought of as special SDIRK methods where a single Newton step is done at each stage of the method) but robust to Jacobian reuse and thus are good when the second derivative is tame. I plan on mentioning this in a review on the methods soon.

You must have changed the structure planet because there is no v member - only vx, vy, vz.

I think he/she is using my StaticArrays version

Though I appreciate all of the work folks have doing trying to speed this program up - I think the point was missed. I was trying to do basically a clone of the original C/C++ version of the program. On the Julia home page under “A Summary of Features” it states: “Good performance, approaching that of statically-compiled languages like C.” So, I was looking for something like “compiler options” to see if the program could be optimized - not completely re-writing the code. The author states in his book that the compiled C++ version of this little program took ~6.5 seconds and my Julia version took ~20 seconds. That’s nearly 3 times slower.

For me, the c code ran in 0.75 seconds and julia in 1 second.

1 Like

That’s damn good! Great!

Did you find the -O3 option?

That shaved about 4 seconds off - the original code I posted. That’s a big improvement.

Also, did you recompile your system image? That helps, especially with -O3.

https://docs.julialang.org/en/stable/devdocs/sysimg/

We are discussing single precision arithmentic, and indeed variable presision arithmentic.
In the original source on the benchmarksgame.alioth website for example the mass of Jupiter is given to 17 decimal places. That constant seems to be pretty well used in lots of N-body benchmarks if you Google for it.
Is the mass of Jupiter really known to 17 places? I tried Wikipedia but that just points to a paper.
I guess that number is accurate, and I woudl be interested to hear from an astronomer how it is obtained.

Also relevant I read a paper a couple of years ago where the energy cost of computations in various formats was being worked out, eg. picojoules per floating point operation. It is not really that esoteric if we think about exascale computing, where every energy efficiency will have to be used unless we want to couple up our supercomputers to the nearest power station (I read the other day that the working limit is being taken as 20MW for an exascale machine BTW).
Sadly I have lost the reference to that paper, which made a lot of sense regardign choosing an appropriate type for computations and not just flinging double floats at all problems.

Well, the integrators the benchmarks are using are definitely not getting anywhere near 17 places of precision, so it doesn’t matter. Especially for long-time integrations of N-body problems, the error due to missing a large asteroid could easily matter at the lower digits.