Help speeding up a test progam from a book

What you really want to do is emit the juicy RSQRTSS and RCPSS instructions (on x86) and their vectorized counterparts. In terms of speedup, single precision is chump change compared to these beautiful bastards (compute approximate reciprocal square-root and approx reciprocal of floats for almost the same price as multipliation, as of http://www.agner.org/optimize/instruction_tables.pdf).

Which brings me to the question: Is there a sensible way of accessing these instructions in Julia? Like @really_fast_math? Or as @inline RSQRTSS(x::Float) = something, which hopefully compiles to the single fast instruction on architectures where it exists, and otherwise to 1/sqrt(x)? (we would need to figure out whether llvm understands this operation and possibly do some version of asm - but julia/llvm need to understand this well enough to optimize out boilerplate/ select a free register/ etc, so a native x86 emit does not cut the cake)

Re symplectic integrators: I kinda remember that the long-term energy non-conservation due to time-discretization is exponentially small in the time-step, because all sorts of things cancel. Floating point errors are different, and hence might blow up in our face. If you say that this is not an issue, then I will believe you; I do not remember the detailed proofs for approximate energy conservation, but I do remember writing into the margins of my textbook that its proof was not directly applicable to floating point errors. This does not preclude better theorems being available in the literature; as I said, symplectic integrators are not my field.
(approximate evaluations of an exact function are not the same as exact evaluations of an approximate function, especially if this function and the approximate function have extra structure like smoothness, analyticity, equivariance under some group, symplecticity, exactness (as a differential form), closedness (exact and no de-rham)).

This question becomes very relevant if we perform the square root with the ~15 bits of precision given by RSQRTSS.