Compiler optimizations on integer exponentation

The following code contains two functions that compute Lennard-Jones potentials, U(r) = 4\varepsilon \left(\frac{\sigma^{12}}{r^{12}} - \frac{\sigma^6}{r^6}\right) for all pairs of a set of 1000 particles.

The first function is the “obvious” one, which directly translates the formula above. In the second one I manually optimize the computation of the powers of \sigma and r .

The difference is way too large:

 77.724 ms (0 allocations: 0 bytes)
  1.802 ms (0 allocations: 0 bytes)
Code
using StaticArrays, LinearAlgebra, BenchmarkTools, Test

function lj1(x)
  eps = 2.0
  sig = 1.0
  u = 0.
  for i in 1:length(x)-1
    for j in i+1:length(x)
      r = norm(x[j] - x[i])
      u += 4*eps*(sig^12/r^12 - sig^6/r^6)
    end
  end
  return u
end

function lj2(x)
  eps = 2.0
  sig = 1.0
  sig6 = sig^6
  sig12 = sig6^2
  u = 0.
  for i in 1:length(x)-1
    for j in i+1:length(x)
      r = norm(x[j] - x[i])
      r3 = r^2*r
      r6 = r3^2
      r12 = r6^2
      u += 4*eps*(sig12/r12 - sig6/r6)
    end
  end
  return u
end


x = rand(SVector{2,Float64},1000)

@test lj1(x) ≈ lj2(x)

@btime lj1($x)

@btime lj2($x)

I would not be very concerned about this if this was actually a very performance-critical code, manually optimizing those exponents is not a big drama.

However, I am writing this code for a demonstration, hands-on, course on molecular dynamics, so having a code as close as possible as the text-formulas is important. In this case, in particular, the difference is the simulation taking 1 minute vs. taking 20 minutes, thus completely changing the flow of the course.

As a side-note, gfortran does seem to optimize these exponentiations, and the code is fast in the “natural” version.

Thus, my questions are: is there any important/fundamental that is impairing Julia to optimize these exponents, is it just an implementation detail that is on the list, or what?

See this open PR (inline ^ for literal powers of numbers by stevengj · Pull Request #20637 · JuliaLang/julia · GitHub) and the FastPow.jl package (GitHub - JuliaMath/FastPow.jl: optimal addition-chain exponentiation for Julia).

(Coincidentally, it seems like it would make sense to rewrite sigma^n/r^n to (sigma/r)^n both for performance and to avoid overflow at high powers)

3 Likes

The optimized exponentiations are less accurate, which is part of the reason why this did not get implemented in Base (see inline ^ for literal powers of numbers by stevengj · Pull Request #20637 · JuliaLang/julia · GitHub). However, LLVM implements it for some types (hardware floating-point types) in @fastmath mode.

However, I have a package FastPow.jl that provides a macro @fastpow to optionally enable this in a block of code.

julia> using FastPow

julia> @fastpow foo(x) = 3x^-12 - 4x^-6

If you inspect the code with @code_llvm foo(1.0), it shows that it computes the two exponents with the optimal 1 division and 4 multiplications, just like your hand-coded routine. (Actually, your hand-coded routine uses 2 divisions.)

In this case, the compiler was able to pull out the common subexpressions from the two exponents. More generally, optimally computing a set of exponents is a hard, indeed NP-hard, problem (see sets of exponents · Issue #4 · JuliaMath/FastPow.jl · GitHub).

5 Likes

I was wandering about that possibility, thanks. Indeed, @fastpow is slightly faster than my hand-make code:

  1.487 ms (0 allocations: 0 bytes)
  1.801 ms (0 allocations: 0 bytes)