Compiler optimizations on integer exponentation

The following code contains two functions that compute Lennard-Jones potentials, U(r) = 4\varepsilon \left(\frac{\sigma^{12}}{r^{12}} - \frac{\sigma^6}{r^6}\right) for all pairs of a set of 1000 particles.

The first function is the “obvious” one, which directly translates the formula above. In the second one I manually optimize the computation of the powers of \sigma and r .

The difference is way too large:

 77.724 ms (0 allocations: 0 bytes)
  1.802 ms (0 allocations: 0 bytes)
Code
using StaticArrays, LinearAlgebra, BenchmarkTools, Test

function lj1(x)
  eps = 2.0
  sig = 1.0
  u = 0.
  for i in 1:length(x)-1
    for j in i+1:length(x)
      r = norm(x[j] - x[i])
      u += 4*eps*(sig^12/r^12 - sig^6/r^6)
    end
  end
  return u
end

function lj2(x)
  eps = 2.0
  sig = 1.0
  sig6 = sig^6
  sig12 = sig6^2
  u = 0.
  for i in 1:length(x)-1
    for j in i+1:length(x)
      r = norm(x[j] - x[i])
      r3 = r^2*r
      r6 = r3^2
      r12 = r6^2
      u += 4*eps*(sig12/r12 - sig6/r6)
    end
  end
  return u
end


x = rand(SVector{2,Float64},1000)

@test lj1(x) ≈ lj2(x)

@btime lj1($x)

@btime lj2($x)

I would not be very concerned about this if this was actually a very performance-critical code, manually optimizing those exponents is not a big drama.

However, I am writing this code for a demonstration, hands-on, course on molecular dynamics, so having a code as close as possible as the text-formulas is important. In this case, in particular, the difference is the simulation taking 1 minute vs. taking 20 minutes, thus completely changing the flow of the course.

As a side-note, gfortran does seem to optimize these exponentiations, and the code is fast in the “natural” version.

Thus, my questions are: is there any important/fundamental that is impairing Julia to optimize these exponents, is it just an implementation detail that is on the list, or what?

See this open PR (inline ^ for literal powers of numbers by stevengj · Pull Request #20637 · JuliaLang/julia · GitHub) and the FastPow.jl package (https://github.com/JuliaMath/FastPow.jl).

(Coincidentally, it seems like it would make sense to rewrite sigma^n/r^n to (sigma/r)^n both for performance and to avoid overflow at high powers)

The optimized exponentiations are less accurate, which is part of the reason why this did not get implemented in Base (see inline ^ for literal powers of numbers by stevengj · Pull Request #20637 · JuliaLang/julia · GitHub). However, LLVM implements it for some types (hardware floating-point types) in @fastmath mode.

However, I have a package FastPow.jl that provides a macro @fastpow to optionally enable this in a block of code.

julia> using FastPow

julia> @fastpow foo(x) = 3x^-12 - 4x^-6

If you inspect the code with @code_llvm foo(1.0), it shows that it computes the two exponents with the optimal 1 division and 4 multiplications, just like your hand-coded routine. (Actually, your hand-coded routine uses 2 divisions.)

In this case, the compiler was able to pull out the common subexpressions from the two exponents. More generally, optimally computing a set of exponents is a hard, indeed NP-hard, problem (see sets of exponents · Issue #4 · JuliaMath/FastPow.jl · GitHub).

I was wandering about that possibility, thanks. Indeed, @fastpow is slightly faster than my hand-make code:

  1.487 ms (0 allocations: 0 bytes)
  1.801 ms (0 allocations: 0 bytes)