Performance Regression on Micro-Benchmark (Calculus, ForwardDiff)

I noticed an unexpected performance regression when porting a package from 0.6.4 to 0.7 and reduced it to the following example script; output on v0.6.4, v0.7 and v1.0 are below. The performance of ForwardDiff is remarkably improved, while the symbolic differentiation via Calculus.differentiate deteriorates significantly. The reason is clear - there are lots of zeros that are not optimised away. Is there an optimisation that I should perform?

using ForwardDiff, Calculus, BenchmarkTools

e0, A, r0 = 1.234, 3.456, 1.012
fex = :( $e0 * (exp(-2*$A*(r/$r0-1.0)) - 2.0*exp(-$A*(r/$r0-1.0))) )
f = eval(:( r -> $fex ))
dfex = Calculus.differentiate(fex, :r)
df = eval(:(  r -> $dfex ))

x = 1.0+rand()
print("     f: "); @btime ($f($x))
print("    f': "); @btime ($df($x))
print("FwDiff: "); @btime ForwardDiff.derivative($f, $x)
println("\nCalculus.differentiate expression:"); @show dfex

OUTPUT: (j6=v0.6.4, j7 = v0.7.0, j = v1.0.3)

Fuji-2:scratch ortner$ j6 -O3 calculus_test.jl
     f:   18.502 ns (0 allocations: 0 bytes)
    f':   17.873 ns (0 allocations: 0 bytes)
FwDiff:   73.476 ns (2 allocations: 32 bytes)
Calculus.differentiate expression:
dfex = :(1.234 * (-6.830039525691699 * exp(-6.912 * (r / 1.012 - 1.0)) - 2.0 * (-3.4150197628458496 * exp(-3.456 * (r / 1.012 - 1.0)))))

Fuji-2:scratch ortner$ j7 -O3 calculus_test.jl
     f:   17.845 ns (0 allocations: 0 bytes)
    f':   42.202 ns (0 allocations: 0 bytes)
FwDiff:   21.106 ns (0 allocations: 0 bytes)
Calculus.differentiate expression:
dfex = :(0 * (exp(-2 * 3.456 * (r / 1.012 - 1.0)) - 2.0 * exp(-3.456 * (r / 1.012 - 1.0))) + 1.234 * ((0 * 3.456 * (r / 1.012 - 1.0) + -2 * 0 * (r / 1.012 - 1.0) + -2 * 3.456 * (1 / 1.012)) * exp(-2 * 3.456 * (r / 1.012 - 1.0)) - (0 * exp(-3.456 * (r / 1.012 - 1.0)) + 2.0 * ((0 * (r / 1.012 - 1.0) + -3.456 * (1 / 1.012)) * exp(-3.456 * (r / 1.012 - 1.0))))))

Fuji-2:scratch ortner$ j -O3 calculus_test.jl
     f:   17.797 ns (0 allocations: 0 bytes)
    f':   42.823 ns (0 allocations: 0 bytes)
FwDiff:   21.431 ns (0 allocations: 0 bytes)
Calculus.differentiate expression:
dfex = :(0 * (exp(-2 * 3.456 * (r / 1.012 - 1.0)) - 2.0 * exp(-3.456 * (r / 1.012 - 1.0))) + 1.234 * ((0 * 3.456 * (r / 1.012 - 1.0) + -2 * 0 * (r / 1.012 - 1.0) + -2 * 3.456 * (1 / 1.012)) * exp(-2 * 3.456 * (r / 1.012 - 1.0)) - (0 * exp(-3.456 * (r / 1.012 - 1.0)) + 2.0 * ((0 * (r / 1.012 - 1.0) + -3.456 * (1 / 1.012)) * exp(-3.456 * (r / 1.012 - 1.0))))))

Multiplying with zero might not get optimized away if it is possible for the other term to be NaN or Inf. If you enable fastmath these should get optimized away:

> df_fast = eval(:(@fastmath r -> $dfex ))
> #20 (generic function with 1 method)

> @code_llvm df_fast(x)
define double @"julia_#20_13198"(double) #0 {
top:
   %1 = fmul fast double %0, 0x401B51F5E1A4EECC
   %2 = fsub fast double 6.912000e+00, %1
   %3 = call double @julia_exp_13025(double %2)
   %4 = fmul fast double %0, 0x400B51F5E1A4EECC
   %5 = fsub fast double 3.456000e+00, %4
   %6 = call double @julia_exp_13025(double %5)
   %7 = call double @julia_exp_13025(double %2)
   %8 = call double @julia_exp_13025(double %5)
   %9 = call double @julia_exp_13025(double %5)
   %10 = fsub fast double %7, %9
   %11 = fmul fast double %10, 0xC020DB460B7A7FE2
  ret double %11
}

For best performance (assuming purity), we should also run CSE (common subexpression elimination) to fold all those calls to exp with the same argument (using CommonSubexpressions.jl/CommonSubexpressions.jl at master · rdeits/CommonSubexpressions.jl · GitHub),

> using CommonSubexpressions

> dfex_cse = CommonSubexpressions.cse(Calculus.differentiate(fex, :r));

> df_cse = eval(:(@fastmath r -> $dfex_cse ));

> @code_llvm df_cse(x)
define double @"julia_#28_13557"(double) #0 {
top:
   %1 = fmul fast double %0, 0x401B51F5E1A4EECC
   %2 = fsub fast double 6.912000e+00, %1
   %3 = call double @julia_exp_13025(double %2)
   %4 = fmul fast double %0, 0x400B51F5E1A4EECC
   %5 = fsub fast double 3.456000e+00, %4
   %6 = call double @julia_exp_13025(double %5)
   %7 = fsub fast double %3, %6
   %8 = fmul fast double %7, 0xC020DB460B7A7FE2
  ret double %8
}

Thank you for the useful suggestions, I’ll explore these right away.

Regarding @fastmath I was under the impression that this may cause numerical instability since it gives up IEEE?

Calculus.jl already tries to do these simplifications: Calculus.jl/symbolic.jl at 4da2f9b6d29d3c30bb761b9dcab9e6528492d231 · JuliaMath/Calculus.jl · GitHub. It just happens to fail on 0.7+ for some reason.