# Performance Regression on Micro-Benchmark (Calculus, ForwardDiff)

I noticed an unexpected performance regression when porting a package from 0.6.4 to 0.7 and reduced it to the following example script; output on v0.6.4, v0.7 and v1.0 are below. The performance of `ForwardDiff` is remarkably improved, while the symbolic differentiation via `Calculus.differentiate` deteriorates significantly. The reason is clear - there are lots of zeros that are not optimised away. Is there an optimisation that I should perform?

``````using ForwardDiff, Calculus, BenchmarkTools

e0, A, r0 = 1.234, 3.456, 1.012
fex = :( \$e0 * (exp(-2*\$A*(r/\$r0-1.0)) - 2.0*exp(-\$A*(r/\$r0-1.0))) )
f = eval(:( r -> \$fex ))
dfex = Calculus.differentiate(fex, :r)
df = eval(:(  r -> \$dfex ))

x = 1.0+rand()
print("     f: "); @btime (\$f(\$x))
print("    f': "); @btime (\$df(\$x))
print("FwDiff: "); @btime ForwardDiff.derivative(\$f, \$x)
println("\nCalculus.differentiate expression:"); @show dfex
``````

OUTPUT: `(j6=v0.6.4, j7 = v0.7.0, j = v1.0.3)`

``````Fuji-2:scratch ortner\$ j6 -O3 calculus_test.jl
f:   18.502 ns (0 allocations: 0 bytes)
f':   17.873 ns (0 allocations: 0 bytes)
FwDiff:   73.476 ns (2 allocations: 32 bytes)
Calculus.differentiate expression:
dfex = :(1.234 * (-6.830039525691699 * exp(-6.912 * (r / 1.012 - 1.0)) - 2.0 * (-3.4150197628458496 * exp(-3.456 * (r / 1.012 - 1.0)))))

Fuji-2:scratch ortner\$ j7 -O3 calculus_test.jl
f:   17.845 ns (0 allocations: 0 bytes)
f':   42.202 ns (0 allocations: 0 bytes)
FwDiff:   21.106 ns (0 allocations: 0 bytes)
Calculus.differentiate expression:
dfex = :(0 * (exp(-2 * 3.456 * (r / 1.012 - 1.0)) - 2.0 * exp(-3.456 * (r / 1.012 - 1.0))) + 1.234 * ((0 * 3.456 * (r / 1.012 - 1.0) + -2 * 0 * (r / 1.012 - 1.0) + -2 * 3.456 * (1 / 1.012)) * exp(-2 * 3.456 * (r / 1.012 - 1.0)) - (0 * exp(-3.456 * (r / 1.012 - 1.0)) + 2.0 * ((0 * (r / 1.012 - 1.0) + -3.456 * (1 / 1.012)) * exp(-3.456 * (r / 1.012 - 1.0))))))

Fuji-2:scratch ortner\$ j -O3 calculus_test.jl
f:   17.797 ns (0 allocations: 0 bytes)
f':   42.823 ns (0 allocations: 0 bytes)
FwDiff:   21.431 ns (0 allocations: 0 bytes)
Calculus.differentiate expression:
dfex = :(0 * (exp(-2 * 3.456 * (r / 1.012 - 1.0)) - 2.0 * exp(-3.456 * (r / 1.012 - 1.0))) + 1.234 * ((0 * 3.456 * (r / 1.012 - 1.0) + -2 * 0 * (r / 1.012 - 1.0) + -2 * 3.456 * (1 / 1.012)) * exp(-2 * 3.456 * (r / 1.012 - 1.0)) - (0 * exp(-3.456 * (r / 1.012 - 1.0)) + 2.0 * ((0 * (r / 1.012 - 1.0) + -3.456 * (1 / 1.012)) * exp(-3.456 * (r / 1.012 - 1.0))))))
``````

Multiplying with zero might not get optimized away if it is possible for the other term to be `NaN` or `Inf`. If you enable fastmath these should get optimized away:

``````> df_fast = eval(:(@fastmath r -> \$dfex ))
> #20 (generic function with 1 method)

> @code_llvm df_fast(x)
define double @"julia_#20_13198"(double) #0 {
top:
%1 = fmul fast double %0, 0x401B51F5E1A4EECC
%2 = fsub fast double 6.912000e+00, %1
%3 = call double @julia_exp_13025(double %2)
%4 = fmul fast double %0, 0x400B51F5E1A4EECC
%5 = fsub fast double 3.456000e+00, %4
%6 = call double @julia_exp_13025(double %5)
%7 = call double @julia_exp_13025(double %2)
%8 = call double @julia_exp_13025(double %5)
%9 = call double @julia_exp_13025(double %5)
%10 = fsub fast double %7, %9
%11 = fmul fast double %10, 0xC020DB460B7A7FE2
ret double %11
}
``````

For best performance (assuming purity), we should also run CSE (common subexpression elimination) to fold all those calls to `exp` with the same argument (using CommonSubexpressions.jl/CommonSubexpressions.jl at master · rdeits/CommonSubexpressions.jl · GitHub),

``````> using CommonSubexpressions

> dfex_cse = CommonSubexpressions.cse(Calculus.differentiate(fex, :r));

> df_cse = eval(:(@fastmath r -> \$dfex_cse ));

> @code_llvm df_cse(x)
define double @"julia_#28_13557"(double) #0 {
top:
%1 = fmul fast double %0, 0x401B51F5E1A4EECC
%2 = fsub fast double 6.912000e+00, %1
%3 = call double @julia_exp_13025(double %2)
%4 = fmul fast double %0, 0x400B51F5E1A4EECC
%5 = fsub fast double 3.456000e+00, %4
%6 = call double @julia_exp_13025(double %5)
%7 = fsub fast double %3, %6
%8 = fmul fast double %7, 0xC020DB460B7A7FE2
ret double %8
}
``````

Thank you for the useful suggestions, I’ll explore these right away.

Regarding `@fastmath` I was under the impression that this may cause numerical instability since it gives up IEEE?

Calculus.jl already tries to do these simplifications: Calculus.jl/symbolic.jl at 4da2f9b6d29d3c30bb761b9dcab9e6528492d231 · JuliaMath/Calculus.jl · GitHub. It just happens to fail on 0.7+ for some reason.