Performance Regression on Micro-Benchmark (Calculus, ForwardDiff)

cortner · February 14, 2019, 1:03pm

I noticed an unexpected performance regression when porting a package from 0.6.4 to 0.7 and reduced it to the following example script; output on v0.6.4, v0.7 and v1.0 are below. The performance of ForwardDiff is remarkably improved, while the symbolic differentiation via Calculus.differentiate deteriorates significantly. The reason is clear - there are lots of zeros that are not optimised away. Is there an optimisation that I should perform?

using ForwardDiff, Calculus, BenchmarkTools

e0, A, r0 = 1.234, 3.456, 1.012
fex = :( $e0 * (exp(-2*$A*(r/$r0-1.0)) - 2.0*exp(-$A*(r/$r0-1.0))) )
f = eval(:( r -> $fex ))
dfex = Calculus.differentiate(fex, :r)
df = eval(:(  r -> $dfex ))

x = 1.0+rand()
print("     f: "); @btime ($f($x))
print("    f': "); @btime ($df($x))
print("FwDiff: "); @btime ForwardDiff.derivative($f, $x)
println("\nCalculus.differentiate expression:"); @show dfex

OUTPUT: (j6=v0.6.4, j7 = v0.7.0, j = v1.0.3)

Fuji-2:scratch ortner$ j6 -O3 calculus_test.jl
     f:   18.502 ns (0 allocations: 0 bytes)
    f':   17.873 ns (0 allocations: 0 bytes)
FwDiff:   73.476 ns (2 allocations: 32 bytes)
Calculus.differentiate expression:
dfex = :(1.234 * (-6.830039525691699 * exp(-6.912 * (r / 1.012 - 1.0)) - 2.0 * (-3.4150197628458496 * exp(-3.456 * (r / 1.012 - 1.0)))))

Fuji-2:scratch ortner$ j7 -O3 calculus_test.jl
     f:   17.845 ns (0 allocations: 0 bytes)
    f':   42.202 ns (0 allocations: 0 bytes)
FwDiff:   21.106 ns (0 allocations: 0 bytes)
Calculus.differentiate expression:
dfex = :(0 * (exp(-2 * 3.456 * (r / 1.012 - 1.0)) - 2.0 * exp(-3.456 * (r / 1.012 - 1.0))) + 1.234 * ((0 * 3.456 * (r / 1.012 - 1.0) + -2 * 0 * (r / 1.012 - 1.0) + -2 * 3.456 * (1 / 1.012)) * exp(-2 * 3.456 * (r / 1.012 - 1.0)) - (0 * exp(-3.456 * (r / 1.012 - 1.0)) + 2.0 * ((0 * (r / 1.012 - 1.0) + -3.456 * (1 / 1.012)) * exp(-3.456 * (r / 1.012 - 1.0))))))

Fuji-2:scratch ortner$ j -O3 calculus_test.jl
     f:   17.797 ns (0 allocations: 0 bytes)
    f':   42.823 ns (0 allocations: 0 bytes)
FwDiff:   21.431 ns (0 allocations: 0 bytes)
Calculus.differentiate expression:
dfex = :(0 * (exp(-2 * 3.456 * (r / 1.012 - 1.0)) - 2.0 * exp(-3.456 * (r / 1.012 - 1.0))) + 1.234 * ((0 * 3.456 * (r / 1.012 - 1.0) + -2 * 0 * (r / 1.012 - 1.0) + -2 * 3.456 * (1 / 1.012)) * exp(-2 * 3.456 * (r / 1.012 - 1.0)) - (0 * exp(-3.456 * (r / 1.012 - 1.0)) + 2.0 * ((0 * (r / 1.012 - 1.0) + -3.456 * (1 / 1.012)) * exp(-3.456 * (r / 1.012 - 1.0))))))

kristoffer.carlsson · February 14, 2019, 1:31pm

Multiplying with zero might not get optimized away if it is possible for the other term to be NaN or Inf. If you enable fastmath these should get optimized away:

> df_fast = eval(:(@fastmath r -> $dfex ))
> #20 (generic function with 1 method)

> @code_llvm df_fast(x)
define double @"julia_#20_13198"(double) #0 {
top:
   %1 = fmul fast double %0, 0x401B51F5E1A4EECC
   %2 = fsub fast double 6.912000e+00, %1
   %3 = call double @julia_exp_13025(double %2)
   %4 = fmul fast double %0, 0x400B51F5E1A4EECC
   %5 = fsub fast double 3.456000e+00, %4
   %6 = call double @julia_exp_13025(double %5)
   %7 = call double @julia_exp_13025(double %2)
   %8 = call double @julia_exp_13025(double %5)
   %9 = call double @julia_exp_13025(double %5)
   %10 = fsub fast double %7, %9
   %11 = fmul fast double %10, 0xC020DB460B7A7FE2
  ret double %11
}

For best performance (assuming purity), we should also run CSE (common subexpression elimination) to fold all those calls to exp with the same argument (using CommonSubexpressions.jl/CommonSubexpressions.jl at master · rdeits/CommonSubexpressions.jl · GitHub),

> using CommonSubexpressions

> dfex_cse = CommonSubexpressions.cse(Calculus.differentiate(fex, :r));

> df_cse = eval(:(@fastmath r -> $dfex_cse ));

> @code_llvm df_cse(x)
define double @"julia_#28_13557"(double) #0 {
top:
   %1 = fmul fast double %0, 0x401B51F5E1A4EECC
   %2 = fsub fast double 6.912000e+00, %1
   %3 = call double @julia_exp_13025(double %2)
   %4 = fmul fast double %0, 0x400B51F5E1A4EECC
   %5 = fsub fast double 3.456000e+00, %4
   %6 = call double @julia_exp_13025(double %5)
   %7 = fsub fast double %3, %6
   %8 = fmul fast double %7, 0xC020DB460B7A7FE2
  ret double %8
}

cortner · February 14, 2019, 1:33pm

Thank you for the useful suggestions, I’ll explore these right away.

Regarding @fastmath I was under the impression that this may cause numerical instability since it gives up IEEE?

kristoffer.carlsson · February 14, 2019, 1:37pm

Calculus.jl already tries to do these simplifications: Calculus.jl/symbolic.jl at 4da2f9b6d29d3c30bb761b9dcab9e6528492d231 · JuliaMath/Calculus.jl · GitHub. It just happens to fail on 0.7+ for some reason.

Topic		Replies	Views
How to avoid ForwardDiff.jl generating a second-order derivative that wastes flops by eventually multiplying by zero Performance llvm , differentiation , forwarddiff	10	1101	January 25, 2021
Can I expect ForwardDiff to give the same performance in this case? Performance forwarddiff	5	562	September 8, 2021
Fastmath and forwarddiff General Usage forwarddiff , fast-math	1	524	March 24, 2018
Recursive call with ForwardDiff Optimization (Mathematical)	0	255	March 7, 2020
Jacobian calculation with FastDifferentiation.jl Specific Domains differentiation	2	156	August 23, 2024

Performance Regression on Micro-Benchmark (Calculus, ForwardDiff)

Related topics