Why is my ForwardDiff derivative caluculation 5x as slow as just evaluating the function, when allocations are just 2x?

So, I am trying to understand what impacts the speed of calculating the derivative of a function using ForwardDiff. I am evaluating the mean of several simulated paths of a stochastic variable. However, evaluating the derivative of the function calculating the mean with respect to some parameter a is 5x as slow as just evaluating the mean function it self, but from @btime I see that memory allocation is just ~2x. Thus, I am wondering what causes is to be 5x as slow? (I have read elsewhere that I should only expect about x2 slowdown).

a = -0.43;
@btime mean = computing_volterra_mean(a)
g(x) = ForwardDiff.derivative(computing_volterra_mean, x)
@btime deriv_mean = g(a)

112.625 ms (48002 allocations: 36.90 MiB)

552.525 ms (75011 allocations: 64.84 MiB)

Code:

using ForwardDiff
using DualNumbers
using BenchmarkTools
using DiffBase


> function b(a, k)
>   return (k^(a+1.0)-(k-1.0)^(a+1.0))/(a+1.0);
> end
> 
> function get_sigma_Z1(n)
>   return 1.0/sqrt(n);
> end
> 
> function get_sigma_Z2(a,n)
>   return 1.0/sqrt(((2.0*a+1.0)*n^(2.0*a+1.0)))
> end
> 
> function get_rho(a)
>   return sqrt(2.0*a+1.0)/(a+1.0)
> end
> 
> function gen_G(a,n,s)
>   G =zeros(eltype(a),s+1);
>   for  i =2:s+1
>     G[i] = b(a,Float64(i))/(n^a)
>   end
>   return G;
> end
> 
> function direct_convolution(x, y, s, a)
>     c = zeros(eltype(a),2*s)
>      @inbounds @simd for j=1:s
>          @inbounds @simd for k=1:(s+1)
>             c[j+k-1] = c[j+k-1]+ x[j]*y[k]
>         end
>     end
>     return c
> end
> 
> function gen_volterra(a,s,n,G,rand_1,rand_2,rho, sigma_Z1,sigma_Z2)
>   Z_1 = zeros(eltype(a),s); #1st normal
>   Z_2 = zeros(eltype(a),s); # 2nd normal
> 
>    @inbounds @simd for i = 1:s
>      Z_1[i] = sigma_Z1*rand_1[i] #set correct variance of random variate
>      Z_2[i] = sigma_Z2*(rho*rand_1[i]+sqrt(1.0-rho^2.0)*rand_2[i]); #set correct variance of random variate
>   end
>   return [0;(direct_convolution(Z_1,G,s,a)[1:s]+Z_2)];
> end
> 
> function computing_volterra_mean(a)
>   num_paths =1000;
>   n =500;
>   s =500;
> 
>   G = gen_G(a,n,s); #Generate kernel
>   rho = get_rho(a); # precomute rho, and stddevs
>   sigma_Z1 =get_sigma_Z1(n);
>   sigma_Z2 = get_sigma_Z2(a,n)
>   result =zero(a);
> 
>   @inbounds @simd for i =1:num_paths
>     rand1 =randn(s);
>     rand2 =randn(s);
>     result+=gen_volterra(a,s,n,G,rand1,rand2,rho,sigma_Z1,sigma_Z2)[s+1]
>   end
> 
>   return result/num_paths;
> end

Try profiling, especially with ProfileView.jl, paying attention to red segments (type instability).

Thanks for the tip @Tamas_Papp. Not aware of the ProfileView.jl package. As such I am not experienced in reading these profile charts, but as far as I have understood ‘red’ bars means type instability. Below are the profiles of the function evaluation and the derivative calculation. Its not much difference, but maybe you see something I dont?

Profile of computing_volterra_mean(a):

Profile of deriv_mean = g(a):

You won’t get SIMD with dual numbers unless you start Julia with -O3. Advanced Usage Guide · ForwardDiff

Depending on if your bottle neck is actually in the tight loops or not, it might help.

Now you have to work through your code to make sure it is type stable and optimize bottlenecks. gen_volterra, for example, may not be type stable because you are concatenating a 0 (not a zero(...)). Type stability is usually necessary, but not sufficient for optimal performance, see the performance tips in the manual.

I am using Atom, but I have set the Optimisation level to 3 already. Is this equal to staring Julia with -O3 ?

Most likely.

Concatenations of numbers and arrays promote to a common element type. The allocations going to dual numbers has only allocated by a factor of 2 which seems reasonable since a dual number occupies twice the size.

Why are you assuming that there is a type instability?

Looks like direct_convolution is the bottle neck and is just ~4x slower for Dual numbers.

Mistake on my part.

Why would it be x4 slower with dual numbers and not x2?