Why is my ForwardDiff derivative caluculation 5x as slow as just evaluating the function, when allocations are just 2x?

vgdev · August 25, 2017, 11:12am

So, I am trying to understand what impacts the speed of calculating the derivative of a function using ForwardDiff. I am evaluating the mean of several simulated paths of a stochastic variable. However, evaluating the derivative of the function calculating the mean with respect to some parameter a is 5x as slow as just evaluating the mean function it self, but from @btime I see that memory allocation is just ~2x. Thus, I am wondering what causes is to be 5x as slow? (I have read elsewhere that I should only expect about x2 slowdown).

a = -0.43;
@btime mean = computing_volterra_mean(a)
g(x) = ForwardDiff.derivative(computing_volterra_mean, x)
@btime deriv_mean = g(a)

112.625 ms (48002 allocations: 36.90 MiB)

552.525 ms (75011 allocations: 64.84 MiB)

Code:

using ForwardDiff
using DualNumbers
using BenchmarkTools
using DiffBase


> function b(a, k)
>   return (k^(a+1.0)-(k-1.0)^(a+1.0))/(a+1.0);
> end
> 
> function get_sigma_Z1(n)
>   return 1.0/sqrt(n);
> end
> 
> function get_sigma_Z2(a,n)
>   return 1.0/sqrt(((2.0*a+1.0)*n^(2.0*a+1.0)))
> end
> 
> function get_rho(a)
>   return sqrt(2.0*a+1.0)/(a+1.0)
> end
> 
> function gen_G(a,n,s)
>   G =zeros(eltype(a),s+1);
>   for  i =2:s+1
>     G[i] = b(a,Float64(i))/(n^a)
>   end
>   return G;
> end
> 
> function direct_convolution(x, y, s, a)
>     c = zeros(eltype(a),2*s)
>      @inbounds @simd for j=1:s
>          @inbounds @simd for k=1:(s+1)
>             c[j+k-1] = c[j+k-1]+ x[j]*y[k]
>         end
>     end
>     return c
> end
> 
> function gen_volterra(a,s,n,G,rand_1,rand_2,rho, sigma_Z1,sigma_Z2)
>   Z_1 = zeros(eltype(a),s); #1st normal
>   Z_2 = zeros(eltype(a),s); # 2nd normal
> 
>    @inbounds @simd for i = 1:s
>      Z_1[i] = sigma_Z1*rand_1[i] #set correct variance of random variate
>      Z_2[i] = sigma_Z2*(rho*rand_1[i]+sqrt(1.0-rho^2.0)*rand_2[i]); #set correct variance of random variate
>   end
>   return [0;(direct_convolution(Z_1,G,s,a)[1:s]+Z_2)];
> end
> 
> function computing_volterra_mean(a)
>   num_paths =1000;
>   n =500;
>   s =500;
> 
>   G = gen_G(a,n,s); #Generate kernel
>   rho = get_rho(a); # precomute rho, and stddevs
>   sigma_Z1 =get_sigma_Z1(n);
>   sigma_Z2 = get_sigma_Z2(a,n)
>   result =zero(a);
> 
>   @inbounds @simd for i =1:num_paths
>     rand1 =randn(s);
>     rand2 =randn(s);
>     result+=gen_volterra(a,s,n,G,rand1,rand2,rho,sigma_Z1,sigma_Z2)[s+1]
>   end
> 
>   return result/num_paths;
> end

Tamas_Papp · August 25, 2017, 12:04pm

Try profiling, especially with ProfileView.jl, paying attention to red segments (type instability).

vgdev · August 25, 2017, 1:51pm

Thanks for the tip @Tamas_Papp. Not aware of the ProfileView.jl package. As such I am not experienced in reading these profile charts, but as far as I have understood ‘red’ bars means type instability. Below are the profiles of the function evaluation and the derivative calculation. Its not much difference, but maybe you see something I dont?

Profile of computing_volterra_mean(a):

Profile of deriv_mean = g(a):

kristoffer.carlsson · August 25, 2017, 1:54pm

You won’t get SIMD with dual numbers unless you start Julia with -O3. Advanced Usage Guide · ForwardDiff

Depending on if your bottle neck is actually in the tight loops or not, it might help.

Tamas_Papp · August 25, 2017, 2:07pm

Now you have to work through your code to make sure it is type stable and optimize bottlenecks. gen_volterra, for example, may not be type stable because you are concatenating a 0 (not a zero(...)). Type stability is usually necessary, but not sufficient for optimal performance, see the performance tips in the manual.

vgdev · August 25, 2017, 2:07pm

I am using Atom, but I have set the Optimisation level to 3 already. Is this equal to staring Julia with -O3 ?

kristoffer.carlsson · August 25, 2017, 2:14pm

Most likely.

kristoffer.carlsson · August 25, 2017, 2:17pm

Concatenations of numbers and arrays promote to a common element type. The allocations going to dual numbers has only allocated by a factor of 2 which seems reasonable since a dual number occupies twice the size.

Why are you assuming that there is a type instability?

kristoffer.carlsson · August 25, 2017, 2:27pm

Looks like direct_convolution is the bottle neck and is just ~4x slower for Dual numbers.

Tamas_Papp · August 25, 2017, 2:29pm

Mistake on my part.

vgdev · August 25, 2017, 3:50pm

Why would it be x4 slower with dual numbers and not x2?