Optim: What optimiser is best if your gradient computation is slow?



I have a objective function that is a result of N stochastic simulations, i.e. some function f with the sample mean of my simulations as input. I am using the ForwarDiff package to produce the gradient of my objective function wrt. to three (3) parameters. However, the computation of the gradient is slow, so I was wondering is it is possible to say something about what optimizer in the Optim package that gives you the best value for actually providing a gradient compared to just using finite differences? thx


How many parameters do you have? Is the function actually smooth or is it noisy?


In JuliaDiffEq we did some extensive testing of parameter estimation from stochastic differential equations via StochasticDiffEq.jl + DiffEqParamEstim.jl (encompassed in DifferentialEquations.jl) and found that NLopt.jl was able to much more efficiently reproduce the parameters in generated data than other methods I tried (included Optim.jl, Evolutionary.jl, BlackBoxOptim.jl, etc.). Global optimization methods were found to be pretty essential here. This was using a method of moments type of fitting with high order adaptive integrators, and using more error prone methods like Euler-Maruyama or Milstein variants did lead to noticeable errors (or slower convergence by using larger Monte Carlo calculations of weak properties at each step). Autodifferentiation does work through the whole SDE solver but we found that this wasn’t too beneficial: some of the global derivative-free optimizers did really well, at least to get to the general area and finish with a local optimizer. At least that’s the results we got when attempting to replicate input data, but every model is different so YMMV.


If you are doing indirect inference, you may want to fix the “noise” terms, making your function continuous, if that is possible.

Look at the advanced usage section in the manual, and optimize your calls to ForwardDiff. Naive applications are often not type stable, so at least assert a ::Matrix{Float64} for the Jacobian, but configuring chunk size and SIMD often help a lot (independently).

For local convergence, if you can obtain derivatives with AD, using algorithms that take advantage of them (like BFGS) are almost always worth it. Global convergence, especially if you have many local optima, is an orthogonal issue.


As I said, I have 3 parameters. The objective function is the RMSE between call options market data and the call options produced by the model, i.e. the sample mean of \max(S^{n}_T-K,0) where S^{n}_T is the simulated stock price for path n and K the strike.


However, what I am experiencing is that the time for evaluation of the objective function is 0.1s, while calling forwarddiff to produce the gradient runs in 10s (using benchmarktools). So it is almost 100 times slower (?)

I did some test runs with BFGS with finite differences and BFGS with AD. The AD version obviously takes longer time, but looking at the number of function/derivative calls it is not converging faster than the FD run, which I cannot understand why.

BFGS with FD:

Results of Optimization Algorithm

  • Algorithm: Fminbox with BFGS
  • Starting Point: [-0.3,-0.5,1.5]
    (Lower = [-0.4999, -0.9999,1.00])
    (Upper = [0.0, 0.9999,3.00])
  • Minimizer: [-0.36851226394666786,-0.9880360255423059, 2.9999999999291562]
  • Minimum: 8.545390e-01
  • Iterations: 4
  • Convergence: true
    • |x - x’| < 1.0e-32: false
      |x - x’| = 6.87e-16
    • |f(x) - f(x’)| / |f(x)| < 1.0e-32: true
      |f(x) - f(x’)| / |f(x)| = 3.42e-14
    • |g(x)| < 1.0e-08: false
      |g(x)| = 4.07e+03
    • stopped by an increasing objective: false
    • Reached Maximum Number of Iterations: false
  • Objective Calls: 1384
  • Gradient Calls: 1384

BFGS with AD:

Results of Optimization Algorithm

  • Algorithm: Fminbox with BFGS
  • Starting Point: [-0.3,-0.5,1.5]
    (Lower = [-0.4999, -0.9999,1.00])
    (Upper = [0.0, 0.9999,3.00])
  • Minimizer: [-0.36599647187650547,-0.9875285450280383, 2.999999824307108]
  • Minimum: 8.544151e-01
  • Iterations: 4
  • Convergence: true
    • |x - x’| < 1.0e-32: false
      |x - x’| = 6.30e-16
    • |f(x) - f(x’)| / |f(x)| < 1.0e-32: true
      |f(x) - f(x’)| / |f(x)| = 1.86e-14
    • |g(x)| < 1.0e-08: false
      |g(x)| = 2.46e-01
    • stopped by an increasing objective: false
    • Reached Maximum Number of Iterations: false
  • Objective Calls: 1837
  • Gradient Calls: 1837


Try benchmarking and optimizing

g(x) = ForwardDiff.derivative(f, x)

paying close attention to type stability (Base.Test.@inferred, @code_warntype, etc), profile it, and then supply a function-and-derivative object to Optim (look at the manual). I found DiffBase.DiffResult particularly useful for a similar problem.


Also cool that the DiffResults can return all the lower order derivatives.
Not that important if the gradient calculation is 100x more expensive than the function though.

@vgdev To expand on type stability, does your code ever do something like sum up values over a loop and start with a value of 0.0, 1.0, etc? Ie,

f(...) where T
x = 0.0
x += arg[i]

is not type stable when T <: dual numbers. You could instead use some variation of:

y = zero(T)
z = one(eltype(arg))

If things are 100x slower while using AutoDiff and your gradient only contains 3 values, it sounds like may have introduced type instabilities that aren’t there when you normally call your function with Float64s.
I’d look the code over and make sure there isn’t anything like the “x” above.


@Elrod @Tamas_Papp I believe I have tried to take care of this, without speedup. I have found the place in my code that is causing the slowdown. But I am a rookie wrt. to Julia and Forwarddiff, so I would very much appreciate some help with why just the evaluation of gen_volterra(a,s,n,gen_G(a,n,s),x,y) is ~100 times faster than getting the derivative wrt to the parameter a in the code below. Thank you

using ForwardDiff
using DualNumbers
using BenchmarkTools

function b(a, k)
  return (k^(a+1.0)-(k-1.0)^(a+1.0))/(a+1.0);

function get_sigma_Z1(n)
  return 1.0/sqrt(n);

function get_sigma_Z2(a,n)
  return 1.0/sqrt(((2.0*a+1.0)*n^(2.0*a+1.0)))

function get_rho(a)
  return sqrt(2.0*a+1.0)/(a+1.0)

function gen_G(a,n,s) ## Generate kernel vector
  i = range(1,s)
  return [0;map(i->b(a,i+1)/(n^a),i)]

function direct_convolution(x, y, s, a)
    c = zeros(eltype(a),2*s)
    @fastmath @inbounds @simd for j=1:s
        @fastmath @inbounds @simd for k=1:(s+1)
            c[j+k-1] = c[j+k-1]+ x[j]*y[k]
    return c

function gen_volterra(a,s,n,G,rand_1,rand_2)
  Z_1 = zeros(eltype(a),s); #first normal
  Z_2 = zeros(eltype(a),s); # 2nd normal
  @fastmath @inbounds @simd for i = 1:s
    Z_1[i] = get_sigma_Z1(n)*rand_1[i] #set correct variance of random variate
    Z_2[i] = get_sigma_Z2(a,n)*(get_rho(a)*rand_1[i]+sqrt(1-get_rho(a)*get_rho(a))*rand_2[i]); #set correct variance of random variate
  return [0;(direct_convolution(Z_1,G,s,a)[1:s]+Z_2)]; #Convolute

n =500;
s =500;
rand1 =randn(n);
rand2 =randn(n);
a = -0.43

@time result = gen_volterra(a,s,n,gen_G(a,n,s),x,y) #Evaluate function
@time volterra_gradient = ForwardDiff.derivative(a -> gen_volterra(a,s,n,gen_G(a,n,s),rand1,rand2),-0.43) #Get derivative


NLopt.jl was able to much more efficiently reproduce the parameters in generated data than other methods I tried

Just curious - which NLopt algorithm(s) did you use? Any recommendations?


Use @btime or at least precompile so you don’t get compile time.

So you have a single parameter, and are taking 500 derivatives with respect to it.
Delete all the @fastmath tags. They make math slower far more often than faster, here included.
I have about 260 microseconds median time for gen_volterra vs 750 for the derivative.

Much better than 100x slower, @code_warntype for the derivative still doesn’t look good.


Well, no, the output of this Volterra process will be used to model the volatility of the stock-price, which is then used to compute the call price by averaging over all paths, resulting in the ‘model’ option price that I try to calibrate to market options prices. So in the end, there is 3 parameters and one objective function, so 3 derivatives.

I thought that the convolution part of my code was the bottleneck, but I should obviously have used btime to check this. However, the gradient call of my full code is still indeed ~100x slower using btime, than just the evaluation of the objective function. What can I do with @code_warntype ?


We ran the whole gambit. I hope to get notebooks up on DiffEqBenchmarks.jl for this soon. It’ll probably just show the results on ODEs since those run much quicker (no need to Monte Carlo) and the results as to what optimizers do well (and the effect of tolerance on the ability to retrieve the correct solution) was pretty much the same. This is a preview of what will become the notebook:

LN_BOBYQA got the right answer in 0.04 seconds vs the fastest BlackBoxOptim taking 4.5 seconds. I forget why Optim isn’t in this file but it’s closer to BlackBoxOptim. Then the “longer version” requires global optimizers to do well. GN_ESCH is a decent choice. While this is on a chaotic equation (Lorenz equation), with only 3 variables I’d assume the difficulty is close to what you’d expect from real problems with more variables.

The SDE versions of these just puts them in a MonteCarloProblem and the cost function automatically uses method of moments for fitting. There’s a test in DiffEqParamEstim.jl that makes sure this works with Optim.jl, but in general this method is slower so I’ve had less patience to do benchmarking with it than the ODEs :wink:.

Many publications mention the necessity of global optimization methods for this, and we have been able to confirm it on this 3 parameter Lorenz example and 4 parameter Lokta-Volterra equations. Anything more than like 3 parameters (even on ODEs) will start to find alternative minima far too easily, so you’d have to at least run local optimizers with many initial conditions to get something trustworthy on problems that aren’t trivial.


Thanks for all the information, Chris!