Optim: What optimiser is best if your gradient computation is slow?

vgdev · August 21, 2017, 8:17pm

I have a objective function that is a result of N stochastic simulations, i.e. some function f with the sample mean of my simulations as input. I am using the ForwarDiff package to produce the gradient of my objective function wrt. to three (3) parameters. However, the computation of the gradient is slow, so I was wondering is it is possible to say something about what optimizer in the Optim package that gives you the best value for actually providing a gradient compared to just using finite differences? thx

stevengj · August 21, 2017, 10:56pm

How many parameters do you have? Is the function actually smooth or is it noisy?

ChrisRackauckas · August 22, 2017, 1:18am

In JuliaDiffEq we did some extensive testing of parameter estimation from stochastic differential equations via StochasticDiffEq.jl + DiffEqParamEstim.jl (encompassed in DifferentialEquations.jl) and found that NLopt.jl was able to much more efficiently reproduce the parameters in generated data than other methods I tried (included Optim.jl, Evolutionary.jl, BlackBoxOptim.jl, etc.). Global optimization methods were found to be pretty essential here. This was using a method of moments type of fitting with high order adaptive integrators, and using more error prone methods like Euler-Maruyama or Milstein variants did lead to noticeable errors (or slower convergence by using larger Monte Carlo calculations of weak properties at each step). Autodifferentiation does work through the whole SDE solver but we found that this wasn’t too beneficial: some of the global derivative-free optimizers did really well, at least to get to the general area and finish with a local optimizer. At least that’s the results we got when attempting to replicate input data, but every model is different so YMMV.

Tamas_Papp · August 22, 2017, 5:41am

If you are doing indirect inference, you may want to fix the “noise” terms, making your function continuous, if that is possible.

Look at the advanced usage section in the manual, and optimize your calls to ForwardDiff. Naive applications are often not type stable, so at least assert a ::Matrix{Float64} for the Jacobian, but configuring chunk size and SIMD often help a lot (independently).

For local convergence, if you can obtain derivatives with AD, using algorithms that take advantage of them (like BFGS) are almost always worth it. Global convergence, especially if you have many local optima, is an orthogonal issue.

vgdev · August 22, 2017, 8:48am

As I said, I have 3 parameters. The objective function is the RMSE between call options market data and the call options produced by the model, i.e. the sample mean of \max(S^{n}_T-K,0) where S^{n}_T is the simulated stock price for path n and K the strike.

vgdev · August 22, 2017, 9:23am

However, what I am experiencing is that the time for evaluation of the objective function is 0.1s, while calling forwarddiff to produce the gradient runs in 10s (using benchmarktools). So it is almost 100 times slower (?)

I did some test runs with BFGS with finite differences and BFGS with AD. The AD version obviously takes longer time, but looking at the number of function/derivative calls it is not converging faster than the FD run, which I cannot understand why.

BFGS with FD:

Results of Optimization Algorithm

Algorithm: Fminbox with BFGS

Starting Point: [-0.3,-0.5,1.5]
(Lower = [-0.4999, -0.9999,1.00])
(Upper = [0.0, 0.9999,3.00])

Minimizer: [-0.36851226394666786,-0.9880360255423059, 2.9999999999291562]

Minimum: 8.545390e-01

Iterations: 4

Convergence: true

|x - x’| < 1.0e-32: false
|x - x’| = 6.87e-16

|f(x) - f(x’)| / |f(x)| < 1.0e-32: true
|f(x) - f(x’)| / |f(x)| = 3.42e-14

|g(x)| < 1.0e-08: false
|g(x)| = 4.07e+03

stopped by an increasing objective: false

Reached Maximum Number of Iterations: false

Objective Calls: 1384

Gradient Calls: 1384

BFGS with AD:

Results of Optimization Algorithm

Algorithm: Fminbox with BFGS

Starting Point: [-0.3,-0.5,1.5]
(Lower = [-0.4999, -0.9999,1.00])
(Upper = [0.0, 0.9999,3.00])

Minimizer: [-0.36599647187650547,-0.9875285450280383, 2.999999824307108]

Minimum: 8.544151e-01

Iterations: 4

Convergence: true

|x - x’| < 1.0e-32: false
|x - x’| = 6.30e-16

|f(x) - f(x’)| / |f(x)| < 1.0e-32: true
|f(x) - f(x’)| / |f(x)| = 1.86e-14

|g(x)| < 1.0e-08: false
|g(x)| = 2.46e-01

stopped by an increasing objective: false

Reached Maximum Number of Iterations: false

Objective Calls: 1837

Gradient Calls: 1837

Tamas_Papp · August 22, 2017, 10:07am

Try benchmarking and optimizing

g(x) = ForwardDiff.derivative(f, x)

paying close attention to type stability (Base.Test.@inferred, @code_warntype, etc), profile it, and then supply a function-and-derivative object to Optim (look at the manual). I found DiffBase.DiffResult particularly useful for a similar problem.

Elrod · August 22, 2017, 10:54am

Also cool that the DiffResults can return all the lower order derivatives.
Not that important if the gradient calculation is 100x more expensive than the function though.

@vgdev To expand on type stability, does your code ever do something like sum up values over a loop and start with a value of 0.0, 1.0, etc? Ie,

f(...) where T
x = 0.0
x += arg[i]

is not type stable when T <: dual numbers. You could instead use some variation of:

y = zero(T)
z = one(eltype(arg))

If things are 100x slower while using AutoDiff and your gradient only contains 3 values, it sounds like may have introduced type instabilities that aren’t there when you normally call your function with Float64s.
I’d look the code over and make sure there isn’t anything like the “x” above.

vgdev · August 22, 2017, 11:56am

@Elrod @Tamas_Papp I believe I have tried to take care of this, without speedup. I have found the place in my code that is causing the slowdown. But I am a rookie wrt. to Julia and Forwarddiff, so I would very much appreciate some help with why just the evaluation of gen_volterra(a,s,n,gen_G(a,n,s),x,y) is ~100 times faster than getting the derivative wrt to the parameter a in the code below. Thank you

using ForwardDiff
using DualNumbers
using BenchmarkTools

function b(a, k)
  return (k^(a+1.0)-(k-1.0)^(a+1.0))/(a+1.0);
end

function get_sigma_Z1(n)
  return 1.0/sqrt(n);
end

function get_sigma_Z2(a,n)
  return 1.0/sqrt(((2.0*a+1.0)*n^(2.0*a+1.0)))
end

function get_rho(a)
  return sqrt(2.0*a+1.0)/(a+1.0)
end


function gen_G(a,n,s) ## Generate kernel vector
  i = range(1,s)
  return [0;map(i->b(a,i+1)/(n^a),i)]
end

function direct_convolution(x, y, s, a)
    c = zeros(eltype(a),2*s)
    @fastmath @inbounds @simd for j=1:s
        @fastmath @inbounds @simd for k=1:(s+1)
            c[j+k-1] = c[j+k-1]+ x[j]*y[k]
        end
    end
    return c
end 

function gen_volterra(a,s,n,G,rand_1,rand_2)
  Z_1 = zeros(eltype(a),s); #first normal
  Z_2 = zeros(eltype(a),s); # 2nd normal
  @fastmath @inbounds @simd for i = 1:s
    Z_1[i] = get_sigma_Z1(n)*rand_1[i] #set correct variance of random variate
    Z_2[i] = get_sigma_Z2(a,n)*(get_rho(a)*rand_1[i]+sqrt(1-get_rho(a)*get_rho(a))*rand_2[i]); #set correct variance of random variate
  end
  return [0;(direct_convolution(Z_1,G,s,a)[1:s]+Z_2)]; #Convolute
end


n =500;
s =500;
rand1 =randn(n);
rand2 =randn(n);
a = -0.43

@time result = gen_volterra(a,s,n,gen_G(a,n,s),x,y) #Evaluate function
@time volterra_gradient = ForwardDiff.derivative(a -> gen_volterra(a,s,n,gen_G(a,n,s),rand1,rand2),-0.43) #Get derivative

oschulz · August 22, 2017, 12:05pm

NLopt.jl was able to much more efficiently reproduce the parameters in generated data than other methods I tried

Just curious - which NLopt algorithm(s) did you use? Any recommendations?

Elrod · August 22, 2017, 12:18pm

Use @btime or at least precompile so you don’t get compile time.

So you have a single parameter, and are taking 500 derivatives with respect to it.
Delete all the @fastmath tags. They make math slower far more often than faster, here included.
I have about 260 microseconds median time for gen_volterra vs 750 for the derivative.

Much better than 100x slower, @code_warntype for the derivative still doesn’t look good.

vgdev · August 22, 2017, 4:03pm

Well, no, the output of this Volterra process will be used to model the volatility of the stock-price, which is then used to compute the call price by averaging over all paths, resulting in the ‘model’ option price that I try to calibrate to market options prices. So in the end, there is 3 parameters and one objective function, so 3 derivatives.

I thought that the convolution part of my code was the bottleneck, but I should obviously have used btime to check this. However, the gradient call of my full code is still indeed ~100x slower using btime, than just the evaluation of the objective function. What can I do with @code_warntype ?

ChrisRackauckas · August 22, 2017, 4:56pm

We ran the whole gambit. I hope to get notebooks up on DiffEqBenchmarks.jl for this soon. It’ll probably just show the results on ODEs since those run much quicker (no need to Monte Carlo) and the results as to what optimizers do well (and the effect of tolerance on the ability to retrieve the correct solution) was pretty much the same. This is a preview of what will become the notebook:

github.com

SciML/DiffEqParamEstim.jl/blob/master/test/lorenz_true_test.jl

################################################################################

# Setup

# Note: This version is against the true solution, so adaptive timestepping is
# used. This means that the option which is set to get the correct timepoints
# is saveat, not tstops!

using RecursiveArrayTools, ParameterizedFunctions, SciMLSensitivity, ModelingToolkit
using NLopt, DiffEqParamEstim, OptimizationBBO, OptimizationOptimJL

Xiang2015Bounds = Tuple{Float64, Float64}[(9, 11), (20, 30), (2, 3)] # for local optimizations
xlow_bounds = [9.0, 20.0, 2.0]
xhigh_bounds = [11.0, 30.0, 3.0]
LooserBounds = Tuple{Float64, Float64}[(0, 22), (0, 60), (0, 6)] # for global optimization
GloIniPar = [0.0, 0.5, 0.1] # for global optimizations
LocIniPar = [9.0, 20.0, 2.0] # for local optimization

g1 = @ode_def LorenzExample begin
    dx = σ * (y - x)

This file has been truncated. show original

LN_BOBYQA got the right answer in 0.04 seconds vs the fastest BlackBoxOptim taking 4.5 seconds. I forget why Optim isn’t in this file but it’s closer to BlackBoxOptim. Then the “longer version” requires global optimizers to do well. GN_ESCH is a decent choice. While this is on a chaotic equation (Lorenz equation), with only 3 variables I’d assume the difficulty is close to what you’d expect from real problems with more variables.

The SDE versions of these just puts them in a MonteCarloProblem and the cost function automatically uses method of moments for fitting. There’s a test in DiffEqParamEstim.jl that makes sure this works with Optim.jl, but in general this method is slower so I’ve had less patience to do benchmarking with it than the ODEs .

Many publications mention the necessity of global optimization methods for this, and we have been able to confirm it on this 3 parameter Lorenz example and 4 parameter Lokta-Volterra equations. Anything more than like 3 parameters (even on ODEs) will start to find alternative minima far too easily, so you’d have to at least run local optimizers with many initial conditions to get something trustworthy on problems that aren’t trivial.

oschulz · August 22, 2017, 7:24pm

Thanks for all the information, Chris!

Topic		Replies	Views
Optim, Forward Differentiation and OnceDifferentiable with MLE Optimization (Mathematical) optim	18	2675	November 28, 2020
Latest recommendations for global optimization Optimization (Mathematical) package	92	12894	August 23, 2022
Optim user-supplied gradient and hessian MLE Optimization (Mathematical) optim	3	971	February 22, 2022
Is there something like ScipyOptimizer in Julia? General Usage question , diffeq	10	1629	July 11, 2019
Why Optim.jl does not allow for backwards mode autodifferentiation? Optimization (Mathematical) autodiff	6	538	June 15, 2023

Optim: What optimiser is best if your gradient computation is slow?

Related topics