Nelder Mead tolerance vastly higher when I restart optimization with updated guess for initial parameters

Is it normal for the tolerance to jump when you restart Optim with the parameters initialized to equal the parameters from the last iteration of the last Optim call?

I am minimizing some function f(b, x) using Nelder-Mead where b is a vector of parameters and x is data. I start the optimization with a zero-vector as the initial value for b. I then stop the optimization in the N^{th} step, save the parameters b_N in the last step, and re-run the optimizer with b_N as the new guess. As expected, the N+1^{th} iteration has a very similar value of f as the N^{th} iteration, however, the tolerance metric is vastly different. The tolerance in the N^{th} iteration is around 1E-5 whereas the tolerance in the N+1^{th} iteration is around 1E2.

Is this expected behaviour?

Below is an example of how I call the Nelder-Mead optimization.

function f(b, x)
    # b are parameters, x is data
    # return some function of x,b to be minimized
    CSV.write("parameter_estimates.csv", Tables.table(vec(b)), writeheader=false, append=true)
end

## Optim call round 1
Optim.minimizer(optimize(b -> f(b, x0), b0)
      # suppose the julia session is killed when we reach the Nth iteration
      # so we must restart optim using the Nth iteration above as the new initial point


## Optim call round 2
param = readdlm("parameter_estimates.csv", ',')
b_N = param[size(param,1), :] # grab last iteration of round 1 of optim call

Optim.minimizer(optimize(b -> f(b, x0), b_N) # restart optim with b_N as guess

I’m a bit confused. Why are you using autodiff with Nelder-Mead? That algorithm only needs function evaluations.

1 Like

Thanks @ctkelley, that autodiff is indeed unnecessary here. It seems it had no impact one way or another as I’ve removed it and it makes no difference.

Nelder-Mead forms a new simplex when you start it with b_N as the initial point. That simplex is larger than the simplex it ended up with in the first round. The convergence criterion of Nelder-Mead is the standard deviation of the function evaluated at each point of the simplex. Since the initial simplex of the second round is larger, you end up with a larger value of the standard deviation. Btw, you can change the initial simplex, see Optim.jl, and you can also save the last simplex of the first round.

3 Likes

Thanks @amrods. Is there a reason why the initial simplex is large in the second round? It is almost as large as the initial simplex in the first round.

I thought the simplex searched around the current parameters guess and thus would only be a function of current parameter guess. But it seems to also be influenced by how long the optim call has been running for.

Generally speaking, when Nelder-Mead is improving the function, the next simplex it forms will be smaller. So as further iterations improve the function, the convergence criterion becomes smaller. The link I posted has a section on how the initial simplex is formed. Also, you can read about Nelder-Mead in this free book Algorithms for Optimization, which is very intuitive and written with Julia examples.

2 Likes

Would you happen to know whether any part of the Nelder-Mead call can be parallelized?

I am using it in a maximum likelihood estimation to find the minima of the negative of the log likelihood. I am currently parallelizing the construction of the log likelihood by distributing the summation of the likelihood contributions over processors. However this is all happening before the optim call. I’m wondering if the optim call itself can be parallelized (e.g., by distributing the evaluation of the objective over the points of the simplex).

As I understand, in general you cannot parallelize optimization algorithms. You could parallelize the objective function as you have done. In principle, however, you can parallelize Nelder-Mead as you mention, but I don’t think Optim.jl supports that. The book I linked above has sample code for Nelder-Mead, which you can modify and parallelize at will. Another suggestion I have is to try to optimize the computation of the objective function and log-likelihood and use speed-up tools like LoopVectorization.jl. See here How to speed up these functions? for example.

Actually, now that I think about it, after a simplex is formed, Nelder-Mead only changes one vertex at a time, so there is no need to parallelize it.

Could you show me how to save the simplex for each iteration of Optim? I do not see the syntax for this in the documentation you attached.

Try playing with options store_trace and extended_trace: Optim.jl.

I really don’t understand why people are still using Nelder–Mead. It was obsolete decades ago, and is not even guaranteed to converge to a local optimum.

If warnings about its obsolescence (and explanations of what is better) are not shared in the documentation of the various packages (e.g. Optim.jl ) people will certainly continue using it as the “obvious gradient-free method”.

1 Like

Which gradient free alternative would you recommend?

Ok I managed to retrieve the simplex for each iteration but only after the Optim call completes successfully. But this doesn’t solve my problem because I need to extract the simplex and save to a csv while the optimizer is running.

My issue is that I have an optimizer which runs for 70+ hours but the system I am on terminates the session after 50 hours. So I can’t wait for the routine to complete successfully to export the trace since the first call simply cannot be completed within the constraints of the system.

@stevengj I have seen your suggestion here on printing the function value from within the objective function Optim.jl, showing trace. I imagine this can’t be done in my case because the simplex is determined outside the objective function?

options = Optim.Options(store_trace = true, extended_trace = true, trace_simplex = true, iterations = 10^10)
res = optimize(b -> f(b, x0), b0, options)

for i = 1:Optim.iterations(res)
    simp = get(Optim.trace(res)[i].metadata, "simplex", "na")
    simp_val = get(Optim.trace(res)[i].metadata, "simplex_values", "na")
end

I believe you can use the option callback to supply a function that writes a csv. I have not done that, though. Also, writing a file externally would presumably considerably slow down the optimization.

Regarding the algorithms, I believe DIRECT and Sbplx are preferable over Nelder-Mead. Those are available in NLopt.jl, rather than Optim.jl. You may also spend some time getting familiar with Optimization.jl, which aims to provide an interface to all optimization packages in Julia.

Having said that, I remember reading somewhere that using derivative-free algorithms for optimizing a function are a good way to spend one more year in a PhD (or more generally in a project).

One reason for Nelder-Mead’s popularity is that it is easy to code it yourself. The NM paper has been heavily cited for that reason. While it has minimal theoretical support and is known to fail on some very simple problems, when it works it works pretty well.

Having said that, I doubt that @amrods wants to code it and nomad.jl may be a good choice.

me? perhaps you mean @structural.

Yes, I did.

Thanks I will have to take a look at NLopt.jl, Optimization.jl, and Nomad.jl.

When one does not have analytic gradients and hessians that one can provide, is it still faster to use algorithms which require gradients and/or hessians over derivative-free solvers? I imagine there is a tradeoff between the time it takes to numerically approximate the gradient/hessian and the improvement in search that results from having information about the gradient/hessian. Is there usually a net speed improvement from using approximations vs. going derivative-free?