Nelder Mead tolerance vastly higher when I restart optimization with updated guess for initial parameters

structural · September 28, 2022, 3:17pm

Is it normal for the tolerance to jump when you restart Optim with the parameters initialized to equal the parameters from the last iteration of the last Optim call?

I am minimizing some function f(b, x) using Nelder-Mead where b is a vector of parameters and x is data. I start the optimization with a zero-vector as the initial value for b. I then stop the optimization in the N^{th} step, save the parameters b_N in the last step, and re-run the optimizer with b_N as the new guess. As expected, the N+1^{th} iteration has a very similar value of f as the N^{th} iteration, however, the tolerance metric is vastly different. The tolerance in the N^{th} iteration is around 1E-5 whereas the tolerance in the N+1^{th} iteration is around 1E2.

Is this expected behaviour?

Below is an example of how I call the Nelder-Mead optimization.

function f(b, x)
    # b are parameters, x is data
    # return some function of x,b to be minimized
    CSV.write("parameter_estimates.csv", Tables.table(vec(b)), writeheader=false, append=true)
end

## Optim call round 1
Optim.minimizer(optimize(b -> f(b, x0), b0)
      # suppose the julia session is killed when we reach the Nth iteration
      # so we must restart optim using the Nth iteration above as the new initial point


## Optim call round 2
param = readdlm("parameter_estimates.csv", ',')
b_N = param[size(param,1), :] # grab last iteration of round 1 of optim call

Optim.minimizer(optimize(b -> f(b, x0), b_N) # restart optim with b_N as guess

ctkelley · September 28, 2022, 5:29pm

I’m a bit confused. Why are you using autodiff with Nelder-Mead? That algorithm only needs function evaluations.

structural · September 29, 2022, 10:37am

Thanks @ctkelley, that autodiff is indeed unnecessary here. It seems it had no impact one way or another as I’ve removed it and it makes no difference.

amrods · September 29, 2022, 10:57am

Nelder-Mead forms a new simplex when you start it with b_N as the initial point. That simplex is larger than the simplex it ended up with in the first round. The convergence criterion of Nelder-Mead is the standard deviation of the function evaluated at each point of the simplex. Since the initial simplex of the second round is larger, you end up with a larger value of the standard deviation. Btw, you can change the initial simplex, see Optim.jl, and you can also save the last simplex of the first round.

structural · September 29, 2022, 11:03am

Thanks @amrods. Is there a reason why the initial simplex is large in the second round? It is almost as large as the initial simplex in the first round.

I thought the simplex searched around the current parameters guess and thus would only be a function of current parameter guess. But it seems to also be influenced by how long the optim call has been running for.

amrods · September 29, 2022, 11:08am

Generally speaking, when Nelder-Mead is improving the function, the next simplex it forms will be smaller. So as further iterations improve the function, the convergence criterion becomes smaller. The link I posted has a section on how the initial simplex is formed. Also, you can read about Nelder-Mead in this free book Algorithms for Optimization, which is very intuitive and written with Julia examples.

structural · September 29, 2022, 11:20am

Would you happen to know whether any part of the Nelder-Mead call can be parallelized?

I am using it in a maximum likelihood estimation to find the minima of the negative of the log likelihood. I am currently parallelizing the construction of the log likelihood by distributing the summation of the likelihood contributions over processors. However this is all happening before the optim call. I’m wondering if the optim call itself can be parallelized (e.g., by distributing the evaluation of the objective over the points of the simplex).

amrods · September 29, 2022, 11:30am

As I understand, in general you cannot parallelize optimization algorithms. You could parallelize the objective function as you have done. In principle, however, you can parallelize Nelder-Mead as you mention, but I don’t think Optim.jl supports that. The book I linked above has sample code for Nelder-Mead, which you can modify and parallelize at will. Another suggestion I have is to try to optimize the computation of the objective function and log-likelihood and use speed-up tools like LoopVectorization.jl. See here How to speed up these functions? for example.

amrods · September 29, 2022, 11:37am

Actually, now that I think about it, after a simplex is formed, Nelder-Mead only changes one vertex at a time, so there is no need to parallelize it.

structural · September 30, 2022, 9:45pm

Could you show me how to save the simplex for each iteration of Optim? I do not see the syntax for this in the documentation you attached.

amrods · September 30, 2022, 11:08pm

Try playing with options store_trace and extended_trace: Optim.jl.

stevengj · October 1, 2022, 1:50am

I really don’t understand why people are still using Nelder–Mead. It was obsolete decades ago, and is not even guaranteed to converge to a local optimum.

Krastanov · October 1, 2022, 4:47am

If warnings about its obsolescence (and explanations of what is better) are not shared in the documentation of the various packages (e.g. Optim.jl ) people will certainly continue using it as the “obvious gradient-free method”.

structural · October 1, 2022, 8:59am

Which gradient free alternative would you recommend?

structural · October 1, 2022, 10:06am

Ok I managed to retrieve the simplex for each iteration but only after the Optim call completes successfully. But this doesn’t solve my problem because I need to extract the simplex and save to a csv while the optimizer is running.

My issue is that I have an optimizer which runs for 70+ hours but the system I am on terminates the session after 50 hours. So I can’t wait for the routine to complete successfully to export the trace since the first call simply cannot be completed within the constraints of the system.

@stevengj I have seen your suggestion here on printing the function value from within the objective function Optim.jl, showing trace. I imagine this can’t be done in my case because the simplex is determined outside the objective function?

options = Optim.Options(store_trace = true, extended_trace = true, trace_simplex = true, iterations = 10^10)
res = optimize(b -> f(b, x0), b0, options)

for i = 1:Optim.iterations(res)
    simp = get(Optim.trace(res)[i].metadata, "simplex", "na")
    simp_val = get(Optim.trace(res)[i].metadata, "simplex_values", "na")
end

amrods · October 1, 2022, 11:37am

I believe you can use the option callback to supply a function that writes a csv. I have not done that, though. Also, writing a file externally would presumably considerably slow down the optimization.

Regarding the algorithms, I believe DIRECT and Sbplx are preferable over Nelder-Mead. Those are available in NLopt.jl, rather than Optim.jl. You may also spend some time getting familiar with Optimization.jl, which aims to provide an interface to all optimization packages in Julia.

Having said that, I remember reading somewhere that using derivative-free algorithms for optimizing a function are a good way to spend one more year in a PhD (or more generally in a project).

ctkelley · October 1, 2022, 11:39am

One reason for Nelder-Mead’s popularity is that it is easy to code it yourself. The NM paper has been heavily cited for that reason. While it has minimal theoretical support and is known to fail on some very simple problems, when it works it works pretty well.

Having said that, I doubt that @amrods wants to code it and nomad.jl may be a good choice.

amrods · October 1, 2022, 11:41am

me? perhaps you mean @structural.

ctkelley · October 1, 2022, 12:20pm

Yes, I did.

structural · October 1, 2022, 2:59pm

Thanks I will have to take a look at NLopt.jl, Optimization.jl, and Nomad.jl.

When one does not have analytic gradients and hessians that one can provide, is it still faster to use algorithms which require gradients and/or hessians over derivative-free solvers? I imagine there is a tradeoff between the time it takes to numerically approximate the gradient/hessian and the improvement in search that results from having information about the gradient/hessian. Is there usually a net speed improvement from using approximations vs. going derivative-free?

Topic		Replies	Views
Forcing Neldermead from Optim.jl to abort when "stuck" Optimization (Mathematical) question , optim	8	1384	October 14, 2021
Optim: What optimiser is best if your gradient computation is slow? Optimization (Mathematical) optim	13	3198	August 22, 2017
Solving an optimal control problem with Optim.jl Optimization (Mathematical) question , optimization	4	1686	March 28, 2020
Why Nelder-Mead minimization without minimal property check? Optimization (Mathematical) optimization	27	2856	February 8, 2024
Providing More Initial Parameter Values than there are Parameters in the Function Being Optimized Optimization (Mathematical) question , diffeq	9	612	September 3, 2020

Nelder Mead tolerance vastly higher when I restart optimization with updated guess for initial parameters

Related topics