But the other thing I have found is that by using more robust optimizers it is often easier to get a sense of the domains of attraction from initial conditions (and, frequently, problems in my models which need to be fixed or corners that I didn’t anticipate would ever bind).
In particular, you can tell these optimizers to emphasize feasibility over optimizality and then it can carefully walk along the F(a,p) = 0 manifold which might be essential for the stability of your other code. Another thing you can do is fix an initial a and tell it to find a feasible point without worrying about the objective. Once you know “the solution” you can backwards engineer conditions to guide the less robust optimizers to get to it and drop the commercial one.
But with all that, I completely understand your aversion to experimenting with commercial optimizers. And I suspect that black box derivative-free ones are never going to work with a problem of this type because feasibility on the constraint is too important.
Try setting some options in Ipopt, e.g. start_with_resto = true or max_soc = 50. You can also use second order Ipopt or AugLag by using the option first_order = false. The default in Nonconvex is first order.
Note that you can differentiate M(a,p(a)) with respect to a very efficiently and easily using an adjoint method, a.k.a. implicit functions — given the solution of F(a,p) = 0 and the Jacobian \partial F/\partial p, you only have to solve one linear system with \partial M / \partial p as a right-hand side. See e.g. section 3 of my course notes.
You could, alternatively, just use an optimization algorithm like SLSQP that supports nonlinear equality constraints directly, and then just solve \min_{a,p} M(a,p) subject to F(a,p)=0.
(For example, a better version of your “penalty” method is the augmented Lagrangian method, which adds both a penalty and a Lagrangian term and is designed so that the penalty does not need to become extremely large in order to enforce the constraint to a small tolerance — unlike your simple L2 penalty.)
A follow-up if someone is interested: there were two problems with the original problem: a conceptual and a numerical one.
The conceptual problem was that the Stone-Geary preferences we used for modeling non-homotheticity shuts down the service sector below some level of GDP — a lot of moments become 0, derivatives have a very different sparsity pattern, and in practice seems to lead most methods astray when they are at a point that the region they are investigating is on or near a boundary.
This was solved by using methods from a nice recent paper by Comin, D., Lashkari, D., & Mestieri, Martí (2021), coded up for AD in
Now everything is continuously differentiable.
The numerical problem was occasional floating point overflow for parameters which are nowhere near the optimum, but are nevertheless investigated by the algorithms. It is very easy to get an exp(x) == Inf for a not-so-large x, and when two of these are subtracted etc you get NaNs. What was insidious about this is that this occurred mostly in the Hessian, which we did not test for originally. While I recognize that coding up a reasonable objective is the responsibility of the user, it would be better if optimizers checked for all(isfinite, ...) for everything they use and fail early. I opened an issue for JuliaSmoothOptimizers which ended up as JuliaSmoothOptimizers/SolverTest.jl/issues/9.
Some benchmarks:
Percival.jl (either via Nonconvex.jl, or used directly) finds a global optimum 54% of the time, from randomly drawn points,
Ipopt.jl gets it right 64% of the time, from the same points,
still can’t get NLopt.jl to work (same failure as noted above).
Dear Steve, if possible and available, could you please share the Julia-equivalent of the Matlab code snippet displayed in your notes on adjoint methods? Would like to save this with the course notes. Thanks.
Right, thanks for the citation. I looked through the chapter,
It makes perfect sense to me to simultaneously optimize (a,p,λ) for each μ, and anneal a sharp constraint…
L(a,p,λ | μ) = M + Σ_i λ_i F_i + μ/2 F^2
My question might be dumber than that. Why doesn’t μ = 0, as an objective function for (a,p,λ)
L(a,p,λ) = M + Σ_i λ_i F_i
doesn’t just work. Is it because 2nd (partial-)derivatives vanish? Are these KKT conditions not usually satisfied or something?
(The quadratic constraint looks ugly to me because it isn’t invariant to changes in F - scaling.)
Sorry, I am not sure I understand what you are suggesting above. Depending on the sign of F_i, blindly optimizing the objective above would get you \lambda_i = \pm \infty.
You can show that for mu large enough, the augmented Lagrangian becomes a strongly convex function even if the original function and constraints are non-convex. This significantly improves the convergence behavior when minimizing it.
Right I see for μ = 0, the optimum is stationary but L is like completely hyperbolic there. Apologies for asking for a geometry lesson on Julia discourse. Thanks for your patience!