I am glad to announce Nonconvex.jl, a package in which I implement and wrap a few constrained nonlinear optimization algorithms and packages using automatic differentiation as a fundamental building block. There is no way to define gradients directly in Nonconvex.jl, you can only define a custom adjoint rule using ChainRulesCore.jl. This is especially powerful however because it allows you to use Zygote.jl with your favorite nonlinear optimization algorithm. The following algorithms are available:
A pure Julia implementation of the original method of moving asymptotes (MMA), referred to as MMA87. This is a first order algorithm. The algorithm was generalized to handle infinite bounds.
A pure Julia implementation of the globally convergent MMA, referred to as MMA02. This is a first order algorithm. The algorithm was generalized to handle infinite bounds.
First and second order augmented Lagrangian algorithms as implemented in Percival.jl. Using the first order augmented Lagrangian algorithm together with Zygote is the star of the show because it enables efficient ODE/PDE-constrained optimization where the constraints are per element or per time point. I use it in TopOpt.jl for example to do stress-constrained topology optimization.
First and second order interior point algorithms as available in Ipopt.jl
Cool! This is a very handy algorithm. Note that the Svanberg (2002) paper on this describes a family of algorithms using different approximations, and as far as I can tell there is no advantage of the original “MMA” approximation compared to the simple linear + quadratic-penalty variant (called CCSA_QUADRATIC in NLopt), and the latter is far simpler so I tend to recommend it.
A big selling point of this algorithm is its simplicity, and in particular it is relatively straightforward to support objectives and constraints with sparse gradients (simply represented by sparse vectors in Julia). If you can support sparsity it would be extremely powerful, since then you can efficiently solve LASSO-like problems with L1 constraints and many other problems with huge numbers of constraints. (A student of mine once tried this for a class project and found it was within a factor of 2 of specialized LASSO algorithms, which is great because it is so much more general.)
By the way, another nice property of the CCSA (e.g. MMA) algorithm is that you don’t need any other optimizer — you can solve the dual problem by calling itself recursively. On the second recursion it terminates because it hits a trivial 0-dimensional optimization problem. (I notice that you’re currently calling Optim.GradientDescent() for this, which is fine too of course. The Svanberg paper does something similar, but I always thought the recursive approach was more elegant.)
Ya I have been meaning to implement the rest but MMA is what people usually use in my domain so I started with that.
By the way, another nice property of the CCSA (e.g. MMA) algorithm is that you don’t need any other optimizer — you can solve the dual problem by calling itself recursively.
Cool! I haven’t thought about that before but it makes perfect sense! I will open an issue.
As far as I can tell there is no advantage in implementing more than one — they all have essentially equivalent convergence rates, because they are all basically first-order methods. And if you are going to implement only one, might as well implement the simplest (in order to easily generalize it later to things like sparse gradients).
I know very little about constrained optimization so apologies if this is a dumb question but from what I can read these first order methods appear to be unaccelerated stationary methods (ie steepest descent type methods with no LBFGS or CG type tricks), is this really the case? Can’t some of the standard unconstrained optimization tricks be applied there?
MMA and its friends rely on separable approximations such that the Hessian of the Lagrangian of the approximation is diagonal and positive definite. They also only work for inequality constraints, equality constraints are not supported. Turns out problems of this form can be solved exactly quite efficiently because the dual boils down to a box constrained problem and the optimal primal solution of the approximation can be efficiently obtained from the optimal dual. This makes it a good candidate for a “sequential approximation” routine with trust regions which is essentially the class that MMA is in. The approximation used is not first or second order Taylor series expansion wrt x though, it is a per variable first order Taylor series expansion wrt either 1 / (x_i - L_i) or 1 / (U_i - x_i) whichever leads to positive second derivative (aka convex approximation). L_i and U_i are known as the “moving asymptotes” hence the name of the method. These play the role of a trust region. The “globally convergent” variant basically tries to maintain primal feasibility at every step according to the original constraints.
Thanks, that’s also what I surmised from the paper. My point is more a meta-objection: any non-Newton numerical method of the form x_{n+1} = f(x_n) with a fixed f is asymptotically suboptimal by nature, because it ends up converging at a predictable pace (which could be eg extrapolated). Typically this shows up as a very regular linear convergence in semilog convergence plots. Efficient methods always try to learn something from the history (curvature, conjugacy…). The black-box version of this is Anderson acceleration, which is super efficient on problems with regular convergence patterns.
So I was wondering whether this method was missing an extra step of trickery to make it converge faster. But maybe it’s not applicable because the active set of constraints keeps changing or something. OTOH I see in the paper that they solve the dual problem using CG, so maybe that’s my answer. Or maybe it’s geared towards problems that are more “LP-like” (ie relatively simple objective and constraints, but with non-trivial intersections) than “general unconstrained optimization-like” (complicated objective function but simple solution set).
In general my problem is that the literature on unconstrained optimization and on constrained optimization feel like two completely different fields with orthogonal concerns, and I’m not sure why.
My experience has been that sometimes when the problem is nasty enough, learning less can be more robust and more exploratory. Bad Hessian approximations can do more harm than good and good Hessian approximations can also sometimes do more harm than good because you converge too quickly to a local minimum or a locally infeasible solution. A good line search is in my experience sometimes more important than fancy search directions, e.g. CG or l-BFGS, in non-convex optimisation. Of course, if your problem is nice enough at least locally, learning more can be beneficial but I wouldn’t discard the simple algorithm just because it is simple. For quasi-Newton and Newton style algorithms, there are Ipopt and AugLag/Percival. But MMA is also generally quite fast and robust in my applications.
OTOH I see in the paper that they solve the dual problem using CG, so maybe that’s my answer.
So whenever vanilla like Gradient Descent is used one could use accelerated version of it, is that your point?
This isn’t where most of the time is usually spent though. In my domain, it is safe to assume that most of the time is spent evaluating the function and its gradient. This can involve a heavy physics simulation, linear system solve, eigenvalue problem, etc. Solving the dual problem of the approximation costs close to nothing compared to that so the choice of the algorithm there is not super important as long as it converges robustly. No function evaluations happen there. The function and gradient evaluations happen when constructing the approximation. I found GradientDescent works better than CG and LBFGS for some reason (perhaps an Optim bug) but the point is it doesn’t really matter because this isn’t the bottleneck of MMA.
OK, good point about the dual problem not needing function evaluation.
I think you’re right, it mostly depends on the type of problems. If you have an optimization problem which is unconstrained and weakly nonlinear (so that a quadratic model is appropriate after an initial stage) but might be ill-conditioned, then you definitely want fancy search directions (that’s where I’m coming from). If you have a constrained optimization problem where the geometry of the feasible set plays a major role, then you want simpler LP-like linear approximations and don’t particularly care about speeding up the phase where the active set is located and the problem looks like a nice quadratic linearly constrained problem. That makes sense. Still might be worth plugging in more sophisticated search directions / Anderson accelerating it at the end stages if the user wants a good accuracy though.
This is a very important point. Also, if the problem has varying curvature and high dimension, then a Hessian itself has a lot of information that takes quite a few steps to “learn” well (ie obtain a reasonable approximation) so methods like BFGS will inevitably be working with an approximation that is outdated for the current iteration.
I am very happy to see a native Julia implementation of the MMA methods.
Ages ago i proposed this to @pkofod as something to think about for Optim, once we hit ChainRules 1.0. So that rrule would take the place of passing in g! to optimize.
It is cool to see this actually being done.
You might be interested in the discussion JuMP is having about revamping there Nonlinear support
CCSA/MMA is not really a fixed iteration f. There is a curvature term that is updated on each step that essentially can be though of as a crude (diagonal) Hessian approximation, not to mention the trust region that is updated on each step. (CCSA is somewhat unusual in that it has a hypercube trust region with an adaptive per-variable diameter, rather than an isotropic diameter like the more common spherical trust region.)
(Also, realize that nonlinearly constrained optimization is something of a blend of optimization and root-finding — satisfying an active constraint is more like a root-finding problem, so first-order methods are more like Newton steps when it comes to the constraints. That’s why NLP BFGS methods typically use BFGS only for the objective and not for the constraints, ala SQP.)
Another nice property of CCSA is that it produces a sequence of strictly feasible iterates in its outer iterations, so you can stop it early (in many problems only a low accuracy is needed) while still satisfying the constraints. (Its main limitation is that it supports only nonlinear inequality constraints, not nonlinear equality constraints—the feasible set must have a non-empty interior.)
I am aware of the JuMP team’s discussions. JuMP/MOI is a huge project though so any re-design goes through long vetting and discussion processes. I like to move fast and often break things in Nonconvex. I plan to use some of MOI’s constructs when generalising Nonconvex to handle convex and conic constraints efficiently, e.g. through generalised augmented Lagrangian, generalised interior point / barrier methods or proximal algorithms. Also I want to convert back and forth between MOI models to allow for user-friendly definitions of linear constraints (one of JuMP’s main strengths) as well as using MINLP solvers like Juniper and Pavito which only have a JuMP/MOI API. I hope that I can contribute something to MOI/JuMP one day but for now, Nonconvex is my little pet project which I mostly use in TopOpt.jl. I am also not necessarily going to shy away from re-inventing some of GalacticOptim, JuliaSmoothOptimisers, or other packages along the way if I find it useful. Ideally, we should figure out a way to synergise but features are more important than where they live in my opinion. Once the features are there, I am happy to move some code around if people want. Let’s see.
Thanks everybody, that helps a lot, sorry for the tangent. So MMA is less efficient at learning the second-order geometry than standard unconstrained optimization algorithms, but that doesn’t really matter because it’s not really the main difficulty MMA aims to solve.
This is a very important point. Also, if the problem has varying curvature and high dimension, then a Hessian itself has a lot of information that takes quite a few steps to “learn” well (ie obtain a reasonable approximation) so methods like BFGS will inevitably be working with an approximation that is outdated for the current iteration.
I’m not sure I buy this argument: BFGS basically always outperforms steepest descent even on high dimensional, highly nonlinear unconstrained problems (although I agree it’s usually not spectacularly efficient during the initial stages). Learning something is better than learning nothing, if you make sure not to trust too much the data.
(Its main limitation is that it supports only nonlinear inequality constraints, not nonlinear equality constraints—the feasible set must have a non-empty interior.)
Thanks, I missed that (the standard trick people use is to reformulate f(x)=0 as f(x) >= 0 and f(x) <= 0, but I guess the algorithm won’t like that)
Numerically, the “standard” trick is to reparametrize your problem so that you get rid of one variable for each equality constraint. If all else fails and you cannot do this analytically, use a numerical rootfinder for it.
No, that won’t work. There needs to be a strictly feasible point.
A workaround is to use an epsilon relaxation to create a non-empty interior and take epsilon to 0 in a continuation scheme.
That trick only works in proofs
Numerically, the “standard” trick is to reparametrize your problem so that you get rid of one variable for each equality constraint. If all else fails and you cannot do this analytically, use a numerical rootfinder for it.
While MMA can’t handle equality constraints in general, linear equality constraints can in theory be handled using the re-parameterisation trick you mentioned which for the linear case means you optimise wrt the coordinates of the nullspace. But you need to add new inequality constraints in place of the original variable bounds. Automating the re-parameterisation for the nonlinear case is more difficult though because the root finding algorithm may not find roots if the equality constraints are not satisfied. If multiple roots exist, it can also get complicated because the “root function” won’t be a proper function anymore.
I don’t know enough about MMA to know if it could work there, but for steepest descent with linear constraints, you can just simply project the search direction onto the nullspace, so you don’t really need to make the change of variables explicitly. Following this idea to nonlinear constraints leads to riemannian manifolds (there’s a few packages supporting this in julia), where you do this locally and use a retraction to go back to a feasible point. That works out fine even with more sophisticated LBFGS-type methods, as long as the constraint is non-degenerate.