Lovely. JuliaPackageComparisons is nowhere close to what I want it to be, but quality content like exactly what is needed to get it there!
Just a note to say I’ve updated the table and graphs in the first post of this thread. They now include only those solvers that I was able to add to the benchmark code.
If you’d like to see a specific solver added to the table, please first add it to the benchmark.
What is a good simple to use solver package for the case when one single cost function evaluation takes about 100 minutes? I know this depend on the function itself, but this I do not know yet… should I start with BlackBoxOptim.jl ?
Depends a lot on your function. How many parameters does it have? Is it differentiable? (If so, it’s ideal if you can compute derivatives, but even if not there are derivative-free algorithms that internally exploit the existence of derivatives.) What kind of constraints?
The Rosenbrock function has a global minimum of 0. You may want to include another variant of the Rosenbrock function, say something like
that I got from “METHODS FOR NON-LINEAR LEAST SQUARES PROBLEMS” by K. Madsen, H.B. Nielsen, O. Tingleff (link: http://www2.imm.dtu.dk/pubdb/edoc/imm3215.pdf ). Why? Because LM and some other NLS specific algorithms can have poor convergence properties when the residual at the optimum is significantly different from 0. You can control the final residual by varying at lambda.
If you’re using this to solve nonlinear equations you’re fine because the residual will be 0 at the solution (convergence rates emerge from local properties so it’s the residual at the solution that’s important), otherwise you should maybe consider some curve fitting (to data) examples as well. However, using nonlinear least squares to solve systems of equations has it’s own set of issues, because local solutions to the nonlinear least squares problem does not necessarily have zero residuals, and as such they may not solve the underlying nonlinear equations you’re trying to solve.
For several reason, including the non-zero residual error issue above, I personally have great experience with not using LM for NLS. There are so many other algorithms out there for nonlinear optimization that can be more robust and often converge quite quickly. LM was an excellent invention that brought a trust region-like approach to NLS, but I’m actually not quite sure that LM is seen as a leading algorithm in 2023. Looking at CERES it seems like they have made the same conclusion and they include N-CG, BFGS, etc as well.
I never got around to reimplementing LM for NLSolvers.jl, but my point is just that “any optimizer” is a “NLLS/NLS solver”. Can you exploit structure? Yes. Is it necessarily the most important part when picking a solver? Not sure. I’m not an expert on this topic, but the JuliaSmoothOptimizers people have their solver and have actually published in the field, so maybe they have more to say. The scale of the problem also plays a role.
Thanks for this. Very interesting. I am sceptical about the claim, and have studied the reference you gave (which I’m familiar with). I really don’t understand how they can claim this. The quadratic problem at each step will be identical (given the same x), so the step taken can only be affected by the value of the damping factor. The damping factor update will only be impacted by the offset in so much as the offset will reduce the accuracy of the change in cost, due to floating point round off. I would expect this to have minimal impact. I’ve written a gist here to test the claim, and it does indeed show that the non-zero residual makes almost no difference. Most of the steps taken are identical, then very close to the minimum the damping factor of the augmented problem drops faster, so the optimizer actually converges faster. Here’s the output (I’ve adjusted the cost of the augmented problem in the graph, so the two costs are directly comparable):
The code is an interactive plot, allowing you to select different starting points in the state space.
I think the presence of these solvers in Ceres is more so people can compare, rather than they’re better. For the types of problems that people tend to solve with Ceres (visual geometry problems), which do end up with non-zero residuals, LM is usually the best. See Bundle Adjustment - A Modern Synthesis.
Yes. But they don’t exploit Gauss’ approximation to the Hessian. NLLS-specific solvers do, such that:
- The approximate Hessian can be computed using only the 1st derivative of the residuals (as opposed to twice differentiating the squared cost function). It’s simpler and more efficient.
- The Hessian is positive (semi-)definite, so the gauss-newton step is always a minimum of the quadratic approximation.
This is why (I believe) general solvers tend to have worse performance. But I’m very happy to be proved wrong. I’d love to find a better solver than LM.
You point about the specific example seems quite valid. I agree.
Perhaps it’s different if the non-zero residuals actually depend on the variables. Not sure. But I find in practice, on real problems with non-zero residuals, that LM works very well.
LM can work while being suboptimal But yes, I took the example because it was easy, but in the examples I had it mind, it would be nonzero in a slightly more complicated manner. But I’m happy to be proven wrong.
@TheLateKronos There’s a PR waiting to be merged.
Regarding non-Euclidean variables (especially those from non-Hadamard manifolds like sphere or SO(3)), I think it is important to note that there are two standard ways of handling the issue of not being able to cover the entire manifold with one chart. The most common one is using retractions, vector transports and bases of tangent spaces. This is what Manopt.jl does. The other one is chart switching, which is easier to integrate with generic Euclidean solvers, and is the way Manifolds.jl handles integrating geodesics. I don’t see how NLLSsolver.jl does either?
Unfortunately I’m not familiar with these terms.
In NLLSsolver, each variable type has an update
function implemented, that essentially takes a Euclidean vector in the tangent space of the current variable state (i.e. passing in a vector of zeros for the update vector will return the current variable state), and updates the variable state accordingly. This generally involves a projection onto the manifold. Perhaps that’s what you mean by retraction? At each iteration, the Euclidean space that the linear solver works in is updated to the current tangent space of each variable - perhaps this is what you mean by chart switching?
To summarize, the update function takes in a variable type and a Euclidean vector in the current tangent space of the variable, and outputs a suitably updated variable. NLLSsolver is agnostic to what the update method does under the hood, or indeed anything else about the variable.
My implementation for SO(3) matrices can be seen here. The rodrigues
method computes the exponential map from {\mathfrak {so}}(3) to SO(3).
I think it would be easy to use any Manifolds.jl
type, simply by implementing this update
function, plus an nvars
function that returns the dimensionality of the tangent space. Is there an abstract Manifolds
type for which it can be defined?
Yes, update
is similar to retraction.
No, chart switching is different. It sounds more similar the retraction-based approach. SO(3) is not an ideal example here because you can just use Lie algebra which simplifies things but in general you can’t assume that a tangent vector at one point can just be used at another one. Vector transport is what adapts tangent vectors between points. Very often just doing a projection is enough (that’s what Optim.jl does) but Manopt.jl is more generic.
Unless I’ve misunderstood what you’re saying, my solver doesn’t transport tangent vectors (Euclidean vectors in the tangent space?) between different points/tangent spaces. But I’m very happy to engage in a private chat to get to the bottom of what you’re saying.
I see, NLLSolver.jl algorithms just don’t require vector transport, thanks for clarification.
Correct. NLLSsolver never uses information computed in one tangent space to compute an update step in another tangent space (indeed, I’m aware of the problem associated with doing this). Hence vector transport is never required.
The comparison I published in the original post has now been published in juliapackagecomparisons.github.io.
Thanks, @TheLateKronos, for creating that resource. I hope it takes off.
The comparison has moved, to here.