From this JuliaCon talk Fast and Robust Least Squares / Curve Fitting by @ChrisRackauckas, I gathered that if you have a L2 loss function then you should be using NonlinearLeastSquaresProblem from the NonlinearSolve.jl package.
Indeed it should be faster and more numerically stable than other optimization methods not using the full Jacobian information. Is my understunding correct?
Hence my question: can we use NonlinearLeastSquaresProblem for training neural networks for regression tasks where the loss is a MSE? In other words, can we use NonlinearLeastSquaresProblem instead of OptimizationProblem from Optimization.jl for training neural networks with a MSE loss?
That is to say, exit Adam, SGD, LBFGS and just use Levenberg-Marquardt and Gauss-Newton?
In my use case I have a neural ODE model where the loss is the MSE between the predicted and observed time series.
On this exemple from the documentation it is indeed faster by x6 and has better objective values.
On my real cases with NeuralODEs, I am not sure yet to see such improvements, but I am still testing and trying to understand the best way to use it (Levenberg-Marquardt, Gauss-Newton, TrustRegion?). I have a few coupled ODEs (1 to 5) with small Neural Networks (~400 parameters).
I have some questions:
-
Can one use a callback function to compute metrics during the optimization and save parameters? Is it possible to do so with NonlinearSolve? From the documentation it is not clear.
-
Do you have recommendations for neuralODE specifically? Is it better to use Levenberg-Marquardt, Gauss-Newton or TrustRegion? And autodiff method? I am currently using the same as the tutorial.
Yes it is definitely underused for this kind of thing.
Not at this time, but we should add a form of callbacks.
It can highly depend on the many things. But TrustRegion is generally pretty good, and the right autodiff depends on the structure of the Jacobian: if it’s wide then you want reverse mode, tall you want forward mode. So it’s the number of data points vs the number of states, where data points >> states means you want reverse mode otherwise forward mode.
2 Likes
One note of caution: if you are minimizing g(\vec{x}) = \Vert \vec{f}(\vec{x}) \Vert_2 for a function \vec{f}: \mathbb{R}^n \mapsto \mathbb{R}^m (n inputs and m outputs, the latter fed into an L2 norm), then Levenberg–Marquardt and similar methods that require the full Jacobian of \vec{f} require more effort to compute the derivatives than methods based on gradient descent that only require \nabla g.
Using reverse-mode/back-propagation/adjoint methods, you can compute \nabla g in roughly the cost of computing g (via \vec{f}) once. On the other hand, computing the full m \times n Jacobian of \vec{f} is comparable to computing \vec{f} about min(m,n) times in general (when m is smaller you use reverse mode, and when n is smaller you use forward mode, as @ChrisRackauckas mentioned above). So, it may be impractical if both m and n are large, unless the Jacobian has some special structure you can exploit.
This is probably why machine learning typically uses gradient-descent methods rather than Jacobian-based methods even for least-squares loss functions? (There is also the issue that ML typically only randomly samples a small batch of training data on each step, see below.)
(Levenberg–Marquardt is popular and well-known in highly overdetermined nonlinear least-square problems where you only have a few parameters, i.e. small n, in which case the Jacobian is not too costly — that’s what it was originally designed for.)
3 Likes
I should also note that if m \ll n, i.e. you have many fewer outputs then parameters (like a deep neural network with a zillion parameters where you are only trying to predict a small batch of training data on each step), then the problem is highly underdetermined. In this case the traditional Levenberg–Marquardt and Gauss–Newton methods are ill-suited — typical implementations form a huge n \times n matrix J^T J with the m \times n Jacobian J, which is expensive and rank-deficient (you avoid forming this explicitly with pivoted QR-based methods, but for L–M methods you want J^T J + \lambda I or its equivalent and you still end up with a large O(n^2) matrix somewhere if you’re not careful … you can avoid this with the SVD of J, but this seems to be rarely used in L–M methods, and it’s overkill for typical underdetermined applications where J^T J + \lambda I is well-conditioned and SPD thanks to the regularization). Of course, in an underdetermined problem there may be many solutions, but one might often want a solution close to the starting guess, and there may be no exact solutions due to the nonlinearity or due to constraints (e.g. bound constraints on \vec{x}).
It turns out that one can devise a variant of Levenberg–Marquardt that works well in such cases, based on the obvious idea of finding a regularized minimum-norm solution to the linearized overdetermined problem at each step, so that you use the small (m \times m), full-rank matrix J J^T instead (you can again use LQ to avoid squaring the condition number, but once you add the L–M regularization that is less of a concern). We described this algorithm in Chen et al. (2026), appendix B, and I wouldn’t be surprised if other authors have stumbled across the same technique too.
(Our application was for physics-driven topology optimization via Gridap.jl, not data-driven machine learning, but the underlying principles are generic. However, for stochastic optimization with sampled training data there is probably more work to be done to prove convergence?)
It might be nice to add this underdetermined L–M variant as an option to one of the optimization packages, not sure which one, I guess NonlinearSolve.jl? Right now our code for the paper is in a LeastSquaresOptim.jl fork at Inverse-Design-of-Resonances/LeastSquaresOptim/levenberg_marquardt.jl at main · aristeidis-karalis/Inverse-Design-of-Resonances · GitHub … I filed an issue: Levenberg–Marquardt variant for underdetermined least-square problems? · Issue #851 · SciML/NonlinearSolve.jl · GitHub
3 Likes
In this regime there seem to be some papers on stochastic L–M variants, e.g. Bergou (2021), but I don’t know enough about these methods to comment on them right now.
Is there a resource you could recommend to get a deeper insight on the topics you mention here?
Which topics, specifically?
Thank you both for you very interesting answers it really helps!
AD for scientific ML, first order vs second order, overdetermined vs underdetermined.
Those are three very different topics! Gil Strang’s book Linear Algebra and Learning from Data is a nice general survey of many topics in this area, though: Linear Algebra and Learning from Data along with the accompanying OCW lectures: Matrix Methods in Data Analysis, Signal Processing, and Machine Learning | Mathematics | MIT OpenCourseWare
On AD and forward vs. reverse mode, there are many articles and books. We have a short course and course notes that cover some of these topics from a sophomore linear-algebra perspective.
1 Like
Let’s add it and see where it goes.
Yeah, though for lots of things I’ve been finding first order gradient-based methods can be quite unstable in comparison. Even using a Newton-Krylov approach can keep things tame but get much better results, depending on the problem of course.
Though I’ll say, my main point was that NonlinearLeastSquares methods are heavily underutilized, but just like gradient-based optimizers, they also aren’t a silver bullet. They should always be considered when doing an L2 fit, but there are scenarios where they can be beat. I would say you should always be thinking about them and if you have an L2 loss, you should have a reason why you go to a general optimization instead of simply defaulting there.
Thank you very much! I am going through the matrixcalc course, what a fantastic resource.