Should NonlinearLeastSquaresProblem be used for deep learning?

stevengj · February 23, 2026, 8:02pm

I should also note that if m \ll n, i.e. you have many fewer outputs then parameters (like a deep neural network with a zillion parameters where you are only trying to predict a small batch of training data on each step), then the problem is highly underdetermined. In this case the traditional Levenberg–Marquardt and Gauss–Newton methods are ill-suited — typical implementations form a huge n \times n matrix J^T J with the m \times n Jacobian J, which is expensive and rank-deficient (you avoid forming this explicitly with pivoted QR-based methods, but for L–M methods you want J^T J + \lambda I or its equivalent and you still end up with a large O(n^2) matrix somewhere if you’re not careful … you can avoid this with the SVD of J, but this seems to be rarely used in L–M methods, and it’s overkill for typical underdetermined applications where J^T J + \lambda I is well-conditioned and SPD thanks to the regularization). Of course, in an underdetermined problem there may be many solutions, but one might often want a solution close to the starting guess, and there may be no exact solutions due to the nonlinearity or due to constraints (e.g. bound constraints on \vec{x}).

It turns out that one can devise a variant of Levenberg–Marquardt that works well in such cases, based on the obvious idea of finding a regularized minimum-norm solution to the linearized overdetermined problem at each step, so that you use the small (m \times m), full-rank matrix J J^T instead (you can again use LQ to avoid squaring the condition number, but once you add the L–M regularization that is less of a concern). We described this algorithm in Chen et al. (2026), appendix B, and I wouldn’t be surprised if other authors have stumbled across the same technique too.

(Our application was for physics-driven topology optimization via Gridap.jl, not data-driven machine learning, but the underlying principles are generic. However, for stochastic optimization with sampled training data there is probably more work to be done to prove convergence?)

It might be nice to add this underdetermined L–M variant as an option to one of the optimization packages, not sure which one, I guess NonlinearSolve.jl? Right now our code for the paper is in a LeastSquaresOptim.jl fork at Inverse-Design-of-Resonances/LeastSquaresOptim/levenberg_marquardt.jl at main · aristeidis-karalis/Inverse-Design-of-Resonances · GitHub … I filed an issue: Levenberg–Marquardt variant for underdetermined least-square problems? · Issue #851 · SciML/NonlinearSolve.jl · GitHub

Topic		Replies	Views
Comparing non-linear least squares solvers Optimization (Mathematical) nonlinear , nonlinear-optimizati	37	4793	March 5, 2024
Differences between NLsolve and Optim in solving system of equations Optimization (Mathematical)	24	3163	November 16, 2022
Curve fitting with NonlinearSolve.jl General Usage curve-fitting	9	712	December 24, 2024
Nonlinear Least Squares Millions of Allocations Optimization (Mathematical) jump , ipopt	9	599	June 15, 2024
Least squares with SciML NonlinearSolve.jl - how to? General Usage nonlinearsolve , least-squares	13	933	April 18, 2024

Should NonlinearLeastSquaresProblem be used for deep learning?

Related topics