How to solve a large quadratic programming problem

BLI · March 6, 2024, 9:30am

I’m curious as to the origin of the problem. Matrix K seems to be tall, i.e. few rows and many columns. So this indicates an underdetermined problem with much fewer equations than unknowns.

Is the problem really to find f from:

Kf = g

with the added condition that f\ge 0?

aris · March 6, 2024, 10:17am

@geoffroyleconte and @odow, your RipQP implementations work very well indeed. Thanks a ton for writing these down!

I suppose “solution” credit should go to @dpo for being the first to recommend using RipQP.

That’ll do for now, but I do plan to benchmark the other options recommended in this thread, whenever I manage to find some time to study how to use the different packages. My goal is to write a package similar to RegularizationTools.jl to automatically deal with these problems, so having multiple solvers would be beneficial.

Precisely, this is a textbook case of Tikhonov regularization.
The condition number in the K example above has also been reduced a lot by SVD truncation during the preprocessing of the data.

This method sounds very interesting, thank you for the recommendation. I will try to go throught the documentation and see how that could work.

The problem arises in the processing of NMR relaxation data. Your suggestion makes sense, it is effectively just Kf=g, but this problem is very ill posed, and slight perturbations in g (experimental noise in this case) would lead in dramatically different f results, if we just solve the problem using the aforementioned “naive” way.
This is why we use Tikhonov regularization and introduce this \alpha term, to make the results more consistent and the resulting f distribution more “smooth”. Larger \alpha values result in smoother distributions which are less affected by the noise in g. Of course, if you increase \alpha too much you begin to lose a lot of information, so you need to strike a balance between ignoring noise and retaining actual information.

gdalle · March 6, 2024, 10:18am

Here’s an example that is very similar to yours:
https://jump.dev/DiffOpt.jl/stable/examples/autotuning-ridge/

BLI · March 6, 2024, 10:31am

Is there then a chance that for the optimal solution, Kf=g is not satisfied?

aris · March 6, 2024, 10:46am

I would imagine with data like these, equality would never be satisfied.
You want to strike a balance between minimizing both terms ||Kf-g||^2 and ||\alpha f||^2. The L-curve method illustrates this goal nicely. If you’re interested in more information, here is a link to a YouTube playlist with some lectures on the subject, which I found very helpful.

lrnv · March 6, 2024, 10:56am

No its the other way around. Since f is underdetermined (more variables than lines), we are actually in high dimensionsional settings.

The space of f’s that satify exactlu Kf = g is large. Among them, you want to choose the one witht he smallest L2 norm.

Actually here whatever \alpha you should obtain the same result.

You shoudl actually implement the problem with Kf=g as equality constraints and the loss being just norm2(f) : this form of the QP is equivalent in your case (K wide), but will probably be much faster. No need for \alpha.

aris · March 6, 2024, 11:15am

||Kf-g||^2 and ||f||^2 seem to be the things most people try to minimize.
I don’t see Kf=g as a constraint being used (maybe I’m not looking hard enough).
Perhaps there’s something I’m missing here, or I’m getting something wrong in the methodology (not unlikely).
https://www.sciencedirect.com/science/article/pii/S0377042700004143

lrnv · March 6, 2024, 11:25am

Well, consider the sizes of things: size the matrix K is sized (n,d) with d > n, the linear system of equations Kf = g will always have a solution (in fact, the space of solutions is d-n-dimensional.

So there is a way the optimizer will make the first part of the loss zero, namely \lVert Kf - g\rVert^{2} = 0. Among these solutions, the optimizer simply has to pick the one with smaller \lVert f \rVert, regardless of the value of \alpha.

Overwise said, Ridge regression does not handle high dimensional cases as well as Lasso regression would.

Edit: I forgot about the positivity constraint. Depending on the context, sometimes the positivity constraint is satisfied out of the box and you can just “check” for it at the end of the non-constrainted optimization. Otherwise, what I just said does not hold anymore and you have to use the full QP and find the right hyperparameter \alpha. But since the time is orders of magnitude smaller, I would still make this check for luck at the begining of my implementation

BLI · March 6, 2024, 3:04pm

So – in principle one could just say:

f0 = K\g

which produces the f with minimum norm. If f0 >= 0, then the solution is found (perhaps poorly conditioned…).

What if f0 >= 0 is not satisfied? Could one do:

M = nullspace(K)
f = f0 + M*z

and search for z such that f >= 0? This would reduce (slightly) the size of the problem?

dpo · March 6, 2024, 3:30pm

It’s not because K is underdetermined that Kf = g is consistent. But if indeed it is, you should be able to remove the r variable from my formulation of the problem. You can also remove \alpha. RipQP will continue to apply. But if there’s a chance that Kf = g is inconsistent, or nearly inconsistent, the formulation with r is more robust.

mstewart · March 6, 2024, 4:05pm

I’m wondering if I missed a turn in this discussion and am maybe misinterpreting. But are you talking about finding a minimum norm solution or finding the solution with Tikhonov regularization? In the former case \alpha is not involved. But with Tikhonov regularization, even neglecting nonnegativity constraints, you won’t generally get an exact solution to Kf = g. In fact as \alpha becomes large, you can expect the Tikhonov solution f to approach zero. Given that the problem is ill posed and K is ill conditioned, seeking an exact solution seems likely to give bad results, even if it is an exact solution of minimum norm.

BLI · March 6, 2024, 4:15pm

That’s what I meant when I asked if there is a chance that Kf=g is not satisfied. And I agree with you.

jondeuce · March 7, 2024, 7:37am

In case it is useful, I solve regularized small-scale nonnegative least-squares problems in my package DECAES.jl using this NNLS submodule which was forked from NonNegLeastSquares.jl some years ago.

To use this for your problem, you would pass A = [K; sqrt(alpha) * I] and b = [g; zeros(n)] where size(K) = (m, n). Given that your K has many more columns than rows this is probably inefficient for your use case, though.

jondeuce · March 7, 2024, 7:59am

Also, slight aside, but I saw an interesting twitter thread about regularization when you have many more columns than rows. They reference a paper showing that the optimal regularization parameter can sometimes be zero (or even negative!) due to implicit regularization in the least-norm solution f = K' * ((K * K') \ g).

aris · March 7, 2024, 8:46am

Thanks for the suggestion @jondeuce.
Interestingly enough, we are solving a very similar problem here.
However, if I understand correctly you are interested in multiple T_2 distributions (one for each point in the k-space I guess?) while I’m looking for T_1-T_2 correlation maps, which is the 2D analogue without any magnetic field gradients involved.

I installed DECAES.jl and tried DECAES.nnls(A,b) but it seems the size of b as defined in your comment will not work here, any insights into this?
Also, would the nonneg_lsq() solver from NonNegLeastSquares.jl work in a similar way here?

The paper you cite for the negative regularization parameter is very interesting, I’ll definitely have a read at some point.

jondeuce · March 7, 2024, 8:52am

Ahh cool, yeah probably DECAES as a whole is related but not totally relevant to you.

I edited my previous comment to make the sizes more clear; if size(K) = (m, n) then the padded arrays should be size(A) = (m+n, n) and size(b) = (m+n,).

aris · March 7, 2024, 9:58am

Works indeed! Although it is a lot slower than RipQP unfortunately.

stevengj · March 7, 2024, 1:14pm

The problem is that this is a huge matrix when K is “wide”, so you sacrifice a lot of efficiency by working in terms of A.

jondeuce · March 7, 2024, 6:58pm

Yeah I figured this would probably be prohibitively costly. Also that code is single-threaded and doesn’t use BLAS, so would need to be tweaked to be competitive for large problems in any case.

dpo · March 9, 2024, 6:10pm

A sparse factorization will not care about the \sqrt{\alpha} I block, and it improves conditioning. It’s just that your vectors are longer, so a very small price to pay.

Topic		Replies	Views
Write large Least Square like problems in JuMP Optimization (Mathematical) jump	2	1203	March 16, 2020
JuMP, excessive memory usage? Optimization (Mathematical) question	16	3287	February 14, 2018
Optimize() error: "Initial value and slope must be finite" Optimization (Mathematical)	5	886	June 2, 2018
Sparse Cholesky of Gram Matrix (SuiteSparse?) Numerics question , linearalgebra	25	2663	April 11, 2020
Linear solver \(A, B) performance vs Matlab A\b General Usage	32	7765	May 21, 2017

How to solve a large quadratic programming problem

Related topics