Differentiating through non-negative least squares solver?

Hi, I’m wondering if there’s a better way to get the gradient of a function like this one:

using NonNegLeastSquares, DifferentiationInterface
import FiniteDiff

function f(aa, b)
    A = reshape(aa, size(b, 1), :)
    x = nonneg_lsq(A, b)
    return sum(abs2, b .- A * x)
end

A = rand(5, 3)
aa = vec(A)
B = rand(5, 10)
gradient(aa -> f(aa, B), AutoFiniteDiff(), aa)

This works with FiniteDiff, but not other backends, due to Dual/Float64 conversion issues.

My objective is to optimize a set of non-negative basis vectors (i.e. A), as part of a more complex optimization model. For instance, the real function would be something like:

function f1(u, b) 
    aa = complicated_computation(u)
    return f(aa, b)
end

Maybe ImplicitDifferentiation.jl would fit the bill?

1 Like

Yes, it’s “just” implicit differentiation, which you can also easily implement manually if you know the principles. The only complication here is because you have inequality constraints, the optimality conditions (KKT conditions) are more complicated than in unconstrained optimization (and there can be a discontinuity in the derivative at points where a constraint changes from inactive to active or vice versa).

But basically you can just treat it as a least-squares problem where you’ve removed all the variables corresponding to the active constraints where x[i] == 0 (ideally checking that the gradient is actually pushing into these constraints, but it’s fairly unlikely that x[i] == 0 and the constraint is inactive). That is, if you do the following after you solve for x = nonneg_lsq(A, b):

inactive = findall(!=(0), x)
Ai = A[:, inactive]

you should have x[inactive] ≈ Ai \ b and be able to apply AD (or manual differentiation) to Ai \ b.

If you have lots of parameters (> 100), I would use reverse mode rather than forward mode (dual numbers).

4 Likes

If you wanted to try your hand at implicit differentiation, here’s the gist. We’re solving \min \lVert A u - b \rVert^2 subject to -u \leq 0. Let’s associate a Lagrange multiplier \lambda \geq 0 to the non-negativity constraint, so that the Lagrangian writes \mathcal{L} = \lVert A u - b \rVert^2 + \lambda^\top (-u). The KKT optimality conditions are:

  • Lagrangian stationarity: \nabla_u \mathcal{L} = 2A^\top (Au - b) - \lambda = 0
  • Complementary slackness: \lambda \odot (-u) = 0
  • Primal and dual feasibility: -u \leq 0 and \lambda \geq 0

We can use the first two conditions (which are equaliies) to compute derivatives by solving a linear system. This linear system is given by the implicit function theorem, but ImplicitDifferentiation.jl takes care of it automagically, the only thing you need to provide are optimality conditions.

Let us denote by x = (A, b) the inputs and y = (u, \lambda) the outputs of our optimization procedure (we include the Lagrange multiplier because it is involved in the optimality conditions). We define the functions corresponding to

  • the solver f(x) = (u^\star, \lambda^\star) giving the optimal values
  • the conditions c(x, y) = \begin{pmatrix} 2A^\top (Au - b) - \lambda \\ \lambda \odot (-u) \end{pmatrix} = 0

The only issue here is that you would need a primal-dual solver, which also returns the value of the optimal Lagrange multipliers. I don’t think NonNegLeastSquares.jl provides that, which is why DiffOpt.jl (which makes implicit differentiation even more streamlined) uses JuMP.jl to solve the underlying optimization problems.

7 Likes

Once you have the optimum (by any method), you can easily compute the Lagrange multipliers after the fact by solving a linear system (Lagrangian stationarity). Then you can plug \lambda into complementary slackness to have your implicit-function equations.

In this particular case, this is trivial: from your formulas above, you have \lambda = 2A^T (Au - b). Equivalently, you only have a single system of implicit equations \boxed{u \odot A^T (Au - b) = 0} (i.e., in every component, either \nabla \Vert Au - b\Vert^2 is zero or u is zero).

However, in this particular problem, as I said, you could simplify in a different way because effectively you can just delete the (zero) variables/columns where the constraints are active and treat it (locally) as OLS for differentiation purposes.

2 Likes

Indeed I had missed that part! So with that in mind, here’s a complete example using ImplicitDifferentiation.jl:

using NonNegLeastSquares, DifferentiationInterface, ImplicitDifferentiation
using ForwardDiff: ForwardDiff
using FiniteDiff: FiniteDiff
using Zygote: Zygote

function f(aa, b)
    A = reshape(aa, length(b), :)
    return nonneg_lsq(A, b)[:, 1]
end

function kkt(aa, u, b)
    A = reshape(aa, length(b), :)
    grad = A' * (A * u - b)
    return u .* grad
end

implicit = ImplicitFunction(f, kkt)

A = rand(5, 3)
aa = vec(A)
b = rand(5)

implicit(aa, b) == f(aa, b)

J = jacobian(aa -> f(aa, b), AutoFiniteDiff(), aa)
jacobian(aa -> f(aa, b), AutoZygote(), aa)  # error
jacobian(aa -> f(aa, b), AutoForwardDiff(), aa)  # error

J1 = jacobian(aa -> implicit(aa, b), AutoZygote(), aa)
J2 = jacobian(aa -> implicit(aa, b), AutoForwardDiff(), aa)

isapprox(J, J1; rtol=1e-3)  # true
isapprox(J, J2; rtol=1e-3)  # true

Another way to derive optimality conditions, which is sometimes simpler, is to formulate the fixed point equation associated with an optimization method like projected gradient descent. See here for an example and [2105.15183] Efficient and Modular Implicit Differentiation for more theoretical insights.

2 Likes

Do you even need to differentiate f for that?

I would simply define a new objective function:

function f2(u, b, x) 
    aa = complicated_computation(u)
    A = reshape(aa, size(b, 1), :)
    return sum(abs2, b .- A * x)
end

then minimize f2 with respect to both u and x, under the constraint x .>= 0.

The case where you do need to differentiate the NNLS least-squares solution is a more general bilevel optimization problem, where you want to minimize some loss function \mathcal{L}(p, x(p)) where x(p) is the NNLS solution for a parameterized matrix A(p), i.e. you want

\min_p \mathcal{L} (p, \text{arg }\min_{x \succeq 0} \Vert A(p) x- b \Vert)

But if the loss function is simply \mathcal{L}(p, x) = \Vert A(p) x - b \Vert, then you’re right that you can combine the two into a single \min_{p, x\succeq 0} \Vert A(p) x - b \Vert problem. (You still might want to split them up, e.g. ADMM-style, because the x minimization is convex.)

Thanks @stevengj and @gdalle for the thorough examples. I defined a couple of outer cost functions for the second ImplicitDifferentiation example, and achieved what I need:

function g(aa, b)
   u = implicit(aa, b)
   A = reshape(aa, length(b), :)
   return sum(abs2, b .- A * u)
end

function h(aa, B)
    sum(g(aa, B[:, i]) for i in axes(B, 2))
end

gradient(aa -> h(aa, B), AutoZygote(), aa) # works!

I was then trying to figure out how to do this with just one nonneg_lsq call on the matrix B, rather than doing it for each column, which I was able to do with some creative reshapeing. Seems like ImplicitFunction needs vector-valued outputs? Anyway, this works and is twice as fast as the above:

function f1(aa, B)
    A = reshape(aa, size(B, 1), :)
    return vec(nonneg_lsq(A, B))
end

function kkt1(aa, u, B)
    A = reshape(aa, size(B, 1), :)
    u1 = reshape(u, size(A,2), size(B, 2))
    grad = A' * (A * u1 - B)
    return vec(u1 .* grad)
end

implicit1 = ImplicitFunction(f1, kkt1)

function h1(aa, B)
    A = reshape(aa, size(B, 1), :)
    u = reshape(implicit1(aa, B), size(A, 2), size(B, 2))
    return sum(abs2, B .- A * u)
end

@btime h(aa, B) # 52.634 μs
@btime h1(aa, B) # 22.454 μs
@btime gradient(aa -> h(aa, B), AutoZygote(), aa) #  431.179 μs
@btime gradient(aa -> h1(aa, B), AutoZygote(), aa) #  213.752 μs

Marking this as the solution unless someone else has a better idea. Thanks again for the help!

Yeah, it makes things easier on our end and it’s not a fundamental limitation if you’re not afraid of reshapeing.