Computing Hessian of the Langrangian efficiently with DI.jl

Hello,

The new MOI.VectorNonlinearOracle support providing the Hessian of the Langrangian function, defined as, according to MOI.jl doc:

\sum_i \mu_i \nabla^2 f_i(\mathbf{x})

but the Langrangian typically also include the objective function:

L( \mathbf{x},\boldsymbol{\mu}) = J(\mathbf{x}) + \boldsymbol{\mu} \cdot \mathbf{f}(\mathbf{x})

where \cdot is the dot product, J(\mathbf{x}) the objective function, and the nonlinear equality constraints are defined as \mathbf{f}(\mathbf{x}) = \mathbf{0}.

  1. My understanding is that MOI will do \nabla^2 J(\mathbf{x}) + \sum_i \mu_i \nabla^2 f_i(\mathbf{x}) internally before passing it to the solver (if the Hessian of the objective function is also provided), am I right ?
  2. The MOI.VectorNonlinearOracle also requires functions to evaluate the equality constraint \mathbf{f}(\mathbf{x}) and its Jacobian \nabla \mathbf{f}(\mathbf{x}). When computing the Jacobian with e.g. DifferentiationInterface.jl and ForwardDiff.jl, we can compute both efficiently with value_and_jacobian. Now, If I compute the \sum_i \mu_i \nabla^2 f_i(\mathbf{x}) matrix with value_gradient_and_hessian, can I re-use the intermediate results (the value and the gradient), to construct the \mathbf{f}(\mathbf{x}) vector and \nabla \mathbf{f}(\mathbf{x}) Jacobian ? My feeling right now is no, so I should just compute \sum_i \mu_i \nabla^2 f_i(\mathbf{x}) with hessian directly, since the intermediate results are useless here.

Thanks!

There’s nothing built into DI at the moment.
Are your Jacobians and Hessians sparse or dense?

The Jacobian can be dense or sparse, depending if I use a single shooting of multiple shooting transcription. The Hessian will be always sparse.

But my question is not about sparsity, but on avoiding redundant computations by using value_and_jacobian and value_gradient_and_hessian judiciously.

In other words, the VectorNonlinearOracle want 3 functions that computes respectively:

  1. The value of \mathbf{f}(\mathbf{x})
  2. The Jacobian of \mathbf{f}(\mathbf{x})
  3. The Hessian of the function \ell(\mathbf{x}, \boldsymbol{\mu}) = \boldsymbol{\mu} \cdot \mathbf{f}(\mathbf{x})

We can compute 1. and 2. efficiently in one single value_and_jacobian call. Now are the value and the gradient of \ell(\mathbf{x}, \boldsymbol{\mu}) completely useless, or they can be used to compute 1. and 2. ? My feeling right now is they are useless since the dot product make it impossible to recuperate the \mathbf{f}(\mathbf{x}) value.

edit : I think @odow should be able to answer my question easily.

To answer my own question here, clearly it does. I had difficulties finding good references on this, but the MATLAB documentation is great on this subject. The Lagrangian of the equality constrained optimization problem above is:

L(\mathbf{x}, \boldsymbol{\mu}) = J(\mathbf{x}) + \boldsymbol{\mu} \cdot \mathbf{f}(\mathbf{x})

meaning that the Hessian of the Langrangian is really:

\nabla^2 L = \nabla^2 J(\mathbf{x}) + \sum_i \mu_i \nabla^2 f_i(\mathbf{x})

I think it would be a good Idea to add this little piece of information in VectorNoninearOracle docstring or somewhere else. Maybe just a sentence e.g. ā€œThe Hessian of Langrangian for a given model is the sum of all VectorNonLinearOracle Hessians and all nonlinear operator Hessians.ā€

Ideally, you also want a multiplier for the objective, which the optimization solver may need to steer (for example in \ell_1 relaxation or feasibility restoration).

1 Like

Yeah, the idea is that you can sometimes reuse the results of a second-order derivative because it computes first-order quantities as a byproduct. Here however all constraints are mixed, which makes this impossible. You could do something if you were to compute the Hessian of each f_i individually, but that would be wasteful.

1 Like

Not sure if I need to chime in with anything here, but:

My understanding is that MOI will do … before passing it to the solver

Correct.

In the Ipopt.jl case, we compute most of the Hessian of the Lagrangian for the objective and any quadratic and scalar nonlinear constraints, and then we add on the terms that correspond to the VectorNonlinearOracle constraints: Ipopt.jl/ext/IpoptMathOptInterfaceExt/MOI_wrapper.jl at 3a7aed2dbc012a83a227dc1726ac791110e5a6f8 Ā· jump-dev/Ipopt.jl Ā· GitHub

Ideally, you also want a multiplier for the objective

The MOI API has this Nonlinear programming Ā· MathOptInterface.

It’s not applicable for VectorNonlinearOracle because there is no objective component, so we just have \mu^\top f(x)

You could do something if you were to compute the Hessian of each individually, but that would be wasteful.

Yeah, the best way is probably to define a new function like

lagrange(x, u) = u' * set.eval_f(x)

and then compute a single Hessian of lagrange, in pseudo-code:

eval_hessian_lagrangian(ret, x, u) = ret .= vec(DI.hessian(x -> lagrange(x, u)))

You’d want to re-use any work for different x and u, but I don’t know how to do that in DI. And you want sparsity if it exists.

2 Likes

Thanks a lot it does help!

About filling ret in the Hessian of the Lagrangian function, should I fill only the upper or lower triangular part? Or it will also work if I fill both part ?

See the docstring: Standard form Ā· MathOptInterface

ret is an AbstractVector{Float64}. You must supply the sparsity structure using the hessian_lagrangian_structure kwarg. Then ret are the corresponding non-zeros. Since the matrix is symmetric, you do not need to provide triangles. Matching elements (i, j) and (j, i will be summed together.

I can make a PR to improve the docstring and make this clearer.

1 Like

What I meant is should I provide e.g. both (1, 2) and (2, 1) in the hessian_lagrangian_structure , or just providing (1, 2) should work ?

Providing (1, 2) is enough

1 Like
1 Like

So, for posterity, the efficient way of doing this in DifferentiationInterface.jl is by using both a Cache and a Constant context respectively for the output of \mathbf{f}(\mathbf{x}) and the Langrange multiplier \boldsymbol{\mu} (in pseudo-code):

using DifferentiationInterface, SparseArrays, SparseConnectivityTracer, SparseMatrixColorings
import ForwardDiff
function ā„“(x, μ, f)
    f!(f, x)
    return dot(μ, f)
end
nx, nf = 3, 4
backend = AutoSparse(
    AutoForwardDiff();
    sparsity_detector=TracerSparsityDetector(),
    coloring_algorithm=GreedyColoringAlgorithm(),
)
x, f, μ = zeros(nx), zeros(nf), zeros(nf)
strict = Val(true)
prep = prepare_hessian(ā„“, backend, x, Constant(μ), Cache(f); strict)
H = similar(sparsity_pattern(prep), eltype(x))
get_uppertriangle_indices(structure) = [(i,j) for (i,j) in structure if i ≤ j]
I, J = findnz(H)
hessian_lagrangian_structure = get_uppertriangle_indices(collect(zip(I, J)))
fill_ret!(ret, H, structure) = map!(i -> H[structure[i]], ret, eachindex(ret))
function eval_hessian_lagrangian(ret, x_arg, μ_arg)
    x .= x_arg
    μ .= μ_arg
    # should be using the in-place `hessian!`, to save allocations:
    H = hessian(ā„“, prep, backend, x, Constant(μ), Cache(f))
    return fill_ret!(ret, H, hessian_lagrangian_structure)
end

Note that x .= x_arg and μ .= μ_arg are needed because strict=Val(true) and MOI will call eval_hessian_lagrangian with non-pure-vector arguments.

In the end both MathOptInterface.jl and DiffenrentiationInterface.jl offer all the tools to implement this efficiently, but this is surprisingly complex (handling sparsity, fill only the upper triangular part, using Constant and Cache contexts, etc.).

1 Like

You almost got it right, but be careful when you prepare a sparse Jacobian or Hessian with additional constants: those constants are not ā€œtracedā€ by the sparsity detection step, so if their own structure changes, you might be in for a surprise. Better trace with generic constant values instead of zeros then (it is similar to this discussion in the SCT docs).

As an example, can you figure out why prep1 is correct but prep2 is wrong in the example below?

using DifferentiationInterface, SparseConnectivityTracer, SparseMatrixColorings
import ForwardDiff
using LinearAlgebra, SparseArrays

cons(x) = map(abs2, x)
lag(x, μ) = dot(μ, cons(x))

backend = AutoSparse(
    AutoForwardDiff();
    sparsity_detector = TracerSparsityDetector(),
    coloring_algorithm = GreedyColoringAlgorithm(),
)

typical_x = zeros(10)
typical_μ1 = zeros(10)
typical_μ2 = spzeros(10)

prep1 = prepare_hessian(lag, backend, typical_x, Constant(typical_μ1));
prep2 = prepare_hessian(lag, backend, typical_x, Constant(typical_μ2));

API-wise, do you think there is anything DI can do to improve those workflows?

1 Like

Well, clearly prep2 will be wrong since sparsity detection is done with typical_μ2 = spzeros(10) effectively saying ā€œĪ¼ will be systematically always zerosā€. But sparsity detection in prep1 is done with typical_μ1 = zeros(10), so also zeros but not structured zeros, so it seems to work here. But that’s also what I did in my example above. I did not used spzeros. Are you saying that I should do e.g.: typical_μ1 = ones(10) to guarantee no false-negatives in the pattern ?

API-wise, do you think there is anything DI can do to improve those workflows?

Honestly, I don’t think so. This is just an intrinsically complex problem to apply AD efficiently.

Yes it does work here, but if you had a slightly more sneaky function, like

f(x, mu) = sum(iszero(mui) ? 0 : xi for (xi, mui) in zip(x, mu))

then the zeros in your preparation vector typical_mu would deceive sparsity detection.

Ideally SCT would trace through every Constant context provided, but for arbitrarily complex objects that is not possible. So instead we keep these constants fixed during sparsity detection. This means that, as a rule, you should probably choose a typical_mu that is rather generic, like rand(n), instead of very specific like zeros(n).

That’s probably something I should stress in the DI docs

1 Like