Using a sparse Hessian in Optim.jl

gdalle · October 26, 2024, 7:50pm

This is in fact egregiously suboptimal

The goal of my question was to figure out if your handwritten implementations of Gradient and (especially) Hessian are useful. I imagine they were not completely straightforward to implement, and autodiff can take care of that for you. With the ecosystem around DifferentiationInterface.jl, you don’t even need to specify the sparsity pattern ahead of time, it is detected for you (so that only nonzero coefficients of the Hessian get computed). But this detection is costly, so it should only be done once (through the so-called “preparation” step) and then reused for every Hessian update.

First, you’d have to check that DifferentiationInterface.hessian returns the exact same matrix as the one you compute manually. If not, then there is a bug on at least one of the two sides and we should investigate. Once that is confirmed, the real question is whether we waste performance by replacing your manual versions with the automated ones. Note that the manual versions themselves could probably be optimized but that’s a separate question.

So what I had in mind was benchmarking your smooth function against another one that looks like this. Since I don’t have all the dependencies, I can’t run the code, so beware of syntax errors, but it should give you the general idea. I’m not sure about x0 though, and why it works with a matrix instead of a vector

import DifferentiationInterface as DI
using Enzyme
using ForwardDiff
using SparseConnectivityTracer: TracerSparsityDetector
using SparseMatrixColorings

function smooth_autodiff(
    lds::LinearDynamicalSystem{S,O}, y::Matrix{T}
) where {T<:Real,S<:GaussianStateModel{T},O<:GaussianObservationModel{T}}
    backend = DI.AutoForwardDiff()  # replace by DI.AutoEnzyme() once it works
    sparse_backend =  DI.AutoSparse(
        dense_second_order_backend;
        sparsity_detector=TracerSparsityDetector(),
        coloring_algorithm=GreedyColoringAlgorithm(),
    )

    T_steps, D = size(y, 2), lds.latent_dim

    X₀ = zeros(T, D * T_steps)
    example_vec_x = nothing  # vec(x0[1, :]) maybe?

    function nll(vec_x::Vector{T})
        x = reshape(vec_x, D, T_steps)
        return -loglikelihood(x, lds, y)
    end

    gradient_prep = DI.prepare_gradient(nll, backend, example_vec_x)
    hessian_prep = DI.prepare_hessian(nll, sparse_backend, example_vec_x)

    function g!(g::Vector{T}, vec_x::Vector{T})
        return DI.gradient!(nll, g, gradient_prep, backend, vec_x)
    end

    function h!(h::AbstractSparseMatrix, vec_x::Vector{T})
        return DI.hessian!(nll, h, hessian_prep, sparse_backend, vec_x)
    end

    initial_g = DI.gradient(nll, gradient_prep, backend, example_vec_x)
    initial_H = DI.hessian(nll, hessian_prep, sparse_backend, example_vec_x)
    
    # Create TwiceDifferentiable object I guess?
    td = TwiceDifferentiable(
        nll,
        g!,
        h!,
        X₀,  # why not a vector here?
        Float64(nll(X₀)),  # Initial value as Float64
        initial_g,         # Initial gradient
        initial_H         # Initial Hessian
    )
    
    res = optimize(td, X₀, Newton(linesearch=LineSearches.BackTracking()))

    # blablabla
end

Topic		Replies	Views
Highly Asymmetric Hessian after optimization Optimization (Mathematical)	26	1675	July 6, 2019
Optim; Hessian or Hessian free? Optimization (Mathematical) package	15	1654	July 11, 2019
Improving performance of a number of functions for large scale gradient based optimization Performance performance , optim , memory-allocation , loops	11	869	November 5, 2021
JuMP interface access to automatic differentiation (ReverseDiffSparse) Optimization (Mathematical) jump , differentiation	5	2695	February 15, 2018
Sparse jacobians of matrix models Machine Learning question , documentation , performance , optimization , sparse	37	1992	July 3, 2022

Using a sparse Hessian in Optim.jl

Related topics