This is in fact egregiously suboptimal ![]()
The goal of my question was to figure out if your handwritten implementations of Gradient and (especially) Hessian are useful. I imagine they were not completely straightforward to implement, and autodiff can take care of that for you. With the ecosystem around DifferentiationInterface.jl, you don’t even need to specify the sparsity pattern ahead of time, it is detected for you (so that only nonzero coefficients of the Hessian get computed). But this detection is costly, so it should only be done once (through the so-called “preparation” step) and then reused for every Hessian update.
First, you’d have to check that DifferentiationInterface.hessian returns the exact same matrix as the one you compute manually. If not, then there is a bug on at least one of the two sides and we should investigate. Once that is confirmed, the real question is whether we waste performance by replacing your manual versions with the automated ones. Note that the manual versions themselves could probably be optimized but that’s a separate question.
So what I had in mind was benchmarking your smooth function against another one that looks like this. Since I don’t have all the dependencies, I can’t run the code, so beware of syntax errors, but it should give you the general idea. I’m not sure about x0 though, and why it works with a matrix instead of a vector
import DifferentiationInterface as DI
using Enzyme
using ForwardDiff
using SparseConnectivityTracer: TracerSparsityDetector
using SparseMatrixColorings
function smooth_autodiff(
lds::LinearDynamicalSystem{S,O}, y::Matrix{T}
) where {T<:Real,S<:GaussianStateModel{T},O<:GaussianObservationModel{T}}
backend = DI.AutoForwardDiff() # replace by DI.AutoEnzyme() once it works
sparse_backend = DI.AutoSparse(
dense_second_order_backend;
sparsity_detector=TracerSparsityDetector(),
coloring_algorithm=GreedyColoringAlgorithm(),
)
T_steps, D = size(y, 2), lds.latent_dim
X₀ = zeros(T, D * T_steps)
example_vec_x = nothing # vec(x0[1, :]) maybe?
function nll(vec_x::Vector{T})
x = reshape(vec_x, D, T_steps)
return -loglikelihood(x, lds, y)
end
gradient_prep = DI.prepare_gradient(nll, backend, example_vec_x)
hessian_prep = DI.prepare_hessian(nll, sparse_backend, example_vec_x)
function g!(g::Vector{T}, vec_x::Vector{T})
return DI.gradient!(nll, g, gradient_prep, backend, vec_x)
end
function h!(h::AbstractSparseMatrix, vec_x::Vector{T})
return DI.hessian!(nll, h, hessian_prep, sparse_backend, vec_x)
end
initial_g = DI.gradient(nll, gradient_prep, backend, example_vec_x)
initial_H = DI.hessian(nll, hessian_prep, sparse_backend, example_vec_x)
# Create TwiceDifferentiable object I guess?
td = TwiceDifferentiable(
nll,
g!,
h!,
X₀, # why not a vector here?
Float64(nll(X₀)), # Initial value as Float64
initial_g, # Initial gradient
initial_H # Initial Hessian
)
res = optimize(td, X₀, Newton(linesearch=LineSearches.BackTracking()))
# blablabla
end