Avoiding redundant computation when using Zygote and ForwardDiff for forward-over-reverse hessians

I am computing the Hessian of a function using a combination of Zygote (reverse-mode AD) and ForwardDiff (forward-mode AD). A MWE of my current implementation is as follows:

using Zygote, ForwardDiff, DiffResults

# Define a simple function that takes a vector and returns a scalar
f_aux(x) = sum(x .^ 2)  # Example: sum of squares

# Gradient function
function gfun(x)
    valgrad = Zygote.withgradient(f_aux, x) # Reverse-mode autodiff
    return valgrad.grad[1]
end

# Hessian function
hessfun = x -> ForwardDiff.jacobian!(DiffResults.JacobianResult(x), gfun, x)  # Forward-mode autodiff (forward-over-reverse)

# Example input vector
x = rand(2)
hessfun(x)

This implementation ensures that I don’t recompute the gradient when computing the Hessian. However, I still have to recompute f_aux(x) separately if I need its value along with the Hessian.

Is there a way to modify this approach so that I can efficiently retrieve both f_aux(x) and the Hessian without recomputing f_aux(x)? Ideally, I would like a solution that remains compatible with Zygote and ForwardDiff, as these are the only AD tools I have found to work with Lux.jl and CUDA.jl for GPU acceleration. In my use case, f_aux(x) is actually a neural network (NN) where x are the NN’s input, and they are low-dimensional, and the NN output is scalar.

Any insights or suggestions would be greatly appreciated!

I have asked myself the same question within DifferentiationInterface.jl and I don’t yet have a satisfactory answer. The API includes a function DI.value_gradient_and_hessian which should do what you want, but only ForwardDiff.jl and ReverseDiff.jl have an optimized implementation. The generic fallback doesn’t avoid an additional call to the function. It is theoretically possible to avoid it, but the current state of the interface doesn’t make it easy.
However, note that when computing the Hessian of f : \mathbb{R}^n \to \mathbb{R} with ForwardDiff.jl over Zygote.jl, you need around n/12 function calls. As soon as your dimension is moderately large, one more call isn’t gonna change your life. And if your dimension is really small, you may want to use pure ForwardDiff.jl instead.

Thanks for the answer. Indeed, the function DI.value_gradient_and_hessian does what I want, and to me it is not a problem to use pure ForwardDiff.jl as my dimension is really small. However, it seems to create issues when running on GPU due to some scalar indexing. MWE (to be added to the previous one) below:

using DifferentiationInterface, CUDA
DifferentiationInterface.value_gradient_and_hessian(f_aux, AutoForwardDiff(), x) # works

x_gpu = CuArray(x)
hessfun(x_gpu) # works
backend = AutoForwardDiff()
DifferentiationInterface.value_gradient_and_hessian(f_aux, backend, x_gpu) # DOES NOT work

Yeah, that’s because ForwardDiff.gradient uses scalar indexing. I don’t think there is a way around that without submitting a PR to ForwardDiff for GPU support. But if your arrays are really small, can I ask what motivated the switch to GPU in the first place?

My arrays in this case are the input to the Neural Network, and they are small, but the hidden layers get much wider and this is why I benefit from GPU. Just to make an example, my NN should approximate a function f:\mathbb{R}^2\to\mathbb{R}, but it can have up to 5/6 hidden layers with 50 neurons each. If I need to evaluate the neural network (in parallel) at 10k points in \mathbb{R}^2 I then really benefit from the GPU.