Hessian matrix of a neural network vetor output

ChrisRackauckas · July 16, 2022, 1:28pm

Copy pasting from Slack

Here is the argument in a nutshell. Start with and n to 1 function. The gradient is an n to n function. Reverse mode is higher overhead than forward (because of required memory operations to delay computation) but it requires output many calls and forward is input many. So first one, reverse mode

But the gradient is n to n, so forward since lower overhead. But now, that gives a 2n to 2n function. So forward. But that gives a 4n to 4n function.

So for m derivatives, it’s 2^(m-1) to 2^(m-1) function calculation with m-1 layers of forward over a single reverse.

But that is clearly suboptimal compared to numerical because instead of growing the compute cost by the power it’s just one more f eval for the higher derivative, so linear m times n cost. The only way to get down to that with AD is to do single reverse mode over m-1 Taylor mode, and if the process optimizes everything to remove the redundant bits only then do you achieve the same scaling. Hence building new AD while doing numerical over reverse for now

Topic		Replies	Views
Is there an efficient way to compute the Hessian of a NN? Machine Learning flux	11	4738	November 12, 2019
Apply hessian of neural network to array Machine Learning	2	399	March 20, 2021
Gradient of Gradient in Zygote General Usage	2	2758	January 1, 2021
Gradiens of gradients using ReverseDiff Numerics	0	639	June 13, 2018
Higher order derivatives in Flux Machine Learning differentiation	3	2444	May 22, 2018

Hessian matrix of a neural network vetor output

Related topics