Hessian matrix of a neural network vetor output

mauerflitzer · June 29, 2022, 1:00pm

Hello,

so I have a physics informed neural network which has the same output shape as input [batchsize x 64]. With this output vector I have to calculate the first and second order derrivatives, which I tried to do with Zygote diaghessian and jacobian. The jacobian part works fine like that:

jacobian(x_grid->modell_call(xtrain,x_sensor,t_sensor), x_grid)

but when I try to do the same for the hessian i get an error. Even if I try to broadcast the diaghessian function like that it throws an error:

diaghessian.(x_sensor->modell_call(xtrain, x_sensor, t_grid), x_grid_cpu)

ERROR: Output is an array, so the gradient is not defined. Perhaps you wanted jacobian.
Stacktrace

Is there a way to calculate the second order derrivative ∇u/xx of each output for all inputs efficiently?

Thanks for all help, it is appreciated.

ChrisRackauckas · June 29, 2022, 1:21pm

I know some papers do this but I will keep repeating: double reverse Hessians is probably the slowest way to compute this object and comes from some insane ML idea that reverse mode is always better. You almost certainly don’t want to do this if you want a fast PINN, in any language. Just look at the asymptotics and it’s very clear.

mauerflitzer · June 29, 2022, 2:11pm

Thanks for the answer. What exactly do you mean with asymptotics?

ChrisRackauckas · July 16, 2022, 1:28pm

Copy pasting from Slack

Here is the argument in a nutshell. Start with and n to 1 function. The gradient is an n to n function. Reverse mode is higher overhead than forward (because of required memory operations to delay computation) but it requires output many calls and forward is input many. So first one, reverse mode

But the gradient is n to n, so forward since lower overhead. But now, that gives a 2n to 2n function. So forward. But that gives a 4n to 4n function.

So for m derivatives, it’s 2^(m-1) to 2^(m-1) function calculation with m-1 layers of forward over a single reverse.

But that is clearly suboptimal compared to numerical because instead of growing the compute cost by the power it’s just one more f eval for the higher derivative, so linear m times n cost. The only way to get down to that with AD is to do single reverse mode over m-1 Taylor mode, and if the process optimizes everything to remove the redundant bits only then do you achieve the same scaling. Hence building new AD while doing numerical over reverse for now

Topic		Replies	Views
Is there an efficient way to compute the Hessian of a NN? Machine Learning flux	11	4673	November 12, 2019
Apply hessian of neural network to array Machine Learning	2	387	March 20, 2021
Gradient of Gradient in Zygote General Usage	2	2737	January 1, 2021
Calculation of Hessian of a loss function w.r.t. dense layer weight matrices Performance hessian	0	236	September 9, 2023
Compute gradient of gradient norm using zygote New to Julia zygote	17	2028	August 26, 2022

Hessian matrix of a neural network vetor output

Related topics