Hessian matrix of a neural network vetor output


so I have a physics informed neural network which has the same output shape as input [batchsize x 64]. With this output vector I have to calculate the first and second order derrivatives, which I tried to do with Zygote diaghessian and jacobian. The jacobian part works fine like that:

jacobian(x_grid->modell_call(xtrain,x_sensor,t_sensor), x_grid)

but when I try to do the same for the hessian i get an error. Even if I try to broadcast the diaghessian function like that it throws an error:

diaghessian.(x_sensor->modell_call(xtrain, x_sensor, t_grid), x_grid_cpu)
ERROR: Output is an array, so the gradient is not defined. Perhaps you wanted jacobian.

Is there a way to calculate the second order derrivative ∇u/xx of each output for all inputs efficiently?

Thanks for all help, it is appreciated.

I know some papers do this but I will keep repeating: double reverse Hessians is probably the slowest way to compute this object and comes from some insane ML idea that reverse mode is always better. You almost certainly don’t want to do this if you want a fast PINN, in any language. Just look at the asymptotics and it’s very clear.

Thanks for the answer. What exactly do you mean with asymptotics?

Copy pasting from Slack

Here is the argument in a nutshell. Start with and n to 1 function. The gradient is an n to n function. Reverse mode is higher overhead than forward (because of required memory operations to delay computation) but it requires output many calls and forward is input many. So first one, reverse mode

But the gradient is n to n, so forward since lower overhead. But now, that gives a 2n to 2n function. So forward. But that gives a 4n to 4n function.

So for m derivatives, it’s 2^(m-1) to 2^(m-1) function calculation with m-1 layers of forward over a single reverse.

But that is clearly suboptimal compared to numerical because instead of growing the compute cost by the power it’s just one more f eval for the higher derivative, so linear m times n cost. The only way to get down to that with AD is to do single reverse mode over m-1 Taylor mode, and if the process optimizes everything to remove the redundant bits only then do you achieve the same scaling. Hence building new AD while doing numerical over reverse for now