Okay yes, this is a gradient of a gradient example, not a gradient example. The gradient of a gradient of a FastChain won’t work because the FastChain adjoint uses mutation. It could be specialized to handle this case, but because this is almost never an efficient way to calculate the second derivative (forward-over-adjoint is just better in almost all respects) I’m not sure it’s a high priority to specialize this.
(And BTW, taking a gradient of gradient is a TensorFlow misnomer. It’s actually Jacobian of a gradient unless it’s a scalar function and thus the second derivative. Otherwise the sizes don’t align. TensorFlow silently makes gradient = sum of Jacobian as I describe here Gradient of Gradient in Zygote - #3 by ChrisRackauckas You should really double check whether that summation is the interpretation you wanted)