Current status of nested AD

Could anyone comment on the current status of nested AD? What works? What doesn’t? And what does the future look like?

In my application, I am trying to learn a conservative vector field—that is, a vector field that is the gradient of a scalar potential. I would like to model the potential using a neural network, take it’s gradient wrt to its inputs, and use the resulting gradient as part of the loss function. The gradient of the loss function wrt the parameters would then be used to update the model. Similar uses come up quite frequently in the form of gradient penalties in GANs (e.g. WGAN-GP) and in PINNs, SciML, etc.

I have had a lot of difficulty in getting this to work. There are a lot of combinations of AD tools that one could use for the inner and outer AD. It’s quite a bit of work to try the various combinations without much sense of their likelihood of working.

I have tried a number of different combinations, including Zygote over {Zygote, ReverseDiff, ForwardDiff}. These all fail due to some internal mutation by the AD library, or an adjoint is written in a way that doesn’t support higher-order AD.

My next step is to try ReverseDiff over the other options to see if I can get anything to work. For my specefic example reverse-mode over reverse-mode is going to be most efficient (I think).

Is there any guidance about what is likely to work?

My understanding is that Diffractor is designed around these sorts of use-cases. What is its status?