Need Diffractor.jl for State-of-the-Art Deep Learning Model

I agree for vanillia higher order AD, there are better ways than nesting reverse.
E.g. Forward over forward, Forward over reverse (especially for hessians), and taylor mode.
I complained to Keno about this several times, but i have now been convinced.

The case Diffactor is useful for is not directly nested AD but when there is a function in between.
i.e AD → some function of gradients → AD

For example imagine that I have some back box bb(x) and I am trying to train a Neural net to imitate it nn(x; θ) (where theta is my neural network parameters).
And I not only want to get the output to match, I also want the derivitive to match.
That will mean each training example is much more informative – which is good if bb is expensive to run (which it quite probably is, since that is one reason to train a nn to immitate it)

So I write my loss function as as the sum of the squared difference both of bb and nn at x, and of their derivatives bb' and nn' at x

loss(x, θ) = (bb(x) - nn(x; θ))^2 + (bb'(x) - nn'(x; θ))^2

Now I want to compute the dervative of loss with regards to θ so I can train my neural net.
Done naively the result is that one would call reverse mode AD on code generated by reverse mode AD.

Maybe there is a smart way to reformulate it to avoid calling reverse on code generated by reverse.
If so, I would like to hear about it. (especially if it is generalizable to other problems that are not quite in this form)

6 Likes