Need Diffractor.jl for State-of-the-Art Deep Learning Model

Hi, I hope to beta test for your Diffractor.jl next generation AD.

I am a co-author of DeepCPCFG [2103.05908] DeepCPCFG: Deep Learning and Context Free Grammars for End-to-End Information Extraction.
This research is also a guest lecture at MIT Intro to Deep Learning MIT 6.S191: Deep CPCFG for Information Extraction - YouTube

I have experienced some memory leak issues while using Zygote.jl to implement the above ML model that combines dynamic programming and deep neural network. I have provided the MWE here GitHub - deepcpcfg/zygote_memleak

and the issue is here Memory leak on worker process · Issue #930 · FluxML/Zygote.jl · GitHub

To overcome this issue, I have to resort to using Tracker.jl

I am requesting to beta test diffractor.jl

@Keno @dhairyagandhi96 @darsnack


I’ve whispers of this diffractor.jl many places now but never seen a dev/wip repo for it. I’m anxious to try it too should it ever come into excistance. :blush:


Word has it that it will be released at JuliaCon.


I am looking forward so much.

Can’t wait. :muscle:t2::+1:t2::white_check_mark::blush:


We are witnessing a milestone in the history of Computer Science.

We are witnessing a milestone in the history of Computer Science.

:joy: cool your jets* ace.
Diffractor is pretty exciting, but at the end of the day it is not world changing.

It’s probably going to have many bugs – but different bugs to Zygote.

Still I also am pretty hype.
We are indeed going to be able to do a lot of cool things faster, especially when it comes to derivatives of functions of derivatives.

(*Pun not intented, but kinda good: Jets are used for higher order AD, which Diffractor does)


I can see some tests about higher order AD in Diffractor, it is a good proof of its effort in optimzing compiling. Just wondering what is the typical using case of higher order reverse mode AD?

For 2nd order, we just use Forward over backward.
For 3 order or higher, I can not see much advantage of using Reverse mode AD, because the input dimension can not be large.

Is it just for checking the compiler performance?


There is a use case for 2 reverse passes (but not more, the rest forward) in some applications of physics-informed neural networks, and so you can see PINNs as the test case. Fun fact, Diffractor.jl was grant funded for these weird SciML applications (same ARPA-E that made ModelingToolkit.jl, Symbolics.jl, and JuliaSim), but of course fast higher order AD is something anyone could benefit from.


I agree for vanillia higher order AD, there are better ways than nesting reverse.
E.g. Forward over forward, Forward over reverse (especially for hessians), and taylor mode.
I complained to Keno about this several times, but i have now been convinced.

The case Diffactor is useful for is not directly nested AD but when there is a function in between.
i.e AD → some function of gradients → AD

For example imagine that I have some back box bb(x) and I am trying to train a Neural net to imitate it nn(x; θ) (where theta is my neural network parameters).
And I not only want to get the output to match, I also want the derivitive to match.
That will mean each training example is much more informative – which is good if bb is expensive to run (which it quite probably is, since that is one reason to train a nn to immitate it)

So I write my loss function as as the sum of the squared difference both of bb and nn at x, and of their derivatives bb' and nn' at x

loss(x, θ) = (bb(x) - nn(x; θ))^2 + (bb'(x) - nn'(x; θ))^2

Now I want to compute the dervative of loss with regards to θ so I can train my neural net.
Done naively the result is that one would call reverse mode AD on code generated by reverse mode AD.

Maybe there is a smart way to reformulate it to avoid calling reverse on code generated by reverse.
If so, I would like to hear about it. (especially if it is generalizable to other problems that are not quite in this form)


Thanks for your example, it looks like a decent using case for nested reverse mode AD.
This remembers me in physics, we often switch the description between energy, forces and acceleration. When including acceleration in the loss, it is a 3rd order reverse mode AD w.r.t. the energy model. Makes sense.

If NN is N → 1 sized, then you want to do reverse mode for the x values per x in the loss function, but the n train doing reverse mode once for the theta. That’s then a reason for reverse-over-reverse. Then for the second derivative, N->1 becomes N->N after the first, in that case you then want to do forward-over-reverse-over-reverse. This is exactly the PINN reasons BTW.