Need Diffractor.jl for State-of-the-Art Deep Learning Model

Freddy_Chua · April 16, 2021, 4:37pm

Hi, I hope to beta test for your Diffractor.jl next generation AD.

I am a co-author of DeepCPCFG [2103.05908] DeepCPCFG: Deep Learning and Context Free Grammars for End-to-End Information Extraction.
This research is also a guest lecture at MIT Intro to Deep Learning MIT 6.S191: Deep CPCFG for Information Extraction - YouTube

I have experienced some memory leak issues while using Zygote.jl to implement the above ML model that combines dynamic programming and deep neural network. I have provided the MWE here https://github.com/deepcpcfg/zygote_memleak

and the issue is here https://github.com/FluxML/Zygote.jl/issues/930

To overcome this issue, I have to resort to using Tracker.jl

I am requesting to beta test diffractor.jl

@Keno @dhairyagandhi96 @darsnack

DoktorMike · April 16, 2021, 5:40pm

I’ve whispers of this diffractor.jl many places now but never seen a dev/wip repo for it. I’m anxious to try it too should it ever come into excistance.

sethaxen · July 16, 2021, 10:13am

Word has it that it will be released at JuliaCon.

Tomas_Pevny · July 16, 2021, 11:23am

I am looking forward so much.

DoktorMike · July 16, 2021, 6:38pm

Can’t wait.

rkube · July 26, 2021, 2:19pm

https://github.com/Keno/Diffractor.jl

Freddy_Chua · July 27, 2021, 6:47am

We are witnessing a milestone in the history of Computer Science.

oxinabox · July 27, 2021, 9:32am

We are witnessing a milestone in the history of Computer Science.

cool your jets* ace.
Diffractor is pretty exciting, but at the end of the day it is not world changing.

It’s probably going to have many bugs – but different bugs to Zygote.

Still I also am pretty hype.
We are indeed going to be able to do a lot of cool things faster, especially when it comes to derivatives of functions of derivatives.

(*Pun not intented, but kinda good: Jets are used for higher order AD, which Diffractor does)

1115 · July 27, 2021, 10:36am

I can see some tests about higher order AD in Diffractor, it is a good proof of its effort in optimzing compiling. Just wondering what is the typical using case of higher order reverse mode AD?

For 2nd order, we just use Forward over backward.
For 3 order or higher, I can not see much advantage of using Reverse mode AD, because the input dimension can not be large.

Is it just for checking the compiler performance?

ChrisRackauckas · July 27, 2021, 10:40am

There is a use case for 2 reverse passes (but not more, the rest forward) in some applications of physics-informed neural networks, and so you can see PINNs as the test case. Fun fact, Diffractor.jl was grant funded for these weird SciML applications (same ARPA-E that made ModelingToolkit.jl, Symbolics.jl, and JuliaSim), but of course fast higher order AD is something anyone could benefit from.

oxinabox · July 27, 2021, 10:43am

I agree for vanillia higher order AD, there are better ways than nesting reverse.
E.g. Forward over forward, Forward over reverse (especially for hessians), and taylor mode.
I complained to Keno about this several times, but i have now been convinced.

The case Diffactor is useful for is not directly nested AD but when there is a function in between.
i.e AD → some function of gradients → AD

For example imagine that I have some back box bb(x) and I am trying to train a Neural net to imitate it nn(x; θ) (where theta is my neural network parameters).
And I not only want to get the output to match, I also want the derivitive to match.
That will mean each training example is much more informative – which is good if bb is expensive to run (which it quite probably is, since that is one reason to train a nn to immitate it)

So I write my loss function as as the sum of the squared difference both of bb and nn at x, and of their derivatives bb' and nn' at x

loss(x, θ) = (bb(x) - nn(x; θ))^2 + (bb'(x) - nn'(x; θ))^2

Now I want to compute the dervative of loss with regards to θ so I can train my neural net.
Done naively the result is that one would call reverse mode AD on code generated by reverse mode AD.

Maybe there is a smart way to reformulate it to avoid calling reverse on code generated by reverse.
If so, I would like to hear about it. (especially if it is generalizable to other problems that are not quite in this form)

1115 · July 27, 2021, 10:55am

Thanks for your example, it looks like a decent using case for nested reverse mode AD.
This remembers me in physics, we often switch the description between energy, forces and acceleration. When including acceleration in the loss, it is a 3rd order reverse mode AD w.r.t. the energy model. Makes sense.

ChrisRackauckas · July 27, 2021, 11:03am

If NN is N → 1 sized, then you want to do reverse mode for the x values per x in the loss function, but the n train doing reverse mode once for the theta. That’s then a reason for reverse-over-reverse. Then for the second derivative, N->1 becomes N->N after the first, in that case you then want to do forward-over-reverse-over-reverse. This is exactly the PINN reasons BTW.

Topic		Replies	Views
Diffractor release Package Announcements autodiff	30	3077	July 29, 2023
Is it possible to do Nested AD ~elegantly~ in Julia? (PINNs) General Usage machine-learning	43	3434	September 27, 2024
Nested AD with Lux etc Machine Learning ad	26	1412	May 1, 2024
What is the difference between Zygote vs ForwardDiff and ReverseDiff Machine Learning	4	6701	February 23, 2021
How to force Flux to use FiniteDiff Machine Learning flux , finitediff	16	2392	February 16, 2022

Need Diffractor.jl for State-of-the-Art Deep Learning Model

Related topics