Hello everyone! The quotes below is a reconstruction of a wonderful conversation I had with Lyndon White (@oxinabox) on the Julia Slack on February 24, 2021 (Thank you to Lyndon as well to let me quote and repost this on Discourse)! I thought it might be useful to record it in Discourse in case anyone had similar questions about some of the differences in Automatic Differentiation in Python compared to Julia.
And we can also continue the discussion on this if anyone wants to add anything else.
Kim Laberinto (He/Him)
Is this (below) an accurate view of one of the the differences with Julia’s autodiff ecosystem and Python’s autodiff frameworks?
It seems to me likes long as your code is written in normal Julia code, the Julia autodiff packages will probably work(?) at doing the autodiff in ways you want! Which leads to things like being able to backprop through other Julia packages like powerful differential equation solvers.
Whereas in Python, with Tensorflow and Pytorch, you have to build the computational graph using the objects made for that purpose with knowing in mind in advance that you want to autodiff.
If anyone has some some other significant differences, feel free to suggest! I would love to look into them
Don’t talk about TensorFlow 2.0 or Pytorch using the word Computational Graph.
While technically applicable to dynmaic, it does have strong connections to the explict graph structure TensorFlow. 1.0 used
I would say yes, in Pytorch and TensorFlow the code does need to be written with AD specifically in mind, useful functions from those or made for those frameworks.
Rather than just what ever libraries you want.
Some might work through duck-typing. Those that that have a C backend (etc) won’t. Which is most serious computational packages in python.
On the upside: as long as you stay inside that walled garden most things are well tested and used by a ton of people, so you won’t run into many sharp edges.
On the significant downside, many things are outside of that walled garden.
Including efficient ways to solve small systems.
Like things other than traditional neural networks.
(E.g. compare torchdiffeq vs DiffEqFlux. On LokaVoltra, DIffEqFlux has solved the problem in less time that it took to even run a single forward pass in torchdiffeq)
Kim Laberinto (He/Him)
Thank you for the response!! That helps me get an insight on limitations in Python and how incredible Julia is!
First I thought all of AD was magic. I thought that all of AD used Dual Numbers. Then I learned what Reverse Mode AD was and realized I’ve actually seen and used reverse mode AD in Python before. Then I realized that in Julia you can AD anything/most things!! Now it is even more magical haha
Though I don’t quite understand what you meant by computational graphs in TF2.0 and PyTorch. I had thought that all of AD had to somehow construct the graph to be able to AD through it. But I guess there are other techniques!
function my_pow(x, n) net = x for i in 1:n-1 net *= x end return net end
That can’t construct a static computational graph. (and as a rule tensorflow people mean static computational graph when they say computational graph).
Because at compile time you don’t know how many iterations of the loop will occur.
Since it depends on the variable n.
So you can’t write out the computational structure of all operations that occur
(You actually can’t construct this in TensorFlow 1.0 without using some of the limitted dynamic extensions. I think this also causes problems for symbolic differentiation for similar reasons.).
What you can do is create a tape (sometimes the word trace is used interchangably, occationally it is used slightly differently).
A record of every operation that is done.
And that is all we need.
Infact that will generally be simpler than a graph.
function double_pos_square_neg(x) y = if x > 0 2*x else # x <= 0 x^2 end return y end
Here we don’t need to generate the full graph, if the input is x=1 since we don’t care what happens in the x<0 branch.
We just need the instructions on the branch that was taken, since the other branch can’t affect our derivative since it didn’t occur.
Out tape thus doesn’t have to record it.
To do a reverse mode AD the we need to walk that tape backwards “pullingback” the derivative through each operation that was recorded.
Here is video of me explaining both forard then reverse in like 5 min
here is slides ChainRules
Here is a very small reverse mode AD using ChainRules
Operator Overloading · ChainRules
See it constructs the tape recording how to pullback each operation that was preformed on the tracked variable, then it triggers them in sequence the propagate that gradient back until it is for the right variables.
(It does get blurry, Zygote will actually work out the code for the other branch, but won’t use it)
Kim Laberinto (he/him)
Wow that explains so much! Thank you. I kept seeing the term “tape” used and just thought it meant a computational graph. But it makes so much sense to just record a tape of operations that is done and then keep “pullingback” through the tape backwards to do reverse mode AD to the variables to what you take a derivative with respect to.
if you like the tape is single path through the graph.
(or more properly i think it is something line: a topological sort of the inclusion of a single path. Is inclusion the word? Add the extra paths that have beginnings and ends on the main path)