Automatic Differentiation (AD) in Python compared to Julia and AD Basics

Hello everyone! The quotes below is a reconstruction of a wonderful conversation I had with Lyndon White (@oxinabox) on the Julia Slack on February 24, 2021 (Thank you to Lyndon as well to let me quote and repost this on Discourse)! I thought it might be useful to record it in Discourse in case anyone had similar questions about some of the differences in Automatic Differentiation in Python compared to Julia.

And we can also continue the discussion on this if anyone wants to add anything else.

Kim Laberinto (He/Him)
Is this (below) an accurate view of one of the the differences with Julia’s autodiff ecosystem and Python’s autodiff frameworks?
It seems to me likes long as your code is written in normal Julia code, the Julia autodiff packages will probably work(?) at doing the autodiff in ways you want! Which leads to things like being able to backprop through other Julia packages like powerful differential equation solvers.
Whereas in Python, with Tensorflow and Pytorch, you have to build the computational graph using the objects made for that purpose with knowing in mind in advance that you want to autodiff.
If anyone has some some other significant differences, feel free to suggest! I would love to look into them

Lyndon White
Don’t talk about TensorFlow 2.0 or Pytorch using the word Computational Graph.
While technically applicable to dynmaic, it does have strong connections to the explict graph structure TensorFlow. 1.0 used

Lyndon White
I would say yes, in Pytorch and TensorFlow the code does need to be written with AD specifically in mind, useful functions from those or made for those frameworks.
Rather than just what ever libraries you want.
Some might work through duck-typing. Those that that have a C backend (etc) won’t. Which is most serious computational packages in python.
On the upside: as long as you stay inside that walled garden most things are well tested and used by a ton of people, so you won’t run into many sharp edges.
On the significant downside, many things are outside of that walled garden.
Including efficient ways to solve small systems.
Like things other than traditional neural networks.
(E.g. compare torchdiffeq vs DiffEqFlux. On LokaVoltra, DIffEqFlux has solved the problem in less time that it took to even run a single forward pass in torchdiffeq)

Kim Laberinto (He/Him)
Thank you for the response!! That helps me get an insight on limitations in Python and how incredible Julia is!
First I thought all of AD was magic. I thought that all of AD used Dual Numbers. Then I learned what Reverse Mode AD was and realized I’ve actually seen and used reverse mode AD in Python before. Then I realized that in Julia you can AD anything/most things!! Now it is even more magical haha
Though I don’t quite understand what you meant by computational graphs in TF2.0 and PyTorch. I had thought that all of AD had to somehow construct the graph to be able to AD through it. But I guess there are other techniques!

Lyndon White
Consider:

function my_pow(x, n)
     net = x
     for i in 1:n-1
        net *= x
     end
     return net
end

That can’t construct a static computational graph. (and as a rule tensorflow people mean static computational graph when they say computational graph).
Because at compile time you don’t know how many iterations of the loop will occur.
Since it depends on the variable n.
So you can’t write out the computational structure of all operations that occur
(You actually can’t construct this in TensorFlow 1.0 without using some of the limitted dynamic extensions. I think this also causes problems for symbolic differentiation for similar reasons.).
What you can do is create a tape (sometimes the word trace is used interchangably, occationally it is used slightly differently).
A record of every operation that is done.
And that is all we need.
Infact that will generally be simpler than a graph.
Consider

function double_pos_square_neg(x)
    y = if x > 0
        2*x
    else  # x <= 0
        x^2
    end
    return y
end

Here we don’t need to generate the full graph, if the input is x=1 since we don’t care what happens in the x<0 branch.
We just need the instructions on the branch that was taken, since the other branch can’t affect our derivative since it didn’t occur.
Out tape thus doesn’t have to record it.
To do a reverse mode AD the we need to walk that tape backwards “pullingback” the derivative through each operation that was recorded.
Here is video of me explaining both forard then reverse in like 5 min
https://youtu.be/B4NfkkkJ7rs?t=119
here is slides ChainRules
Here is a very small reverse mode AD using ChainRules
Operator Overloading · ChainRules
See it constructs the tape recording how to pullback each operation that was preformed on the tracked variable, then it triggers them in sequence the propagate that gradient back until it is for the right variables.
(It does get blurry, Zygote will actually work out the code for the other branch, but won’t use it)

Kim Laberinto (he/him)
Wow that explains so much! Thank you. I kept seeing the term “tape” used and just thought it meant a computational graph. But it makes so much sense to just record a tape of operations that is done and then keep “pullingback” through the tape backwards to do reverse mode AD to the variables to what you take a derivative with respect to.

Lyndon White
if you like the tape is single path through the graph.
(or more properly i think it is something line: a topological sort of the inclusion of a single path. Is inclusion the word? Add the extra paths that have beginnings and ends on the main path)

10 Likes

In fact, this function can still be represented as a static graph with Loop operator with its own subgraph, something like:

MAIN
----
inputs: 
  (%1, %2) = (x, n)
tape:
  %3 = %1         # net = x
  %4 = Loop(LOOP_1, inputs=(%1, %3, %2))
  return %4

LOOP_1
------
inputs:
  %1, %2, %3 = (x, net, n)
tape:
  %4 = %1 * %2     # net * x
  %5 = %3 - 1      # slightly modified for simplicity
  %6 = %5 > 0      # next iteration condition
  return %4 if not %6

As far as I understand, this is approximately the Loop operator in ONNX is supposed to work.

Yeah that’s a “semi-static function”. There are loops which are truly dynamic, but that’s not one.

Many thanks for this comparison! However I do not quite understand them:

The argumentation is built on top of graphs: TF 2.0 does not need to use graphs. And in fact uses a “tape” for automatic differentiation.
So to my understanding, it seems what TF (eager), PyTorch and Julia do is the same then?

There is notably a large difference: the tape on arbitrary native logic (say loops in Python/Julia) works perfectly fine with TF/PyTorch, but the logic that can be traced and made into a static graph is strongly limited, whereas Julias JIT is way better and can handle “the whole language”.

Is that understanding correct?

Correct.

More or less correct.
Though it has little to do with the JIT.
And more to do with Julia being very flexible while also being (making overloading AD that feasible enough to run on most code) and julia being fast enough that you are not leaving julia for some compiled languages which is less flexible and can’t support it.
Plus julia’s compiler being sufficiently hack-able that source code transformation ADs (Zygote, Yota, Diffractor) can be created (and which are even more flexible)

TF Eager, PyTorch, and older Julia tools (Tracker, ReverseDiff, etc.) use a tape. Tapes are a bit slow because they have overhead in the forward pass and they cannot optimize backwards passes really well because the pass can change each time (every new value can give a new tape). Tensorflow 1.x and Jax work on quasi-static codes, i.e. codes that can be expanded to a static compute graph (not surprising they have the same limitation given they are both built heavily on the same XLA IR). This is very fast but it limits the kinds of codes that can be handled (fully dynamic codes are beyond this limitation) or at least optimize.

Newer Julia tools (Zygote, Diffractor, Enzyme, etc.) work directly on the lowered code doing a source-to-source transformation. Unlike a tape, this source can contain all branches meaning you can fully JIT compile and optimize it without much issue (it won’t change for different forward values). And unlike the XLA-based tools it can fully optimize things which include value-dependent while loops (yes, Jax can represent some of these things but notice its JIT compilation and/or reverse mode AD passes are not compatible with these features because that’s beyond the quasi-static domain they generally allow). These tools have a harder time optimizing some operations as much as XLA because of the much wider program space that they tackle though, but get to rely on the Julia compiler itself and its optimization passes which are steadily improving over time.

Are there cases which thread this needle where PyTorch, Jax, and Tensorflow wouldn’t optimize the method but the Julia tools would? I wrote a blog post detailing one such architecture: Useful Algorithms That Are Not Optimized By Jax, PyTorch, or Tensorflow - Stochastic Lifestyle . But more importantly, once you see one example it’s pretty clear what kinds of algorithms will have that property.

6 Likes

Great, many thanks for the explanations and the link to the blog post, which contains all I wanted to know!

For the cases, the most striking to me are all adaptive algorithms for numerical integration and similar; that is the main reason why we strongly consider using Julia