[blog post] Implement your own AD with Julia in ONE day

blog-post

#1

I was wondering how easy and simple can it be to implement a simple straight forward AD (reverse mode) for machine learning and quantum physics in Julia. So I tried to write my own last weekend.

And the answer is about only 200~400 lines (including doc strings), you can get an AD with basic function defined in DiffRules and broadcast support with a reasonable performance (actually it maybe the fastest by now XD).

Check it here: http://blog.rogerluo.me/2018/10/23/write-an-ad-in-one-day/


#2

How long do you think it will reach feature parity with torch? Also, based on this experience can you see yourself at some point switch to Julia entirely? What about practitioners who are not comfortable with writing their own AD? Which machine package would you advice them to learn? PyTorch or some Julia package?


#3

It will depends on how many contributors are there, I won’t implement things I totally don’t need at the moment (like multi-GPU support for conv, and RNN units). PyTorch is actually quite similar to Chainer, but PyTorch has a more active community.

I’m a physicist working on machine learning, which means sometimes people in machine learning community don’t care about what we need and we have to implement them by ourselves, e.g complex number support, it will take a quite long time (can be years) to merge new things into the main tree of a large project like PyTorch. It is painful and not actually necessary for researchers, check the issue and progress here:

I’m still working on this thing because of our legacy dependencies in the lab, however, I, personally with my lab-mates, collaborators, have switched to Julia entirely, I have built several packages that I need for research:

And more in private.

Some of them (e.g QuHamiltonian.jl) is not quite possible to implement in Python (or it would be quite hard with the ast module). Most of the python packages we write at the moment are just for public and non-pros who does not interested in coding at all. It is a nightmare comparing to Julia to bind C++ with Python, even there is pybind11.

Furthermore, Julia has the best support for tensor networks among all the languages, Python only has an i tensor wrapper. But Julia has TensorOperations.jl and another coming package of Jutho, and the author of iTensor is also writing a Julia version of it.

I’m actually writing this AD package because of a practical problem, a recent model implemented in PyTorch is too slow and I cannot use a batched trace in PyTorch, because it does not have (I don’t want to write C++ extension, and even I wrote one, it could still slower because of the python wrapper), and I cannot just use a for loop in Python, because it is slow as well, and the lattice libraries are slow as well in Python. I speed up my own model about 10x faster (on CPU) comparing to PyTorch (with almost the same syntax) in just a few days.

And I cannot just move Jutho’s TensorOperations.jl entirely to Python, meta-programming in Python comparing to what we want in TensorOperations.jl does not look possible to implement (or you will create your own DSL beneath Python like many other Python packages do).

If you are really a “practitioner”, under this situation, I believe you will choose Julia (if you don’t want to write your own, add custom operator in Zygote.jl is faster than PyTorch on CPU, and you can use mine in the future) rather than write your own PyTorch C++ extension with its C++ interface.

Being a Practitioner is not the reason to be lazy: if there is a package good enough, then use it, if there is not, then write one.

I don’t suggest to “learn” ANY machine learning package, because what you should learn is the algorithm and theory. Most machine learning package is designed to be intuitive enough that as long as you familiar with the theory, you will know how to use it. If you don’t know how to use it, it is either because the user does not actually know the theory/how this machine learning algorithm works or the package author should change their interface.

But well, if someone just say I don’t want to learn any theory, I just want to call a function and then I run a new deep learning algorithm with it. You will probably need a time machine and a black hole computer then.

I can use Flux.jl/Knet.jl/PyTorch/TensorFlow or just write from scratch as long as I find one of the approach is the fastest. I don’t actually see much difference between those packages, people are making similar interfaces with different implementation now.


#4

Thanks for the interesting and well-written blog post. AD implementation is indeed a good use case to demo the power of a language.

However, I wonder if we have too much of a good thing, as at the moment there are at least 5 reverse-mode AD libraries which are in various stages of being experimental, minimally maintained while waiting for an experimental one to be usable in production, targeting specific use cases/communities (eg ML), or catching up to 0.7/1.0 (these are not exclusive). AFAIK all of them have outstanding bugs that require some compromise or extra work on part of the user. As you have shown, Julia makes it easy to write a minimal AD library; the difficult part is maintaining one that is robust and performant for various use cases.

An outsider looking at the reverse-mode AD landscape in Julia could wonder what compels people to write yet another library for this, and whether this reflects a problem with the language.


#5

On related terms, I recently stumbled upon this manifesto for AD in Swift: https://gist.github.com/rxwei/30ba75ce092ab3b0dce4bde1fc2c9f1d

I don’t know too much about AD, but was wondering which of these ambitious points are addressable (or already addressed?) within the Julia ecosystem, or whether they are even on the radar. Or, is there anything that the Swifter’s aim to be able to, which would pose problems to Julia?

(if this is too off-topic, I can make a separate thread)


#6

Yes, there’s a lot AD package under development in Julia. And as you said, what I wanted was a simple and straight forward AD for practical use.

However, I don’t think this reflects a problem of the language, but it reflects an advantage: while struggling with learning how to add a new operator to C++, one can write a fast and usable AD with only a few lines in Julia.

The other AD packages in Julia has different goals, e.g Zygote aims to provide a source 2 source AD by extending the compiler, this is definitely better and harder to implement.

The older but more mature one: AutoGrad, which inherited its Python version, is as slow as its Python version since it is not written in a very Julian way (e.g.some of the type are not parametric, which is not suggested by performance tips), and you will need to generate derivatives by a primitive macro, which I personally does not prefer. But it is under refactoring.

And some other attempts tried to implement source 2 source AD by macros, or overloading (actually multiple dispatch via traits).

But yes, as you said, we probably want something usable and easy for the user: and that’s probably what is YAAD going to do next. Because it is tiny, it won’t be hard to fix future bugs. And because it make use of multiple dispatch, one can easily extend it with defining only one or two method. And because it tries to mimic the interface of popular package PyTorch (can’t mimic v0.4’s tensor though), it won’t be hard to switch to it, while waiting for more promising packages like Zygote and Capstan, we might just use it first.

I believe not only for AD, but also other area, one can use Julia to implement something tiny but usable.

I’ll write an ANN later, when I add more operators to YAAD.


#7

This was discussed in slack. I don’t actually think we need to change the language to adapt AD.

Those cassette based AD in Julia will directly extend the compiler to be able to mark and differentiate expression without tweaking the language. Making AD first class might bite those whose don’t actually need it.


#8

Each AD makes different compromises to flexibility and performance, changing their applicability. I think it’s nice to have these options.


#9

Do you know if there is a simple “Pros and cons” page somewhere? Otherwise the risk is that, while for the expert developer Julia is a dream language for AD as there are many options and it’s very easy to roll your own, the less technically knowledgeable developer ends up a bit confused on what to use for their library code.


#10

It is always nice to have options. However, the AD landscape is very fragmented. Only ForwardDiff.jl is robust, with the reverse mode package it is very easy to run into problems using seemingly trivial code.

To be fair, doing AD while preserving the generic code is very difficult, as it highlights all the problems of result type computation etc.


#11

Autograd has a lot of untyped stuff in its graph building types and it has a macro for defining primitives. This makes it work on pretty much everything, but the untyped parts reduces efficiency. However, on something like a neural net where the matrix multiplies take all of the time, the small amounts of dynamic dispatch won’t matter and it’s a good choice. On functions with a lot of small subfunction calls, this will be a non-trivial performance difference.

ReverseDiff and Flux are very similar. They are the reverse mode of ForwardDiff and uses types to essentially trace a computation graph. Mike and Jarrett can duke it out, but to me it seems ReverseDiff applies to more places but that has changed over time. YAAD also uses tracker types, but is a very simple implementation, but probably more similar to these two than not. However, tracker types only trace the branch that the values take. So while you can compile the computation graph and keep it with ReverseDiff, repeated applications of the gradient are only correct if it traced out something appropriate for the new value. This is a pretty fundamental limitation if you want to build a graph once and spend time optimizing/compiling it to re-use.

Zygote is source-to-source, and its paper describes how it can get a performance advantage by allowing all branches to compile and optimize at once. Capstan is via Cassette, which is essentially a form of source-to-source transformations using Cassette overdubing. Again, Mike and Jarrett are working on something the is probably more similar than different here, for similar reasons but for different applications. But Zygote already exists and Cassette/Capstan is still more of a near future thing, so :man_shrugging:. However, while tracker-based systems are easy to control (you just define a new dispatch on the type that says what the derivative is), I am not sure how customizable source-to-source is, but here’s a challenge problem that can give it an issue:

const x = Vector{Float64}(undef,4)
function f!(z,y,x)
  x .= 2.*y
  z .= sin.(x)
  nothing
end
g!(z,y) -> f!(z,y,x)
# Challenge: autodiff z = g!(y) 

I am not sure how Zygote would know how to handle the cache array, while with a type you can create a dual cache system that works with type-based AD via multiple dispatch. Capstan might be able to handle this because it’s using Cassette which is essentially a flexible and overridable source-to-source engine, but this is to be seen.

So for now, Zygote.jl is awesome if it works for your code. If not, ReverseDiff and Flux are good to go to, and ReverseDiff can store/compile the computation graph if appropriate to get similar speeds to Zygote, but you have to be careful with the application. Autograd you can easily get working on pretty much anything, but there’s a dispatch cost associated with it. Capstan and Cassette might be a beautiful system in the near future for both AD and customizing the source transformation, but it’s not here yet and I’m not sure most Julia users will actually know how to write overdubs.

For now, I always find ForwardDiff and ReverseDiff robust enough to send through big codes (entire differential equation solvers) with ease, and am waiting to see what happens with source-to-source.


#12

Hi guys, I just updated my blog with Flux’s AD, it is approximating the baseline for what I need for tr(x1 * x2)!

julia> @benchmark bench_mul_tr_flux(x1, x2)
BenchmarkTools.Trial:
  memory estimate:  30.25 KiB
  allocs estimate:  24
  --------------
  minimum time:     8.017 μs (0.00% GC)
  median time:      10.060 μs (0.00% GC)
  mean time:        14.592 μs (30.22% GC)
  maximum time:     16.378 ms (99.85% GC)
  --------------
  samples:          10000
  evals/sample:     3

I thought Flux was using ReverseDiff directly which is not actually true, that’s why I didn’t tested in the post, because ReverseDiff is not active maintained anymore. And thanks @MikeInnes to mention Flux’s AD here. And I would be happy to help if we could make a similar separated AD package in the future.


#13

Yes, I implemented YAAD in a very similar way comparing to Flux’s AD mixed with similar conventions from PyTorch (both backends and frontends this may make PyTorch users easier to adapt). I’m just hoping we can have a separate package for Flux’s AD now!

While waiting for Capstan and Zygote, we need something to use at the moment.


#14

I tried something similar a while back (here) but stopped because the language was changing in each version. Are you willing to accept pull requests? It would be great if you could create some more issues for the plans you have in mind in the github repository.


#15

I’m still considering what we are going to do with YAAD.jl, since Flux’s Tracker actually looks more optimized. I will probably choose to mock Flux’s tracker (e.g move it out of Flux), or keep using a extreme simple AD with this reasonable performance (not the fastest now, haha).

I’ll file some issue under YAAD.jl’s repo later along with an ANN here in discourse. And I’m definitely happy to accept PRs!


#16

I have a non-tracker type, but a global tape version as well, since there is only about 200 lines, not a big thing to implement them both XD:


#17

Note that Flux has an explicit @grad rule for matrix multiplication, which should be fast, while it looks like Zygote does not (yet?): https://github.com/FluxML/Zygote.jl/blob/master/src/lib/array.jl compare to https://github.com/FluxML/Flux.jl/blob/master/src/tracker/array.jl line 327.

So perhaps it must fall back on some generic for-loop multiplication, and having a go with this Naive matrix multiplication is super slow in Julia 1.0? version gives me a slowdown of almost this magnitude.


#18

Several packages that need Flux’s AD (e.g. Omega and Turing) just depend on Flux directly. There isn’t much downside to that since there’s not much else to Flux anyway (basically just some layer definitions), so the advantage of splitting it out is relatively minimal.

That said, we will likely split it out once Zygote and Capstan are ready to be used as the default AD. But this is not going to be the case for a few months at the least.


#19

I tried to define the matrix multiplication explicitly:

Zygote.@grad LinearAlgebra.tr(x) = LinearAlgebra.tr(x), Δ-> (Δ * Matrix(I, size(x)), )
Zygote.@grad Base.:(*)(lhs::Matrix, rhs::Matrix) = lhs * rhs, grad -> (grad * transpose(rhs), transpose(lhs) * grad)

Or

Zygote.@grad Base.:(*)(lhs::Matrix, rhs::Matrix) = BLAS.gemm('N', 'N', lhs, rhs), grad -> (grad * transpose(rhs), transpose(lhs) * grad)

And it seems this does not help…


#20

That is curious, I hadn’t tried (no 1.0 on laptop). If you add in println("forward") & println("back"), does this definition get called?