[blog post] Implement your own AD with Julia in ONE day

Roger-luo · October 23, 2018, 4:21am

I was wondering how easy and simple can it be to implement a simple straight forward AD (reverse mode) for machine learning and quantum physics in Julia. So I tried to write my own last weekend.

And the answer is about only 200~400 lines (including doc strings), you can get an AD with basic function defined in DiffRules and broadcast support with a reasonable performance (actually it maybe the fastest by now XD).

Check it here: http://blog.rogerluo.me/2018/10/23/write-an-ad-in-one-day/

xiaodai · October 23, 2018, 5:54am

How long do you think it will reach feature parity with torch? Also, based on this experience can you see yourself at some point switch to Julia entirely? What about practitioners who are not comfortable with writing their own AD? Which machine package would you advice them to learn? PyTorch or some Julia package?

Roger-luo · October 23, 2018, 6:44am

It will depends on how many contributors are there, I won’t implement things I totally don’t need at the moment (like multi-GPU support for conv, and RNN units). PyTorch is actually quite similar to Chainer, but PyTorch has a more active community.

I’m a physicist working on machine learning, which means sometimes people in machine learning community don’t care about what we need and we have to implement them by ourselves, e.g complex number support, it will take a quite long time (can be years) to merge new things into the main tree of a large project like PyTorch. It is painful and not actually necessary for researchers, check the issue and progress here:

github.com/pytorch/pytorch

Integrating complex tensors

opened 11:06PM - 15 Feb 17 UTC

closed 04:25PM - 08 Feb 22 UTC

PhilippPelz

feature triaged module: complex

New description from @ezyang: Work is in progress at https://github.com/Roger…-luo/pytorch-complex ## Organizational principles * Complex tensor support is important to PyTorch, and we will accept patches to core which add small amounts of code to make adding complex support. * Adding complex involves writing a lot of new kernels and code: we'd like this code to initially live out of repo, so it is easier for people to iterate quickly on them without having to go through the PyTorch main code review process. We will *NOT* commit to reviewing large new kernels in the short term, but eventually we would like all the kernels to come back to PyTorch. * The external library will be buildable separately from PyTorch, so you will be able to maintain it as a separate repository without having to merge with PyTorch (and deal with loads of merge conflicts). * PyTorch may occasionally make breaking changes in C++ API; if you bring these to our attention we will do our utmost to help solve these problems. * The hooks needed for this will NOT ship with PyTorch 1.0, but they will ship with a released version of PyTorch in the not too distant future. ## How will I work on complex kernels? Here is what the workflow will look like in the steady state. **PyTorch will natively contain APIs for referring to the complex dtype, but they won't do anything by default.** PyTorch defines torch.complex64 and torch.complex128 referring to complex tensors. However, if you try to construct a tensor this way, by default, PyTorch will error: ``` >>> torch.zeros({2,2}, dtype=torch.complex64) RuntimeError: complex64 not supported by PyTorch ``` @ezyang provided a patch which adds these dtypes to PyTorch. https://github.com/pytorch/pytorch/pull/11173 In the mid-term, we will merge support for basic functionality (like allocating a tensor of zeros) to be supported by PyTorch natively. A reasonable proxy for what support is “basic” is PyTorch's native support for CPU half tensors (which are extremely impoverished). **PyTorch publishes an interface for registering an implementation of complex tensors.** The implementation inherits from the TypeDefault class (https://github.com/pytorch/pytorch/pull/11013) and will override methods on this class to define implementations of functions for which we have complex implementations. It will look something like this: ``` struct CPUComplexFloatType final : public TypeDefault { virtual Tensor add(const Tensor & self, const Tensor & other, Scalar alpha=1) const override { // Your implementation of add for complex tensors } // ... } ``` This class will override exactly the types which are supported for complex; all other implementations are provided by TypeDefault and will error by default. There will be a canonical listing of methods supported on Type (the overall interface) as an autogenerated file that is checked into the PyTorch source repository; we'll communicate API changes by diffs to this file. In general, the methods are in one-to-one correspondence with their corresponding names in the PyTorch frontend. In general, when you use an operation which you haven't implemented yet, **WARNING:** We intend to refactor Type away into a new system that also supports open registration of new operations (this obviously doesn't work if you have a single superclass that defines all the methods you might possibly want to support). Thus, try not to get too tied to the particular implementation strategy of writing Type as a subclass. **To publish new, complex only operations, you will use the C++ extension API.** The C++ extension API is documented at https://pytorch.org/tutorials/advanced/cpp_extension.html Essentially, you can write a C++ function like: ``` at::Tensor imag(at::Tensor z) { ... } ``` And then the C++ extension API will generate a Python binding so that you invoke this function from Python. **Some operations will be “easy” to integrate into PyTorch as it exists today.** For example, for implementation of binary operations, it probably makes more sense to extend add_kernel in BinaryOpsKernel.cpp so that it dispatches over complex types (and then you get it for free, because std::complex implements addition). As long as these patches are small and self-contained, we promise to merge them on a timely basis. It should ALWAYS be possible to unblock, by just writing an override on Type instead of using existing infrastructure, and doing liberal copy pasting. But let's avoid it when it's easy! **Autograd.** As long as you're working on operations which already have derivative formulas defined for them, you will “automatically” get autograd support, as long as you implement complex support for all the constituent functions which are invoked in the backwards implementation from derivatives.yaml. In some cases, we may need to adjust autograd formulas so that they work for complex numbers; e.g., the gradient of 'abs' isn't 'grad . self.sign()'. In these cases, all we need to do is upstream fix of changing the autograd formula of 'abs' to 'abs_backward', which is a function that can be overridden. For general complex valued back propagation, there are some references: 1. *Akira’s “Complex Valued Neural Networks”.* 2. https://giggleliu.github.io/2018/02/01/complex_bp.html Generally, we won't need to modify the autograd since in most cases we only calculate the derivatives of a real-valued function (the loss). ## Work plan Many of the necessary pieces are in place today, but they are not put together in an end-to-end way. Here is what needs to be done. - [X] Codemod TH to not ifdef real https://github.com/pytorch/pytorch/pull/11163 - [X] Built-in support for torch.complex64 and torch.complex128 dtypes. https://github.com/pytorch/pytorch/pull/11173 - [X] An interface for registering CPUComplexType, etc., so that this implementation is invoked when you request a complex tensor with dtype=torch.complex64 or do an operation on complex tensors. - [X] Land https://github.com/pytorch/pytorch/pull/11013 - [X] An end-to-end example, including working build system, of a separately compileable C++ program that links against libtorch and uses the aforementioned interface to implement complex tensor allocation. Short term integration plan. These operations are “easy” to implement, and so we should mainline them in PyTorch as soon as possible. - [X] Basic tensor factories: torch.empty, torch.zeros, torch.ones - [ ] CPU binary operations: add, sub, mul, div #11641 - [ ] FFT - [ ] ??? Kernel implementation: TODO: Generate a list based on https://github.com/Roger-luo/TH/blob/master/ChangeLog.md Other complex related tasks: - [ ] Figure out the type promotion rules for complex tensors, and implement it in promoteTypes #11641 ## Historical issue content Original comment from @PhilippPelz I was wondering if there is interest in incorporating complex tensors into pytorch. For CPU support there is ztorch and I have written z-cutorch ( https://github.com/PhilippPelz/z-cutorch ) a while ago. It is a fork off cutorch before the refactoring for CudaHalfTensor (don't have the hardware yet). If it's not too much work, I would like to slowly integrate it with pytorch. I am using matplotlib for plotting via fb.ptyhon and it turns out a huge pain every time I reinstall my system (compiling all the dependencies), plus it seems pytorch will work under Windows soon, which one of my experiment PCs runs on. I would also need complex gradients, so I would sooner or later touch autograd as well. While tf supports complex tensors per se, it seems many ops don't support it yet (https://github.com/tensorflow/tensorflow/issues/2255), plus it seems a bit heavyweight for my purposes. Maybe someone could say a few words how and where to start with this, if it's a welcome idea.

I’m still working on this thing because of our legacy dependencies in the lab, however, I, personally with my lab-mates, collaborators, have switched to Julia entirely, I have built several packages that I need for research:

And more in private.

Some of them (e.g QuHamiltonian.jl) is not quite possible to implement in Python (or it would be quite hard with the ast module). Most of the python packages we write at the moment are just for public and non-pros who does not interested in coding at all. It is a nightmare comparing to Julia to bind C++ with Python, even there is pybind11.

Furthermore, Julia has the best support for tensor networks among all the languages, Python only has an i tensor wrapper. But Julia has TensorOperations.jl and another coming package of Jutho, and the author of iTensor is also writing a Julia version of it.

I’m actually writing this AD package because of a practical problem, a recent model implemented in PyTorch is too slow and I cannot use a batched trace in PyTorch, because it does not have (I don’t want to write C++ extension, and even I wrote one, it could still slower because of the python wrapper), and I cannot just use a for loop in Python, because it is slow as well, and the lattice libraries are slow as well in Python. I speed up my own model about 10x faster (on CPU) comparing to PyTorch (with almost the same syntax) in just a few days.

And I cannot just move Jutho’s TensorOperations.jl entirely to Python, meta-programming in Python comparing to what we want in TensorOperations.jl does not look possible to implement (or you will create your own DSL beneath Python like many other Python packages do).

If you are really a “practitioner”, under this situation, I believe you will choose Julia (if you don’t want to write your own, add custom operator in Zygote.jl is faster than PyTorch on CPU, and you can use mine in the future) rather than write your own PyTorch C++ extension with its C++ interface.

Being a Practitioner is not the reason to be lazy: if there is a package good enough, then use it, if there is not, then write one.

I don’t suggest to “learn” ANY machine learning package, because what you should learn is the algorithm and theory. Most machine learning package is designed to be intuitive enough that as long as you familiar with the theory, you will know how to use it. If you don’t know how to use it, it is either because the user does not actually know the theory/how this machine learning algorithm works or the package author should change their interface.

But well, if someone just say I don’t want to learn any theory, I just want to call a function and then I run a new deep learning algorithm with it. You will probably need a time machine and a black hole computer then.

I can use Flux.jl/Knet.jl/PyTorch/TensorFlow or just write from scratch as long as I find one of the approach is the fastest. I don’t actually see much difference between those packages, people are making similar interfaces with different implementation now.

Tamas_Papp · October 23, 2018, 10:56am

Thanks for the interesting and well-written blog post. AD implementation is indeed a good use case to demo the power of a language.

However, I wonder if we have too much of a good thing, as at the moment there are at least 5 reverse-mode AD libraries which are in various stages of being experimental, minimally maintained while waiting for an experimental one to be usable in production, targeting specific use cases/communities (eg ML), or catching up to 0.7/1.0 (these are not exclusive). AFAIK all of them have outstanding bugs that require some compromise or extra work on part of the user. As you have shown, Julia makes it easy to write a minimal AD library; the difficult part is maintaining one that is robust and performant for various use cases.

An outsider looking at the reverse-mode AD landscape in Julia could wonder what compels people to write yet another library for this, and whether this reflects a problem with the language.

jtackm · October 23, 2018, 11:21am

On related terms, I recently stumbled upon this manifesto for AD in Swift: https://gist.github.com/rxwei/30ba75ce092ab3b0dce4bde1fc2c9f1d

I don’t know too much about AD, but was wondering which of these ambitious points are addressable (or already addressed?) within the Julia ecosystem, or whether they are even on the radar. Or, is there anything that the Swifter’s aim to be able to, which would pose problems to Julia?

(if this is too off-topic, I can make a separate thread)

Roger-luo · October 23, 2018, 2:09pm

Yes, there’s a lot AD package under development in Julia. And as you said, what I wanted was a simple and straight forward AD for practical use.

However, I don’t think this reflects a problem of the language, but it reflects an advantage: while struggling with learning how to add a new operator to C++, one can write a fast and usable AD with only a few lines in Julia.

The other AD packages in Julia has different goals, e.g Zygote aims to provide a source 2 source AD by extending the compiler, this is definitely better and harder to implement.

The older but more mature one: AutoGrad, which inherited its Python version, is as slow as its Python version since it is not written in a very Julian way (e.g.some of the type are not parametric, which is not suggested by performance tips), and you will need to generate derivatives by a primitive macro, which I personally does not prefer. But it is under refactoring.

And some other attempts tried to implement source 2 source AD by macros, or overloading (actually multiple dispatch via traits).

But yes, as you said, we probably want something usable and easy for the user: and that’s probably what is YAAD going to do next. Because it is tiny, it won’t be hard to fix future bugs. And because it make use of multiple dispatch, one can easily extend it with defining only one or two method. And because it tries to mimic the interface of popular package PyTorch (can’t mimic v0.4’s tensor though), it won’t be hard to switch to it, while waiting for more promising packages like Zygote and Capstan, we might just use it first.

I believe not only for AD, but also other area, one can use Julia to implement something tiny but usable.

I’ll write an ANN later, when I add more operators to YAAD.

Roger-luo · October 23, 2018, 2:29pm

This was discussed in slack. I don’t actually think we need to change the language to adapt AD.

Those cassette based AD in Julia will directly extend the compiler to be able to mark and differentiate expression without tweaking the language. Making AD first class might bite those whose don’t actually need it.

ChrisRackauckas · October 23, 2018, 3:42pm

Each AD makes different compromises to flexibility and performance, changing their applicability. I think it’s nice to have these options.

piever · October 23, 2018, 4:04pm

Do you know if there is a simple “Pros and cons” page somewhere? Otherwise the risk is that, while for the expert developer Julia is a dream language for AD as there are many options and it’s very easy to roll your own, the less technically knowledgeable developer ends up a bit confused on what to use for their library code.

Tamas_Papp · October 23, 2018, 4:08pm

It is always nice to have options. However, the AD landscape is very fragmented. Only ForwardDiff.jl is robust, with the reverse mode package it is very easy to run into problems using seemingly trivial code.

To be fair, doing AD while preserving the generic code is very difficult, as it highlights all the problems of result type computation etc.

ChrisRackauckas · October 23, 2018, 4:37pm

Autograd has a lot of untyped stuff in its graph building types and it has a macro for defining primitives. This makes it work on pretty much everything, but the untyped parts reduces efficiency. However, on something like a neural net where the matrix multiplies take all of the time, the small amounts of dynamic dispatch won’t matter and it’s a good choice. On functions with a lot of small subfunction calls, this will be a non-trivial performance difference.

ReverseDiff and Flux are very similar. They are the reverse mode of ForwardDiff and uses types to essentially trace a computation graph. Mike and Jarrett can duke it out, but to me it seems ReverseDiff applies to more places but that has changed over time. YAAD also uses tracker types, but is a very simple implementation, but probably more similar to these two than not. However, tracker types only trace the branch that the values take. So while you can compile the computation graph and keep it with ReverseDiff, repeated applications of the gradient are only correct if it traced out something appropriate for the new value. This is a pretty fundamental limitation if you want to build a graph once and spend time optimizing/compiling it to re-use.

Zygote is source-to-source, and its paper describes how it can get a performance advantage by allowing all branches to compile and optimize at once. Capstan is via Cassette, which is essentially a form of source-to-source transformations using Cassette overdubing. Again, Mike and Jarrett are working on something the is probably more similar than different here, for similar reasons but for different applications. But Zygote already exists and Cassette/Capstan is still more of a near future thing, so . However, while tracker-based systems are easy to control (you just define a new dispatch on the type that says what the derivative is), I am not sure how customizable source-to-source is, but here’s a challenge problem that can give it an issue:

const x = Vector{Float64}(undef,4)
function f!(z,y,x)
  x .= 2.*y
  z .= sin.(x)
  nothing
end
g!(z,y) -> f!(z,y,x)
# Challenge: autodiff z = g!(y)

I am not sure how Zygote would know how to handle the cache array, while with a type you can create a dual cache system that works with type-based AD via multiple dispatch. Capstan might be able to handle this because it’s using Cassette which is essentially a flexible and overridable source-to-source engine, but this is to be seen.

So for now, Zygote.jl is awesome if it works for your code. If not, ReverseDiff and Flux are good to go to, and ReverseDiff can store/compile the computation graph if appropriate to get similar speeds to Zygote, but you have to be careful with the application. Autograd you can easily get working on pretty much anything, but there’s a dispatch cost associated with it. Capstan and Cassette might be a beautiful system in the near future for both AD and customizing the source transformation, but it’s not here yet and I’m not sure most Julia users will actually know how to write overdubs.

For now, I always find ForwardDiff and ReverseDiff robust enough to send through big codes (entire differential equation solvers) with ease, and am waiting to see what happens with source-to-source.

Roger-luo · October 23, 2018, 4:59pm

Hi guys, I just updated my blog with Flux’s AD, it is approximating the baseline for what I need for tr(x1 * x2)!

julia> @benchmark bench_mul_tr_flux(x1, x2)
BenchmarkTools.Trial:
  memory estimate:  30.25 KiB
  allocs estimate:  24
  --------------
  minimum time:     8.017 μs (0.00% GC)
  median time:      10.060 μs (0.00% GC)
  mean time:        14.592 μs (30.22% GC)
  maximum time:     16.378 ms (99.85% GC)
  --------------
  samples:          10000
  evals/sample:     3

I thought Flux was using ReverseDiff directly which is not actually true, that’s why I didn’t tested in the post, because ReverseDiff is not active maintained anymore. And thanks @MikeInnes to mention Flux’s AD here. And I would be happy to help if we could make a similar separated AD package in the future.

Roger-luo · October 23, 2018, 5:02pm

Yes, I implemented YAAD in a very similar way comparing to Flux’s AD mixed with similar conventions from PyTorch (both backends and frontends this may make PyTorch users easier to adapt). I’m just hoping we can have a separate package for Flux’s AD now!

While waiting for Capstan and Zygote, we need something to use at the moment.

lakshgupta · October 23, 2018, 5:04pm

I tried something similar a while back (here) but stopped because the language was changing in each version. Are you willing to accept pull requests? It would be great if you could create some more issues for the plans you have in mind in the github repository.

Roger-luo · October 23, 2018, 5:10pm

I’m still considering what we are going to do with YAAD.jl, since Flux’s Tracker actually looks more optimized. I will probably choose to mock Flux’s tracker (e.g move it out of Flux), or keep using a extreme simple AD with this reasonable performance (not the fastest now, haha).

I’ll file some issue under YAAD.jl’s repo later along with an ANN here in discourse. And I’m definitely happy to accept PRs!

Roger-luo · October 23, 2018, 5:18pm

I have a non-tracker type, but a global tape version as well, since there is only about 200 lines, not a big thing to implement them both XD:

github.com/Roger-luo/YAAD.jl

Refactor to a tape-based AD

Roger-luo:master ← Roger-luo:tape

opened 04:18AM - 22 Oct 18 UTC

Roger-luo

+343 -104

Refactor the core part to be a tape based AD. It will register a tracked value i…n the tapeeach time, the key in tape will be gc if there is no reference to it (tape is a WeakKeyIdDict). However, the performance is decreased comparing to original implementation. e.g for tr(x1 * x2), the total time increased about 3 μs for two rand(30, 30) matrix.

improbable22 · October 23, 2018, 5:40pm

Note that Flux has an explicit @grad rule for matrix multiplication, which should be fast, while it looks like Zygote does not (yet?): https://github.com/FluxML/Zygote.jl/blob/master/src/lib/array.jl compare to https://github.com/FluxML/Flux.jl/blob/master/src/tracker/array.jl line 327.

So perhaps it must fall back on some generic for-loop multiplication, and having a go with this Naive matrix multiplication is super slow in Julia 1.0? version gives me a slowdown of almost this magnitude.

MikeInnes · October 23, 2018, 5:46pm

Several packages that need Flux’s AD (e.g. Omega and Turing) just depend on Flux directly. There isn’t much downside to that since there’s not much else to Flux anyway (basically just some layer definitions), so the advantage of splitting it out is relatively minimal.

That said, we will likely split it out once Zygote and Capstan are ready to be used as the default AD. But this is not going to be the case for a few months at the least.

Roger-luo · October 23, 2018, 5:49pm

I tried to define the matrix multiplication explicitly:

Zygote.@grad LinearAlgebra.tr(x) = LinearAlgebra.tr(x), Δ-> (Δ * Matrix(I, size(x)), )
Zygote.@grad Base.:(*)(lhs::Matrix, rhs::Matrix) = lhs * rhs, grad -> (grad * transpose(rhs), transpose(lhs) * grad)

Or

Zygote.@grad Base.:(*)(lhs::Matrix, rhs::Matrix) = BLAS.gemm('N', 'N', lhs, rhs), grad -> (grad * transpose(rhs), transpose(lhs) * grad)

And it seems this does not help…

improbable22 · October 23, 2018, 5:53pm

That is curious, I hadn’t tried (no 1.0 on laptop). If you add in println("forward") & println("back"), does this definition get called?

Topic		Replies	Views
What lessons could Julia's autodiff ecosystem learn from Stan's TinyGrad? Machine Learning	41	3846	September 13, 2023
Automatic Differentiation (AD) in Julia vs. Python (or PyTorch) Machine Learning autodiff	14	1578	January 16, 2025
State of automatic differentiation in Julia Machine Learning	57	21851	September 8, 2021
Comparison of automatic differentiation tools from 2016 still accurate? Numerics differentiation	41	5822	August 16, 2018
State of machine learning in Julia Machine Learning	60	65665	August 26, 2022

Related topics