Which autodiff to currently use for a neural network backend?

DoktorMike · September 19, 2018, 2:58pm

Dear community,

I’ve been trying out some of the different autodiff packages out there. I’ve run some tests and simulations on the type of problem that I’m addressing. Basically neural networks with high dimensional input and low dimensional output.

A bit of background is that I’m trying to migrate parts of my code from pytorch into Julia and hopefully gaining the speed and expressive programming power that Julia may provide.

So far the candidates I have in mind, along with my comments, are the following.

Knets gradient: Ran into problems with speed while running on CPU’s
ReverseDiff: Couldn’t get it to work, and don’t understand why.
ForwardDiff: Works fine and easy to use, but is slower than Zygote for my problems.
Zygote: Works well and seems really quite fast. Beats pytorch in differentiation speed, but the author warns that this package is not ready to be used as it is in it’s infancy.
Capstan: I really like the idea and believe that this could be really fast and nice, but currently I haven’t seen any momentum in the efforts despite Cassette being released.

All in all each of these packages are really cool and good at what they do in their own right. For my application though there might be some experiences here in the community that I can learn from.

I need them to work well on GPU’s and CPU’s and be fast.

Any pointers or advice from others who have played around with this?

improbable22 · September 19, 2018, 3:36pm

Missing from your list is Flux, which seems like the obvious answer here. I assume there will be an upgrade path to Zygote at some point when sufficiently mature.

I had the impression ReverseDiff wasn’t working on 0.7 yet but could be wrong / out of date.

DoktorMike · September 19, 2018, 4:46pm

Does Flux spin it’s own reverse based autodiff? If so do you know how it performs against Zygote?

Tamas_Papp · September 19, 2018, 6:12pm

I am pretty much in the same situation with derivative-based MCMC. I am experimenting with Zygote, and use ForwardDiff as a reliable fallback (not ideal for \mathbb{R}^n \to \mathbb{R}, but surprisingly effective nevertheless, larger chunks seem to work better for me). I am aware that Zygote is WIP, but issues get fixed quickly, the maintainers seem to be committed, and I can always fall back to ForwardDiff.

Elrod · September 19, 2018, 6:14pm

Note that MikeInnes is the author of both Flux and Zygote (which is located in FluxML).
So it may be reasonable to stick with Flux, and then when Zygote is ready making that switch ought to be seamless.

Also worth pointing out that because Zygote’s API is so simple, you don’t have to dedicate much code to building around it (eg, no need to manage compiled tapes).

improbable22 · September 19, 2018, 6:27pm

Re Flux, yes it implements its own reverse-mode AD. For my purposed (which are n-to-1) this has proved faster than the alternatives you listed.

Zygote I have barely played with, but as pointed out above it’s written by the same Mike, and shares some syntax e.g. for defining custom gradients.

baggepinnen · September 20, 2018, 2:23am

You may want to check out

and

I have done many tests myself, and found Reversediff.jl to be the most versatile packages for reverse mode diff. It’s one of the few that supports my use case with nested differentiation.

Tamas_Papp · September 20, 2018, 5:56am

AFAIK Nabla.jl does not (yet?) work on 0.7/1.0.

DoktorMike · September 23, 2018, 1:09pm

Thanks everyone for your suggestions and pointers.

RoyiAvital · September 23, 2018, 1:35pm

You also have AutoGrad.jl (Part of KNet.jl).

Tamas_Papp · October 1, 2018, 5:08pm

Just a follow-up to this topic: I am experimenting with incorporating many of the above tools into a common framework to handle AD for the gradient (for use in Bayesian inference, primarily).

I found that

When Zygote.jl works, it is amazing. But it is experimental, and the usual caveats apply.
Flux.jl is pretty good for reverse-mode AD. But when it breaks, I find debugging difficult.
ReverseDiff.jl is pretty reliable, except when it is missing AD rules for methods. Then these need to be defined.
In general, when looking at the discussions of some of the AD packages, there is a general sentiment that great AD tools are just around the corner and perhaps maintaining/developing existing ones is wasted effort (eg #81 in Nabla.jl, ReverseDiff.jl’s README, and some others, I won’t link all of them). While this is understandable and projects based on eg Cassette.jl are indeed very promising, I think there would be value in keeping the existing tools working in the transition period which may take many months, if not years.

Topic		Replies	Views
What is the difference between Zygote vs ForwardDiff and ReverseDiff Machine Learning	4	6553	February 23, 2021
Taking gradients in Julia General Usage question , zygote , forwarddiff , reversediff	7	2130	September 28, 2021
Differentiating Jacobian-vector product for sliced score matching? Machine Learning flux , zygote , ad	18	548	June 29, 2023
Use ForwardDiff instead of Zygote with Flux? Machine Learning	10	1713	September 3, 2021
Comparison of Julia autodiff packages General Usage autodiff	1	203	September 27, 2024

Which autodiff to currently use for a neural network backend?

Related topics