Reddit discussion: limitations of Zygote

There is a reddit discussion of the swift/tensorflow project that has a couple comments about Julia.

I am interested in this statement - to what extent is this true?

If the end goal is to “differentiate all the things”, where by things we mean all Julia packages, then Zygote might be the answer. Problem is that, from what I understand of it, it can’t really differentiate all current Julia packages unless they all adhere to a strict coding format or either the compiler is significantly altered to cope with what Zygote expects.

The Zygote docs do not seem to mention such coding format.

The discussion also says

IIRC SciML is about differential equations, not differentiable programming.

which I think maybe is not true in theory, but in practice true (I gather that right now the focus of the Flux community is mainly differential equations, not so much interest for general ML?)

I never understood what it means to “differentiate a program”.
From calc I know what it means to differentiate a differentiable function f:R^n \to R^m.
I realize we can sometimes think of a program as a function, but I don’t see how to compute it’s derivative.

@MikeInnes @ChrisRackauckas et al have a paper I’ve seen mentioned often, but I never understood it.

The guy who began the reddit discussion has a post on differentiable programming where he writes:

In a nutshell, differentiable programming is a programming paradigm in which your program itself can be differentiated. This allows you to set a certain objective you want to optimize, have your program automatically calculate the gradient of itself with regards to this objective, and then fine-tune itself in the direction of this gradient…

He gives an example (I’ll paraphrase)

cube(x) = x^3
cube𝛁 = gradient(of: cube)
cube(2)   // 8.0
cube𝛁(2)  // 12.0

I’m confused, this is just the derivative of a differentiable function f:R \to R.
He continues

There are no libraries or external code being used here, gradient is simply a new function that is being introduced by the S4TF team into the Swift language…
This is Swift’s big new feature. You can take arbitrary Swift code and, as long as it’s differentiable, automatically calculate its gradient. The code above has no imports or weird dependencies, it’s just plain Swift.


  1. we are “just” computing derivatives of differentiable functions (in a convenient way that works w/ all libraries in the language)
  2. I’m even more confused than I begin

It’s 1, with the key insight being how much of a pain that is.
Have you ever ran the chain rule through a for loop? What about recursion? It’s not fun or easy.


It’s this one. I describe the key of differentiable programming as really just knowing that you can get derivatives from an arbitrary program here in a more example-oriented way:


Can all differentiable functions in Julia be differentiated w/ Zygote.jl?
If so, the only difference I see is that you don’t need to load a library to compute a gradient in Swift, whereas you need using Zygote in Julia…

Zygote currently can not deal with mutation, but that is WIP.

1 Like

IIRC, In Swift you have to mark things as @differentiable, so you do need buy-in from the package authors. I can see this being an issue when you want to throw some messaging protocols like webservers in the loop.

But I don’t really see these kinds of details as an actual big deal with differentiable programming. I think the bigger deal is what other codes are working with your differentiable programming world that you can plug together. With Julia, that includes things like stiff delay differential equation solvers, verified commercial pharmacometric simulation platforms, climate models, full-featured robotics simulation libraries I really think that what matters the most is ChainRules.jl and the culture of pervasive derivative rules and overloading everything for highest performance derivatives.

1 Like

I think it is telling that when I google Swift the only things I can find are articles about app development.

@tbeason the narrative I keep hearing (not sure I understand):

  1. the only two languages where you can differentiate all (differentiable) functions are Julia & Swift
  2. Julia is the only one of those two w/ extensive scientific libraries.
    Swift is intended for developing apps for iOS/MacOS.

There is a beautiful example in the paper where they differentiate sin(x)

#Define sin(x) through its Taylor series
function s(x)
    t = 0.0
    sign = -1.0
    for i in 1:19
        if isodd(i)
            newterm = x^i/factorial(i)
            abs(newterm)<1e-8 && return t
            sign = -sign
            t += sign * newterm
    return t
using Zygote, ForwardDiff
ForwardDiff.derivative(s, 1.0) # Forward Mode AD
Zygote.gradient(s, 1.0) # Reverse Mode AD

#compare with
ForwardDiff.derivative(sin, 1.0) # Forward Mode AD
Zygote.gradient(sin, 1.0) # Reverse Mode AD

While both s(x) & sin(x) are differentiable functions, s(x) is written w/ standard Julia programming tools. (Perhaps this is why they call it “differentiable programming” :man_shrugging:)

I think the point is that it is much harder to differentiate functions like s(x) in R/Python etc


AD has a reasonably long history (in the context of computer programming), so in a sense it is not hard to just apply some form of AD if the only concern is feasibility. A lot of languages provide some solution to AD; the operator overloading approach should work fine with s(x) in most languages.

However, for practical programs, convenience (don’t need a source-to-source transformer like ADIFOR, or keep track of duals/adjoints manually), speed and memory efficiency are relevant concerns. The great thing about the approach Zygote has chosen is that it provides a code transformation solution that meshes really well with Julia’s compiler model.

However, apparently few people read the docs, which clearly state that

We’re still in beta so expect some adventures.

At this point, Zygote is for users who are willing to tolerate the fact that they need to program in a certain way to have the code transformation work. This is becoming less and less restrictive, but still remains relevant.


From the ADIFOR website:

Users wishing to use ADIFOR for educational and non-profit research or for the purpose of commercial evaluation can obtain ADIFOR at no cost by doing the following:

  1. Fill out and electronically submit a copy of the ADIFOR Request Form.
  2. Download a copy of the ADIFOR Public License (see Help with File Downloading if necessary.).
  3. Read and sign the license. Students should have their advisor sign the license.
  4. Fax the signed license to:
  • Paul Hovland at +1-630-252-5986 if you are registering at Argonne.

Aside from not being free software, I’d spend some time looking for alternatives that don’t require I print out, sign, and then fax a license agreement.


That nicely underlines my point about convenience :wink: I just linked it for historical reasons.


This is my question: what are these restrictions on how we need to program?
I think they are not described in the docs anywhere?

The only thing I have heard of is that arrays cannot be mutated (and I think even that is not in the docs?). Are there other restrictions?

As someone who “just wants to develop algorithms”, should I be using Zygote or tracker?

Have a look at the issue page of Zygote

While this might not be good strategy to learn how to write a program that Zygote can handle, it will show you that there are a ton of things that will trip you up, and any non-trivial program is likely to require some workarounds and quite a lot of manually defined adjoints (the standard answer to many issues is “just define the adjoint”).

A particularly uncomfortable issue with Zygote is that many bugs lead to segfaults causing julia to crash and not always give you much helpful info (I’m debugging one of these as I write this).

ReverseDiff.jl or Tracker.jl are probably easier to get going at the moment.


To a certain extent it is, eg the introduction mentions that arrays should not be mutated but it is WIP.

I think it is best to understand that it is work in progress: don’t use it unless you are OK with encountering limitations (not necessarily documented), corner cases, issues, making an MWE, reporting an issue, and an occasional PR; and of course defining your own adjoints when necessary. As @baggepinnen, the issue tracker is very useful.

If you are not so adventurous, at the moment I would recommend Tracker, or even ReverseDiff or ForwardDiff, depending on your use case.

I have read

and looked at the code of

but it not clear to me how (if) “∂P in practice” is different from Zygote (in its current state).

If the authors have time for a short explanation, it would be nice.