What lessons could Julia's autodiff ecosystem learn from Stan's TinyGrad?

This came up in the What steps should the Julia community take to bring Julia to the next level of popularity? megathread and I think it’s worth having a dedicated topic on the point. I’ll merge some existing posts into this topic from there and discussion can continue here.


As for autodiff, I want to really appreciate how ForwardDiff works in Julia! It felt so nice after Python (a few years ago, maybe they have this now with some magic, idk) to be able to just differentiate any function you throw at it.

Yes, it’s forward mode and isn’t suited for large numbers of inputs — but for other usecases it’s straightforward, painless, and totally general.


No, maybe more like 3-4 years, and with Bob Carpenter a major driver of design and implementation, and several million dollars of grant funds.

If someone plops $5M into a group of 5 people to finish up milestones for Enzyme you’d see a lot of progress in the next 3 years


TinyGrad looks like an ML system targeted exclusively at NN models, not a general-purpose AD engine? If you read the source code of TinyGrad, it looks like it has hard-coded derivative rules for a relatively small set of building-blocks (and compositions thereof), with users expected to implement other rules themselves (e.g. in their examples). This is much, much easier than something like Enzyme that aims to AD arbitrary code.


This was inspired by a conversation with Bob about this where he cited that it:

involved about 2 person years of effort spread over 18 months. We did it on two $60K postdoc salaries and one $40K TA salary.

Right, the problem is it costs $5M and takes a group of 5 people 3 years.

1,000,000,000%. I am completely in love with ForwardDiff and it’s by far my favorite autodiff tool in Julia. The only problem is, as you mentioned, that it’s forward-mode :sweat_smile:



I think you’re arguing for a specialized reverse mode differentiator that’s good enough to do a limited subset because that’s going to get a lot of people 80% of the way there and it’s good enough. I understand that point of view, but I will point out that I filed a Stan issue numbered like 30 ish requesting Chebyshev polynomials, someone even implemented it for their own use, it was never incorporated and to this day you can’t do orthopoly calcs in Stan. A more recent issue report reopened a similar request.

One of the great things about Julia is you can utilize the general ecosystem rather generically. I think there’s a place for a language where you can differentiate ODE solves and datetime conversions and orthogonal polynomials and splines and root finders and whatever. Julia is trying to be that language. No one else is. If people are frustrated with Julia because it’s too general, it seems misguided. Just link to Stans differentiable library, haven’t they decoupled that recently?


If it weren’t for this ambitiousness, why would anyone choose Julia over Python?


Does anyone keep track as to which kernels/elementary operations are supported in mainstream deep learning frameworks but not by the various AD systems in Julia? Those could in principle be addressed one-by-one by custom rules. If tinygrad is basically just doing this, surely it is not that hard to keep up to speed with that?

1 Like

But those kinds of things (matrix–vector products, sums, elementwise activation, convolutions, and compositions thereof) already work in Julia AD systems too, and have for some time.


One of the issues that can come up when you try to be quite general is that you fail and then this failure is perceived as making the system unusable. This can be true even when the actual working features vastly exceed the ability of the next best thing.

Part of the issue is hardly anyone is an expert on where the bug/failure boundary is, so people can prefer a strong fence that keeps them in a small pen that they at least know there are no land mines inside. Being able to go into a bigger field but having only a rough idea where the minefield is… not so desirable.

(also note, this can vary by consumer. A business that wants to make a formulaic thing that just works is different from a researcher who wants to explore a big space for new possibilities)

How can Julia’s systems be better about notifying the user when they try to do something that isn’t going to work?


Performance, ease-of-use, the REPL, the ability to tinker with packages myself, multiple dispatch, metaprogramming…


I imagine this is a linter’s job, and the VSCode extension already warns if it detects certain predefined patterns that are error-prone.

1 Like

This is just far from enough, and several of those aren’t even advantages over Python. Once you sacrifice genericity, multiple dispatch loses its lustre, and and it’s no longer a compelling alternative to Python.

A new language cannot just be a bit better than the alternatives.

Is it definitively known that this is an inherent limitation of a Stan-like AD system versus a limitation of the implementing language? Put another way, would you still be unable to add support for custom orthogonal polynomial types yourself if it were rewritten in Julia?

Which brings us to this point. I’d argue this exists already, and it’s called ReverseDiff. Although most (myself included) would consider ReverseDiff general purpose instead of specialized, it has a pretty similar design philosophy to Stan’s AD and in many respects is more limited than the other extant reverse mode ADs (no custom struct support, no GPU support, runs into ambiguities if you use fancier types, etc). That it still remains the workhouse AD for Turing.jl shows how useful it is.

…which is why I’m confused by how the usual response to “we should improve the AD ecosystem in Julia” ends up being “wait for [new library] to mature”. To be clear, the Enzyme team have no responsibility in this and I have nothing but good things to say about Enzyme or the people working on it. After all, this is not the first time we’ve been through this. Tracker.jl was originally developed to replace ReverseDiff for applications such as NNs and GPU-heavy code. Development on that died off in 2019 and it’s essentially soft-archived now. Zygote was created with the hope of replacing both, but its development fell off a cliff in 2020 and any activity now has been focused on patching bad bugs to keep the lights on. Even ReverseDiff went through a dev winter between 2018 and 2020, with recent contributions being more focused on addressing priority bugs. Meanwhile, ForwardDiff has received a fair number of correctness and performance improvements over the past couple of years. I do think @ParadaCarleton has a point in that the track record for sustained maintenance in this part of the ecosystem should give us pause.


If you just want AD that works for bog-standard neural-net architectures, ala tinygrad, you shouldn’t be using AD directly in the first place — just use an ML framework like Flux.jl or Lux.jl and stick to composing its built-in layers and cost functions.

Another alternative is to create a walled garden like JAX, in which the only things that are likely to work are compositions of jax.*. Don’t use Numpy, use jax.numpy. Don’t use Scipy, use jax.scipy. Use jax.experimental.ode.odeint to integrate your ODEs, and so forth. One basic issue with this approach is that it requires massive financial resources, ala Google, to re-implement so many popular numerical libraries — unless you have $50 million to offer, telling Julia developers to “get more money” is not really actionable advice. Another basic issue is that a walled garden is still quite limiting, and the people working in Julia are more interested in taking the next leap than in merely re-inventing JAX.

You could try to tag Julia packages as e.g. “Zygote-compatible” or “Enzyme-compatible” if they work well with those AD systems. (A reasonable heuristic for the former is if they depend on ChainRulesCore.) But this kind of label could quickly get out of date and do more harm than good.

The good news is that AD systems usually either work or immediately give an error; it’s quite rare in my experience for them to silently give incorrect results.


Would somebody mind commenting on how Diffractor relates to all this?

1 Like

Without exception, all modern ML frameworks rely on AD. There is simply no escaping it, because “bog-standard” NN architectures contain a diverse enough set of operations that one can’t avoid creating a generalized system to handle them. The only alternative is to do what libraries like llama.{cpp,jl} or SimpleChains do and do most of what AD does manually (e.g. manual composition of ops for gradients), but that takes us back to the pre-2015 ML era of tech when Lua(!) Torch still didn’t have AD and NNs barely had any mindshare.

IMO it doesn’t, because it uses very experimental tech and doesn’t advertise itself as being a reliable reverse-mode AD.


Diffractor.jl is another ongoing effort to produce a general-purpose AD system, targeting some advances over existing systems (mixed mode, higher-order derivatives, …); at this point it’s mostly a research project and not something for ordinary users.


There is one kind of important “silent wrong” that I’ve encountered with ForwardDiff.jl, which is diffing through root-finding, as I found here: confused by this root finding example · Issue #595 · JuliaDiff/ForwardDiff.jl · GitHub.

Yes, I now know about adjoint methods/ImplicitAD.jl, but this was a tricky bug to figure out as a user. I think ForwardDiff.jl might benefit from documentation explaining this kind of problem, which I think is more generic than root-finding. instead it comes from code which runs iteratively (e.g., root finding), and executes in a different number of iterations on standard types than it does on Duals… Maybe this is more generic than root-finding? Not sure.

tinygrad, which is the subject of this thread, has an extremely limited “AD” framework that only supports a small set of primitives (plus user-written vJp). I would hardly even call this an AD by modern standards.

1 Like