Mooncake, Turing, and CI

marcobonici · November 24, 2025, 2:00am

Hello,
I finally am switching to DifferentiationInterface for the tests of a package of mine and I don’t think I could be happier! It lets me quickly write tests in a coherent manner, to try different backends. Amazing work!

Now, this finally let me something that has been on my to-do list for a long time: Mooncake.
To my surprise (maybe not to the specialists?), it just worked out of the box for my packages. It also outperforms Zygote (with a single exception, when I wrote some rrules myself based on ChainRules).
For my work, most of what I do is to blend together Neural Networks (mostly, simple MLPs based on Lux), SciML for ODE and integrals, and Turing. I have been using all of these things in a fruitful manner, with several papers of mine, students, and collaborators.

Given the success in employing Mooncake, I am considering to continue supporting Zygote within my codes but to later switch to Mooncake as the primary AD backend for my work. However, there are a couple of caveats.
First, Mooncake is painfully slow when the gradient for a function is computed the first time. This is not a problem when I am running my actual analysis (they last ~ hours, so a few minutes on top of them are not a problem at all), but this painful when I am doing developments and I push on GitHub, as the CI is now significantly slower. Is there any suggestion on how to improve that? This is not super important, but it would be nice if anyone had a suggestion.
Second, I find that the gradient preparation mechanism significantly improves the performance of Mooncake is my use cases. Is there any manner to do a similar thing within a Turing model?

Thanks in advance

gdalle · November 24, 2025, 6:09am

As you can guess, I also have that same issue with DifferentiationInterface’s CI, and I don’t have a great workaround. Here are some tricks I found to be helpful:

Don’t run tests on every Julia version for draft PR’s, which allows you to iterate quicker (only run them once the PR is “ready for review”).
Use a lower compilation setting when running tests on draft PRs.
Split your CI workflow into several jobs (GitHub can run a couple of them in parallel). Of course that means compilation is repeated on each one so it’s a trade-off.

Indeed you never should use Mooncake without preparation.
@penelopeysm would know best for this one, but I thought preparation was baked into Turing’s use of DI?

marcobonici · November 24, 2025, 6:31am

Hi @gdalle , thanks for the prompt response!
Regarding the CI, I will definitely steal something from your workflows
Regarding Turing, I haven’t tested yet whether the preparation mechanism is backed in or not, but from what you linked this seems to be the case. Just waiting for the answer from the Turing people. If this happens to be the case, maybe it could be the case to add a line to the docs of Turing to highlight this…?

Topic		Replies	Views
[ANN] Turing.jl 0.10.0 Probabilistic Programming	0	707	March 25, 2020
Which autodiff to currently use for a neural network backend? General Usage package , statistics , machinevision	10	2208	October 1, 2018
Comparison of automatic differentiation tools from 2016 still accurate? Numerics differentiation	41	5959	August 16, 2018
[blog post] Implement your own AD with Julia in ONE day Community blog-post	33	4397	November 3, 2018
Speeding up gradients for custom neural network - currently much slower than in PyTorch Machine Learning performance , differentiation	16	2173	August 28, 2021

Mooncake, Turing, and CI

Related topics