Any faster way of computing small gradients?

Definitely not the case. For type-stable non-allocating (mutating) functions, Enzyme is already a great choice, especially for reverse-mode AD. Enzyme is by far already the best reverse-mode AD for those kinds of cases (as your benchmarks even show here). For forward-mode AD, there are cases where Enzyme could do some optimizations that ForwardDiff.jl cannot do (certain loop transformations), so in some cases it can be faster but in many cases ForwardDiff.jl is close enough/ good enough and sometimes can win by a little. The simplicity and generality of ForwardDiff.jl makes it pretty unparalleled though, it works on all kinds of janky dynamic code, which is why it’s still preferred over Enzyme.jl.

But forward-mode AD has bad scaling with respect to inputs (as Billy mentions), and so there’s always a size at which reverse-mode will do better. In this paper:

we show that’s about at 100 inputs, which is a lot lower of a cutoff than most other reverse-mode AD systems. You see there that for large systems, Enzyme definitely wins.

Now, there are some downsides to Enzyme, most of which include that (currently) your code needs to be type-stable and avoid some other Julia runtime parts (like GC, currently, in some/most cases), but that just means that it’s a really good GC for a (currently) limited number of codes. But the codes that it does include are codes which are highly optimized. I don’t think anyone has ever portrayed Enzyme as currently a universal AD (though over time that is its progression, to cover more and more of the Julia runtime).

2 Likes

I recently had a somewhat similar case in a expected loglikelihood computation. Making the logpdf avoid branches or defining an rrule, combined with avoiding type instability of a sum over a lazy broadcast through explicit allocation gave a 1-2 orders of magnitude speedup.
The solution ends up quite hacky though, but maybe it can be adapted to this case here. It was suggested that defining an rrule for the lazy broadcast would have been cleaner and maybe even faster, but I did not have time to attempt writing one.