ReverseDiff.jl

package
announcement

#1

Hi all,

I’m pleased to announce that the first version of ReverseDiff is now registered. ReverseDiff is brought to you by the same folks who develop and maintain the ForwardDiff package.

ReverseDiff implements methods to take gradients, Jacobians, Hessians, and higher-order derivatives of native Julia functions (or any callable object, really) using reverse mode automatic differentiation (AD). While performance can vary depending on the functions you evaluate, the algorithms implemented by ReverseDiff generally outperform non-AD algorithms in both speed and accuracy.

Why use ReverseDiff as your go-to reverse-mode AD package? Here are some reasons:

  • supports a large subset of the Julia language, including loops, recursion, and control flow
  • user-friendly API for reusing and compiling tapes
  • user-friendly performance annotations such as @forward and @skip (with more to come!)
  • compatible with ForwardDiff, enabling mixed-mode AD
  • built-in definitions leverage the benefits of ForwardDiff’s Dual numbers (e.g. SIMD, zero-overhead arithmetic)
  • a familiar differentiation API for ForwardDiff users
  • non-allocating linear algebra optimizations
  • nested differentiation
  • suitable as an execution backend for graphical machine learning libraries
  • ReverseDiff doesn’t need to record scalar indexing operations (a huge cost for many similar libraries)
  • higher-order map and broadcast optimizations
  • it’s well tested

In the future, we aim to add GPU support, sparsity exploitation, and more optimized linear algebra derivatives to the package.

Compared to ForwardDiff, ReverseDiff’s methods are algorithmically more efficient for differentiating functions where the input dimension is larger than the output dimension. In general, the optimal choice between ForwardDiff and ReverseDiff for a given problem requires some nuance, so I’ve written up some tips to guide your choice.

Internally, ReverseDiff contains facilities for recording execution traces of native Julia code to reusable, compilable instruction tapes, as well as mechanisms for propagating values “forwards” and “backwards” through these tapes. Since these tapes can be analyzed as computation graphs, my hope is that this infrastructure can eventually be rendered useful for non-AD purposes, such as performance optimization, scheduled parallel execution, and constraint programming. Feel free to reach out if you’re interested in exploring this area!


#2

Congrats @jrevels - This is exciting!


#3

This is awesome! ForwardDiff.jl is already one of my favorite parts of the Julia ecosystem, and I’m thrilled to be able to try out reverse-mode just as easily.


#4

Congrats on the announcement @jrevels! This is very cool, the benchmark results are extremely impressive, and I like the possibilities of the instruction tape work. Can’t wait to see what people build on top of this.


#5

when trying to add the package I get

ERROR: unsatisfiable package requirements detected: no feasible version could be found for package: DiffBase

any suggestion what might have gone wrong?


#6

Here’s ReverseDiff’s REQUIRE file; it seems that you have a package somewhere that conflicts with these version requirements?

For example, ReverseDiff requires ForwardDiff 0.3.2 or higher, but a lot of popular packages (like JuMP and Optim) haven’t made the switch over yet (they’ll be switched over soon, though).

I’m not really sure why you’d see that error with DiffBase, though - the only packages I know of that use it currently are ForwardDiff and ReverseDiff, and neither of them have upper bounds on it.


#7

Hooray! For those who haven’t looked at the code, it’s really impressive how much @jrevels has managed to generalize and optimize the algorithm to support a broad base of Julia functionality.

This functionality should be really helpful to those implementing Julia machine learning algorithms going forward.


#8

“Since these tapes can be analyzed as computation graphs, my hope is that this infrastructure can eventually be rendered useful for non-AD purposes, such as performance optimization, scheduled parallel execution, and constraint programming.”

Oh, that’s really interesting, @jrevels. A lot of probabilistic programming is in terms of “execution traces”, usually expressed in terms of coroutines or continuations. Computation of gradients is also really common in that context; maybe there’s a way to combine the two?


#9

@cscherrer that sounds a lot like what Cassette.jl is aiming to do.


#10

Oh interesting, I hadn’t seen Cassette.