ReverseDiff resultst do not fit theory

I have a question concerning ReverseDiff. I am doing some simple benchmarking with different Julia tools for automatic differentiation. In theory (at least what I know of automatic differentiation theory) computing a gradient should take no more than five times the time it takes to compute the objective function (with the reverse mode).

So I took the extension of the Rosenbrock function:

function f(x)
	return 100.0 * sum((x[i] - x[i - 1]^2)^2 for i=2:n) + (1.0 - x[1])^2

For all the values of n I tested, the time to compute the gradient is much more then 5 times the time it took to compute f(x). The testing I did was simple:

x = rand(n)                                                    # Where n is the right size for f
t = @elapsed f(x)
t_g = @elapsed ReverseDiff.gradient(f, x)

With n=100, t=2.224000e-06 and t_g = 9.482440e-04
With n = 500, t = 2.054000e-06 and t_g = 4.637314e-03
With n = 1000, t =3.007000e-06 and t_g = 9.125266e-03

Am I misunderstanding the theory? Or is there something in my code that’s wrong? I am really confused by the disparity between theory and real life…

Thanks for your help!

(I am new to discourse so I don’t know if it’s the right place to ask this question…)

Please don’t post twice, or at least crossref it:

I was as confused as you and received this answer:

Basically, the theory is correct but irrelevant: ReverseDiff does compute the gradient in asymptotically optimal time, but it has to store all the operations, which makes it slow. Playing around with vector operations might help.


ReverseDiff works by constructing an “instruction tape”, recording everything that your function does and then using that tape to propagate the gradient information backwards to the inputs. There’s a significant amount of overhead involved in constructing that tape, but fortunately you can do it ahead of time.


First of all, let’s establish a baseline, using BenchmarkTools.jl for more accurate timing:

julia> n = 1000

julia> function f(x)
               n = length(x)
               return 100.0 * sum((x[i] - x[i - 1]^2)^2 for i=2:n) + (1.0 - x[1])^2
f (generic function with 1 method)

julia> using BenchmarkTools

julia> @btime f(x) setup=(x = rand(n))
  991.692 ns (5 allocations: 96 bytes)

Your function f(x) is allocating a small amount of memory. You could probably make it faster by computing the sum in a loop rather than using a generator as you’re doing now, but this is OK for now.

Just calling ReverseDiff.gradient is very expensive because it recomputes the tape every time:

julia> @btime ReverseDiff.gradient(f, x) setup=(x = rand(n))
  8.637 ms (32012 allocations: 1.00 MiB)

Instead, we should pre-record the tape and pre-allocate an output vector to hold the result:

julia> tape = ReverseDiff.compile(ReverseDiff.GradientTape(f, rand(n)))

julia> @btime ReverseDiff.gradient!(g, $tape, x) setup=(x = rand(n); g = similar(x))
  312.478 ÎĽs (0 allocations: 0 bytes)

Computing the gradient with the pre-recorded tape is almost 30 times faster than re-building the tape every time. ReverseDiff is even smart enough to pre-allocate all the memory it needs, so this function doesn’t even allocate anything.

This still isn’t magically “no more than five times” as slow as the original f, but it’s much closer. Getting even better performance will require more than my level of expertise. But the point is that it’s extremely important to consider the overhead involved when benchmarking things like autodiff. Julia’s ReverseDiff is quite competitive in that regard.


Sorry I didn’t know what was the most appropriate place to post this issue was! Won’t happen again!

Thank you very much! That will be very useful for the continuation of my testing with automatic differentiation!

By the way, the future of autodiff in Julia is using however neither of those is ready for use yet.

1 Like

I’ll keep an eye out for those! For now ForwardDiff and ReverseDiff should suffice for me!