# ReverseDiff resultst do not fit theory

I have a question concerning ReverseDiff. I am doing some simple benchmarking with different Julia tools for automatic differentiation. In theory (at least what I know of automatic differentiation theory) computing a gradient should take no more than five times the time it takes to compute the objective function (with the reverse mode).

So I took the extension of the Rosenbrock function:

``````function f(x)
n=1000;
return 100.0 * sum((x[i] - x[i - 1]^2)^2 for i=2:n) + (1.0 - x)^2
end
``````

For all the values of n I tested, the time to compute the gradient is much more then 5 times the time it took to compute f(x). The testing I did was simple:

``````x = rand(n)                                                    # Where n is the right size for f
t = @elapsed f(x)
``````

With n=100, t=2.224000e-06 and t_g = 9.482440e-04
With n = 500, t = 2.054000e-06 and t_g = 4.637314e-03
With n = 1000, t =3.007000e-06 and t_g = 9.125266e-03

Am I misunderstanding the theory? Or is there something in my code that’s wrong? I am really confused by the disparity between theory and real life…

(I am new to discourse so I don’t know if it’s the right place to ask this question…)

Basically, the theory is correct but irrelevant: ReverseDiff does compute the gradient in asymptotically optimal time, but it has to store all the operations, which makes it slow. Playing around with vector operations might help.

3 Likes

ReverseDiff works by constructing an “instruction tape”, recording everything that your function does and then using that tape to propagate the gradient information backwards to the inputs. There’s a significant amount of overhead involved in constructing that tape, but fortunately you can do it ahead of time.

First of all, let’s establish a baseline, using BenchmarkTools.jl for more accurate timing:

``````
julia> n = 1000
1000

julia> function f(x)
n = length(x)
return 100.0 * sum((x[i] - x[i - 1]^2)^2 for i=2:n) + (1.0 - x)^2
end
f (generic function with 1 method)

julia> using BenchmarkTools

julia> @btime f(x) setup=(x = rand(n))
991.692 ns (5 allocations: 96 bytes)
``````

Your function `f(x)` is allocating a small amount of memory. You could probably make it faster by computing the `sum` in a loop rather than using a generator as you’re doing now, but this is OK for now.

Just calling `ReverseDiff.gradient` is very expensive because it recomputes the tape every time:

``````julia> @btime ReverseDiff.gradient(f, x) setup=(x = rand(n))
8.637 ms (32012 allocations: 1.00 MiB)
``````

Instead, we should pre-record the tape and pre-allocate an output vector to hold the result:

``````julia> tape = ReverseDiff.compile(ReverseDiff.GradientTape(f, rand(n)))
ReverseDiff.CompiledTape(f)

julia> @btime ReverseDiff.gradient!(g, \$tape, x) setup=(x = rand(n); g = similar(x))
312.478 μs (0 allocations: 0 bytes)
``````

Computing the gradient with the pre-recorded tape is almost 30 times faster than re-building the tape every time. ReverseDiff is even smart enough to pre-allocate all the memory it needs, so this function doesn’t even allocate anything.

This still isn’t magically “no more than five times” as slow as the original `f`, but it’s much closer. Getting even better performance will require more than my level of expertise. But the point is that it’s extremely important to consider the overhead involved when benchmarking things like autodiff. Julia’s ReverseDiff is quite competitive in that regard.

5 Likes

Sorry I didn’t know what was the most appropriate place to post this issue was! Won’t happen again!

Thank you very much! That will be very useful for the continuation of my testing with automatic differentiation!

By the way, the future of autodiff in Julia is https://github.com/JuliaDiff/Capstan.jl using https://github.com/jrevels/Cassette.jl however neither of those is ready for use yet.

1 Like

I’ll keep an eye out for those! For now ForwardDiff and ReverseDiff should suffice for me!