Why is reversediff slow?

fmeulen · January 18, 2021, 1:45pm

I am trying to use automatic differentiation to obtain the gradient of a function with scalar-valued output.
Contrary to what I had expected, reverse differentiation is much slower than forward differentiation. Can anyone help me understand what is going on?
I just made small test example, with a test-function called ff.

using ReverseDiff
using LinearAlgebra
using ForwardDiff

# define some function for testing
function ff(x)
    res = 0.0
    for i in 1:100
        for j in eachindex(x)
            res += sin(exp(x[j])-norm(x))
        end
    end
    atan(res)
end

N = 50
y = rand(N)
res = similar(y)

@time ForwardDiff.gradient!(res, ff, y)

@time ftape = ReverseDiff.GradientTape(ff, y)
@time compiled_ftape = ReverseDiff.compile(ftape)
@time ReverseDiff.gradient!(res, compiled_ftape, y)

Especially when N is large, ReverseDiff seems to take very long.

roflmaostc · January 18, 2021, 2:12pm

I’m not experienced with ReverseDiff but I was interested in your question.

I tried to follow the advice given here

using ReverseDiff
using LinearAlgebra
using ForwardDiff
using BenchmarkTools

 # define some function for testing
function ff_vec(x)
    res = 0.0 
    for i in 1:100
        res += sum(sin.(exp.(x) .- norm(x)))
    end 
    atan(res)
end


function ff_vec2(x)
    res = 0.0
    n = norm(x)
    for i in 1:100
        res += sum(sin.(exp.(x) .- n))
    end
    atan(res)
end



function ff(x)
    res = 0.0 
    for i in 1:100
        for j in eachindex(x)
            res += sin(exp(x[j])-norm(x))
        end 
    end 
    atan(res)
end


N = 50
y = rand(N)
res = similar(y)

ftape_vec = ReverseDiff.GradientTape(ff_vec, y)
compiled_ftape_vec = ReverseDiff.compile(ftape_vec)

ftape = ReverseDiff.GradientTape(ff, y)
compiled_ftape = ReverseDiff.compile(ftape)

    
ftape_vec2 = ReverseDiff.GradientTape(ff_vec2, y)
compiled_ftape_vec2 = ReverseDiff.compile(ftape_vec2)

In Comparison to the other ones.
Reverse Diff:

julia> @btime ReverseDiff.gradient!($res, $compiled_ftape, $y);
  66.869 ms (0 allocations: 0 bytes)

julia> @btime ReverseDiff.gradient!($res, $compiled_ftape_vec, $y);
  1.289 ms (200 allocations: 96.88 KiB)

julia> @btime ReverseDiff.gradient!($res, $compiled_ftape_vec2, $y);
  444.041 μs (200 allocations: 96.88 KiB)

Forward Diff:

julia> @btime ForwardDiff.gradient!($res, $ff, $y);
  4.586 ms (2 allocations: 5.23 KiB)

julia> @btime ForwardDiff.gradient!($res, $ff_vec, $y);
  1.107 ms (502 allocations: 2.17 MiB)

julia> @btime ForwardDiff.gradient!($res, $ff_vec2, $y);
  971.400 μs (502 allocations: 2.17 MiB)

We see FordwardDiff is ~4x faster, but ReverseDiff is ~60x. The vectorized version improves a lot for ReverseDiff.

As I said, other people with ReverseDiff experience can probably improve even more.

fmeulen · January 18, 2021, 2:36pm

Thanks, that helps understanding. I took a somewhat arbitrary example function. The one I actually use is way more complex, so it may be hard to pin down how to rewrite it. If I understand correctly, the scaling for reversemode differentiation should be better, but the constant may be high (?).

roflmaostc · January 18, 2021, 2:44pm

Yes, in general f: R^N -> R^M for M<<N is better with ReverseDiff and vice versa.

Did you also check out Zygote.jl?

fmeulen · January 18, 2021, 3:06pm

I did, I am not completely sure about the status of that package. Also, it turned out to be slower than forwardDiff in my example. The issue is that I need to compute gradients over the same tape over and over again and that made me think ReverseDiff should be preferable.

mschauer · January 18, 2021, 3:44pm

might create a type instability, try ˋres = 0*x[1]ˋ?

Oscar_Smith · January 18, 2021, 3:47pm

The typical pattern for this would be res=zero(eltype(x))

apo383 · January 18, 2021, 5:41pm

Did you mean R^N \rightarrow R^M?

fmeulen · January 18, 2021, 7:01pm

I believe so.

But the issue seems more subtle than just comparing M and N.
It is a bit annoying that on the one hand in Julia vectorisation is not necessary, but when one wants to do reverse mode differentiation it does make a difference. I am just wondering whether this is specific to the Julia implementation in ReverseDiff, or a generic feature that plays a role in other programming languages as well. Perhaps I lack a sufficiently profound understanding of this topic, but for the type of function I have defined I am surprised that even with an input vector of about length 200 the computation is very slow (relative to ForwardDiff).

apo383 · January 19, 2021, 12:54am

I don’t think you need to vectorize, and for this example the norm(x) makes a huge difference. Running your original example on my machine for N=200 I got 156 ms for ForwardDiff, and slower 782 ms for ReverseDiff. This seems consistent with your experience.

Like @roflmaostc I moved the norm out of the outer loop, but kept the inner loop:

function ff2(x)
    res = 0.0 
    nmx = norm(x)
    for i in 1:100
        for j in eachindex(x)
            res += sin(exp(x[j])-nmx)
        end
    end
    atan(res)
end

@btime ForwardDiff.gradient!($res, $ff2, $y); # 9.764 ms
ftape2 = ReverseDiff.GradientTape(ff2, y)
compiled_ftape2 = ReverseDiff.compile(ftape2)
@btime ReverseDiff.gradient!($res, $compiled_ftape2, $y) # 15.827 ms

Now forward runs 16x faster, reverse 49x faster than before (but still 50% slower then the new forward), all just because of the norm. I did not find the @roflmaostc’s ff_vec2 to make much difference, about comparable in timing.

I am very inexperienced with this, but here’s my impression. This example is quite simple, and so norm clearly outweighs the other computations. The loop operations are computed 100N times, but only the norm has N terms itself, so it dominates with large N. That single operation is trivial whether forward or backwards diffing, but with some overhead backwards. Also, sin and exp are trivial to diff.

So I don’t think the issue is vectorization, and it’s up to you whether the norm here is representative of your real problem. But I believe backpropagation gains most when you have a lot of graph nodes where there is opportunity to re-use pieces of the forward pass, or gradients in the backward pass. Maybe this example is too trivial to get good re-use, relative to the overhead, which was not so bad (~50% worse performance). I don’t know enough to comment on potential gains from stuff like @simd, @inbounds, or other optimizations.

[EDIT:] I take it back about ff_vec2. I ran it again with N=200 and got 13.2 ms forwards and 1.47 ms backwards, so it seems backwards is winning with vectorization, the norm computed once, and larger N. I don’t know why my ff2 is faster than ff_vec2 forwards. But backwards, it seems vectorization is a big help.

Kai_Xu · January 19, 2021, 1:07am

ReverseDiff.jl can potentially optimise the code, which I have experienced a while ago (`@benchmark` on `ReverseDiff.gradient!` gives a shorter time than the original function · Issue #102 · JuliaDiff/ReverseDiff.jl · GitHub). Maybe it’s related?

fmeulen · January 19, 2021, 6:02pm

It may be related; I am not an expert in this. But it seems that in more complicated settings there are some do’s and don’t for AD around. I have not seen much documentation on that.

Topic		Replies	Views
Automatic Differentiation Slow (Slower than Finite-Differences) Optimization (Mathematical)	18	3415	May 31, 2018
ReverseDiff resultst do not fit theory Numerics	6	1225	June 8, 2018
Mixed-mode automatic differentiation using ForwardDiff and ReverseDiff General Usage forwarddiff , reversediff , autodiff	9	2692	February 1, 2022
ReverseDiff function type specification General Usage optimization	6	765	April 16, 2018
ReverseDiff.jl Community package , announcement	9	1678	December 13, 2017

Why is reversediff slow?

Related topics