Why is reversediff slow?

I am trying to use automatic differentiation to obtain the gradient of a function with scalar-valued output.
Contrary to what I had expected, reverse differentiation is much slower than forward differentiation. Can anyone help me understand what is going on?
I just made small test example, with a test-function called ff.

using ReverseDiff
using LinearAlgebra
using ForwardDiff

# define some function for testing
function ff(x)
    res = 0.0
    for i in 1:100
        for j in eachindex(x)
            res += sin(exp(x[j])-norm(x))
        end
    end
    atan(res)
end

N = 50
y = rand(N)
res = similar(y)

@time ForwardDiff.gradient!(res, ff, y)

@time ftape = ReverseDiff.GradientTape(ff, y)
@time compiled_ftape = ReverseDiff.compile(ftape)
@time ReverseDiff.gradient!(res, compiled_ftape, y)

Especially when N is large, ReverseDiff seems to take very long.

I’m not experienced with ReverseDiff but I was interested in your question.

I tried to follow the advice given here

using ReverseDiff
using LinearAlgebra
using ForwardDiff
using BenchmarkTools

 # define some function for testing
function ff_vec(x)
    res = 0.0 
    for i in 1:100
        res += sum(sin.(exp.(x) .- norm(x)))
    end 
    atan(res)
end


function ff_vec2(x)
    res = 0.0
    n = norm(x)
    for i in 1:100
        res += sum(sin.(exp.(x) .- n))
    end
    atan(res)
end



function ff(x)
    res = 0.0 
    for i in 1:100
        for j in eachindex(x)
            res += sin(exp(x[j])-norm(x))
        end 
    end 
    atan(res)
end


N = 50
y = rand(N)
res = similar(y)

ftape_vec = ReverseDiff.GradientTape(ff_vec, y)
compiled_ftape_vec = ReverseDiff.compile(ftape_vec)

ftape = ReverseDiff.GradientTape(ff, y)
compiled_ftape = ReverseDiff.compile(ftape)

    
ftape_vec2 = ReverseDiff.GradientTape(ff_vec2, y)
compiled_ftape_vec2 = ReverseDiff.compile(ftape_vec2)

In Comparison to the other ones.
Reverse Diff:

julia> @btime ReverseDiff.gradient!($res, $compiled_ftape, $y);
  66.869 ms (0 allocations: 0 bytes)

julia> @btime ReverseDiff.gradient!($res, $compiled_ftape_vec, $y);
  1.289 ms (200 allocations: 96.88 KiB)

julia> @btime ReverseDiff.gradient!($res, $compiled_ftape_vec2, $y);
  444.041 μs (200 allocations: 96.88 KiB)

Forward Diff:

julia> @btime ForwardDiff.gradient!($res, $ff, $y);
  4.586 ms (2 allocations: 5.23 KiB)

julia> @btime ForwardDiff.gradient!($res, $ff_vec, $y);
  1.107 ms (502 allocations: 2.17 MiB)

julia> @btime ForwardDiff.gradient!($res, $ff_vec2, $y);
  971.400 μs (502 allocations: 2.17 MiB)

We see FordwardDiff is ~4x faster, but ReverseDiff is ~60x. The vectorized version improves a lot for ReverseDiff.

As I said, other people with ReverseDiff experience can probably improve even more.

1 Like

Thanks, that helps understanding. I took a somewhat arbitrary example function. The one I actually use is way more complex, so it may be hard to pin down how to rewrite it. If I understand correctly, the scaling for reversemode differentiation should be better, but the constant may be high (?).

1 Like

Yes, in general f: R^N -> R^M for M<<N is better with ReverseDiff and vice versa.

Did you also check out Zygote.jl?

I did, I am not completely sure about the status of that package. Also, it turned out to be slower than forwardDiff in my example. The issue is that I need to compute gradients over the same tape over and over again and that made me think ReverseDiff should be preferable.

might create a type instability, try ˋres = 0*x[1]ˋ?

The typical pattern for this would be res=zero(eltype(x))

2 Likes

Did you mean R^N \rightarrow R^M?

2 Likes

I believe so.

But the issue seems more subtle than just comparing M and N.
It is a bit annoying that on the one hand in Julia vectorisation is not necessary, but when one wants to do reverse mode differentiation it does make a difference. I am just wondering whether this is specific to the Julia implementation in ReverseDiff, or a generic feature that plays a role in other programming languages as well. Perhaps I lack a sufficiently profound understanding of this topic, but for the type of function I have defined I am surprised that even with an input vector of about length 200 the computation is very slow (relative to ForwardDiff).

1 Like

I don’t think you need to vectorize, and for this example the norm(x) makes a huge difference. Running your original example on my machine for N=200 I got 156 ms for ForwardDiff, and slower 782 ms for ReverseDiff. This seems consistent with your experience.

Like @roflmaostc I moved the norm out of the outer loop, but kept the inner loop:

function ff2(x)
    res = 0.0 
    nmx = norm(x)
    for i in 1:100
        for j in eachindex(x)
            res += sin(exp(x[j])-nmx)
        end
    end
    atan(res)
end

@btime ForwardDiff.gradient!($res, $ff2, $y); # 9.764 ms
ftape2 = ReverseDiff.GradientTape(ff2, y)
compiled_ftape2 = ReverseDiff.compile(ftape2)
@btime ReverseDiff.gradient!($res, $compiled_ftape2, $y) # 15.827 ms

Now forward runs 16x faster, reverse 49x faster than before (but still 50% slower then the new forward), all just because of the norm. I did not find the @roflmaostc’s ff_vec2 to make much difference, about comparable in timing.

I am very inexperienced with this, but here’s my impression. This example is quite simple, and so norm clearly outweighs the other computations. The loop operations are computed 100N times, but only the norm has N terms itself, so it dominates with large N. That single operation is trivial whether forward or backwards diffing, but with some overhead backwards. Also, sin and exp are trivial to diff.

So I don’t think the issue is vectorization, and it’s up to you whether the norm here is representative of your real problem. But I believe backpropagation gains most when you have a lot of graph nodes where there is opportunity to re-use pieces of the forward pass, or gradients in the backward pass. Maybe this example is too trivial to get good re-use, relative to the overhead, which was not so bad (~50% worse performance). I don’t know enough to comment on potential gains from stuff like @simd, @inbounds, or other optimizations.

[EDIT:] I take it back about ff_vec2. I ran it again with N=200 and got 13.2 ms forwards and 1.47 ms backwards, so it seems backwards is winning with vectorization, the norm computed once, and larger N. I don’t know why my ff2 is faster than ff_vec2 forwards. But backwards, it seems vectorization is a big help. :man_shrugging:

1 Like

ReverseDiff.jl can potentially optimise the code, which I have experienced a while ago (`@benchmark` on `ReverseDiff.gradient!` gives a shorter time than the original function · Issue #102 · JuliaDiff/ReverseDiff.jl · GitHub). Maybe it’s related?

1 Like

It may be related; I am not an expert in this. But it seems that in more complicated settings there are some do’s and don’t for AD around. I have not seen much documentation on that.