Automatic Differentiation Slow (Slower than Finite-Differences)

I am trying to implement a statistical estimator and am having issues with both ForwardDiff and ReverseDiff taking longer to compute gradients than would simple one-sided finite differencing. Below is a minimal example with logit_GMM being the function that I am trying to minimize. With 20 parameters, the ForwardDiff gradient evaluation time is more than 25x the function evaluation time (when optimally chunked), and ReverseDiff takes closer to 400x the evaluation time! By comparison, finite-differences takes less than 20x, and direct gradient computation takes even less. (My actual application is more complex, and direct gradient computation is much less feasible there.)

Am I doing something wrong, or is this problem just poorly-designed for either forward or reverse-mode automatic differentiation? Any thoughts on how to make my AD faster would be greatly appreciated. (Any other comments that might making my code more efficient are certainly welcomed as well.)

Thank you!

using ForwardDiff, BenchmarkTools, ReverseDiff

num_i = 1_000
choice_set_size = 30
true_b = randn(20)
num_moments = 40

X=randn(num_i*choice_set_size,length(true_b))
Δm_mat = rand(num_i*choice_set_size,num_moments)
ranges = [colon(1+choice_set_size*(ii-1), choice_set_size*ii) for ii=1:num_i]

function logit_GMM(b,X::Matrix{Float64},Δm_mat::Matrix{Float64},ranges::Vector{UnitRange{Int64}})
    eu=exp.(X*b)
    s=similar(eu)
    @inbounds for rng ∈ ranges
        @views s[rng] .= eu[rng] ./ sum(eu[rng])
    end
    EΔm = (Δm_mat' * s) ./ size(ranges,1)
    return EΔm' * EΔm
end

logit_GMM(b) = logit_GMM(b,X::Matrix{Float64},Δm_mat::Matrix{Float64},ranges::Vector{UnitRange{Int64}})

b=randn(size(true_b))
const f_tape = ReverseDiff.GradientTape(logit_GMM,b)
const compiled_f_tape = ReverseDiff.GradientTape(f_tape)
results = similar(b)

cfg_1 = ForwardDiff.GradientConfig(logit_GMM,b,ForwardDiff.Chunk{1}())
cfg_10 = ForwardDiff.GradientConfig(logit_GMM,b,ForwardDiff.Chunk{10}())
cfg_20 = ForwardDiff.GradientConfig(logit_GMM,b,ForwardDiff.Chunk{20}())

println("Eval-times:")
@btime logit_GMM($b,$X::Matrix{Float64},$Δm_mat::Matrix{Float64},$ranges::Vector{UnitRange{Int64}})
@btime logit_GMM($b)

println("Grad-times:")
@btime ForwardDiff.gradient!($results,$logit_GMM,$b)
@btime ForwardDiff.gradient!($results,$logit_GMM,$b,cfg_1)
@btime ForwardDiff.gradient!($results,$logit_GMM,$b,cfg_10)
@btime ForwardDiff.gradient!($results,$logit_GMM,$b,cfg_20)

@btime ReverseDiff.gradient!($results,$compiled_f_tape,$b)


The benchmarked times are:

Eval-times:
  1.043 ms (1008 allocations: 751.11 KiB)
  1.040 ms (1008 allocations: 751.11 KiB)
Grad-times:
  32.826 ms (2018 allocations: 15.21 MiB)
  126.079 ms (20160 allocations: 28.41 MiB)
  32.621 ms (2016 allocations: 15.21 MiB)
  26.231 ms (1009 allocations: 14.48 MiB)
  384.239 ms (0 allocations: 0 bytes)
1 Like

Is the threading needed for this issue to be reproducible? If not, please remove it. Does the inputs really need to be so big to that the time to take the gradient a full minute to execute or can the same point be achieved with smaller input data? The less code and the faster it takes to run, the more likely it is that someone will have time to go through it and give a good answer.

Anyway, it seems you have a R^N → R with a large N problem, so reverse mode differentiation is likely much more efficient.

5 Likes

@kristoffer.carlsson, thank you for the excellent suggestions. I do not have access to a computer with Julia the moment, but I will update the code to be simpler and require less time as soon as possible.

@kristoffer.carlsson, I’ve edited the code to run faster and be as simple as possible. It’s still the case that ForwardDiff is slower than AD. Also, unfortunately it appears that compiled ReverseDiff is even slower than ForwardDiff in this case! This is surprising to me given that this example is similar to a neural-net problem, which is the classic case where reverse-mode AD is supposed to perform extremely well.

What is the size of your gradient? From your sample code, it looks like it’s just 20 elements, is that right? If so, then I’d expect ForwardDiff to be faster than ReverseDiff, even though algorithmically reverse-mode AD might be preferred. ForwardDiff just has a bit less overhead and tends to be very fast for < 100 output dimensions.

If you’re still seeing performance issues, the performance tips from the manual are still the best thing to read: https://docs.julialang.org/en/stable/manual/performance-tips/#man-performance-tips-1

In particular, checking your code for type-instabilities or any unnecessary allocations will help. It should nearly always be possible to make ForwardDiff as fast as or faster than numerical diffing, potentially with some code polish.

@rdeits, thanks for your response.

What is the size of your gradient?

In my actual application, I expect there to be <50 parameters and to have function evaluation times between 2 and 10 seconds (much longer matrices and a few additional model complexities). As such, you may be right that appropriately chunked forward-diff might be the best approach.

In particular, checking your code for type-instabilities or any unnecessary allocations will help.

I took a look at the performance tips section. All of my code was type-stable (according to @code_warntype) excepting the fact that I hadn’t set the chunk-size in ForwardDiff. Doing so and choosing the maximal chunk-size does improve performance a bit, but it is still worse than finite differences. (Modified results are now included in the original post.)

I can preallocate all of the intermediate arrays (i.e. eu, s, EΔm), however I timed some of these allocations, and they appear to scale linearly. In fact, the @btime stats suggest that in the chunk-size 20 case, the allocations are the same and the memory use is actually just under 20x the logit_GMM evaluation time.

It should nearly always be possible to make ForwardDiff as fast as or faster than numerical diffing, potentially with some code polish

I definitely agree that ForwardDiff should basically always outperform finite differencing. In my experience it has always outperformed finite differencing and sometimes even outperforms analytic computation, which is why I’m somewhat perplexed by its underperformance here.

Profiling with

using ProfileView
Profile.clear()
@profile @btime ForwardDiff.gradient!($results,$logit_GMM,$b) evals = 100
ProfileView.view()

shows that almost half the time is being spent doing garbage collection (unless this is somehow an artifact of the way I’m profiling), so I do think it’s worth preallocating more. Almost all of the remaining time is spent in the generic_matvecmul! call on line 18, i.e.

EΔm = (Δm_mat' * s) ./ size(ranges,1)

As this is just a linear transformation of s (Δm_mat is constant), just taking the gradient of this step manually (trivial) may be a good solution. You could e.g. extract that part out into a function and specialize it for ForwardDiff.Dual inputs.

1 Like

Sorry, the GC stuff is definitely an artifact. Here’s a better picture:

function test(n, results, b)
    for i = 1 : n
        ForwardDiff.gradient!(results, logit_GMM, b)
    end
end

using ProfileView
test(1, results, b)
Profile.clear()
@profile test(100, results, b)
ProfileView.view()
1 Like

Because it’s using dual numbers, ForwardDiff ends up using generic matrix multiply instead of BLAS which is used on the Float64s from finite differencing. Thus the limiting step in this function is computed much faster for finite differences, especially since BLAS isn’t just more optimized than the built-in generic matmuls but it’s also multithreaded.

This is a good motivation for a JuliaBLAS revival.

You can get around this though by chunking up your function, defining a dispatch for Dual that AD’s through most of it and then hardcodes the AD through the matmul.

9 Likes

@tkoolen and @ChrisRackauckas, thank you for pinning down the source of the issue!

I have been playing around with possible solutions. I did manage to chunk up my function easily, as @ChrisRackauckas suggested, however my implementation was inelegant and had a lot of overhead (because the intermediate step holds a very large Jacobian right before it gets reduced). I think the “right” permanent solution is probably to create a function that intercepts whatever gets called by M'*v in the case of M::Matrix{Float64}, v::Vector{Dual{T,V,N}}. My quickly sketched version of such a function looks like:

function MTv(M::Matrix{Float64},v::Vector{ForwardDiff.Dual{T,V,N}}) where {T,V,N}

    function rv_pm(v::Vector{ForwardDiff.Dual{T,V,N}}) where {T,V,N}
		real_vec=Vector{V}(length(v))
		partial_mat = Matrix{V}(length(v),N)
		Threads.@threads for dd=1:length(v)
			@inbounds real_vec[dd]=v[dd].value
			@inbounds view(partial_mat,dd,:) .= v[dd].partials
		end
		return real_vec ::Vector{Float64}, partial_mat ::Matrix{Float64}
	end

	rv, pm = rv_pm(v)
    r = M'*rv
    p = M'*pm
    outvec = Vector{ForwardDiff.Dual{T,V,N}}(size(M,2))
    Threads.@threads for dd=1:size(M,2)
        @inbounds outvec[dd] = ForwardDiff.Dual{T}(r[dd],tuple(view(p,dd,:)...))
    end
    return outvec
end

This version is more than 2x as fast as M'*v for reasonably large matrices on my machine:

using ForwardDiff, BenchmarkTools

nparams = 10;
v_len = 100_000;
out_len =30;
dv = ForwardDiff.Dual.(randn(v_len),[tuple(rand(nparams)...) for dd=1:v_len]);
M=rand(v_len,out_len);

@btime MTv($M,$v)
10.166 ms (76 allocations: 8.40 MiB)
@btime $M'*$v
26.849 ms (1 allocation: 2.75 KiB)

Of course, this code is probably nowhere near as efficient as it could be. There is certainly a lot of overhead to converting to matrices and back. If I can figure out how to write a truly efficient version of MTv and what I need to name it to naturally intercept generic matrix multiplication when correctly-typed, I might try to submit a pull request to ForwardDiff. I imagine there are quite a number of people with applications that include large matrix multiplication (e.g. statistics and data science applications) that might benefit.

2 Likes

Cool!

How about this:

using ForwardDiff
using LinearAlgebra

function MTv(M::Matrix{Float64}, v::Vector{ForwardDiff.Dual{T,V,N}}) where {T, V <: LinearAlgebra.BlasFloat, N}
    num_v = length(v)
    vmat = transpose(reshape(reinterpret(V, v), N + 1, num_v)) # first column: values, other columns: partials
    result = transpose(M * vmat) # first row: values, other rows: partials
    transpose(reinterpret(ForwardDiff.Dual{T, V, N}, result))
end

using Test

@testset "MTv" begin
    m = 3
    n = 4
    p = 5

    M = rand(m, n)
    v = [ForwardDiff.Dual(rand(), ForwardDiff.Partials(tuple(rand(p)...))) for _ = 1 : n]
    ret1 = MTv(M, v)
    ret2 = M * v
    @test ret1 ≈ ret2
end

(Tested on Julia master)

2 Likes

@tkoolen, that looks much more elegant!

Unfortunately I can’t test it because I can’t figure out how to use add/use LinearAlgebra package. Pkg.add("LinearAlgebra") throws an error that the package is unknown. Is this only available on v.7?

If I try replacing LinearAlgebra.BlasFloat with Base.LinAlg.BlasFloat I get a ArgumentError: result shape not specified.

It’s not registered, you need to Pkg.clone it.

1 Like

Yeah, there have been a bunch of changes in reinterpret. This works on both 0.6 and master:

using Compat
using ForwardDiff
const LinearAlgebra = Compat.LinearAlgebra

function MTv(M::Matrix{Float64}, v::Vector{ForwardDiff.Dual{T,V,N}}) where {T, V <: LinearAlgebra.BlasFloat, N}
    num_v = length(v)
    vmat = transpose(reshape(reinterpret(V, v), N + 1, num_v)) # first column: values, other columns: partials
    result = transpose(M * vmat) # first row: values, other rows: partials
    reinterpret(ForwardDiff.Dual{T, V, N}, reshape(result, length(result)))
end

using Compat.Test

@testset "MTv" begin
    m = 3
    n = 4
    p = 5

    M = rand(m, n)
    v = [ForwardDiff.Dual(rand(), ForwardDiff.Partials(tuple(rand(p)...))) for _ = 1 : n]
    ret1 = MTv(M, v)
    ret2 = M * v
    @test ret1 ≈ ret2
end

nparams = 10;
v_len = 100_000;
out_len =30;
v = ForwardDiff.Dual.(randn(v_len),[tuple(rand(nparams)...) for dd=1:v_len]);
M=rand(out_len, v_len);

using BenchmarkTools
@btime MTv($M, $v);
@btime $M * $v;

Results on latest master:

  6.503 ms (6 allocations: 2.88 KiB)
  14.319 ms (1 allocation: 2.75 KiB)

Results on 0.6.2:

  9.500 ms (12 allocations: 8.40 MiB)
  17.520 ms (1 allocation: 2.75 KiB)

Edit: I noticed that the difference is much smaller (but still significant) with -O3.

1 Like

No, I was using LinearAlgebra from stdlib (the new name for Base.LinAlg).

@tkoolen, Thanks. Looks great. One quick point: in my application, at least, I was doing M' *v, which can be done even more concisely and runs much faster because it avoids all of the transposing (A'*B = (B'*A)' ) and does the multiplication along the natural column-wise indices for both matrices:

MTv(M::Matrix{Float64}, v::Vector{ForwardDiff.Dual{T,V,N}}) where {T, V <: LinearAlgebra.BlasFloat, N}
    num_v = length(v)
    result = reshape(reinterpret(V, v), N + 1, num_v)*M # first column: values, other columns: partials
    reinterpret(ForwardDiff.Dual{T, V, N}, vec(result))
end


using Compat.Test

@testset "MTv" begin
    m = 3
    n = 4
    p = 5

    M = rand(n, m)
    v = [ForwardDiff.Dual(rand(), ForwardDiff.Partials(tuple(rand(p)...))) for _ = 1 : n]
    ret1 = MTv(M, v)
    ret2 = M' * v
    @test ret1 ≈ ret2
end

If you really wanted to one-line it:

MTv(M::Matrix{Float64}, v::Vector{ForwardDiff.Dual{T,V,N}}) where {T, V <: LinearAlgebra.BlasFloat, N}=reinterpret(ForwardDiff.Dual{T, V, N}, vec(reshape(reinterpret(V, v), N + 1, length(v))*M))

This runs ~5x faster than M'*v on my computer!

Nice! But be aware that transpose and adjoint are going to be lazy in 0.7:

julia> typeof(transpose(rand(3, 3)))
LinearAlgebra.Transpose{Float64,Array{Float64,2}}

julia> typeof(rand(3, 3)')
LinearAlgebra.Adjoint{Float64,Array{Float64,2}}

so you might need additional methods.

@tkoolen, does lazy evaluation mean that both of our methods will have identical performance in v.7 because the compiler will be smart enough to reorient the transposing such that it does the matrix multiplication efficiently?

I’m not sure what you mean by ‘reorient the transposing’, but it just means that e.g. Δm_mat' now produces a very cheap Adjoint ‘wrapper’ around Δm_mat that can then be used for dispatch. For matrix multiplication into a preallocated destination vector, this makes it so that instead of having A_mul_B!, At_mul_B!, A_mul_Bt!, Ac_mul_B! and so on, there’s now only mul! with various overloads.

This makes it a lot easier to implement the optimization in ForwardDiff; you’ll only have to overload mul! (which is called by *) with the type of M something like Union{Matrix{S}, Transpose{Matrix{S}}, Adjoint{Matrix{S}}} where S <: LinearAlgebra.BlasFloat (or perhaps better, StridedMatrix{<:LinearAlgebra.BlasFloat}).

1 Like