 # Automatic Differentiation Slow (Slower than Finite-Differences)

I am trying to implement a statistical estimator and am having issues with both ForwardDiff and ReverseDiff taking longer to compute gradients than would simple one-sided finite differencing. Below is a minimal example with `logit_GMM` being the function that I am trying to minimize. With 20 parameters, the ForwardDiff gradient evaluation time is more than 25x the function evaluation time (when optimally chunked), and ReverseDiff takes closer to 400x the evaluation time! By comparison, finite-differences takes less than 20x, and direct gradient computation takes even less. (My actual application is more complex, and direct gradient computation is much less feasible there.)

Am I doing something wrong, or is this problem just poorly-designed for either forward or reverse-mode automatic differentiation? Any thoughts on how to make my AD faster would be greatly appreciated. (Any other comments that might making my code more efficient are certainly welcomed as well.)

Thank you!

``````using ForwardDiff, BenchmarkTools, ReverseDiff

num_i = 1_000
choice_set_size = 30
true_b = randn(20)
num_moments = 40

X=randn(num_i*choice_set_size,length(true_b))
Δm_mat = rand(num_i*choice_set_size,num_moments)
ranges = [colon(1+choice_set_size*(ii-1), choice_set_size*ii) for ii=1:num_i]

function logit_GMM(b,X::Matrix{Float64},Δm_mat::Matrix{Float64},ranges::Vector{UnitRange{Int64}})
eu=exp.(X*b)
s=similar(eu)
@inbounds for rng ∈ ranges
@views s[rng] .= eu[rng] ./ sum(eu[rng])
end
EΔm = (Δm_mat' * s) ./ size(ranges,1)
return EΔm' * EΔm
end

logit_GMM(b) = logit_GMM(b,X::Matrix{Float64},Δm_mat::Matrix{Float64},ranges::Vector{UnitRange{Int64}})

b=randn(size(true_b))
results = similar(b)

println("Eval-times:")
@btime logit_GMM(\$b,\$X::Matrix{Float64},\$Δm_mat::Matrix{Float64},\$ranges::Vector{UnitRange{Int64}})
@btime logit_GMM(\$b)

``````

The benchmarked times are:

``````Eval-times:
1.043 ms (1008 allocations: 751.11 KiB)
1.040 ms (1008 allocations: 751.11 KiB)
32.826 ms (2018 allocations: 15.21 MiB)
126.079 ms (20160 allocations: 28.41 MiB)
32.621 ms (2016 allocations: 15.21 MiB)
26.231 ms (1009 allocations: 14.48 MiB)
384.239 ms (0 allocations: 0 bytes)
``````
1 Like

Is the threading needed for this issue to be reproducible? If not, please remove it. Does the inputs really need to be so big to that the time to take the gradient a full minute to execute or can the same point be achieved with smaller input data? The less code and the faster it takes to run, the more likely it is that someone will have time to go through it and give a good answer.

Anyway, it seems you have a R^N -> R with a large N problem, so reverse mode differentiation is likely much more efficient.

5 Likes

@kristoffer.carlsson, thank you for the excellent suggestions. I do not have access to a computer with Julia the moment, but I will update the code to be simpler and require less time as soon as possible.

@kristoffer.carlsson, I’ve edited the code to run faster and be as simple as possible. It’s still the case that ForwardDiff is slower than AD. Also, unfortunately it appears that compiled ReverseDiff is even slower than ForwardDiff in this case! This is surprising to me given that this example is similar to a neural-net problem, which is the classic case where reverse-mode AD is supposed to perform extremely well.

What is the size of your gradient? From your sample code, it looks like it’s just 20 elements, is that right? If so, then I’d expect ForwardDiff to be faster than ReverseDiff, even though algorithmically reverse-mode AD might be preferred. ForwardDiff just has a bit less overhead and tends to be very fast for < 100 output dimensions.

If you’re still seeing performance issues, the performance tips from the manual are still the best thing to read: https://docs.julialang.org/en/stable/manual/performance-tips/#man-performance-tips-1

In particular, checking your code for type-instabilities or any unnecessary allocations will help. It should nearly always be possible to make ForwardDiff as fast as or faster than numerical diffing, potentially with some code polish.

In my actual application, I expect there to be <50 parameters and to have function evaluation times between 2 and 10 seconds (much longer matrices and a few additional model complexities). As such, you may be right that appropriately chunked forward-diff might be the best approach.

In particular, checking your code for type-instabilities or any unnecessary allocations will help.

I took a look at the performance tips section. All of my code was type-stable (according to `@code_warntype`) excepting the fact that I hadn’t set the chunk-size in ForwardDiff. Doing so and choosing the maximal chunk-size does improve performance a bit, but it is still worse than finite differences. (Modified results are now included in the original post.)

I can preallocate all of the intermediate arrays (i.e. `eu`, `s`, `EΔm`), however I timed some of these allocations, and they appear to scale linearly. In fact, the `@btime` stats suggest that in the chunk-size 20 case, the allocations are the same and the memory use is actually just under 20x the `logit_GMM` evaluation time.

It should nearly always be possible to make ForwardDiff as fast as or faster than numerical diffing, potentially with some code polish

I definitely agree that ForwardDiff should basically always outperform finite differencing. In my experience it has always outperformed finite differencing and sometimes even outperforms analytic computation, which is why I’m somewhat perplexed by its underperformance here.

Profiling with

``````using ProfileView
Profile.clear()
@profile @btime ForwardDiff.gradient!(\$results,\$logit_GMM,\$b) evals = 100
ProfileView.view()
``````

shows that almost half the time is being spent doing garbage collection (unless this is somehow an artifact of the way I’m profiling), so I do think it’s worth preallocating more. Almost all of the remaining time is spent in the `generic_matvecmul!` call on line 18, i.e.

``````EΔm = (Δm_mat' * s) ./ size(ranges,1)
``````

As this is just a linear transformation of `s` (`Δm_mat` is constant), just taking the gradient of this step manually (trivial) may be a good solution. You could e.g. extract that part out into a function and specialize it for `ForwardDiff.Dual` inputs. 1 Like

Sorry, the GC stuff is definitely an artifact. Here’s a better picture:

``````function test(n, results, b)
for i = 1 : n
end
end

using ProfileView
test(1, results, b)
Profile.clear()
@profile test(100, results, b)
ProfileView.view()
``````
1 Like

Because it’s using dual numbers, ForwardDiff ends up using generic matrix multiply instead of BLAS which is used on the Float64s from finite differencing. Thus the limiting step in this function is computed much faster for finite differences, especially since BLAS isn’t just more optimized than the built-in generic matmuls but it’s also multithreaded.

This is a good motivation for a JuliaBLAS revival.

You can get around this though by chunking up your function, defining a dispatch for Dual that AD’s through most of it and then hardcodes the AD through the matmul.

9 Likes

@tkoolen and @ChrisRackauckas, thank you for pinning down the source of the issue!

I have been playing around with possible solutions. I did manage to chunk up my function easily, as @ChrisRackauckas suggested, however my implementation was inelegant and had a lot of overhead (because the intermediate step holds a very large Jacobian right before it gets reduced). I think the “right” permanent solution is probably to create a function that intercepts whatever gets called by `M'*v` in the case of `M::Matrix{Float64}`, `v::Vector{Dual{T,V,N}}`. My quickly sketched version of such a function looks like:

``````function MTv(M::Matrix{Float64},v::Vector{ForwardDiff.Dual{T,V,N}}) where {T,V,N}

function rv_pm(v::Vector{ForwardDiff.Dual{T,V,N}}) where {T,V,N}
real_vec=Vector{V}(length(v))
partial_mat = Matrix{V}(length(v),N)
@inbounds real_vec[dd]=v[dd].value
@inbounds view(partial_mat,dd,:) .= v[dd].partials
end
return real_vec ::Vector{Float64}, partial_mat ::Matrix{Float64}
end

rv, pm = rv_pm(v)
r = M'*rv
p = M'*pm
outvec = Vector{ForwardDiff.Dual{T,V,N}}(size(M,2))
@inbounds outvec[dd] = ForwardDiff.Dual{T}(r[dd],tuple(view(p,dd,:)...))
end
return outvec
end
``````

This version is more than 2x as fast as `M'*v` for reasonably large matrices on my machine:

``````using ForwardDiff, BenchmarkTools

nparams = 10;
v_len = 100_000;
out_len =30;
dv = ForwardDiff.Dual.(randn(v_len),[tuple(rand(nparams)...) for dd=1:v_len]);
M=rand(v_len,out_len);

@btime MTv(\$M,\$v)
10.166 ms (76 allocations: 8.40 MiB)
@btime \$M'*\$v
26.849 ms (1 allocation: 2.75 KiB)
``````

Of course, this code is probably nowhere near as efficient as it could be. There is certainly a lot of overhead to converting to matrices and back. If I can figure out how to write a truly efficient version of `MTv` and what I need to name it to naturally intercept generic matrix multiplication when correctly-typed, I might try to submit a pull request to ForwardDiff. I imagine there are quite a number of people with applications that include large matrix multiplication (e.g. statistics and data science applications) that might benefit.

2 Likes

Cool!

``````using ForwardDiff
using LinearAlgebra

function MTv(M::Matrix{Float64}, v::Vector{ForwardDiff.Dual{T,V,N}}) where {T, V <: LinearAlgebra.BlasFloat, N}
num_v = length(v)
vmat = transpose(reshape(reinterpret(V, v), N + 1, num_v)) # first column: values, other columns: partials
result = transpose(M * vmat) # first row: values, other rows: partials
transpose(reinterpret(ForwardDiff.Dual{T, V, N}, result))
end

using Test

@testset "MTv" begin
m = 3
n = 4
p = 5

M = rand(m, n)
v = [ForwardDiff.Dual(rand(), ForwardDiff.Partials(tuple(rand(p)...))) for _ = 1 : n]
ret1 = MTv(M, v)
ret2 = M * v
@test ret1 ≈ ret2
end
``````

(Tested on Julia master)

2 Likes

@tkoolen, that looks much more elegant!

Unfortunately I can’t test it because I can’t figure out how to use add/use LinearAlgebra package. `Pkg.add("LinearAlgebra")` throws an error that the package is unknown. Is this only available on v.7?

If I try replacing `LinearAlgebra.BlasFloat` with `Base.LinAlg.BlasFloat` I get a `ArgumentError: result shape not specified`.

It’s not registered, you need to `Pkg.clone` it.

1 Like

Yeah, there have been a bunch of changes in `reinterpret`. This works on both 0.6 and master:

``````using Compat
using ForwardDiff
const LinearAlgebra = Compat.LinearAlgebra

function MTv(M::Matrix{Float64}, v::Vector{ForwardDiff.Dual{T,V,N}}) where {T, V <: LinearAlgebra.BlasFloat, N}
num_v = length(v)
vmat = transpose(reshape(reinterpret(V, v), N + 1, num_v)) # first column: values, other columns: partials
result = transpose(M * vmat) # first row: values, other rows: partials
reinterpret(ForwardDiff.Dual{T, V, N}, reshape(result, length(result)))
end

using Compat.Test

@testset "MTv" begin
m = 3
n = 4
p = 5

M = rand(m, n)
v = [ForwardDiff.Dual(rand(), ForwardDiff.Partials(tuple(rand(p)...))) for _ = 1 : n]
ret1 = MTv(M, v)
ret2 = M * v
@test ret1 ≈ ret2
end

nparams = 10;
v_len = 100_000;
out_len =30;
v = ForwardDiff.Dual.(randn(v_len),[tuple(rand(nparams)...) for dd=1:v_len]);
M=rand(out_len, v_len);

using BenchmarkTools
@btime MTv(\$M, \$v);
@btime \$M * \$v;
``````

Results on latest master:

``````  6.503 ms (6 allocations: 2.88 KiB)
14.319 ms (1 allocation: 2.75 KiB)
``````

Results on 0.6.2:

``````  9.500 ms (12 allocations: 8.40 MiB)
17.520 ms (1 allocation: 2.75 KiB)
``````

Edit: I noticed that the difference is much smaller (but still significant) with `-O3`.

1 Like

No, I was using `LinearAlgebra` from stdlib (the new name for `Base.LinAlg`).

@tkoolen, Thanks. Looks great. One quick point: in my application, at least, I was doing `M' *v`, which can be done even more concisely and runs much faster because it avoids all of the transposing (`A'*B = (B'*A)'`) and does the multiplication along the natural column-wise indices for both matrices:

``````MTv(M::Matrix{Float64}, v::Vector{ForwardDiff.Dual{T,V,N}}) where {T, V <: LinearAlgebra.BlasFloat, N}
num_v = length(v)
result = reshape(reinterpret(V, v), N + 1, num_v)*M # first column: values, other columns: partials
reinterpret(ForwardDiff.Dual{T, V, N}, vec(result))
end

using Compat.Test

@testset "MTv" begin
m = 3
n = 4
p = 5

M = rand(n, m)
v = [ForwardDiff.Dual(rand(), ForwardDiff.Partials(tuple(rand(p)...))) for _ = 1 : n]
ret1 = MTv(M, v)
ret2 = M' * v
@test ret1 ≈ ret2
end
``````

If you really wanted to one-line it:

``````MTv(M::Matrix{Float64}, v::Vector{ForwardDiff.Dual{T,V,N}}) where {T, V <: LinearAlgebra.BlasFloat, N}=reinterpret(ForwardDiff.Dual{T, V, N}, vec(reshape(reinterpret(V, v), N + 1, length(v))*M))
``````

This runs ~5x faster than `M'*v` on my computer!

Nice! But be aware that `transpose` and `adjoint` are going to be lazy in 0.7:

``````julia> typeof(transpose(rand(3, 3)))
LinearAlgebra.Transpose{Float64,Array{Float64,2}}

julia> typeof(rand(3, 3)')
I’m not sure what you mean by ‘reorient the transposing’, but it just means that e.g. `Δm_mat'` now produces a very cheap `Adjoint` ‘wrapper’ around `Δm_mat` that can then be used for dispatch. For matrix multiplication into a preallocated destination vector, this makes it so that instead of having `A_mul_B!`, `At_mul_B!`, `A_mul_Bt!`, `Ac_mul_B!` and so on, there’s now only `mul!` with various overloads.
This makes it a lot easier to implement the optimization in ForwardDiff; you’ll only have to overload `mul!` (which is called by `*`) with the type of `M` something like `Union{Matrix{S}, Transpose{Matrix{S}}, Adjoint{Matrix{S}}} where S <: LinearAlgebra.BlasFloat` (or perhaps better, `StridedMatrix{<:LinearAlgebra.BlasFloat}`).