Comparison of automatic differentiation tools from 2016 still accurate?

I was made aware of a comprehensive comparison of AD tools at a talk I attended yesterday. The best reference I could find online was these slides: http://www.autodiff.org/ad16/Oral/Srajer_Benchmark.pdf

Does anyone know if these benchmarks are representative of the current state of ForwardDiff? Or are these timings potentially due to the Julia-implementation of the code written by student who did these comparisons?

They show that there is potential for improvements in running time of several orders of magnitude for ForwardDiff compared to, for example, ADiMat in MATLAB. Iā€™ve included a plot from the linked-to slides:

The timing for the objective (no AD) in julia is one to two orders of magnitude larger than the C version, and is slower than the julia vectorized version, so Iā€™d be suspicious of the timings for AD. The different scaling (N^2 vs N) seems consistent with forward vs reverse AD, though.

Perhaps it would help to see the source code, do we know where it is?

1 Like

Yes, I thought that was quite suspicious as well.

I Googled around a bit, but couldnā€™t find it. I can email the author (Srajer) and Fitzgibbon, who presented the plots at the talk I attended, and ask if it is available.
See the post below (Comparison of automatic differentiation tools from 2016 still accurate? - #5 by anriseth)

Actually, I was just bad at Googling (how do I cancel a post that is ā€œpendingā€?).

There is a Github repo with all the code.

Welcome @anriseth. Unfortunately you canā€™t cancel those posts in the new queue, but fortunately you wonā€™t hit them anymore! I approved both these posts since there was other info in that fist one, tooā€¦ but now you have some more priveledges (including the ability to edit). Let me know if you have any questions.

1 Like

They timed code that you should never write.

https://github.com/awf/autodiff/blob/b5db1d62c2c320978b23d01a795f46c3b501fac7/Julia/Tests/gmm_F.jl#L69-L77

That doesnā€™t use the in-place gradient! and it doesnā€™t cache the GradientConfig. And what did they write for the objective function?

https://github.com/awf/autodiff/blob/b5db1d62c2c320978b23d01a795f46c3b501fac7/Julia/Tests/gmm_F.jl#L20-L28

@inbounds function gmm_objective(alphas,means,icf,x,wishart::Wishart)
  d = size(x,1)
  n = size(x,2)
  CONSTANT = -n*d*0.5*log(2 * pi)

  sum_qs = sum(icf[1:d,:],1)
  slse = sum(sum_qs)
  Qs = [get_Q(d,icf[:,ik]) for ik in 1:k]
  main_term = zeros(eltype(alphas),1,k)

  slse = 0.
  for ix=1:n
    for ik=1:k
      main_term[ik] = -0.5*sumabs2(Qs[ik] * (x[:,ix] - means[:,ik]))
    end
    slse += logsumexp(alphas + sum_qs + main_term)
  end

  CONSTANT + slse - n*logsumexp(alphas) + log_wishart_prior(wishart, sum_qs, Qs, icf)
end

(Itā€™s funny that they added @inbounds to make it look like they triedā€¦).

It looks like they donā€™t make the C++ codes allocate vectors every step (someone can correct me if Iā€™m wrong there), whereas ForwardDiff is allocating is workspace and the result each time, allocating temporaries in each vectorized statement, not fusing broadcasts (though this looks like it was back before that, so it wouldā€™ve needed to unroll to loops) in the most inner loop, etc. This is far from efficient code and the issues are so clear that nobody would realistically make these mistakes in high-performance codeā€¦ so itā€™s not representative of a benchmark on real-use.

6 Likes

I wouldnā€™t be so severe: it looks like they did try to make a serious effort at comparing all those languages and libraries, which is painstaking work. This with the usual caveats of benchmarks: the languages the author is most familiar with are likely to be more highly ranked.

For this particular use case, even if you cook up the perfect julia implementation, by pure scaling itā€™s going to rank somewhere near finite differences (C++), not competitive. Would be interesting to see some julia reverse diff tools in that plot (with the objective written both in ā€œvectorā€ and in ā€œloopā€ mode, the latter presumably carying more of an overhead)

1 Like

I agree with that.

Iā€™m not sure I agree with that. I took there code:

using BenchmarkTools

function logsumexp(x)
  mx = maximum(x)
  log(sum(exp.(x))) + mx
end


function f()
  x = rand(100,100)
  n = 100; k= 100
  Qs = [rand(100) 1:100]
  main_term = Vector{Float64}(100)
  means = rand(100,100)
  slse = 0.0
  alphas = 0.0
  sum_qs = 0.0

  @inbounds for ix=1:n
    for ik=1:k
      main_term[ik] = -0.5*sum(abs2,Qs[ik] * (x[:,ix] - means[:,ik]))
    end
    slse += logsumexp(alphas + sum_qs + main_term)
  end
end

@btime f()

4.644 ms (60100 allocations: 34.57 MiB)

and 5-10 minutes later:

function logsumexp(x)
  mx = maximum(x)
  @. x = exp(x - mx)
  log(sum(x)) + mx
end

function h(x,Qs,tmp,main_term,means)

  n = 100; k= 100

  slse = 0.0
  alphas = 0.0
  sum_qs = 0.0

  @inbounds for ix=1:n
    for ik=1:k
      for j in 1:100
        tmp[j] = Qs[ik] * (x[j,ix] - means[j,ik])
      end
      main_term[ik] = -0.5*sum(abs2,tmp)
    end
    for i in eachindex(main_term)
      tmp[i] = alphas + sum_qs + main_term[i]
    end
    slse += logsumexp(tmp)
  end
end

x = rand(100,100)
Qs = rand(100)
tmp = zeros(100)
main_term = Vector{Float64}(100)
means = rand(100,100)

@btime h(x,Qs,tmp,main_term,means)

855.973 Ī¼s (0 allocations: 0 bytes)

Thatā€™s a 5x difference just sitting there (and yes, that first one is not what they called the vectorized code either). Then thereā€™s the gradient config caching in ForwardDiff which we know from package writing is easily another 5x or so (of course, this all depends on vector sizes). So it should be sitting right on that C++ line. Iā€™d have to question everything by about an order of magnitude or so.

The problem with that is that Iā€™m pretty sure it completely changes the interpretation. I think that if these were all done efficiently then for large enough vectors/matrices similar algorithms will line of in similar spots with very little underlying difference to the actual language details (simply because vectorization is fine one you get large enough and if you manage the temporaries well). That interpretation falls into their >1 order of magnitude error bars, and you can see it in how the languages they know all line up right by each other with the outlier being F#, Python, and Julia.

2 Likes

Perhaps someone could consider making a PR to the benchmark repo?

That said, the general problem with these kind of benchmarks is that they tend to get obsolete very quickly. Hopefully Julia will get even better AD in the medium run.

1 Like

Thanks for posting! I had attended this conference, but think I missed this talk.

This is actually decent, IMO. As a forward-mode AD tool, ForwardDiffā€™s target domain is (very) low-dimensional gradients, and jacobians where the input dimension is near <= to the output dimension. For those cases, these benchmarks seem to indicate that ForwardDiff performs comparably to C/C++ tools.

The gradient scaling with dimension is obviously going to be bad for forward-mode tools, so Iā€™m not sure what the point in benchmarking them all together was, but I guess itā€™s academically interesting to compare ForwardDiffā€™s scaling with Ceresā€™ on the same plot as reverse-mode tools.

It wouldā€™ve been interesting to see a plot of AD performance relative to the objective function run time alongside the absolute timings presented here.

As other folks have stated, the ForwardDiff benchmark implementation does leave some performance on the table. Seems feasible that ForwardDiff could beat Ceres with a little elbow-grease :slight_smile:

The real goal, though, is to get Capstan on that chartā€¦

2 Likes

Howā€™s Capstan doing?

Ignoring the very important fact that Cassette itself still requires a bit of Base work, Iā€™d say that Capstan is a working prototype - all the fundamental mechanisms are in place (live variable storage, tape storage, interception mechanism, etc.), but things like the user-facing API and the reverse-mode primitive catalog only barely exist. Such items should be relatively easy to implement later, though, since their design is mostly complete at this point. Iā€™d say the bulk of the (non-Base/non-Cassette) work from here will be implementing various performance optimizations, API utilities, and tests/examples/docs.

Capstanā€™s simple neural net example works if you can find the magical Base + Cassette commits on which things currently run.

As far as a timeline goes, the (very) tentative idea is that Capstan will have an initial release around the same time as Cassette, and Cassette will have a release once Base can support it, and Baseā€™s support for Cassette is on the v1.1 milestone, soā€¦stay tuned!

10 Likes

Looking forward to using Capstan! In the meantime, thanks for your maintenance work on the other AD packages, I could not imagine doing without them.

4 Likes

Posting it here instead of opening a new topic: Bob Carpenter, one of the main architects of the AD framework of Stan, has an interesting writeup of their new reverse-mode AD strategy using continuations:

I wonder how it compares to existing and planned tools (AFAICT ReverseDiff.jl uses a tape). At a first glance, it should be easy to implement and optimization-friendly in Julia.

1 Like

after going thru some maths of dual numbers, I realized that AD using dual numbers is SLOWER even than finite differences! (of coz AD is more accurate). Am I wrong? Anyone knows extremely fast AD using dual numbers?

1 Like

It is hard to answer the question without a minimal working example.

1 Like

Number of parameters? Did you preallocate the configuration and diffresult?

using BenchmarkTools, Compat
import ForwardDiff
import DualNumbers

function float64func(x::Float64)::Float64
    2.0 * (x + 1.0) + 3.0 * (x * x)
end

function dualfunc(x::DualNumbers.Dual{Float64})::DualNumbers.Dual{Float64}
    2.0 * (x + 1.0) + 3.0 * (x * x)
end

function forwarddifffunc(x)
    x1 = x[1]
    2.0 * (x1 + 1.0) + 3.0 * (x1 * x1)
end

"""
julia> float64func(5.0)
87.0

julia> dualfunc(DualNumbers.dual(5.0, 1.0) )
87.0 + 32.0ɛ

julia> forwarddifffunc([5.0])
87.0
"""

function finitedifferencegrad(x::Float64)::Float64
    delta = 0.001

    (float64func(x + delta) - float64func(x) ) / delta
end

function dualgrad(x::Float64)::Float64
    DualNumbers.dualpart(dualfunc(DualNumbers.dual(x, 1.0) ) )
end

"""
julia> finitedifferencegrad(5.0)
32.00300000000311

julia> dualgrad(5.0)
32.0

julia> ForwardDiff.gradient(forwarddifffunc, [5.0])
1-element Array{Float64,1}:
 32.0

julia> @benchmark finitedifferencegrad(5.0)
BenchmarkTools.Trial:
 memory estimate:  0 bytes
 allocs estimate:  0
 --------------
 minimum time:     1.564 ns (0.00% GC)
 median time:      1.744 ns (0.00% GC)
 mean time:        1.678 ns (0.00% GC)
 maximum time:     18.642 ns (0.00% GC)
 --------------
 samples:          10000
 evals/sample:     1000

julia> @benchmark dualgrad(5.0)
BenchmarkTools.Trial:
 memory estimate:  0 bytes
 allocs estimate:  0
 --------------
 minimum time:     3.623 ns (0.00% GC)
 median time:      3.730 ns (0.00% GC)
 mean time:        3.933 ns (0.00% GC)
 maximum time:     22.680 ns (0.00% GC)
 --------------
 samples:          10000
 evals/sample:     1000

julia> @benchmark ForwardDiff.gradient(forwarddifffunc, [5.0])
BenchmarkTools.Trial:
 memory estimate:  352 bytes
 allocs estimate:  5
 --------------
 minimum time:     6.336 Ī¼s (0.00% GC)
 median time:      7.132 Ī¼s (0.00% GC)
 mean time:        7.168 Ī¼s (0.00% GC)
 maximum time:     27.631 Ī¼s (0.00% GC)
 --------------
 samples:          10000
 evals/sample:     5
"""

in this simple example, direct gradient by dual number maths is 2.34X time-costly compared to finite differences.
using ForwardDiff AD is even worse, itā€™s 4271.75X !!!

actually, work with maths in dual numbers are like this:
if x = 5.0 + ā 1Ļµ,
2.0 * (x + 1.0) + 3.0 * (x * x) becomes:
2.0 * (5.0 + ā 1Ļµ + 1.0) + 3.0 * ((5.0 + ā 1Ļµ) * (5.0 + ā 1Ļµ) )
= 12.0 + 2.0Ļµ + 3.0 * (25.0 + 10.0Ļµ + ĻµĀ²)
= 12.0 + 2.0Ļµ + 3.0 * (25.0 + 10.0Ļµ + 0.0) (as ĻµĀ² is defined to be 0.0)
= 87.0 + 32.0Ļµ

so, we can easily see that itā€™s even more time consuming than finite differences.

Thatā€™s really cool!
Iā€™ve been wanting to experiment more with autodiff, so thereā€™s an excuse.

Hereā€™s a Julia translation of Bob Carpenterā€™s opening post:

import FunctionWrappers: FunctionWrapper

const adj_t = Vector{Float64}
const chain_t = FunctionWrapper{Cvoid,Tuple{adj_t}}
const stack_ = chain_t[]

const next_idx_ = Ref{Int}(0)
@inline next_idx() = next_idx_[] += 1

struct var{T,int}
    val::T
    idx::int
end
var(v::var) = var(v.val, v.idx)
var(val) = var(val, next_idx())
var() = var(NaN, next_idx())

@inline function Base.:+(x1::var, x2::var)
    y = var(x1.val + x2.val)
    let x1i = x1.idx, x2i = x2.idx, yid = y.idx
        push!(stack_, chain_t(adj -> begin
            adj[x1i] += adj[yid]
            adj[x2i] += adj[yid]
            nothing
        end))
    end
    y
end
@inline function Base.:+(x1::var, x2)
    y = var(x1.val + x2)
    let x1i = x1.idx, yid = y.idx
        push!(stack_, chain_t(adj -> begin
            adj[x1i] += adj[yid]
            nothing
        end))
    end
    y
end
@inline function Base.:+(x1, x2::var)
    y = var(x1 + x2.val)
    let x2i = x2.idx, yid = y.idx
        push!(stack_, chain_t(adj -> begin
            adj[x2i] += adj[yid]
            nothing
        end))
    end
    y
end

@inline function Base.:*(x1::var, x2::var)
    y = var(x1.val * x2.val)
    let x1v = x1.val, x1i = x1.idx, x2v = x2.val, x2i = x2.idx, yid = y.idx
        push!(stack_, chain_t(adj -> begin
            adj[x1i] += x2v * adj[yid]
            adj[x2i] += x1v * adj[yid]
            nothing
        end))
    end
    y
end
@inline function Base.:*(x1::var, x2)
    y = var(x1.val * x2)
    let x1i = x1.idx, x2v = x2, yid = y.idx
        push!(stack_, chain_t(adj -> begin
            adj[x1i] += x2v * adj[yid]
            nothing
        end))
    end
    y
end
@inline function Base.:*(x1, x2::var)
    y = var(x1 * x2.val)
    let x1v = x1, x2i = x2.idx, yid = y.idx
        push!(stack_, chain_t(adj -> begin
            adj[x2i] += x1v * adj[yid]
            nothing
        end))
    end
    y
end

@inline function Base.exp(x::var)
    y = var(exp(x.val))
    let xid = x.idx, yval = y.val, yid = y.idx
        push!(stack_, chain_t(adj -> begin
            adj[xid] += yval * adj[yid]
            nothing
        end))
    end
    y
end

function grad(y::var, x::Vector{<:var})
    adj = zeros(y.idx)
    adj[y.idx] = 1
    for i āˆˆ length(stack_):-1:1
        stack_[i](adj)
    end
    dy_dx = Vector{Float64}(undef, length(x))
    for i āˆˆ eachindex(x)
        dy_dx[i] = adj[x[i].idx]
    end
    dy_dx
end

I wasnā€™t sure whether I should go with push! and looping over stack_ backwards, or insert(stack_, 1, ...) and then a simple fun! āˆˆstack!. I ended up going with the former.

This yields:

julia> resize!(stack_, 0);

julia> next_idx_[] = 0;

julia> x1 = var(10.3);

julia> x2 = var(-1.1);

julia> x = [x1, x2];

julia> y = x1 * x2 * 2 + 7
(x1.idx, x2.idx, y.idx) = (1, 2, 3)
(x1.idx, y.idx) = (3, 4)
(x1.idx, y.idx) = (4, 5)
var{Float64,Int64}(-15.660000000000004, 5)

julia> dy_dx = grad(y, x)
adj = [-2.2, 20.6, 2.0, 1.0, 1.0]
2-element Array{Float64,1}:
 -2.2
 20.6

I assume (like the C++ version) that this would be very slow. The functions wont inline, vectorize, etc, so there will be a lot of overhead on repeated little calls.

Anyway, I saw that Cassette now has version 1.1 and documentation:
https://github.com/jrevels/Cassette.jl

So definitely looking forward to Capstan!
Any updates?
Also saw that Jarrett Revels is listed among the JuliaCon speakers.
Iā€™m guessing weā€™ll see that talk on YouTube in the next few days?

1 Like