Comparison of automatic differentiation tools from 2016 still accurate?

anriseth · January 13, 2018, 2:38pm

I was made aware of a comprehensive comparison of AD tools at a talk I attended yesterday. The best reference I could find online was these slides: http://www.autodiff.org/ad16/Oral/Srajer_Benchmark.pdf

Does anyone know if these benchmarks are representative of the current state of ForwardDiff? Or are these timings potentially due to the Julia-implementation of the code written by student who did these comparisons?

They show that there is potential for improvements in running time of several orders of magnitude for ForwardDiff compared to, for example, ADiMat in MATLAB. I’ve included a plot from the linked-to slides:

antoine-levitt · January 13, 2018, 3:37pm

The timing for the objective (no AD) in julia is one to two orders of magnitude larger than the C version, and is slower than the julia vectorized version, so I’d be suspicious of the timings for AD. The different scaling (N^2 vs N) seems consistent with forward vs reverse AD, though.

Tamas_Papp · January 13, 2018, 3:48pm

Perhaps it would help to see the source code, do we know where it is?

anriseth · January 13, 2018, 4:55pm

Yes, I thought that was quite suspicious as well.

~~I Googled around a bit, but couldn’t find it. I can email the author (Srajer) and Fitzgibbon, who presented the plots at the talk I attended, and ask if it is available.~~
See the post below (Comparison of automatic differentiation tools from 2016 still accurate? - #5 by anriseth)

anriseth · January 13, 2018, 4:55pm

Actually, I was just bad at Googling (how do I cancel a post that is “pending”?).

There is a Github repo with all the code.

mbauman · January 13, 2018, 5:01pm

Welcome @anriseth. Unfortunately you can’t cancel those posts in the new queue, but fortunately you won’t hit them anymore! I approved both these posts since there was other info in that fist one, too… but now you have some more priveledges (including the ability to edit). Let me know if you have any questions.

ChrisRackauckas · January 13, 2018, 5:37pm

They timed code that you should never write.

https://github.com/awf/autodiff/blob/b5db1d62c2c320978b23d01a795f46c3b501fac7/Julia/Tests/gmm_F.jl#L69-L77

That doesn’t use the in-place gradient! and it doesn’t cache the GradientConfig. And what did they write for the objective function?

https://github.com/awf/autodiff/blob/b5db1d62c2c320978b23d01a795f46c3b501fac7/Julia/Tests/gmm_F.jl#L20-L28

@inbounds function gmm_objective(alphas,means,icf,x,wishart::Wishart)
  d = size(x,1)
  n = size(x,2)
  CONSTANT = -n*d*0.5*log(2 * pi)

  sum_qs = sum(icf[1:d,:],1)
  slse = sum(sum_qs)
  Qs = [get_Q(d,icf[:,ik]) for ik in 1:k]
  main_term = zeros(eltype(alphas),1,k)

  slse = 0.
  for ix=1:n
    for ik=1:k
      main_term[ik] = -0.5*sumabs2(Qs[ik] * (x[:,ix] - means[:,ik]))
    end
    slse += logsumexp(alphas + sum_qs + main_term)
  end

  CONSTANT + slse - n*logsumexp(alphas) + log_wishart_prior(wishart, sum_qs, Qs, icf)
end

(It’s funny that they added @inbounds to make it look like they tried…).

It looks like they don’t make the C++ codes allocate vectors every step (someone can correct me if I’m wrong there), whereas ForwardDiff is allocating is workspace and the result each time, allocating temporaries in each vectorized statement, not fusing broadcasts (though this looks like it was back before that, so it would’ve needed to unroll to loops) in the most inner loop, etc. This is far from efficient code and the issues are so clear that nobody would realistically make these mistakes in high-performance code… so it’s not representative of a benchmark on real-use.

antoine-levitt · January 13, 2018, 8:29pm

I wouldn’t be so severe: it looks like they did try to make a serious effort at comparing all those languages and libraries, which is painstaking work. This with the usual caveats of benchmarks: the languages the author is most familiar with are likely to be more highly ranked.

For this particular use case, even if you cook up the perfect julia implementation, by pure scaling it’s going to rank somewhere near finite differences (C++), not competitive. Would be interesting to see some julia reverse diff tools in that plot (with the objective written both in “vector” and in “loop” mode, the latter presumably carying more of an overhead)

ChrisRackauckas · January 13, 2018, 9:03pm

I agree with that.

I’m not sure I agree with that. I took there code:

using BenchmarkTools

function logsumexp(x)
  mx = maximum(x)
  log(sum(exp.(x))) + mx
end


function f()
  x = rand(100,100)
  n = 100; k= 100
  Qs = [rand(100) 1:100]
  main_term = Vector{Float64}(100)
  means = rand(100,100)
  slse = 0.0
  alphas = 0.0
  sum_qs = 0.0

  @inbounds for ix=1:n
    for ik=1:k
      main_term[ik] = -0.5*sum(abs2,Qs[ik] * (x[:,ix] - means[:,ik]))
    end
    slse += logsumexp(alphas + sum_qs + main_term)
  end
end

@btime f()

4.644 ms (60100 allocations: 34.57 MiB)

and 5-10 minutes later:

function logsumexp(x)
  mx = maximum(x)
  @. x = exp(x - mx)
  log(sum(x)) + mx
end

function h(x,Qs,tmp,main_term,means)

  n = 100; k= 100

  slse = 0.0
  alphas = 0.0
  sum_qs = 0.0

  @inbounds for ix=1:n
    for ik=1:k
      for j in 1:100
        tmp[j] = Qs[ik] * (x[j,ix] - means[j,ik])
      end
      main_term[ik] = -0.5*sum(abs2,tmp)
    end
    for i in eachindex(main_term)
      tmp[i] = alphas + sum_qs + main_term[i]
    end
    slse += logsumexp(tmp)
  end
end

x = rand(100,100)
Qs = rand(100)
tmp = zeros(100)
main_term = Vector{Float64}(100)
means = rand(100,100)

@btime h(x,Qs,tmp,main_term,means)

855.973 μs (0 allocations: 0 bytes)

That’s a 5x difference just sitting there (and yes, that first one is not what they called the vectorized code either). Then there’s the gradient config caching in ForwardDiff which we know from package writing is easily another 5x or so (of course, this all depends on vector sizes). So it should be sitting right on that C++ line. I’d have to question everything by about an order of magnitude or so.

The problem with that is that I’m pretty sure it completely changes the interpretation. I think that if these were all done efficiently then for large enough vectors/matrices similar algorithms will line of in similar spots with very little underlying difference to the actual language details (simply because vectorization is fine one you get large enough and if you manage the temporaries well). That interpretation falls into their >1 order of magnitude error bars, and you can see it in how the languages they know all line up right by each other with the outlier being F#, Python, and Julia.

Tamas_Papp · January 14, 2018, 11:17am

Perhaps someone could consider making a PR to the benchmark repo?

That said, the general problem with these kind of benchmarks is that they tend to get obsolete very quickly. Hopefully Julia will get even better AD in the medium run.

jrevels · January 15, 2018, 12:35pm

Thanks for posting! I had attended this conference, but think I missed this talk.

This is actually decent, IMO. As a forward-mode AD tool, ForwardDiff’s target domain is (very) low-dimensional gradients, and jacobians where the input dimension is near <= to the output dimension. For those cases, these benchmarks seem to indicate that ForwardDiff performs comparably to C/C++ tools.

The gradient scaling with dimension is obviously going to be bad for forward-mode tools, so I’m not sure what the point in benchmarking them all together was, but I guess it’s academically interesting to compare ForwardDiff’s scaling with Ceres’ on the same plot as reverse-mode tools.

It would’ve been interesting to see a plot of AD performance relative to the objective function run time alongside the absolute timings presented here.

As other folks have stated, the ForwardDiff benchmark implementation does leave some performance on the table. Seems feasible that ForwardDiff could beat Ceres with a little elbow-grease

The real goal, though, is to get Capstan on that chart…

ChrisRackauckas · January 15, 2018, 12:39pm

How’s Capstan doing?

jrevels · January 15, 2018, 1:04pm

Ignoring the very important fact that Cassette itself still requires a bit of Base work, I’d say that Capstan is a working prototype - all the fundamental mechanisms are in place (live variable storage, tape storage, interception mechanism, etc.), but things like the user-facing API and the reverse-mode primitive catalog only barely exist. Such items should be relatively easy to implement later, though, since their design is mostly complete at this point. I’d say the bulk of the (non-Base/non-Cassette) work from here will be implementing various performance optimizations, API utilities, and tests/examples/docs.

Capstan’s simple neural net example works if you can find the magical Base + Cassette commits on which things currently run.

As far as a timeline goes, the (very) tentative idea is that Capstan will have an initial release around the same time as Cassette, and Cassette will have a release once Base can support it, and Base’s support for Cassette is on the v1.1 milestone, so…stay tuned!

Tamas_Papp · January 15, 2018, 1:12pm

Looking forward to using Capstan! In the meantime, thanks for your maintenance work on the other AD packages, I could not imagine doing without them.

Tamas_Papp · August 3, 2018, 9:55am

Posting it here instead of opening a new topic: Bob Carpenter, one of the main architects of the AD framework of Stan, has an interesting writeup of their new reverse-mode AD strategy using continuations:

I wonder how it compares to existing and planned tools (AFAICT ReverseDiff.jl uses a tape). At a first glance, it should be easy to implement and optimization-friendly in Julia.

tomtom · August 12, 2018, 1:08pm

after going thru some maths of dual numbers, I realized that AD using dual numbers is SLOWER even than finite differences! (of coz AD is more accurate). Am I wrong? Anyone knows extremely fast AD using dual numbers?

Tamas_Papp · August 12, 2018, 3:43pm

It is hard to answer the question without a minimal working example.

Elrod · August 12, 2018, 4:05pm

Number of parameters? Did you preallocate the configuration and diffresult?

tomtom · August 13, 2018, 4:31am

using BenchmarkTools, Compat
import ForwardDiff
import DualNumbers

function float64func(x::Float64)::Float64
    2.0 * (x + 1.0) + 3.0 * (x * x)
end

function dualfunc(x::DualNumbers.Dual{Float64})::DualNumbers.Dual{Float64}
    2.0 * (x + 1.0) + 3.0 * (x * x)
end

function forwarddifffunc(x)
    x1 = x[1]
    2.0 * (x1 + 1.0) + 3.0 * (x1 * x1)
end

"""
julia> float64func(5.0)
87.0

julia> dualfunc(DualNumbers.dual(5.0, 1.0) )
87.0 + 32.0ɛ

julia> forwarddifffunc([5.0])
87.0
"""

function finitedifferencegrad(x::Float64)::Float64
    delta = 0.001

    (float64func(x + delta) - float64func(x) ) / delta
end

function dualgrad(x::Float64)::Float64
    DualNumbers.dualpart(dualfunc(DualNumbers.dual(x, 1.0) ) )
end

"""
julia> finitedifferencegrad(5.0)
32.00300000000311

julia> dualgrad(5.0)
32.0

julia> ForwardDiff.gradient(forwarddifffunc, [5.0])
1-element Array{Float64,1}:
 32.0

julia> @benchmark finitedifferencegrad(5.0)
BenchmarkTools.Trial:
 memory estimate:  0 bytes
 allocs estimate:  0
 --------------
 minimum time:     1.564 ns (0.00% GC)
 median time:      1.744 ns (0.00% GC)
 mean time:        1.678 ns (0.00% GC)
 maximum time:     18.642 ns (0.00% GC)
 --------------
 samples:          10000
 evals/sample:     1000

julia> @benchmark dualgrad(5.0)
BenchmarkTools.Trial:
 memory estimate:  0 bytes
 allocs estimate:  0
 --------------
 minimum time:     3.623 ns (0.00% GC)
 median time:      3.730 ns (0.00% GC)
 mean time:        3.933 ns (0.00% GC)
 maximum time:     22.680 ns (0.00% GC)
 --------------
 samples:          10000
 evals/sample:     1000

julia> @benchmark ForwardDiff.gradient(forwarddifffunc, [5.0])
BenchmarkTools.Trial:
 memory estimate:  352 bytes
 allocs estimate:  5
 --------------
 minimum time:     6.336 μs (0.00% GC)
 median time:      7.132 μs (0.00% GC)
 mean time:        7.168 μs (0.00% GC)
 maximum time:     27.631 μs (0.00% GC)
 --------------
 samples:          10000
 evals/sample:     5
"""

in this simple example, direct gradient by dual number maths is 2.34X time-costly compared to finite differences.
using ForwardDiff AD is even worse, it’s 4271.75X !!!

actually, work with maths in dual numbers are like this:
if x = 5.0 + ⁠1ϵ,
2.0 * (x + 1.0) + 3.0 * (x * x) becomes:
2.0 * (5.0 + ⁠1ϵ + 1.0) + 3.0 * ((5.0 + ⁠1ϵ) * (5.0 + ⁠1ϵ) )
= 12.0 + 2.0ϵ + 3.0 * (25.0 + 10.0ϵ + ϵ²)
= 12.0 + 2.0ϵ + 3.0 * (25.0 + 10.0ϵ + 0.0) (as ϵ² is defined to be 0.0)
= 87.0 + 32.0ϵ

so, we can easily see that it’s even more time consuming than finite differences.

Elrod · August 13, 2018, 6:06am

That’s really cool!
I’ve been wanting to experiment more with autodiff, so there’s an excuse.

Here’s a Julia translation of Bob Carpenter’s opening post:

import FunctionWrappers: FunctionWrapper

const adj_t = Vector{Float64}
const chain_t = FunctionWrapper{Cvoid,Tuple{adj_t}}
const stack_ = chain_t[]

const next_idx_ = Ref{Int}(0)
@inline next_idx() = next_idx_[] += 1

struct var{T,int}
    val::T
    idx::int
end
var(v::var) = var(v.val, v.idx)
var(val) = var(val, next_idx())
var() = var(NaN, next_idx())

@inline function Base.:+(x1::var, x2::var)
    y = var(x1.val + x2.val)
    let x1i = x1.idx, x2i = x2.idx, yid = y.idx
        push!(stack_, chain_t(adj -> begin
            adj[x1i] += adj[yid]
            adj[x2i] += adj[yid]
            nothing
        end))
    end
    y
end
@inline function Base.:+(x1::var, x2)
    y = var(x1.val + x2)
    let x1i = x1.idx, yid = y.idx
        push!(stack_, chain_t(adj -> begin
            adj[x1i] += adj[yid]
            nothing
        end))
    end
    y
end
@inline function Base.:+(x1, x2::var)
    y = var(x1 + x2.val)
    let x2i = x2.idx, yid = y.idx
        push!(stack_, chain_t(adj -> begin
            adj[x2i] += adj[yid]
            nothing
        end))
    end
    y
end

@inline function Base.:*(x1::var, x2::var)
    y = var(x1.val * x2.val)
    let x1v = x1.val, x1i = x1.idx, x2v = x2.val, x2i = x2.idx, yid = y.idx
        push!(stack_, chain_t(adj -> begin
            adj[x1i] += x2v * adj[yid]
            adj[x2i] += x1v * adj[yid]
            nothing
        end))
    end
    y
end
@inline function Base.:*(x1::var, x2)
    y = var(x1.val * x2)
    let x1i = x1.idx, x2v = x2, yid = y.idx
        push!(stack_, chain_t(adj -> begin
            adj[x1i] += x2v * adj[yid]
            nothing
        end))
    end
    y
end
@inline function Base.:*(x1, x2::var)
    y = var(x1 * x2.val)
    let x1v = x1, x2i = x2.idx, yid = y.idx
        push!(stack_, chain_t(adj -> begin
            adj[x2i] += x1v * adj[yid]
            nothing
        end))
    end
    y
end

@inline function Base.exp(x::var)
    y = var(exp(x.val))
    let xid = x.idx, yval = y.val, yid = y.idx
        push!(stack_, chain_t(adj -> begin
            adj[xid] += yval * adj[yid]
            nothing
        end))
    end
    y
end

function grad(y::var, x::Vector{<:var})
    adj = zeros(y.idx)
    adj[y.idx] = 1
    for i ∈ length(stack_):-1:1
        stack_[i](adj)
    end
    dy_dx = Vector{Float64}(undef, length(x))
    for i ∈ eachindex(x)
        dy_dx[i] = adj[x[i].idx]
    end
    dy_dx
end

I wasn’t sure whether I should go with push! and looping over stack_ backwards, or insert(stack_, 1, ...) and then a simple fun! ∈stack!. I ended up going with the former.

This yields:

julia> resize!(stack_, 0);

julia> next_idx_[] = 0;

julia> x1 = var(10.3);

julia> x2 = var(-1.1);

julia> x = [x1, x2];

julia> y = x1 * x2 * 2 + 7
(x1.idx, x2.idx, y.idx) = (1, 2, 3)
(x1.idx, y.idx) = (3, 4)
(x1.idx, y.idx) = (4, 5)
var{Float64,Int64}(-15.660000000000004, 5)

julia> dy_dx = grad(y, x)
adj = [-2.2, 20.6, 2.0, 1.0, 1.0]
2-element Array{Float64,1}:
 -2.2
 20.6

I assume (like the C++ version) that this would be very slow. The functions wont inline, vectorize, etc, so there will be a lot of overhead on repeated little calls.

Anyway, I saw that Cassette now has version 1.1 and documentation:
https://github.com/jrevels/Cassette.jl

So definitely looking forward to Capstan!
Any updates?
Also saw that Jarrett Revels is listed among the JuliaCon speakers.
I’m guessing we’ll see that talk on YouTube in the next few days?

Topic		Replies	Views
[blog post] Implement your own AD with Julia in ONE day Community blog-post	33	4330	November 3, 2018
Automatic Differentiation Slow (Slower than Finite-Differences) Optimization (Mathematical)	18	3535	May 31, 2018
State of automatic differentiation in Julia Machine Learning	57	21947	September 8, 2021
[ANN] Yota.jl - yet another reverse-mode autodiff package Package Announcements	28	3785	March 15, 2019
ANN: XDiff.jl - an expression differentiation package Machine Learning	25	4184	August 22, 2017

Comparison of automatic differentiation tools from 2016 still accurate?

Related topics