Comparison of automatic differentiation tools from 2016 still accurate?

differentiation

#1

I was made aware of a comprehensive comparison of AD tools at a talk I attended yesterday. The best reference I could find online was these slides: http://www.autodiff.org/ad16/Oral/Srajer_Benchmark.pdf

Does anyone know if these benchmarks are representative of the current state of ForwardDiff? Or are these timings potentially due to the Julia-implementation of the code written by student who did these comparisons?

They show that there is potential for improvements in running time of several orders of magnitude for ForwardDiff compared to, for example, ADiMat in MATLAB. I’ve included a plot from the linked-to slides:


#2

The timing for the objective (no AD) in julia is one to two orders of magnitude larger than the C version, and is slower than the julia vectorized version, so I’d be suspicious of the timings for AD. The different scaling (N^2 vs N) seems consistent with forward vs reverse AD, though.


#3

Perhaps it would help to see the source code, do we know where it is?


#4

Yes, I thought that was quite suspicious as well.

I Googled around a bit, but couldn’t find it. I can email the author (Srajer) and Fitzgibbon, who presented the plots at the talk I attended, and ask if it is available.
See the post below (Comparison of automatic differentiation tools from 2016 still accurate?)


#5

Actually, I was just bad at Googling (how do I cancel a post that is “pending”?).

There is a Github repo with all the code.


#6

Welcome @anriseth. Unfortunately you can’t cancel those posts in the new queue, but fortunately you won’t hit them anymore! I approved both these posts since there was other info in that fist one, too… but now you have some more priveledges (including the ability to edit). Let me know if you have any questions.


#7

They timed code that you should never write.

That doesn’t use the in-place gradient! and it doesn’t cache the GradientConfig. And what did they write for the objective function?

@inbounds function gmm_objective(alphas,means,icf,x,wishart::Wishart)
  d = size(x,1)
  n = size(x,2)
  CONSTANT = -n*d*0.5*log(2 * pi)

  sum_qs = sum(icf[1:d,:],1)
  slse = sum(sum_qs)
  Qs = [get_Q(d,icf[:,ik]) for ik in 1:k]
  main_term = zeros(eltype(alphas),1,k)

  slse = 0.
  for ix=1:n
    for ik=1:k
      main_term[ik] = -0.5*sumabs2(Qs[ik] * (x[:,ix] - means[:,ik]))
    end
    slse += logsumexp(alphas + sum_qs + main_term)
  end

  CONSTANT + slse - n*logsumexp(alphas) + log_wishart_prior(wishart, sum_qs, Qs, icf)
end

(It’s funny that they added @inbounds to make it look like they tried…).

It looks like they don’t make the C++ codes allocate vectors every step (someone can correct me if I’m wrong there), whereas ForwardDiff is allocating is workspace and the result each time, allocating temporaries in each vectorized statement, not fusing broadcasts (though this looks like it was back before that, so it would’ve needed to unroll to loops) in the most inner loop, etc. This is far from efficient code and the issues are so clear that nobody would realistically make these mistakes in high-performance code… so it’s not representative of a benchmark on real-use.


#8

I wouldn’t be so severe: it looks like they did try to make a serious effort at comparing all those languages and libraries, which is painstaking work. This with the usual caveats of benchmarks: the languages the author is most familiar with are likely to be more highly ranked.

For this particular use case, even if you cook up the perfect julia implementation, by pure scaling it’s going to rank somewhere near finite differences (C++), not competitive. Would be interesting to see some julia reverse diff tools in that plot (with the objective written both in “vector” and in “loop” mode, the latter presumably carying more of an overhead)


#9

I agree with that.

I’m not sure I agree with that. I took there code:

using BenchmarkTools

function logsumexp(x)
  mx = maximum(x)
  log(sum(exp.(x))) + mx
end


function f()
  x = rand(100,100)
  n = 100; k= 100
  Qs = [rand(100) 1:100]
  main_term = Vector{Float64}(100)
  means = rand(100,100)
  slse = 0.0
  alphas = 0.0
  sum_qs = 0.0

  @inbounds for ix=1:n
    for ik=1:k
      main_term[ik] = -0.5*sum(abs2,Qs[ik] * (x[:,ix] - means[:,ik]))
    end
    slse += logsumexp(alphas + sum_qs + main_term)
  end
end

@btime f()

4.644 ms (60100 allocations: 34.57 MiB)

and 5-10 minutes later:

function logsumexp(x)
  mx = maximum(x)
  @. x = exp(x - mx)
  log(sum(x)) + mx
end

function h(x,Qs,tmp,main_term,means)

  n = 100; k= 100

  slse = 0.0
  alphas = 0.0
  sum_qs = 0.0

  @inbounds for ix=1:n
    for ik=1:k
      for j in 1:100
        tmp[j] = Qs[ik] * (x[j,ix] - means[j,ik])
      end
      main_term[ik] = -0.5*sum(abs2,tmp)
    end
    for i in eachindex(main_term)
      tmp[i] = alphas + sum_qs + main_term[i]
    end
    slse += logsumexp(tmp)
  end
end

x = rand(100,100)
Qs = rand(100)
tmp = zeros(100)
main_term = Vector{Float64}(100)
means = rand(100,100)

@btime h(x,Qs,tmp,main_term,means)

855.973 μs (0 allocations: 0 bytes)

That’s a 5x difference just sitting there (and yes, that first one is not what they called the vectorized code either). Then there’s the gradient config caching in ForwardDiff which we know from package writing is easily another 5x or so (of course, this all depends on vector sizes). So it should be sitting right on that C++ line. I’d have to question everything by about an order of magnitude or so.

The problem with that is that I’m pretty sure it completely changes the interpretation. I think that if these were all done efficiently then for large enough vectors/matrices similar algorithms will line of in similar spots with very little underlying difference to the actual language details (simply because vectorization is fine one you get large enough and if you manage the temporaries well). That interpretation falls into their >1 order of magnitude error bars, and you can see it in how the languages they know all line up right by each other with the outlier being F#, Python, and Julia.


#10

Perhaps someone could consider making a PR to the benchmark repo?

That said, the general problem with these kind of benchmarks is that they tend to get obsolete very quickly. Hopefully Julia will get even better AD in the medium run.


#11

Thanks for posting! I had attended this conference, but think I missed this talk.

This is actually decent, IMO. As a forward-mode AD tool, ForwardDiff’s target domain is (very) low-dimensional gradients, and jacobians where the input dimension is near <= to the output dimension. For those cases, these benchmarks seem to indicate that ForwardDiff performs comparably to C/C++ tools.

The gradient scaling with dimension is obviously going to be bad for forward-mode tools, so I’m not sure what the point in benchmarking them all together was, but I guess it’s academically interesting to compare ForwardDiff’s scaling with Ceres’ on the same plot as reverse-mode tools.

It would’ve been interesting to see a plot of AD performance relative to the objective function run time alongside the absolute timings presented here.

As other folks have stated, the ForwardDiff benchmark implementation does leave some performance on the table. Seems feasible that ForwardDiff could beat Ceres with a little elbow-grease :slight_smile:

The real goal, though, is to get Capstan on that chart…


#12

How’s Capstan doing?


#13

Ignoring the very important fact that Cassette itself still requires a bit of Base work, I’d say that Capstan is a working prototype - all the fundamental mechanisms are in place (live variable storage, tape storage, interception mechanism, etc.), but things like the user-facing API and the reverse-mode primitive catalog only barely exist. Such items should be relatively easy to implement later, though, since their design is mostly complete at this point. I’d say the bulk of the (non-Base/non-Cassette) work from here will be implementing various performance optimizations, API utilities, and tests/examples/docs.

Capstan’s simple neural net example works if you can find the magical Base + Cassette commits on which things currently run.

As far as a timeline goes, the (very) tentative idea is that Capstan will have an initial release around the same time as Cassette, and Cassette will have a release once Base can support it, and Base’s support for Cassette is on the v1.1 milestone, so…stay tuned!


#14

Looking forward to using Capstan! In the meantime, thanks for your maintenance work on the other AD packages, I could not imagine doing without them.