Comparison of automatic differentiation tools from 2016 still accurate?

Actually, I was just bad at Googling (how do I cancel a post that is “pending”?).

There is a Github repo with all the code.

Welcome @anriseth. Unfortunately you can’t cancel those posts in the new queue, but fortunately you won’t hit them anymore! I approved both these posts since there was other info in that fist one, too… but now you have some more priveledges (including the ability to edit). Let me know if you have any questions.

1 Like

They timed code that you should never write.

https://github.com/awf/autodiff/blob/b5db1d62c2c320978b23d01a795f46c3b501fac7/Julia/Tests/gmm_F.jl#L69-L77

That doesn’t use the in-place gradient! and it doesn’t cache the GradientConfig. And what did they write for the objective function?

https://github.com/awf/autodiff/blob/b5db1d62c2c320978b23d01a795f46c3b501fac7/Julia/Tests/gmm_F.jl#L20-L28

@inbounds function gmm_objective(alphas,means,icf,x,wishart::Wishart)
  d = size(x,1)
  n = size(x,2)
  CONSTANT = -n*d*0.5*log(2 * pi)

  sum_qs = sum(icf[1:d,:],1)
  slse = sum(sum_qs)
  Qs = [get_Q(d,icf[:,ik]) for ik in 1:k]
  main_term = zeros(eltype(alphas),1,k)

  slse = 0.
  for ix=1:n
    for ik=1:k
      main_term[ik] = -0.5*sumabs2(Qs[ik] * (x[:,ix] - means[:,ik]))
    end
    slse += logsumexp(alphas + sum_qs + main_term)
  end

  CONSTANT + slse - n*logsumexp(alphas) + log_wishart_prior(wishart, sum_qs, Qs, icf)
end

(It’s funny that they added @inbounds to make it look like they tried…).

It looks like they don’t make the C++ codes allocate vectors every step (someone can correct me if I’m wrong there), whereas ForwardDiff is allocating is workspace and the result each time, allocating temporaries in each vectorized statement, not fusing broadcasts (though this looks like it was back before that, so it would’ve needed to unroll to loops) in the most inner loop, etc. This is far from efficient code and the issues are so clear that nobody would realistically make these mistakes in high-performance code… so it’s not representative of a benchmark on real-use.

6 Likes

I wouldn’t be so severe: it looks like they did try to make a serious effort at comparing all those languages and libraries, which is painstaking work. This with the usual caveats of benchmarks: the languages the author is most familiar with are likely to be more highly ranked.

For this particular use case, even if you cook up the perfect julia implementation, by pure scaling it’s going to rank somewhere near finite differences (C++), not competitive. Would be interesting to see some julia reverse diff tools in that plot (with the objective written both in “vector” and in “loop” mode, the latter presumably carying more of an overhead)

1 Like

I agree with that.

I’m not sure I agree with that. I took there code:

using BenchmarkTools

function logsumexp(x)
  mx = maximum(x)
  log(sum(exp.(x))) + mx
end


function f()
  x = rand(100,100)
  n = 100; k= 100
  Qs = [rand(100) 1:100]
  main_term = Vector{Float64}(100)
  means = rand(100,100)
  slse = 0.0
  alphas = 0.0
  sum_qs = 0.0

  @inbounds for ix=1:n
    for ik=1:k
      main_term[ik] = -0.5*sum(abs2,Qs[ik] * (x[:,ix] - means[:,ik]))
    end
    slse += logsumexp(alphas + sum_qs + main_term)
  end
end

@btime f()

4.644 ms (60100 allocations: 34.57 MiB)

and 5-10 minutes later:

function logsumexp(x)
  mx = maximum(x)
  @. x = exp(x - mx)
  log(sum(x)) + mx
end

function h(x,Qs,tmp,main_term,means)

  n = 100; k= 100

  slse = 0.0
  alphas = 0.0
  sum_qs = 0.0

  @inbounds for ix=1:n
    for ik=1:k
      for j in 1:100
        tmp[j] = Qs[ik] * (x[j,ix] - means[j,ik])
      end
      main_term[ik] = -0.5*sum(abs2,tmp)
    end
    for i in eachindex(main_term)
      tmp[i] = alphas + sum_qs + main_term[i]
    end
    slse += logsumexp(tmp)
  end
end

x = rand(100,100)
Qs = rand(100)
tmp = zeros(100)
main_term = Vector{Float64}(100)
means = rand(100,100)

@btime h(x,Qs,tmp,main_term,means)

855.973 μs (0 allocations: 0 bytes)

That’s a 5x difference just sitting there (and yes, that first one is not what they called the vectorized code either). Then there’s the gradient config caching in ForwardDiff which we know from package writing is easily another 5x or so (of course, this all depends on vector sizes). So it should be sitting right on that C++ line. I’d have to question everything by about an order of magnitude or so.

The problem with that is that I’m pretty sure it completely changes the interpretation. I think that if these were all done efficiently then for large enough vectors/matrices similar algorithms will line of in similar spots with very little underlying difference to the actual language details (simply because vectorization is fine one you get large enough and if you manage the temporaries well). That interpretation falls into their >1 order of magnitude error bars, and you can see it in how the languages they know all line up right by each other with the outlier being F#, Python, and Julia.

2 Likes

Perhaps someone could consider making a PR to the benchmark repo?

That said, the general problem with these kind of benchmarks is that they tend to get obsolete very quickly. Hopefully Julia will get even better AD in the medium run.

1 Like

Thanks for posting! I had attended this conference, but think I missed this talk.

This is actually decent, IMO. As a forward-mode AD tool, ForwardDiff’s target domain is (very) low-dimensional gradients, and jacobians where the input dimension is near <= to the output dimension. For those cases, these benchmarks seem to indicate that ForwardDiff performs comparably to C/C++ tools.

The gradient scaling with dimension is obviously going to be bad for forward-mode tools, so I’m not sure what the point in benchmarking them all together was, but I guess it’s academically interesting to compare ForwardDiff’s scaling with Ceres’ on the same plot as reverse-mode tools.

It would’ve been interesting to see a plot of AD performance relative to the objective function run time alongside the absolute timings presented here.

As other folks have stated, the ForwardDiff benchmark implementation does leave some performance on the table. Seems feasible that ForwardDiff could beat Ceres with a little elbow-grease :slight_smile:

The real goal, though, is to get Capstan on that chart…

2 Likes

How’s Capstan doing?

Ignoring the very important fact that Cassette itself still requires a bit of Base work, I’d say that Capstan is a working prototype - all the fundamental mechanisms are in place (live variable storage, tape storage, interception mechanism, etc.), but things like the user-facing API and the reverse-mode primitive catalog only barely exist. Such items should be relatively easy to implement later, though, since their design is mostly complete at this point. I’d say the bulk of the (non-Base/non-Cassette) work from here will be implementing various performance optimizations, API utilities, and tests/examples/docs.

Capstan’s simple neural net example works if you can find the magical Base + Cassette commits on which things currently run.

As far as a timeline goes, the (very) tentative idea is that Capstan will have an initial release around the same time as Cassette, and Cassette will have a release once Base can support it, and Base’s support for Cassette is on the v1.1 milestone, so…stay tuned!

10 Likes

Looking forward to using Capstan! In the meantime, thanks for your maintenance work on the other AD packages, I could not imagine doing without them.

4 Likes

Posting it here instead of opening a new topic: Bob Carpenter, one of the main architects of the AD framework of Stan, has an interesting writeup of their new reverse-mode AD strategy using continuations:

I wonder how it compares to existing and planned tools (AFAICT ReverseDiff.jl uses a tape). At a first glance, it should be easy to implement and optimization-friendly in Julia.

1 Like

after going thru some maths of dual numbers, I realized that AD using dual numbers is SLOWER even than finite differences! (of coz AD is more accurate). Am I wrong? Anyone knows extremely fast AD using dual numbers?

1 Like

It is hard to answer the question without a minimal working example.

1 Like

Number of parameters? Did you preallocate the configuration and diffresult?

using BenchmarkTools, Compat
import ForwardDiff
import DualNumbers

function float64func(x::Float64)::Float64
    2.0 * (x + 1.0) + 3.0 * (x * x)
end

function dualfunc(x::DualNumbers.Dual{Float64})::DualNumbers.Dual{Float64}
    2.0 * (x + 1.0) + 3.0 * (x * x)
end

function forwarddifffunc(x)
    x1 = x[1]
    2.0 * (x1 + 1.0) + 3.0 * (x1 * x1)
end

"""
julia> float64func(5.0)
87.0

julia> dualfunc(DualNumbers.dual(5.0, 1.0) )
87.0 + 32.0ɛ

julia> forwarddifffunc([5.0])
87.0
"""

function finitedifferencegrad(x::Float64)::Float64
    delta = 0.001

    (float64func(x + delta) - float64func(x) ) / delta
end

function dualgrad(x::Float64)::Float64
    DualNumbers.dualpart(dualfunc(DualNumbers.dual(x, 1.0) ) )
end

"""
julia> finitedifferencegrad(5.0)
32.00300000000311

julia> dualgrad(5.0)
32.0

julia> ForwardDiff.gradient(forwarddifffunc, [5.0])
1-element Array{Float64,1}:
 32.0

julia> @benchmark finitedifferencegrad(5.0)
BenchmarkTools.Trial:
 memory estimate:  0 bytes
 allocs estimate:  0
 --------------
 minimum time:     1.564 ns (0.00% GC)
 median time:      1.744 ns (0.00% GC)
 mean time:        1.678 ns (0.00% GC)
 maximum time:     18.642 ns (0.00% GC)
 --------------
 samples:          10000
 evals/sample:     1000

julia> @benchmark dualgrad(5.0)
BenchmarkTools.Trial:
 memory estimate:  0 bytes
 allocs estimate:  0
 --------------
 minimum time:     3.623 ns (0.00% GC)
 median time:      3.730 ns (0.00% GC)
 mean time:        3.933 ns (0.00% GC)
 maximum time:     22.680 ns (0.00% GC)
 --------------
 samples:          10000
 evals/sample:     1000

julia> @benchmark ForwardDiff.gradient(forwarddifffunc, [5.0])
BenchmarkTools.Trial:
 memory estimate:  352 bytes
 allocs estimate:  5
 --------------
 minimum time:     6.336 μs (0.00% GC)
 median time:      7.132 μs (0.00% GC)
 mean time:        7.168 μs (0.00% GC)
 maximum time:     27.631 μs (0.00% GC)
 --------------
 samples:          10000
 evals/sample:     5
"""

in this simple example, direct gradient by dual number maths is 2.34X time-costly compared to finite differences.
using ForwardDiff AD is even worse, it’s 4271.75X !!!

actually, work with maths in dual numbers are like this:
if x = 5.0 + ⁠1ϵ,
2.0 * (x + 1.0) + 3.0 * (x * x) becomes:
2.0 * (5.0 + ⁠1ϵ + 1.0) + 3.0 * ((5.0 + ⁠1ϵ) * (5.0 + ⁠1ϵ) )
= 12.0 + 2.0ϵ + 3.0 * (25.0 + 10.0ϵ + ϵ²)
= 12.0 + 2.0ϵ + 3.0 * (25.0 + 10.0ϵ + 0.0) (as ϵ² is defined to be 0.0)
= 87.0 + 32.0ϵ

so, we can easily see that it’s even more time consuming than finite differences.

That’s really cool!
I’ve been wanting to experiment more with autodiff, so there’s an excuse.

Here’s a Julia translation of Bob Carpenter’s opening post:

import FunctionWrappers: FunctionWrapper

const adj_t = Vector{Float64}
const chain_t = FunctionWrapper{Cvoid,Tuple{adj_t}}
const stack_ = chain_t[]

const next_idx_ = Ref{Int}(0)
@inline next_idx() = next_idx_[] += 1

struct var{T,int}
    val::T
    idx::int
end
var(v::var) = var(v.val, v.idx)
var(val) = var(val, next_idx())
var() = var(NaN, next_idx())

@inline function Base.:+(x1::var, x2::var)
    y = var(x1.val + x2.val)
    let x1i = x1.idx, x2i = x2.idx, yid = y.idx
        push!(stack_, chain_t(adj -> begin
            adj[x1i] += adj[yid]
            adj[x2i] += adj[yid]
            nothing
        end))
    end
    y
end
@inline function Base.:+(x1::var, x2)
    y = var(x1.val + x2)
    let x1i = x1.idx, yid = y.idx
        push!(stack_, chain_t(adj -> begin
            adj[x1i] += adj[yid]
            nothing
        end))
    end
    y
end
@inline function Base.:+(x1, x2::var)
    y = var(x1 + x2.val)
    let x2i = x2.idx, yid = y.idx
        push!(stack_, chain_t(adj -> begin
            adj[x2i] += adj[yid]
            nothing
        end))
    end
    y
end

@inline function Base.:*(x1::var, x2::var)
    y = var(x1.val * x2.val)
    let x1v = x1.val, x1i = x1.idx, x2v = x2.val, x2i = x2.idx, yid = y.idx
        push!(stack_, chain_t(adj -> begin
            adj[x1i] += x2v * adj[yid]
            adj[x2i] += x1v * adj[yid]
            nothing
        end))
    end
    y
end
@inline function Base.:*(x1::var, x2)
    y = var(x1.val * x2)
    let x1i = x1.idx, x2v = x2, yid = y.idx
        push!(stack_, chain_t(adj -> begin
            adj[x1i] += x2v * adj[yid]
            nothing
        end))
    end
    y
end
@inline function Base.:*(x1, x2::var)
    y = var(x1 * x2.val)
    let x1v = x1, x2i = x2.idx, yid = y.idx
        push!(stack_, chain_t(adj -> begin
            adj[x2i] += x1v * adj[yid]
            nothing
        end))
    end
    y
end

@inline function Base.exp(x::var)
    y = var(exp(x.val))
    let xid = x.idx, yval = y.val, yid = y.idx
        push!(stack_, chain_t(adj -> begin
            adj[xid] += yval * adj[yid]
            nothing
        end))
    end
    y
end

function grad(y::var, x::Vector{<:var})
    adj = zeros(y.idx)
    adj[y.idx] = 1
    for i ∈ length(stack_):-1:1
        stack_[i](adj)
    end
    dy_dx = Vector{Float64}(undef, length(x))
    for i ∈ eachindex(x)
        dy_dx[i] = adj[x[i].idx]
    end
    dy_dx
end

I wasn’t sure whether I should go with push! and looping over stack_ backwards, or insert(stack_, 1, ...) and then a simple fun! ∈stack!. I ended up going with the former.

This yields:

julia> resize!(stack_, 0);

julia> next_idx_[] = 0;

julia> x1 = var(10.3);

julia> x2 = var(-1.1);

julia> x = [x1, x2];

julia> y = x1 * x2 * 2 + 7
(x1.idx, x2.idx, y.idx) = (1, 2, 3)
(x1.idx, y.idx) = (3, 4)
(x1.idx, y.idx) = (4, 5)
var{Float64,Int64}(-15.660000000000004, 5)

julia> dy_dx = grad(y, x)
adj = [-2.2, 20.6, 2.0, 1.0, 1.0]
2-element Array{Float64,1}:
 -2.2
 20.6

I assume (like the C++ version) that this would be very slow. The functions wont inline, vectorize, etc, so there will be a lot of overhead on repeated little calls.

Anyway, I saw that Cassette now has version 1.1 and documentation:
https://github.com/jrevels/Cassette.jl

So definitely looking forward to Capstan!
Any updates?
Also saw that Jarrett Revels is listed among the JuliaCon speakers.
I’m guessing we’ll see that talk on YouTube in the next few days?

1 Like
1 Like

I can’t speak to the performance issue at the moment, but why are you defining all these different functions? Just define

myfunc(x) = 2 * (x + 1) + 3x^2

and run the autodifferentiation on that.

Edit: Try to run this:

foo(x) = 2 * (x + 1) + 3x^2
fdiff(f, x, d=0.001) = (f(x + d) - f(x)) / d

v = 5.0
@btime fdiff(foo, $v)
@btime dualpart(foo($v + ε))  # v + ε is equivalent to Dual(v, 1)
@btime ForwardDiff.derivative(foo, $v)

I’m having some problems with ForwardDiff on version 0.7, but I expect some significant speedups on 0.7/1.0 when it comes to compiling away wrapper-types, etc.

In this very simple case, though, I doubt that you can expect speedups when using ForwardDiff over finite diff. Not sure exactly in what situations that should even happen. The great advantage of autodiff is accuracy with almost finitediff performance.

1 Like

Actually, all the talks at Juliacon were livestreamed this year! So the videos were available immediately. Everything is already on Youtube :smiley:

And – they used screen-capture instead of shaky-cam for the slides, so the whole experience was much improved over previous jcons.

1 Like

defining all different functions is just for clarity …

the point is: doing maths with dual number is at least doubling the operations compared to that on float. e.g.
3.0 * (x + ε) has two operations: 3.0 * x, and 3.0 * ε.

And for more complicated operations, the cost could be much higher even, consider:
(x + ε) * (x + ε) becomes the operations: x * x, x * ε, x * ε, ε * ε, (+) … i.e. 5 operations compared to 1 operation in x * x

so, for almost all maths operations on dual numbers, the time cost is more than doubled.

now, for finite differences, the time cost is simply: 2 times O(f), one for (x + delta), one for ( / delta) … i.e. basically constant 2 times of that of f(x).

so I would say that AD by Dual is almost certainly much slower than finite differences …

a solution I could imagine is “symbolic dual number”. For example, for the function
myfunc(x) = 2 * (x + 1) + 3x^2

if we could symbolically evaluate the dual part of it:
dualpart(2.0 * (x + ε) + 3.0 * (x + ε) ^ 2)
= dualpart(2.0x + 2.0ε + 3.0 * (x^2 + 2xε + ε*ε) )
= dualpart(2.0x + 2.0ε + 3.0(x^2) + (6.0 * x)ε + 0.0)
= dualpart(3.0(x^2) + 2.0x + (2.0 + (6.0 * x) )ε)
= 2.0 + (6.0 * x)

and let g(x) = 2.0 + (6.0 * x)

then we could evaluate g(x) for different values of x, e.g. g(5.0), g(5.1), etc., very quickly