Does anyone know if these benchmarks are representative of the current state of ForwardDiff? Or are these timings potentially due to the Julia-implementation of the code written by student who did these comparisons?

They show that there is potential for improvements in running time of several orders of magnitude for ForwardDiff compared to, for example, ADiMat in MATLAB. Iâ€™ve included a plot from the linked-to slides:

The timing for the objective (no AD) in julia is one to two orders of magnitude larger than the C version, and is slower than the julia vectorized version, so Iâ€™d be suspicious of the timings for AD. The different scaling (N^2 vs N) seems consistent with forward vs reverse AD, though.

Welcome @anriseth. Unfortunately you canâ€™t cancel those posts in the new queue, but fortunately you wonâ€™t hit them anymore! I approved both these posts since there was other info in that fist one, tooâ€¦ but now you have some more priveledges (including the ability to edit). Let me know if you have any questions.

That doesnâ€™t use the in-place gradient! and it doesnâ€™t cache the GradientConfig. And what did they write for the objective function?

@inbounds function gmm_objective(alphas,means,icf,x,wishart::Wishart)
d = size(x,1)
n = size(x,2)
CONSTANT = -n*d*0.5*log(2 * pi)
sum_qs = sum(icf[1:d,:],1)
slse = sum(sum_qs)
Qs = [get_Q(d,icf[:,ik]) for ik in 1:k]
main_term = zeros(eltype(alphas),1,k)
slse = 0.
for ix=1:n
for ik=1:k
main_term[ik] = -0.5*sumabs2(Qs[ik] * (x[:,ix] - means[:,ik]))
end
slse += logsumexp(alphas + sum_qs + main_term)
end
CONSTANT + slse - n*logsumexp(alphas) + log_wishart_prior(wishart, sum_qs, Qs, icf)
end

(Itâ€™s funny that they added @inbounds to make it look like they triedâ€¦).

It looks like they donâ€™t make the C++ codes allocate vectors every step (someone can correct me if Iâ€™m wrong there), whereas ForwardDiff is allocating is workspace and the result each time, allocating temporaries in each vectorized statement, not fusing broadcasts (though this looks like it was back before that, so it wouldâ€™ve needed to unroll to loops) in the most inner loop, etc. This is far from efficient code and the issues are so clear that nobody would realistically make these mistakes in high-performance codeâ€¦ so itâ€™s not representative of a benchmark on real-use.

I wouldnâ€™t be so severe: it looks like they did try to make a serious effort at comparing all those languages and libraries, which is painstaking work. This with the usual caveats of benchmarks: the languages the author is most familiar with are likely to be more highly ranked.

For this particular use case, even if you cook up the perfect julia implementation, by pure scaling itâ€™s going to rank somewhere near finite differences (C++), not competitive. Would be interesting to see some julia reverse diff tools in that plot (with the objective written both in â€śvectorâ€ť and in â€śloopâ€ť mode, the latter presumably carying more of an overhead)

Iâ€™m not sure I agree with that. I took there code:

using BenchmarkTools
function logsumexp(x)
mx = maximum(x)
log(sum(exp.(x))) + mx
end
function f()
x = rand(100,100)
n = 100; k= 100
Qs = [rand(100) 1:100]
main_term = Vector{Float64}(100)
means = rand(100,100)
slse = 0.0
alphas = 0.0
sum_qs = 0.0
@inbounds for ix=1:n
for ik=1:k
main_term[ik] = -0.5*sum(abs2,Qs[ik] * (x[:,ix] - means[:,ik]))
end
slse += logsumexp(alphas + sum_qs + main_term)
end
end
@btime f()
4.644 ms (60100 allocations: 34.57 MiB)

and 5-10 minutes later:

function logsumexp(x)
mx = maximum(x)
@. x = exp(x - mx)
log(sum(x)) + mx
end
function h(x,Qs,tmp,main_term,means)
n = 100; k= 100
slse = 0.0
alphas = 0.0
sum_qs = 0.0
@inbounds for ix=1:n
for ik=1:k
for j in 1:100
tmp[j] = Qs[ik] * (x[j,ix] - means[j,ik])
end
main_term[ik] = -0.5*sum(abs2,tmp)
end
for i in eachindex(main_term)
tmp[i] = alphas + sum_qs + main_term[i]
end
slse += logsumexp(tmp)
end
end
x = rand(100,100)
Qs = rand(100)
tmp = zeros(100)
main_term = Vector{Float64}(100)
means = rand(100,100)
@btime h(x,Qs,tmp,main_term,means)
855.973 ÎĽs (0 allocations: 0 bytes)

Thatâ€™s a 5x difference just sitting there (and yes, that first one is not what they called the vectorized code either). Then thereâ€™s the gradient config caching in ForwardDiff which we know from package writing is easily another 5x or so (of course, this all depends on vector sizes). So it should be sitting right on that C++ line. Iâ€™d have to question everything by about an order of magnitude or so.

The problem with that is that Iâ€™m pretty sure it completely changes the interpretation. I think that if these were all done efficiently then for large enough vectors/matrices similar algorithms will line of in similar spots with very little underlying difference to the actual language details (simply because vectorization is fine one you get large enough and if you manage the temporaries well). That interpretation falls into their >1 order of magnitude error bars, and you can see it in how the languages they know all line up right by each other with the outlier being F#, Python, and Julia.

Perhaps someone could consider making a PR to the benchmark repo?

That said, the general problem with these kind of benchmarks is that they tend to get obsolete very quickly. Hopefully Julia will get even better AD in the medium run.

Thanks for posting! I had attended this conference, but think I missed this talk.

This is actually decent, IMO. As a forward-mode AD tool, ForwardDiffâ€™s target domain is (very) low-dimensional gradients, and jacobians where the input dimension is near <= to the output dimension. For those cases, these benchmarks seem to indicate that ForwardDiff performs comparably to C/C++ tools.

The gradient scaling with dimension is obviously going to be bad for forward-mode tools, so Iâ€™m not sure what the point in benchmarking them all together was, but I guess itâ€™s academically interesting to compare ForwardDiffâ€™s scaling with Ceresâ€™ on the same plot as reverse-mode tools.

It wouldâ€™ve been interesting to see a plot of AD performance relative to the objective function run time alongside the absolute timings presented here.

As other folks have stated, the ForwardDiff benchmark implementation does leave some performance on the table. Seems feasible that ForwardDiff could beat Ceres with a little elbow-grease

The real goal, though, is to get Capstan on that chartâ€¦

Ignoring the very important fact that Cassette itself still requires a bit of Base work, Iâ€™d say that Capstan is a working prototype - all the fundamental mechanisms are in place (live variable storage, tape storage, interception mechanism, etc.), but things like the user-facing API and the reverse-mode primitive catalog only barely exist. Such items should be relatively easy to implement later, though, since their design is mostly complete at this point. Iâ€™d say the bulk of the (non-Base/non-Cassette) work from here will be implementing various performance optimizations, API utilities, and tests/examples/docs.

Capstanâ€™s simple neural net example works if you can find the magical Base + Cassette commits on which things currently run.

As far as a timeline goes, the (very) tentative idea is that Capstan will have an initial release around the same time as Cassette, and Cassette will have a release once Base can support it, and Baseâ€™s support for Cassette is on the v1.1 milestone, soâ€¦stay tuned!