Comparison of automatic differentiation tools from 2016 still accurate?

fredrikekre · August 13, 2018, 6:12am

DNF · August 13, 2018, 6:40am

I can’t speak to the performance issue at the moment, but why are you defining all these different functions? Just define

myfunc(x) = 2 * (x + 1) + 3x^2

and run the autodifferentiation on that.

Edit: Try to run this:

foo(x) = 2 * (x + 1) + 3x^2
fdiff(f, x, d=0.001) = (f(x + d) - f(x)) / d

v = 5.0
@btime fdiff(foo, $v)
@btime dualpart(foo($v + ε))  # v + ε is equivalent to Dual(v, 1)
@btime ForwardDiff.derivative(foo, $v)

I’m having some problems with ForwardDiff on version 0.7, but I expect some significant speedups on 0.7/1.0 when it comes to compiling away wrapper-types, etc.

In this very simple case, though, I doubt that you can expect speedups when using ForwardDiff over finite diff. Not sure exactly in what situations that should even happen. The great advantage of autodiff is accuracy with almost finitediff performance.

DNF · August 13, 2018, 7:39am

Actually, all the talks at Juliacon were livestreamed this year! So the videos were available immediately. Everything is already on Youtube

And – they used screen-capture instead of shaky-cam for the slides, so the whole experience was much improved over previous jcons.

tomtom · August 13, 2018, 7:54am

defining all different functions is just for clarity …

the point is: doing maths with dual number is at least doubling the operations compared to that on float. e.g.
3.0 * (x + ε) has two operations: 3.0 * x, and 3.0 * ε.

And for more complicated operations, the cost could be much higher even, consider:
(x + ε) * (x + ε) becomes the operations: x * x, x * ε, x * ε, ε * ε, (+) … i.e. 5 operations compared to 1 operation in x * x

so, for almost all maths operations on dual numbers, the time cost is more than doubled.

now, for finite differences, the time cost is simply: 2 times O(f), one for (x + delta), one for ( / delta) … i.e. basically constant 2 times of that of f(x).

so I would say that AD by Dual is almost certainly much slower than finite differences …

a solution I could imagine is “symbolic dual number”. For example, for the function
myfunc(x) = 2 * (x + 1) + 3x^2

if we could symbolically evaluate the dual part of it:
dualpart(2.0 * (x + ε) + 3.0 * (x + ε) ^ 2)
= dualpart(2.0x + 2.0ε + 3.0 * (x^2 + 2xε + ε*ε) )
= dualpart(2.0x + 2.0ε + 3.0(x^2) + (6.0 * x)ε + 0.0)
= dualpart(3.0(x^2) + 2.0x + (2.0 + (6.0 * x) )ε)
= 2.0 + (6.0 * x)

and let g(x) = 2.0 + (6.0 * x)

then we could evaluate g(x) for different values of x, e.g. g(5.0), g(5.1), etc., very quickly

DNF · August 13, 2018, 8:05am

OK, but I think it instead ended up a lot less clear. Very confusing to read through at first. In general, I recommend that you use type annotations very carefully, and only specify types when you really need them. Generic code is considered idiomatic in Julia, and is normally both clearer, shorter and more powerful.

I’m not prepared to go into a discussion on the relative performance of dual numbers vs finite differences. I don’t know enough about it, but I don’t really expect them to be faster in the simplest case.

I just want to note that I don’t think (x + ε) * (x + ε) expands to that many operations – ε*ε is not calculated, for example, since it is already known to be zero. You probably just do x*x, x*ε and ε*x. Quick benchmark that may not be too useful:

v = 5.0
d = dual(v, 1)
0.6.3> @btime $v * $v
  1.457 ns (0 allocations: 0 bytes)
0.6.3> @btime $d * $d
  1.745 ns (0 allocations: 0 bytes)

DNF · August 13, 2018, 8:07am

BTW, the big reason your ForwardDiff code is so slow, is that you wrap your value inside a vector, and then access that, instead of just using scalar math. That has a gigantic performance cost.

tomtom · August 13, 2018, 8:12am

do you see any hope that we could do something like:

gradfunc = ForwardDiff.symbolicgradient(f)

i.e. generates a function rather than the numeric gradient at a specific x?

the generation of gradfunc may be costly (as I imagine it would use something like “symbolic dual number”), but it costs only once. Afterwards, we could evaluate it very quickly.

DNF · August 13, 2018, 8:13am

This works for a simple function, but quickly becomes enormously difficult. Imagine doing this while combining the effects of several complicated functions with loops, if-statements, special type constructs and so on!

I think this may be related to the source-transformation approach to autodiff. You could try to google that.

tomtom · August 13, 2018, 8:35am

could you show me how to use ForwardDiff.gradient() passing a function that accepts a scalar as argument?

I got the following error if f() accepts a scalar rather than a vector:

julia> function forwarddifffunc(x)
           2.0 * (x + 1.0) + 3.0 * (x * x1)
       end
forwarddifffunc (generic function with 1 method)

julia> ForwardDiff.gradient(forwarddifffunc, 5.0)
ERROR: MethodError: no method matching gradient(::#forwarddifffunc, ::Float64)
You may have intended to import Base.gradient
Closest candidates are:
  gradient(::Any, ::StaticArrays.SArray) at /Applications/JuliaPro-0.6.3.1.app/Contents/Resources/pkgs-0.6.3.1/v0.6/ForwardDiff/src/gradient.jl:42
  gradient(::Any, ::StaticArrays.SArray, ::ForwardDiff.GradientConfig) at /Applications/JuliaPro-0.6.3.1.app/Contents/Resources/pkgs-0.6.3.1/v0.6/ForwardDiff/src/gradient.jl:43
  gradient(::Any, ::StaticArrays.SArray, ::ForwardDiff.GradientConfig, ::Val) at /Applications/JuliaPro-0.6.3.1.app/Contents/Resources/pkgs-0.6.3.1/v0.6/ForwardDiff/src/gradient.jl:44
  ...
Stacktrace:
 [1] macro expansion at /Applications/JuliaPro-0.6.3.1.app/Contents/Resources/pkgs-0.6.3.1/v0.6/Atom/src/repl.jl:118 [inlined]
 [2] anonymous at ./<missing>:?

baggepinnen · August 13, 2018, 8:38am

This is what you’re looking for, still in beta though
https://github.com/FluxML/Zygote.jl
With juliacon video at Flux: The Elegant Machine Learning Library | Mike Innes - YouTube

tomtom · August 13, 2018, 9:55am

wow amazing, thx !

dfdx · August 13, 2018, 9:59am

If you want symbolic derivatives, there’s no need to involve dual numbers at all:

using XGrad
xdiff(:(2 * (x + 1) + 3x^2); x=1.0)

which gives

quote
    tmp372 = 3
    tmp368 = 2
    tmp369 = 1
    tmp370 = x + tmp369
    tmp371 = tmp368 * tmp370
    tmp374 = x ^ tmp368
    tmp375 = tmp372 * tmp374
    tmp376 = tmp371 + tmp375
    dtmp376!dtmp371 = 1.0
    dtmp376!dx__2 = tmp368 * dtmp376!dtmp371
    dtmp376!dtmp374 = tmp372 * dtmp376!dtmp371
    tmp380 = tmp368 - tmp369
    tmp381 = x ^ tmp380
    dtmp376!dx__1 = tmp368 * tmp381 * dtmp376!dtmp374
    dtmp376!dx = dtmp376!dx__1 .+ dtmp376!dx__2
    tmp385 = (tmp376, dtmp376!dx)
end

Code is optimized for tensors so may look a bit ugly, but it should give you pretty good performance.
(I think Calculus.jl used to give beautiful expressions for scalar derivatives, but I can’t see it in their README anymore).

The problem with symbolic approach is that it works really badly with things like conditions, loops, recursion, and, more generally, dynamic computation graphs (e.g. XGrad requires code to be a fixed list of correct algebraic expressions). It’s fine in most cases, but if you encounter a task with dynamic graph - even if it’s fixed for a specific execution - you have to fall back to (more) pure AD.

In (more) pure AD you have a choice of forward-mode AD, which relies on dual numbers and works well for functions R \rightarrow R^n, and reverse-mode AD, which is usually implemented using some kind of tracked arrays or, in case of future Cassette-based solutions, function call interception, and which is designed for R^n \rightarrow R problems.

Note, that reverse-mode AD may also be fully dynamic and construct graph on each run (I think AutoGrad.jl does this) or statically compile derivatives for performance after the first run, making it closer to symbolic approach (ReverseDiff.jl and Capstan.jl do this). (Currently I explore possibility to combine and compare both approaches in a single package Yota.jl).

As a bottom line: dual numbers aren’t the only approach to AD

jrevels · August 13, 2018, 10:16am

in this simple example, direct gradient by dual number maths is 2.34X time-costly compared to finite differences.
using ForwardDiff AD is even worse, it’s 4271.75X !!!

This is a fairly ridiculous comparison - you’re comparing apples to oranges here by pitting a very limited handrolled scalar AD vs. an interface that’s documented to allocate because it’s expected to be used with far more substantial inputs. Furthermore, you changed the target function between differentiation implementations.

Here, why not use this for comparison (also remember to interpolate variables correctly when using BenchmarkTools):

julia> myfunc(x) = 2.0 * (x + 1.0) + 3.0 * (x * x)
myfunc (generic function with 1 method)

julia> @btime ForwardDiff.derivative(myfunc, $(5.0))
  1.525 ns (0 allocations: 0 bytes)
32.0

And even if you did want to change the target function to forwarddifffunc:

julia> using StaticArrays

julia> @btime ForwardDiff.gradient(forwarddifffunc, $(SVector(5.0)))
  1.530 ns (0 allocations: 0 bytes)
1-element SArray{Tuple{1},Float64,1,1}:
 32.0

Also note that, for scalar differentiation, AD is generally faster than finite differences (regardless of the AD implementation, as long as it’s decent and in a language that supports fast scalar ops).

DNF · August 13, 2018, 10:20am

Use ForwardDiff.derivative, not gradient.

There’s also a mistake in your function, x1 is undefined.

Finally, I suggest using integer literals, since they will help you preserve input types better:

foo(x) = 2 * (x + 1) + 3x^2  # use x^2 not x * x, it's more readable.

tomtom · August 13, 2018, 11:15am

yeah! a lot of good stuffs!

anyway, any good website/pdf/docs that cover a review of various AD technologies to suggest?

DNF · August 13, 2018, 12:16pm

You can check out these links:
http://www.juliadiff.org/
http://www.autodiff.org/

Tamas_Papp · August 13, 2018, 12:24pm

Numerical Optimization by Nocedal and Wright has a very nice intro chapter on forward, reverse, and mixed, and why you would use one or the other.

@misc{nocedal2006numerical,
  title={Numerical optimization 2nd},
  author={Nocedal, Jorge and Wright, Stephen J},
  year={2006},
  publisher={Springer}
}

tomtom · August 13, 2018, 1:49pm

You can check out these links:
http://www.juliadiff.org/ 1
http://www.autodiff.org/ 2

Numerical Optimization by Nocedal and Wright has a very nice intro chapter on forward, reverse, and mixed, and why you would use one or the other.

thanks!

MikeInnes · August 16, 2018, 8:30am

This turns out to be fixable, actually. The symbolic approach effectively operates on a single block of SSA-form IR, and generalising that IR to contain control flow is not too difficult. Will put up more of a writeup on this at some point.

dfdx · August 16, 2018, 1:51pm

generalising that IR to contain control flow is not too difficult.

Is it? I’m curious how would you preserve looping and SSA, e.g. in:

function foo(x)
    for i=1:100 
        x = x + 1
    end
    return x
end

You either break SSA-form for x or unroll the loop, or make a subgraph in which x is only assigned once. Breaking SSA-form essentially makes the function non-differentiable, AFAICT. Unrolling the loop only works when the number of iterations is constant, which isn’t true in general case.

The only meaningful option I found is making a subgraph, but it brings another set of questions: how to incorporate it into the main graph, how to map variables inside the subgraph to variables before and after the loop, how should derivative of a loop / subgraph look like, etc.

All of these made me switch to a much simpler approach and just support separately dynamic graphs with possible performance penalty and static graphs with all the optimizations, but it will be cool if you derive correct derivatives for general-purpose loops and conditions.

Topic		Replies	Views
Can I expect ForwardDiff to give the same performance in this case? Performance forwarddiff	5	561	September 8, 2021
Automatic differentiation - Julia implementation advantages Machine Learning	8	4087	June 7, 2019
State of AD in 2024 General Usage machine-learning , autodiff	4	2194	April 6, 2024
Comparison of Julia autodiff packages General Usage autodiff	1	192	September 27, 2024
Automatic Differentiation (AD) in Julia vs. Python (or PyTorch) Machine Learning autodiff	14	1489	January 16, 2025

Comparison of automatic differentiation tools from 2016 still accurate?

Related topics