Comparison of automatic differentiation tools from 2016 still accurate?

1 Like

I canā€™t speak to the performance issue at the moment, but why are you defining all these different functions? Just define

myfunc(x) = 2 * (x + 1) + 3x^2

and run the autodifferentiation on that.

Edit: Try to run this:

foo(x) = 2 * (x + 1) + 3x^2
fdiff(f, x, d=0.001) = (f(x + d) - f(x)) / d

v = 5.0
@btime fdiff(foo, $v)
@btime dualpart(foo($v + Īµ))  # v + Īµ is equivalent to Dual(v, 1)
@btime ForwardDiff.derivative(foo, $v)

Iā€™m having some problems with ForwardDiff on version 0.7, but I expect some significant speedups on 0.7/1.0 when it comes to compiling away wrapper-types, etc.

In this very simple case, though, I doubt that you can expect speedups when using ForwardDiff over finite diff. Not sure exactly in what situations that should even happen. The great advantage of autodiff is accuracy with almost finitediff performance.

1 Like

Actually, all the talks at Juliacon were livestreamed this year! So the videos were available immediately. Everything is already on Youtube :smiley:

And ā€“ they used screen-capture instead of shaky-cam for the slides, so the whole experience was much improved over previous jcons.

1 Like

defining all different functions is just for clarity ā€¦

the point is: doing maths with dual number is at least doubling the operations compared to that on float. e.g.
3.0 * (x + Īµ) has two operations: 3.0 * x, and 3.0 * Īµ.

And for more complicated operations, the cost could be much higher even, consider:
(x + Īµ) * (x + Īµ) becomes the operations: x * x, x * Īµ, x * Īµ, Īµ * Īµ, (+) ā€¦ i.e. 5 operations compared to 1 operation in x * x

so, for almost all maths operations on dual numbers, the time cost is more than doubled.

now, for finite differences, the time cost is simply: 2 times O(f), one for (x + delta), one for ( / delta) ā€¦ i.e. basically constant 2 times of that of f(x).

so I would say that AD by Dual is almost certainly much slower than finite differences ā€¦

a solution I could imagine is ā€œsymbolic dual numberā€. For example, for the function
myfunc(x) = 2 * (x + 1) + 3x^2

if we could symbolically evaluate the dual part of it:
dualpart(2.0 * (x + Īµ) + 3.0 * (x + Īµ) ^ 2)
= dualpart(2.0x + 2.0Īµ + 3.0 * (x^2 + 2xĪµ + Īµ*Īµ) )
= dualpart(2.0x + 2.0Īµ + 3.0(x^2) + (6.0 * x)Īµ + 0.0)
= dualpart(3.0(x^2) + 2.0x + (2.0 + (6.0 * x) )Īµ)
= 2.0 + (6.0 * x)

and let g(x) = 2.0 + (6.0 * x)

then we could evaluate g(x) for different values of x, e.g. g(5.0), g(5.1), etc., very quickly

OK, but I think it instead ended up a lot less clear. Very confusing to read through at first. In general, I recommend that you use type annotations very carefully, and only specify types when you really need them. Generic code is considered idiomatic in Julia, and is normally both clearer, shorter and more powerful.

Iā€™m not prepared to go into a discussion on the relative performance of dual numbers vs finite differences. I donā€™t know enough about it, but I donā€™t really expect them to be faster in the simplest case.

I just want to note that I donā€™t think (x + Īµ) * (x + Īµ) expands to that many operations ā€“ Īµ*Īµ is not calculated, for example, since it is already known to be zero. You probably just do x*x, x*Īµ and Īµ*x. Quick benchmark that may not be too useful:

v = 5.0
d = dual(v, 1)
0.6.3> @btime $v * $v
  1.457 ns (0 allocations: 0 bytes)
0.6.3> @btime $d * $d
  1.745 ns (0 allocations: 0 bytes)

BTW, the big reason your ForwardDiff code is so slow, is that you wrap your value inside a vector, and then access that, instead of just using scalar math. That has a gigantic performance cost.

1 Like

do you see any hope that we could do something like:

gradfunc = ForwardDiff.symbolicgradient(f)

i.e. generates a function rather than the numeric gradient at a specific x?

the generation of gradfunc may be costly (as I imagine it would use something like ā€œsymbolic dual numberā€), but it costs only once. Afterwards, we could evaluate it very quickly.

This works for a simple function, but quickly becomes enormously difficult. Imagine doing this while combining the effects of several complicated functions with loops, if-statements, special type constructs and so on!

I think this may be related to the source-transformation approach to autodiff. You could try to google that.

could you show me how to use ForwardDiff.gradient() passing a function that accepts a scalar as argument?

I got the following error if f() accepts a scalar rather than a vector:

julia> function forwarddifffunc(x)
           2.0 * (x + 1.0) + 3.0 * (x * x1)
       end
forwarddifffunc (generic function with 1 method)

julia> ForwardDiff.gradient(forwarddifffunc, 5.0)
ERROR: MethodError: no method matching gradient(::#forwarddifffunc, ::Float64)
You may have intended to import Base.gradient
Closest candidates are:
  gradient(::Any, ::StaticArrays.SArray) at /Applications/JuliaPro-0.6.3.1.app/Contents/Resources/pkgs-0.6.3.1/v0.6/ForwardDiff/src/gradient.jl:42
  gradient(::Any, ::StaticArrays.SArray, ::ForwardDiff.GradientConfig) at /Applications/JuliaPro-0.6.3.1.app/Contents/Resources/pkgs-0.6.3.1/v0.6/ForwardDiff/src/gradient.jl:43
  gradient(::Any, ::StaticArrays.SArray, ::ForwardDiff.GradientConfig, ::Val) at /Applications/JuliaPro-0.6.3.1.app/Contents/Resources/pkgs-0.6.3.1/v0.6/ForwardDiff/src/gradient.jl:44
  ...
Stacktrace:
 [1] macro expansion at /Applications/JuliaPro-0.6.3.1.app/Contents/Resources/pkgs-0.6.3.1/v0.6/Atom/src/repl.jl:118 [inlined]
 [2] anonymous at ./<missing>:?

This is what youā€™re looking for, still in beta though
https://github.com/FluxML/Zygote.jl
With juliacon video at Flux: The Elegant Machine Learning Library | Mike Innes - YouTube

3 Likes

wow amazing, thx !

If you want symbolic derivatives, thereā€™s no need to involve dual numbers at all:

using XGrad
xdiff(:(2 * (x + 1) + 3x^2); x=1.0)

which gives

quote
    tmp372 = 3
    tmp368 = 2
    tmp369 = 1
    tmp370 = x + tmp369
    tmp371 = tmp368 * tmp370
    tmp374 = x ^ tmp368
    tmp375 = tmp372 * tmp374
    tmp376 = tmp371 + tmp375
    dtmp376!dtmp371 = 1.0
    dtmp376!dx__2 = tmp368 * dtmp376!dtmp371
    dtmp376!dtmp374 = tmp372 * dtmp376!dtmp371
    tmp380 = tmp368 - tmp369
    tmp381 = x ^ tmp380
    dtmp376!dx__1 = tmp368 * tmp381 * dtmp376!dtmp374
    dtmp376!dx = dtmp376!dx__1 .+ dtmp376!dx__2
    tmp385 = (tmp376, dtmp376!dx)
end

Code is optimized for tensors so may look a bit ugly, but it should give you pretty good performance.
(I think Calculus.jl used to give beautiful expressions for scalar derivatives, but I canā€™t see it in their README anymore).

The problem with symbolic approach is that it works really badly with things like conditions, loops, recursion, and, more generally, dynamic computation graphs (e.g. XGrad requires code to be a fixed list of correct algebraic expressions). Itā€™s fine in most cases, but if you encounter a task with dynamic graph - even if itā€™s fixed for a specific execution - you have to fall back to (more) pure AD.

In (more) pure AD you have a choice of forward-mode AD, which relies on dual numbers and works well for functions R \rightarrow R^n, and reverse-mode AD, which is usually implemented using some kind of tracked arrays or, in case of future Cassette-based solutions, function call interception, and which is designed for R^n \rightarrow R problems.

Note, that reverse-mode AD may also be fully dynamic and construct graph on each run (I think AutoGrad.jl does this) or statically compile derivatives for performance after the first run, making it closer to symbolic approach (ReverseDiff.jl and Capstan.jl do this). (Currently I explore possibility to combine and compare both approaches in a single package Yota.jl).

As a bottom line: dual numbers arenā€™t the only approach to AD :slight_smile:

1 Like

in this simple example, direct gradient by dual number maths is 2.34X time-costly compared to finite differences.
using ForwardDiff AD is even worse, itā€™s 4271.75X !!!

This is a fairly ridiculous comparison - youā€™re comparing apples to oranges here by pitting a very limited handrolled scalar AD vs. an interface thatā€™s documented to allocate because itā€™s expected to be used with far more substantial inputs. Furthermore, you changed the target function between differentiation implementations.

Here, why not use this for comparison (also remember to interpolate variables correctly when using BenchmarkTools):

julia> myfunc(x) = 2.0 * (x + 1.0) + 3.0 * (x * x)
myfunc (generic function with 1 method)

julia> @btime ForwardDiff.derivative(myfunc, $(5.0))
  1.525 ns (0 allocations: 0 bytes)
32.0

And even if you did want to change the target function to forwarddifffunc:

julia> using StaticArrays

julia> @btime ForwardDiff.gradient(forwarddifffunc, $(SVector(5.0)))
  1.530 ns (0 allocations: 0 bytes)
1-element SArray{Tuple{1},Float64,1,1}:
 32.0

Also note that, for scalar differentiation, AD is generally faster than finite differences (regardless of the AD implementation, as long as itā€™s decent and in a language that supports fast scalar ops).

6 Likes

Use ForwardDiff.derivative, not gradient.

Thereā€™s also a mistake in your function, x1 is undefined.

Finally, I suggest using integer literals, since they will help you preserve input types better:

foo(x) = 2 * (x + 1) + 3x^2  # use x^2 not x * x, it's more readable.

yeah! a lot of good stuffs!

anyway, any good website/pdf/docs that cover a review of various AD technologies to suggest?

You can check out these links:
http://www.juliadiff.org/
http://www.autodiff.org/

2 Likes

Numerical Optimization by Nocedal and Wright has a very nice intro chapter on forward, reverse, and mixed, and why you would use one or the other.

@misc{nocedal2006numerical,
  title={Numerical optimization 2nd},
  author={Nocedal, Jorge and Wright, Stephen J},
  year={2006},
  publisher={Springer}
}

You can check out these links:
http://www.juliadiff.org/ 1
http://www.autodiff.org/ 2

Numerical Optimization by Nocedal and Wright has a very nice intro chapter on forward, reverse, and mixed, and why you would use one or the other.

thanks!

1 Like

This turns out to be fixable, actually. The symbolic approach effectively operates on a single block of SSA-form IR, and generalising that IR to contain control flow is not too difficult. Will put up more of a writeup on this at some point.

generalising that IR to contain control flow is not too difficult.

Is it? Iā€™m curious how would you preserve looping and SSA, e.g. in:

function foo(x)
    for i=1:100 
        x = x + 1
    end
    return x
end

You either break SSA-form for x or unroll the loop, or make a subgraph in which x is only assigned once. Breaking SSA-form essentially makes the function non-differentiable, AFAICT. Unrolling the loop only works when the number of iterations is constant, which isnā€™t true in general case.

The only meaningful option I found is making a subgraph, but it brings another set of questions: how to incorporate it into the main graph, how to map variables inside the subgraph to variables before and after the loop, how should derivative of a loop / subgraph look like, etc.

All of these made me switch to a much simpler approach and just support separately dynamic graphs with possible performance penalty and static graphs with all the optimizations, but it will be cool if you derive correct derivatives for general-purpose loops and conditions.