I canāt speak to the performance issue at the moment, but why are you defining all these different functions? Just define
myfunc(x) = 2 * (x + 1) + 3x^2
and run the autodifferentiation on that.
Edit: Try to run this:
foo(x) = 2 * (x + 1) + 3x^2
fdiff(f, x, d=0.001) = (f(x + d) - f(x)) / d
v = 5.0
@btime fdiff(foo, $v)
@btime dualpart(foo($v + Īµ)) # v + Īµ is equivalent to Dual(v, 1)
@btime ForwardDiff.derivative(foo, $v)
Iām having some problems with ForwardDiff on version 0.7, but I expect some significant speedups on 0.7/1.0 when it comes to compiling away wrapper-types, etc.
In this very simple case, though, I doubt that you can expect speedups when using ForwardDiff over finite diff. Not sure exactly in what situations that should even happen. The great advantage of autodiff is accuracy with almost finitediff performance.
Actually, all the talks at Juliacon were livestreamed this year! So the videos were available immediately. Everything is already on Youtube
And ā they used screen-capture instead of shaky-cam for the slides, so the whole experience was much improved over previous jcons.
defining all different functions is just for clarity ā¦
the point is: doing maths with dual number is at least doubling the operations compared to that on float. e.g.
3.0 * (x + Īµ) has two operations: 3.0 * x, and 3.0 * Īµ.
And for more complicated operations, the cost could be much higher even, consider:
(x + Īµ) * (x + Īµ) becomes the operations: x * x, x * Īµ, x * Īµ, Īµ * Īµ, (+) ā¦ i.e. 5 operations compared to 1 operation in x * x
so, for almost all maths operations on dual numbers, the time cost is more than doubled.
now, for finite differences, the time cost is simply: 2 times O(f), one for (x + delta), one for ( / delta) ā¦ i.e. basically constant 2 times of that of f(x).
so I would say that AD by Dual is almost certainly much slower than finite differences ā¦
a solution I could imagine is āsymbolic dual numberā. For example, for the function
myfunc(x) = 2 * (x + 1) + 3x^2
if we could symbolically evaluate the dual part of it:
dualpart(2.0 * (x + Īµ) + 3.0 * (x + Īµ) ^ 2)
= dualpart(2.0x + 2.0Īµ + 3.0 * (x^2 + 2xĪµ + Īµ*Īµ) )
= dualpart(2.0x + 2.0Īµ + 3.0(x^2) + (6.0 * x)Īµ + 0.0)
= dualpart(3.0(x^2) + 2.0x + (2.0 + (6.0 * x) )Īµ)
= 2.0 + (6.0 * x)
and let g(x) = 2.0 + (6.0 * x)
then we could evaluate g(x) for different values of x, e.g. g(5.0), g(5.1), etc., very quickly
OK, but I think it instead ended up a lot less clear. Very confusing to read through at first. In general, I recommend that you use type annotations very carefully, and only specify types when you really need them. Generic code is considered idiomatic in Julia, and is normally both clearer, shorter and more powerful.
Iām not prepared to go into a discussion on the relative performance of dual numbers vs finite differences. I donāt know enough about it, but I donāt really expect them to be faster in the simplest case.
I just want to note that I donāt think (x + Īµ) * (x + Īµ)
expands to that many operations ā Īµ*Īµ
is not calculated, for example, since it is already known to be zero. You probably just do x*x, x*Īµ
and Īµ*x
. Quick benchmark that may not be too useful:
v = 5.0
d = dual(v, 1)
0.6.3> @btime $v * $v
1.457 ns (0 allocations: 0 bytes)
0.6.3> @btime $d * $d
1.745 ns (0 allocations: 0 bytes)
BTW, the big reason your ForwardDiff code is so slow, is that you wrap your value inside a vector, and then access that, instead of just using scalar math. That has a gigantic performance cost.
do you see any hope that we could do something like:
gradfunc = ForwardDiff.symbolicgradient(f)
i.e. generates a function rather than the numeric gradient at a specific x?
the generation of gradfunc may be costly (as I imagine it would use something like āsymbolic dual numberā), but it costs only once. Afterwards, we could evaluate it very quickly.
This works for a simple function, but quickly becomes enormously difficult. Imagine doing this while combining the effects of several complicated functions with loops, if-statements, special type constructs and so on!
I think this may be related to the source-transformation approach to autodiff. You could try to google that.
could you show me how to use ForwardDiff.gradient() passing a function that accepts a scalar as argument?
I got the following error if f() accepts a scalar rather than a vector:
julia> function forwarddifffunc(x)
2.0 * (x + 1.0) + 3.0 * (x * x1)
end
forwarddifffunc (generic function with 1 method)
julia> ForwardDiff.gradient(forwarddifffunc, 5.0)
ERROR: MethodError: no method matching gradient(::#forwarddifffunc, ::Float64)
You may have intended to import Base.gradient
Closest candidates are:
gradient(::Any, ::StaticArrays.SArray) at /Applications/JuliaPro-0.6.3.1.app/Contents/Resources/pkgs-0.6.3.1/v0.6/ForwardDiff/src/gradient.jl:42
gradient(::Any, ::StaticArrays.SArray, ::ForwardDiff.GradientConfig) at /Applications/JuliaPro-0.6.3.1.app/Contents/Resources/pkgs-0.6.3.1/v0.6/ForwardDiff/src/gradient.jl:43
gradient(::Any, ::StaticArrays.SArray, ::ForwardDiff.GradientConfig, ::Val) at /Applications/JuliaPro-0.6.3.1.app/Contents/Resources/pkgs-0.6.3.1/v0.6/ForwardDiff/src/gradient.jl:44
...
Stacktrace:
[1] macro expansion at /Applications/JuliaPro-0.6.3.1.app/Contents/Resources/pkgs-0.6.3.1/v0.6/Atom/src/repl.jl:118 [inlined]
[2] anonymous at ./<missing>:?
This is what youāre looking for, still in beta though
https://github.com/FluxML/Zygote.jl
With juliacon video at Flux: The Elegant Machine Learning Library | Mike Innes - YouTube
wow amazing, thx !
If you want symbolic derivatives, thereās no need to involve dual numbers at all:
using XGrad
xdiff(:(2 * (x + 1) + 3x^2); x=1.0)
which gives
quote
tmp372 = 3
tmp368 = 2
tmp369 = 1
tmp370 = x + tmp369
tmp371 = tmp368 * tmp370
tmp374 = x ^ tmp368
tmp375 = tmp372 * tmp374
tmp376 = tmp371 + tmp375
dtmp376!dtmp371 = 1.0
dtmp376!dx__2 = tmp368 * dtmp376!dtmp371
dtmp376!dtmp374 = tmp372 * dtmp376!dtmp371
tmp380 = tmp368 - tmp369
tmp381 = x ^ tmp380
dtmp376!dx__1 = tmp368 * tmp381 * dtmp376!dtmp374
dtmp376!dx = dtmp376!dx__1 .+ dtmp376!dx__2
tmp385 = (tmp376, dtmp376!dx)
end
Code is optimized for tensors so may look a bit ugly, but it should give you pretty good performance.
(I think Calculus.jl used to give beautiful expressions for scalar derivatives, but I canāt see it in their README anymore).
The problem with symbolic approach is that it works really badly with things like conditions, loops, recursion, and, more generally, dynamic computation graphs (e.g. XGrad requires code to be a fixed list of correct algebraic expressions). Itās fine in most cases, but if you encounter a task with dynamic graph - even if itās fixed for a specific execution - you have to fall back to (more) pure AD.
In (more) pure AD you have a choice of forward-mode AD, which relies on dual numbers and works well for functions R \rightarrow R^n, and reverse-mode AD, which is usually implemented using some kind of tracked arrays or, in case of future Cassette-based solutions, function call interception, and which is designed for R^n \rightarrow R problems.
Note, that reverse-mode AD may also be fully dynamic and construct graph on each run (I think AutoGrad.jl does this) or statically compile derivatives for performance after the first run, making it closer to symbolic approach (ReverseDiff.jl and Capstan.jl do this). (Currently I explore possibility to combine and compare both approaches in a single package Yota.jl).
As a bottom line: dual numbers arenāt the only approach to AD
in this simple example, direct gradient by dual number maths is 2.34X time-costly compared to finite differences.
using ForwardDiff AD is even worse, itās 4271.75X !!!
This is a fairly ridiculous comparison - youāre comparing apples to oranges here by pitting a very limited handrolled scalar AD vs. an interface thatās documented to allocate because itās expected to be used with far more substantial inputs. Furthermore, you changed the target function between differentiation implementations.
Here, why not use this for comparison (also remember to interpolate variables correctly when using BenchmarkTools):
julia> myfunc(x) = 2.0 * (x + 1.0) + 3.0 * (x * x)
myfunc (generic function with 1 method)
julia> @btime ForwardDiff.derivative(myfunc, $(5.0))
1.525 ns (0 allocations: 0 bytes)
32.0
And even if you did want to change the target function to forwarddifffunc
:
julia> using StaticArrays
julia> @btime ForwardDiff.gradient(forwarddifffunc, $(SVector(5.0)))
1.530 ns (0 allocations: 0 bytes)
1-element SArray{Tuple{1},Float64,1,1}:
32.0
Also note that, for scalar differentiation, AD is generally faster than finite differences (regardless of the AD implementation, as long as itās decent and in a language that supports fast scalar ops).
Use ForwardDiff.derivative
, not gradient
.
Thereās also a mistake in your function, x1
is undefined.
Finally, I suggest using integer literals, since they will help you preserve input types better:
foo(x) = 2 * (x + 1) + 3x^2 # use x^2 not x * x, it's more readable.
yeah! a lot of good stuffs!
anyway, any good website/pdf/docs that cover a review of various AD technologies to suggest?
You can check out these links:
http://www.juliadiff.org/
http://www.autodiff.org/
Numerical Optimization by Nocedal and Wright has a very nice intro chapter on forward, reverse, and mixed, and why you would use one or the other.
@misc{nocedal2006numerical,
title={Numerical optimization 2nd},
author={Nocedal, Jorge and Wright, Stephen J},
year={2006},
publisher={Springer}
}
You can check out these links:
http://www.juliadiff.org/ 1
http://www.autodiff.org/ 2
Numerical Optimization by Nocedal and Wright has a very nice intro chapter on forward, reverse, and mixed, and why you would use one or the other.
thanks!
This turns out to be fixable, actually. The symbolic approach effectively operates on a single block of SSA-form IR, and generalising that IR to contain control flow is not too difficult. Will put up more of a writeup on this at some point.
generalising that IR to contain control flow is not too difficult.
Is it? Iām curious how would you preserve looping and SSA, e.g. in:
function foo(x)
for i=1:100
x = x + 1
end
return x
end
You either break SSA-form for x
or unroll the loop, or make a subgraph in which x
is only assigned once. Breaking SSA-form essentially makes the function non-differentiable, AFAICT. Unrolling the loop only works when the number of iterations is constant, which isnāt true in general case.
The only meaningful option I found is making a subgraph, but it brings another set of questions: how to incorporate it into the main graph, how to map variables inside the subgraph to variables before and after the loop, how should derivative of a loop / subgraph look like, etc.
All of these made me switch to a much simpler approach and just support separately dynamic graphs with possible performance penalty and static graphs with all the optimizations, but it will be cool if you derive correct derivatives for general-purpose loops and conditions.