Ok, now a summary of what I have learned.
it seems that ForwardDiff
is too smart. If you change the closure significantly, it does a lot of work that is able to speed up a lot many computations of the same/similar function. But, if your goal is to compute hessians of functions, defined via closures that change significantly from call to call the overhead is too much.
In the end, I have ended up with a mixed solution:
1.- I use ForwardDiff
to define the derivatives/second derivatives of the elementary functions. (Note that these are compiled once, and ForwardDiff
will be very efficient doing these computations).
2.- Then I define my own Poor’s man Automatic differentiation. The main differences with ForwardDiff
, is that it basically only works for Float64 types, and that I define custom operations between dual numbers and Float64 numbers without promoting the Float64 numbers to dual numbers. For example:
Base.:+(h1::hyperd, h2::hyperd) = hyperd(h1.v+h2.v, h1.d1+h2.d1, h1.d2+h2.d2, h1.dd+h2.dd)
Base.:+(h1::hyperd, h2::Number) = hyperd(h1.v+h2, h1.d1, h1.d2, h1.dd)
Base.:+(h1::Number, h2::hyperd) = hyperd(h1+h2.v, h2.d1, h2.d2, h2.dd)
In my very poor understanding of all this, promoting to Dual numbers is at the core of this large allocation/first execution time.
This is part of a module that I am writing about something completely unrelated to AD, but the undocumented relevant code is here:
My poor’s man implementation of hessian:
My hyperd
type is similar to ForwardDiff but with a fixed chunk size of 1. This avoids allocations at the cost of more evaluations of the function:
mutable struct hyperd
v::Float64
d1::Float64
d2::Float64
dd::Float64
end
Now some results, for the basic code:
using ForwardDiff
import ADerrors
function do_hess(cs::Function, x::Vector{Float64}, y::Vector{Float64})
n = length(x)
m = length(y)
xav = zeros(Float64, n+m)
for i in 1:n
xav[i] = x[i]
end
for i in n+1:n+m
xav[i] = y[i-n]
end
#trying views,this helped a lot!
function fx(x0)
x1 = @view x0[1:n]
x2 = @view x0[n+1:end]
return cs(x1,x2)
end
hess = zeros(n+m, n+m)
print("ForwardDiff timming: ")
@time ForwardDiff.hessian!(hess, fx, xav)
print("ForwardDiff timming: ")
@time ForwardDiff.hessian!(hess, fx, xav)
println("Custom hessian")
print("hyperd_hessian timming: ")
@time ADerrors.hyperd_hessian!(hess, fx, xav)
print("hyperd_hessian timming: ")
@time ADerrors.hyperd_hessian!(hess, fx, xav)
println("Exit")
return hess
end
function defs(f, x, w, n, m)
chisq(p, d) = sum((d .- f(x, p)).^2 .*w)
return chisq
end
x = rand(10)
w = rand(10).^2
@. lmfcn(x, p) = p[1] + p[2]*x
function lm2(x, p)
res = Vector{eltype(p)}(undef, length(x))
for i in eachindex(res)
res[i] = p[1] + p[2]*x[i] + 1.0/x[i]
end
return res
end
chisq = defs(lm2, x, w, 2, 10)
do_hess(chisq, [1.0, 1.0], [1.0*n for n in 1:10])
println("Change closure")
x = rand(10)
w = rand(10).^2
chisq = defs(lmfcn, x, w, 2, 10)
do_hess(chisq, [1.0, 1.0], [1.0*n for n in 1:10])
println("Change closure")
x = rand(10)
w = rand(10).^2
@. lmfcn(x, p) = p[1]/x + p[2]*x^2
chisq = defs(lmfcn, x, w, 2, 10)
do_hess(chisq, [1.0, 1.0], [1.0*n for n in 1:10])
RESULTS:
ForwardDiff timming: 5.681562 seconds (5.28 M allocations: 235.013 MiB, 2.66% gc time)
ForwardDiff timming: 0.000039 seconds (11 allocations: 61.453 KiB)
Custom hessian
hyperd_hessian timming: 0.000080 seconds (4.15 k allocations: 211.578 KiB)
hyperd_hessian timming: 0.000076 seconds (4.15 k allocations: 211.578 KiB)
Exit
Change closure
ForwardDiff timming: 4.629693 seconds (4.51 M allocations: 187.885 MiB, 0.94% gc time)
ForwardDiff timming: 0.000031 seconds (11 allocations: 61.453 KiB)
Custom hessian
hyperd_hessian timming: 0.162594 seconds (111.11 k allocations: 5.882 MiB, 13.89% gc time)
hyperd_hessian timming: 0.000361 seconds (7.27 k allocations: 306.641 KiB)
Exit
Change closure
ForwardDiff timming: 1.905636 seconds (1.76 M allocations: 72.483 MiB, 6.92% gc time)
ForwardDiff timming: 0.000031 seconds (11 allocations: 61.453 KiB)
Custom hessian
hyperd_hessian timming: 0.036707 seconds (57.31 k allocations: 2.639 MiB)
hyperd_hessian timming: 0.000437 seconds (9.37 k allocations: 366.359 KiB)
So yes, my implementation takes about 2-10 times longer than ForwardDiff
once it has compiled the basic functions. This seems to be true no matter if you change the closure/function. The only important thing for my implementation to perform is that the basic operations (sin, cos, product of hyperd
, etc…) are precompiled.
On the other hand, once ForwardDiff is tuned, it is much faster than my implementation. But the overhead of this tunning is about 1M hessian evaluations!
Final Question
To the experts/developers of ForwardDiff
. Would not be more efficient to hard-code the operations between dual numbers and normal numbers, without promotion? (… notice that maybe I am not understanding ForwardDiff
code very well…).
One last thing: Many thanks for the help!