Optim.jl OnceDifferentiable Baby Steps


#1

Automatic differentiation seems a little like magic. ?OnceDifferentiable is not very informative. Experimenting with it, I want to see, e.g., what happens if my function uses abs(). As always, I want to start with something simple:

julia> using Optim

julia> myfun(x::Vector{Float64})= (x[1]-2)^2 + (x[2]-3)^2 + (x[1]*x[2])^2;

julia> od = OnceDifferentiable(myfun, [ 0.0, 0.0 ]; autodiff = :forward);

I hope my two questions now are simple:

  • Why does an automatic differentiator need an initial_x value, at all? Is it to understand the dimensionality of the function (here 2)? Would it be better, then, to have an int argument requesting the length?

  • How can I use the resulting od object to request the gradient at a particular location? I can use od.f( 2.0, 3.0 ) to use the function, but how do I obtain the two gradients at point [ 2.0, 3.0]?

Advice appreciated.

PS: Does Optim.DifferentiableFunction still exist?


#2

You will need to remove the Float64 type annotation in myfun to use ForwardDiff. Then the following should work.

julia> using Optim

julia> myfun(x)= (x[1]-2)^2 + (x[2]-3)^2 + (x[1]*x[2])^2;

julia> od = OnceDifferentiable(myfun, [0.0, 0.0]; autodiff = :forward);

julia> od.f([2.,2.])
17.0

julia> grad = zeros(2);

julia> od.df(grad,[2.,2.])
2-element Array{Float64,1}:
 16.0
 14.0

julia> grad
2-element Array{Float64,1}:
 16.0
 14.0

No.

I will leave the first question to others since I haven’t read auto-diff source codes.


#3

Consider playing around with https://github.com/JuliaDiff/ForwardDiff.jl directly to get a feel for how it works first.

Automatic differentiation sure feels magical, but you will be able to more chain-rule than magic. See http://www.juliadiff.org/ForwardDiff.jl/latest/dev/how_it_works.html for a very simple explanation, which hopefully also explains why you need to “seed” at a particular point for any calculation.


#4

thanks everyone. I think my guess is right that the initial value is used to determine dimensionality. it makes even less sense in od.df, where the second argument tells it, but then this function is not supposed to be directly used by an enduser anyway.

I am looking at ForwardDiff. what had confused me is that I had thought it would not be able to decompile the function being passed in, but now I realize that it overloads all known elementary julia math functions to basically keep a symbolic copy, so when the function (not a string containing an unevaluated function) is passed into it, it can reconstruct it and apply the chain rule. clever.

I think there is a small math mistake with abs (and kinks) in it:

julia> f(x)= sum(abs.(x))
f (generic function with 1 method)

julia> ForwardDiff.gradient(f, [0.0])
1-element Array{Float64,1}:
 1.0

should probably give NaN or Inf. But this is minor. thanks everyone.

Clarification: Julia has -0.0 and +0.0. In this case, it recognizes -0.0. maybe this is desired behavior??


#5

The gradient of the abs function at 0 is not defined. So it is expected that you know the consequences of asking for a derivative at a point where it is not defined. In many optimization problems however where the objective is not smooth it suffices to return back any value in the sub-gradient set which is [-1,1] in the abs function case. This means that many algorithms for smooth optimization will also work for many non-smooth optimization problems.


#6

od.df and od.fdf each take a preallocated gradient vector and an input. They update the contents of the gradient vector with the gradient, and od.fdf also returns the value of f.

When creating a oncedifferentiable object, they use both the length and element type of x.

Use fieldnames(typeof(od)) to see all the fields.


#7

The initial_x input is not related to automatic differentiation. The OnceDifferentiable struct in Optim / NLSolversBase is a used as a helper to access the objective function and gradient (see my answer below). In addition, OnceDifferentiable is a caching device that stores the last x-values that the function and gradients were evaluated at. initial_x is there to “initiate” this caching.

The autodiff=:forward keyword is used in the constructor to tell OnceDifferentiable to evaluate gradients of an objective with ForwardDiff. Alternatively, it will use finite differences. You can also supply the gradient function yourself (there are some examples at NLSolversBase.jl

NLSolversBase provides an API to evaluate OnceDifferentiable-objects. Look here


#8

thanks, anriseth. How is this NLSolversBase interface used? I have been trying a simple example, and it’s not obvious to me. Take the function above.

julia> using Optim, NLSolversBase

julia> myfun(x)= (x[1]-2)^2 + (x[2]-3)^2 + ((x[1]-2)*(x[2]-3))^2;

julia> od = OnceDifferentiable(myfun, [ 3.0, 3.0 ]; autodiff = :forward);

julia> NLSolversBase.gradient(od, [3.0, 3.0]) .  ## works as expected.
2-element Array{Float64,1}:
 2.0
 0.0

julia> NLSolversBase.gradient(od, [3.0, 3.0]) .  ## wth??  for max confusion?
2-element Array{Float64,1}:
 NaN
 NaN

julia> td = TwiceDifferentiable(myfun, [ 3.0, 3.0 ]; autodiff = :forward);

julia> NLSolversBase.gradient(td, [3.0, 3.0]) .  ## huh?
2-element Array{Float64,1}:
 2.25826e-314
 0.0

julia> NLSolversBase.hessian(td, [3.0,3.0])
ERROR: MethodError: no method matching hessian(::NLSolversBase.TwiceDifferentiable{Float64,Array{Float64,1},Array{Float64,2},Array{Float64,1}}, ::Array{Float64,1})
Closest candidates are:

And what are the admissible keywords to autodiff? I see :forward. Are there others?

PS:

julia> using Optim

julia> myfun(x)= (x[1]-2)^2 + (x[2]-3)^2 + ((x[1]-2)*(x[2]-3))^2;

julia> td = TwiceDifferentiable(myfun, [ 3, 3 ]; autodiff = :forward);
ERROR: InexactError()

The vector should probably be promoted to Float.


#9

I spoke too soon regarding initial_x. It is used to pass type information etc. to ForwardDiff and DiffEqDiffTools in order to create caches for differentiation.

Regarding NLSolversBase, what you want to do is the following:

x0 = [3.0,3.0]
od = OnceDifferentiable(myfun, x0; autodiff=:forward)

gradient!(od, x0)
gval = gradient(od)

You seem to have discovered a bug, and I’ll report it. You are unlikely to ever need to call gradient(od, x), however, and the approach above is more likely to be what you want.

You are welcome to create an issue or submit a PR :wink:


#10

Yeah, I admit it was a mistake to remove the autodiff keyword, using OnceDifferentiable directly is probably not meant to be consumed by most users.

I don’t think it should, but sure, we can take that discussion again if you open an issue.


#11

Gentlemen—can I beg you to report if some of my code stumbled over bugs? See, I am not certain what my code output should be. The docs are not ready yet, so I am trying to learn Optim.jl by experimentation. I am just wondering whenever I find an inconsistency and post when I get stuck.

I am trying to write a “starter” chapter about Optim at http://julia.cookbook.tips/doku.php?id=minimasn , so I hope that if I get clearer over time, you will get fewer dumb questions from me and others.

The following may or may not be the same or a different bug. here it bites on ordinary optimze use.

myfun(x)= (x[1]-2)^2 + (x[2]-3)^2 + ((x[1]-2)*(x[2]-3))^2;
myfungradients(x)= [ 2*(1 + (x[2]-3)^2)*(x[1]-2 ),  2*(1 + (x[1]-2)^2)*(x[2]-3) ]
myfunhessian(x)= [ [ 2 + 2*(x[2]-3)^2, 4*(x[1]-2)*(x[2]-3) ], [ 4*(x[1]-2)*(x[2]-3) , 2 + 2*(x[2]-2)^2 ] ]
 
function f(x) myfun(x) end#function
function g!(G,x) G=myfungradients(x) end#function
function fg!(G,x) g!(G,x); f(x) end#function   (both! can be used to shorten calculations)
function h!(H,x) H=myfunhessian(x) end#function
 
using Optim
z2= zeros(2.0)
nd = NonDifferentiable( f, z2 )
od = OnceDifferentiable( f, g!, z2 )
td = TwiceDifferentiable( f, g!, h!, z2 )
 
println( ( td.f( [2.0, 2.0] ), td.df( [NaN,NaN], [2.0, 2.0] ), td.h( [NaN,NaN], [2.0, 2.0] ) ) )
 
println( ( od.f( [2.0, 2.0] ), od.df( [NaN,NaN], [2.0, 2.0] ) ) )
 
optimize( td, z2, BFGS())  ## works
 
optimize( od, z2, BFGS())  ## error

I will post another question in another thread.


#12

This is not a recommended style since it allocates memory for the output.

The idea of g! is that it should modify the object that G refers to in-place. What you are doing here is simply rebinding the local G name to the array output by myfungradients(x), but outside the function, the passed object was not modified. See this for more details http://www.johnmyleswhite.com/notebook/2014/09/06/values-vs-bindings-the-map-is-not-the-territory/. When you modify the object that G binds to in-place, you are not allocating additional memory so that’s efficient.

I am actually surprised any of them works at all!

See these 2 examples:

  1. http://julianlsolvers.github.io/Optim.jl/stable/algo/cg/
  2. http://julianlsolvers.github.io/Optim.jl/stable/algo/autodiff/

#13

the worst part is when bad stuff sometimes or almost work. if it always bombs badly, its great.

I understood now. yes, I knew the difference, but it had not dawned on me that this was essential in this context, too. this is all making sense to me, now, too. not just madness. madness with method. I now understand why the API for g and h is as it is.

alas, logically, this is somewhat painful. I won’t be the last person to screw this up. is there some clever way for Julia to construct it, rather than pushing the burden onto the unsuspecting end user?

then again, this is probably not worth too much bother. the user may as well use automatic differentiation. symbolic provision may be on the way out.

thanks a lot mohamed.


#14

EDIT: I have made hundreds of edits here, and only just realised why your example fails. Your function g!(G,x) does not add the values of the gradient to G, but rather, it redefines G to another vector. This is a subtetly of how Julia works, rather than Optim.

G = zeros(2)
g!(G, zeros(2)) # 
G != myfungradients(zeros(2)) # You have not actually updated `G` in the outer scope.

OLD:
That is a bit weird. It is helpful if you provide us with the error message as well as the versions of Optim et al that you are using.

Note that the following works

using Optim

f(x)= (x[1]-2)^2 + (x[2]-3)^2 + ((x[1]-2)*(x[2]-3))^2
function g!(out,x)
    out[1] = 2*(1 + (x[2]-3)^2)*(x[1]-2)
    out[2] = 2*(1 + (x[1]-2)^2)*(x[2]-3)
end
function h!(out, x)
    out[1,1] =  2 + 2*(x[2]-3)^2
    out[1,2] = 4*(x[1]-2)*(x[2]-3)
    out[2,2] =  2 + 2*(x[2]-2)^2
    out[2,1] = out[1,2]
end

z2= zeros(2.0)
od = OnceDifferentiable( f, g!, z2 )
td = TwiceDifferentiable( f, g!, h!, z2 )


optimize(f, g!, z2, BFGS())
optimize(td, z2, BFGS())  
optimize(od, z2, BFGS())  

#15

sorry. indeed, mohamed pointed it out. for others, here is a correct example:

myfun(x)= (x[1]-2)^2 + (x[2]-3)^2 + ((x[1]-2)*(x[2]-3))^2;
myfungradients(x)= [ 2*(1 + (x[2]-3)^2)*(x[1]-2 ),  2*(1 + (x[1]-2)^2)*(x[2]-3) ]
myfunhessian(x)= reshape( [ 2+2*(x[2]-3)^2, 4*(x[1]-2)*(x[2]-3), 4*(x[1]-2)*(x[2]-3) , 2+2*(x[2]-2)^2 ], 2, 2 )

function f(x) myfun(x) end#function
function g!(G,x) Gcp=myfungradients(x); G[1]=Gcp[1]; G[2]=Gcp[2]; G; end#function  ## must not use G=Gin!!!
function fg!(G,x) g!(G,x); f(x) end#function   (both! can be used to shorten calculations)
function h!(H,x) Hcp=myfunhessian(x); for i=1:length(x),j=1:length(x) ; H[i,j]= Hcp[i,j]; end; end#function

using Optim
z2= zeros(2.0)
nd = NonDifferentiable( f, z2 )
od = OnceDifferentiable( f, g!, z2 )
td = TwiceDifferentiable( f, g!, h!, z2 )

println( ( td.f( [2.0, 2.0] ), td.df( [NaN,NaN], [2.0, 2.0] ), td.h( [NaN NaN;NaN NaN], [2.0, 2.0] ) ) )
println( ( od.f( [2.0, 2.0] ), od.df( [NaN,NaN], [2.0, 2.0] ) ) )

optimize( td, z2, BFGS())
optimize( od, z2, BFGS())

this can be speeded up by avoiding the Gcp and Hcp, but at least it is correct.


#16

If you’re interested, you can see my latest PR here https://github.com/JuliaNLSolvers/Optim.jl/pull/577

It basically adds a keyword that will remove two of the annoyances you’ve seen. The first is an autodiff at the optimize level so you don’t have to worry about OnceDifferentiable and so on, and the other one is the inplace keyword at the optimize level as well, that allows you to pass your f(x), g(x) and h(x) style functions by setting inplace = false. Notice, that for users to use this the PR first has to be merged (I still need more tests and to change the documentation), then we have to tag a new version and then it has to be merged in METADATA.jl, so it might be a little while before it goes “live”.


#17

thanks. this sounds great. regards, /iaw


#18

Note that, in Julia 0.6, you can assign elements of an array with a “dot”: Gcp .= myfungradients(x).
https://docs.julialang.org/en/stable/manual/functions/#man-vectorized-1


#19

In DiffWrappers.jl, I provide the following interface (used by DynamicHMC.jl):

  1. the user codes a \mathbb{R}^n \to \mathbb{R} function f, ie takes an <:AbstractVector and returns a <:Real,

  2. Then

    ForwardGradientWrapper(f, n)
    

    can be used to generate a callable that returns values and derivatives using a DiffResult object. The dimension can also be queried by length.

I think it would be great to share a common interface (and implementation, using forward- and later reverse-mode AD) for this pattern that is independent of packages.


#20

that’s great. thanks for pointing this out.