Is there something like ScipyOptimizer in Julia?

Hi all,

Recently, I’m turning to use Julia and Flux.jl to treat some differential equations in computational physics.
I find it convenient and efficient indeed.

However, I wonder if there is any high-order optimization method can be used to treat these kind of problems with good differentiability more efficiently, just like tf.contrib.opt.ScipyOptimizer in Tensorflow. I wonder is there anything similar can be found in Julia ecosystem?

Thank you and best,
Hsiao

I have heard of some examples using BFGS on neural networks in Flux. @MikeInnes or @oxinabox is there code on this? I’d like to try it too :slight_smile:.

Flux provides routines to compute the loss function and its gradient, so given those you could throw them at any of the nonlinear optimization packages in Julia that exploit gradients and assume some smoothness (NLopt.jl, Optim.jl, …).

1 Like

not from me.
I have used BFGS in Optim to train NNs without using Flux.

1 Like

It’s a little hard to do though since Flux relies on using implicit parameters. We could set it up with Optim using DiffEqFlux.structure and DiffEqFlux.restructure internal functions, but it’s not something that will work out of the box because of the parameter interface.

Not sure if I understand the problem here, but if the problem is that Optim or any of the other packages expect the parameters of the problem and the gradient to take a specific form/type (i.e. AbstractArray) and this is incompatible with how Flux stores the model, you might want to check out:

It’s work in progress, and not thoroughly tested yet, but this is supposed to provide optimization algorithms that work with any type of variables, provided you specify via keyword arguments how to update the variables and so forth. The README.md constitutes the full documentation to get you started.

No, this will still have the issue because https://github.com/Jutho/OptimKit.jl/blob/master/src/lbfgs.jl#L13 that has the second argument x. Flux.jl has an interface so that way x is supposed to be known from fg itself: the neural network holds its implicit parameters that it knows how to update by references given via Params. That’s why I suggested restructure and destructure. I can probably write an example up soon.

@vavrines the mixed neural DEs example gives a hint for how to do it: https://github.com/JuliaDiffEq/DiffEqFlux.jl#mixed-neural-des . Essentially the idea is that you need to grab the parameters and manually update the parameters of the neural network. Basically, the secret sauce is that you can use this library function to grab all of the parameters from a neural network like:

ann = Chain(Dense(2,10,tanh), Dense(10,1))
p1 = Flux.data(DiffEqFlux.destructure(ann))

So now here’s a full example. Let’s optimize an ODE which is only partially defined by a neural network. We parameterize the initial condition, the neural network, and the partially defined ODE as follows:

using DiffEqFlux, OrdinaryDiffEq, Flux, Optim

tspan = (0.0f0,25.0f0)
save_t = 0.0:0.1:25.0
ann = Chain(Dense(2,10,tanh), Dense(10,1))
p1 = Flux.data(DiffEqFlux.destructure(ann))
p2 = Float32[-2.0,1.1]
p = [p1;p2;u0]

u0 = Float32[0.8; 0.8]
function dudt_(du,u,p,t)
    x, y = u
    du[1] = DiffEqFlux.restructure(ann,p[1:41])(u)[1]
    du[2] = p[end-1]*y + p[end]*x
end

Notice that we will need p for later since p is the vector of all parameters involved. Now we can define a stub ODEProblem which we remake with parameters as part of the loss function:

prob = ODEProblem(dudt_,u0,tspan)

function predict(p)
  _prob = remake(prob,u0=p[end-1:end],p=p[1:43])
  Array(solve(_prob,Tsit5(),saveat=save_t))
end
loss_numerical(p) = sum(abs2,x-1 for x in predict(p))

Each new p gets split up into the u0 part and the p part, and then in the ODE the part that’s left reparameterizes the neural network and the rest is for the part of the ODE that is known. Then I just made up a silly loss function that is the L2 loss against 1. Optim.jl can then optimize this using numerical differencing:

result = optimize(loss_numerical, p, BFGS())

In some sense we are done because that’s all that’s necessary, but we can speed that up by utilizing the adjoint to get the gradients. Using the OnceDifferentiable only_fg! form from NLSolversBase, we can define a function which computes the loss and the gradient simultaneously. The gradient is given by the adjoint sensitivity analysis defined in the DiffEq docs: http://docs.juliadiffeq.org/latest/analysis/sensitivity.html#Adjoint-Sensitivity-Analysis-1

The code is thus as follows:

using DiffEqSensitivity
dg(out,u,p,t,i) = (out.=(1.0.-u)./2)
function fg!(F, G, x)
    _prob = remake(prob,u0=p[end-1:end],p=p[1:43])
    sol = solve(_prob,Tsit5(),saveat=save_t)
    if !(G == nothing)
        res = adjoint_sensitivities(sol,Tsit5(),dg,save_t,sensealg=SensitivityAlg(backsolve=true))
    end
    if !(F == nothing)
        return sum(abs2,x-1 for x in Array(sol))
    end
end
loss_adjoint = OnceDifferentiable(Optim.only_fg!(fg!), p)
result = optimize(loss_adjoint, p, BFGS())

This funny model I choose seems to go unstable when using BFGS but is fine with Flux Tracker. Is it good to use? Who knows: that’s probably a good research topic. Anyways, for reference, the same thing using Flux optimizers directly is:

u0 = param(Float32[0.8; 0.8])
tspan = (0.0f0,25.0f0)

ann = Chain(Dense(2,10,tanh), Dense(10,1))

p1 = Flux.data(DiffEqFlux.destructure(ann))
p2 = Float32[-2.0,1.1]
p3 = param([p1;p2])
ps = Flux.params(p3,u0)

function dudt_(du,u,p,t)
    x, y = u
    du[1] = DiffEqFlux.restructure(ann,p[1:41])(u)[1]
    du[2] = p[end-1]*y + p[end]*x
end
prob = ODEProblem(dudt_,u0,tspan,p3)
diffeq_adjoint(p3,prob,Tsit5(),u0=u0,abstol=1e-8,reltol=1e-6)

function predict_adjoint()
  diffeq_adjoint(p3,prob,Tsit5(),u0=u0,saveat=0.0:0.1:25.0)
end
loss_adjoint() = sum(abs2,x-1 for x in predict_adjoint())
loss_adjoint()

data = Iterators.repeated((), 10)
opt = ADAM(0.1)
cb = function ()
  display(loss_adjoint())
  #display(plot(solve(remake(prob,p=Flux.data(p3),u0=Flux.data(u0)),Tsit5(),saveat=0.1),ylim=(0,6)))
end

# Display the ODE with the current parameter values.
cb()

Flux.train!(loss_adjoint, ps, data, opt, cb = cb)
5 Likes

Thx Chris :slight_smile:
I will try to follow the strategy you mentioned.

Here is a utility package to facilitate using Optims (or most other solvers) (L)BFGS to train flux models

2 Likes

Thank you baggepinnen! I’ll try it.:smiley: