Use ForwardDiff instead of Zygote with Flux?

weymouth · September 2, 2021, 1:18pm

I’m embedding a Flux model into a physical simulation but Zygote doesn’t seem to be able to handle taking the gradient. I got ForwardDiff.gradient working with the code yesterday. Can I use ForwardDiff as the backend instead?

ChrisRackauckas · September 2, 2021, 1:45pm

Yes, if you use GalacticOptim.jl you can just AutoForwardDiff the OptimizationFunction and then send it to Flux optimizers and it’ll use ForwardDiff.

weymouth · September 2, 2021, 2:00pm

Any examples of this?

mcabbott · September 2, 2021, 2:40pm

Replacing f(x) with Zygote.forwarddiff(f, x) will also let you use ForwardDiff for one function call, while everything outside that uses Zygote. I’m not sure from your description whether this would solve your problem.

oxinabox · September 2, 2021, 8:44pm

You may already be aware of this, but there is a reason forward mode AD is not used with NNs.
Forward mode AD scales with number of parameters.
Where as Reverse mode AD scales with number of outputs.

NN training has 1 output – the loss.
And generally hundreds to thousands of parameters – the weights and biases.

Forward mode does have a much lower overhead then reverse.
but the crossover point is usually somewhere between 5 and 200 paramters.

If you have thousands of parameters in your NN then using ForwardDiff is going to be incredibly slow.
You might want to look into another AD, like Enzyme.
(Or perhaps Yota with Avalon rather than Flux; or Tracker with TrackerFlux.jl)

weymouth · September 2, 2021, 9:35pm

Very helpful suggestions!

My research tends to learn more heavily on the physics-understanding side, and only use ML for small closures. For example, this introductory toe-into-the-water example should only need a few dozen parameters. So forward mode is fine, but it still may not be the best plan moving forward.

Plus Flux has all those nice built-in models and optimizers - and I really need to stop reinventing the wheel on things like that. Unfortunately, mutating arrays is how all the physics updates are done in my simulation, which is why Zygote isn’t happy with it. I saw some work-around suggestions, but they didn’t seem worth it.

I will look into other packages you mentioned. Thanks again.

weymouth · September 3, 2021, 7:58am

I hate to press, but I’m having trouble figuring out how to use your Zygote.forwarddiff suggestion with the implicit parameters used in Flux models. For example:

using Flux,Zygote

model = Dense(2,2)
loss(x) = sum(abs2,model(x))
data = rand(2,10)
loss(data)
gs = gradient(()->loss(data),params(model))
for p in params(model)
    Flux.update!(Descent(),p,gs[p])
end
loss(data)

This works (since I’m not mutating arrays in loss(data)) but how can I switch to forwarddiff without losing the Flux model and optimizer?

ChrisRackauckas · September 3, 2021, 10:09am

Did you look at the tutorials?

using GalacticOptim, Flux

rosenbrock(x, p) =  (p[1] - x[1])^2 + p[2] * (x[2] - x[1]^2)^2
x0 = zeros(2)
_p  = [1.0, 100.0]

f = OptimizationFunction(rosenbrock, GalacticOptim.AutoForwardDiff())
l1 = rosenbrock(x0, _p)
prob = OptimizationProblem(f, x0, _p)
sol = solve(prob, ADAM(), maxiters = 100)
@show sol.u
@show sol.minimum

There you go, ForwardDiff with Flux optimizers. And then you can switch to any other AD, and any other optimizers.

weymouth · September 3, 2021, 12:08pm

I did see that, but this tutorial defines an objective function with explicit parameters, while I am trying to use a Flux model which has the parameters defined inside the type. I can see why implicit parameters is handy for using multi-layer models in Flux, but I can’t see how to pull those out to hand over to ForwardDiff

For example:

model = Dense(2,2)
loss(x) = sum(abs2,model(x))
data = rand(2,10)

f = OptimizationFunction(loss, GalacticOptim.AutoForwardDiff())
prob = OptimizationProblem(f, params(model), data)
sol = solve(prob, ADAM(), maxiters = 100)

Throws the error ERROR: MethodError: no method matching typemax(::Type{Any})

ChrisRackauckas · September 3, 2021, 12:19pm

using GalacticOptim, Flux
model = Dense(2,2)
p,re = Flux.destructure(model)
function loss(p,_) 
    sum(abs2,re(p)(data))
end
data = rand(2,10)

f = OptimizationFunction(loss, GalacticOptim.AutoForwardDiff())
prob = OptimizationProblem(f, p, data)
sol = solve(prob, ADAM(), maxiters = 100)

weymouth · September 3, 2021, 12:35pm

Awesome! Flux.destructure was the missing link.

Topic		Replies	Views
How to force Flux to use FiniteDiff Machine Learning flux , finitediff	16	2329	February 16, 2022
Flux loss: Gradient wrt input leads to empty gradient wrt parameters or to "can't differentiate foreigncall" Machine Learning flux , forwarddiff , diffeqflux	3	558	April 8, 2022
Which autodiff to currently use for a neural network backend? General Usage package , statistics , machinevision	10	2167	October 1, 2018
Is it possible perform reverse mode differentiation (Flux.jl with Zygote.jl) of a forward mode differentiation result (e.g. ForwardDiff)? Machine Learning question , flux	3	1448	March 10, 2020
What is the difference between Zygote vs ForwardDiff and ReverseDiff Machine Learning	4	6573	February 23, 2021

Use ForwardDiff instead of Zygote with Flux?

Related topics