How to get the results and gradients when using ForwardDiff.jl

I did some research and found out that if I want to get both the value of a function and its gradients I could use the API from DiffResults. But if the result from a function returns a tuple(instead of a scalar) where the first one is the loss but I also want to get the second and third results how should I do it?

In Zygote.jl, I could do something like:

#loss function for Lux model
function core(S, dcf, ps, model, st)
    value_if_wait, st = model(S, ps, st)
    mse = mean(abs2, value_if_wait .- dcf)
    return mse, value_if_wait, st
end
(loss, value_if_wait, st), back = pullback(p -> core(S_i, dcf, p, model, st), ps)
gs = back((one(loss), nothing, nothing))[1]
#and then do the update 

Not being familair with pkg… I’d see what it’s currently returning by looking in the source code, or return under on variable and print. If it is returning you should be able to match the return to the variable assignment. If not you will need to add additional returns.

Hope that helps.

Not worries. I just saw the slack discussion here and wanna give it a try, my thinking is it seems to be easier for Lux to work with other AD libraries and I only need to flatten it to call the gradient function and reconstruct it when I wanna call the model. For Flux, the params are stored in the struct, therefore a customized walk is needed.

Assume you have mutating function func!(y,u) which writes the result into y.
Prepare a result buffer (once):

diffresult=DiffResults.JacobianResult(u0)
y=zero(u0)
cfg = ForwardDiff.JacobianConfig(func!,y, u0) 

Then you should be able to call (many times)

ForwardDiff.jacobian!(diffresult, func!,y,u,cfg)

and access DiffResults.value(diffresult) and DiffResults.jacobian(diffresult) without allocations.

Not sure if there is another way for your case, but I think you can wrap this around your core()) function.

It is easy to use a Flux model with a flat vector of parameters, the Optimisers docs have an example using ForwardDiff, and the Flux docs have one with a Hessian.

1 Like

@mcabbott Because I was looking for something with ForwardDiff I tested the example from Optimisers and compared the output with Zygote, unfortunately they are not the same :cry: , take a look at the loss output. Am I doing something wrong?

Here both versions.

  • With ForwardDiff and Optimisers
using ForwardDiff  # an example of a package which only likes one array
using Flux
using Random
using Optimisers
Random.seed!(123)

model = Chain(  # much smaller model example, as ForwardDiff is a slow algorithm here
          Conv((3, 3), 3 => 5, pad=1, bias=false), 
          BatchNorm(5, relu), 
          Conv((3, 3), 5 => 3, stride=16),
        )
image = rand(Float32, 224, 224, 3, 1);
@show sum(model(image));

loss(m, x) = sum(m(x))

rule = Optimisers.Adam(0.001f0,  (0.9f0, 0.999f0), 1.1920929f-7)

flat, re = Flux.destructure(model)
st = Optimisers.setup(rule, flat)  # state is just one Leaf now

∇flat = ForwardDiff.gradient(flat) do v
    loss(re(v), image) # re(v), rebuild a new object like model
end

st, flat = Optimisers.update(st, flat, ∇flat)
@show loss(re(flat),image);
sum(model(image)) = -0.33076355f0
loss(re(flat), image) = -7.7023053f0
  • And here the one with Zygote.
using Flux
using Random
Random.seed!(123)

model = Chain(  # much smaller model example, as ForwardDiff is a slow algorithm here
          Conv((3, 3), 3 => 5, pad=1, bias=false), 
          BatchNorm(5, relu), 
          Conv((3, 3), 5 => 3, stride=16),
        )
image = rand(Float32, 224, 224, 3, 1);
@show sum(model(image));

loss(m, x) = sum(m(x))

opt = Flux.Adam(0.001f0,  (0.9f0, 0.999f0), 1.1920929f-7)
θ = Flux.params(model)
grads = Flux.gradient(θ) do 
    loss(model, image)
end

Flux.update!(opt, θ, grads)
@show loss(model, image);

with this

sum(model(image)) = -0.33076355f0
loss(model, image) = -5.064876f0

That’s no good, can you make an issue?

I think the core is that BatchNorm has a test/train-mode change, which doesn’t happen with ForwardDiff. Commenting out that layer leads to identical results.

I see. Ok, here the report: Output from ForwardDiff and Optimisers is different from the one given by Zygote · Issue #117 · FluxML/Optimisers.jl · GitHub

1 Like

This seems to be specific to this particular loss function and/or model.

Consider this example otoh:

using Flux, ForwardDiff, Random
Random.seed!(123)
mlp = Chain(Dense(20, 16, relu), Dense(16,8,relu), Dense(8,1,σ))
ps, re = Flux.destructure(mlp) 
xs = randn(20,50)
ys = mapslices(x->exp.(sin.(sum(x))), xs, dims=1)

bar(p) = Flux.mse(re(p)(xs), ys)

d1 = ForwardDiff.gradient(bar, ps)
d2 = Flux.gradient(bar, ps)[1]

d1 ≈ d2 # true

@edit macro to get to the source code next time and if you are walking a struct I’ve found ComponentArrays as a good alternative to structs. Hope that helps for future.