How to get the results and gradients when using ForwardDiff.jl

Frankiewaang · November 20, 2022, 6:23am

I did some research and found out that if I want to get both the value of a function and its gradients I could use the API from DiffResults. But if the result from a function returns a tuple(instead of a scalar) where the first one is the loss but I also want to get the second and third results how should I do it?

In Zygote.jl, I could do something like:

#loss function for Lux model
function core(S, dcf, ps, model, st)
    value_if_wait, st = model(S, ps, st)
    mse = mean(abs2, value_if_wait .- dcf)
    return mse, value_if_wait, st
end
(loss, value_if_wait, st), back = pullback(p -> core(S_i, dcf, p, model, st), ps)
gs = back((one(loss), nothing, nothing))[1]
#and then do the update

colleybrb · November 21, 2022, 2:42pm

Not being familair with pkg… I’d see what it’s currently returning by looking in the source code, or return under on variable and print. If it is returning you should be able to match the return to the variable assignment. If not you will need to add additional returns.

Hope that helps.

Frankiewaang · November 21, 2022, 4:19pm

Not worries. I just saw the slack discussion here and wanna give it a try, my thinking is it seems to be easier for Lux to work with other AD libraries and I only need to flatten it to call the gradient function and reconstruct it when I wanna call the model. For Flux, the params are stored in the struct, therefore a customized walk is needed.

j-fu · November 21, 2022, 4:23pm

Assume you have mutating function func!(y,u) which writes the result into y.
Prepare a result buffer (once):

diffresult=DiffResults.JacobianResult(u0)
y=zero(u0)
cfg = ForwardDiff.JacobianConfig(func!,y, u0)

Then you should be able to call (many times)

ForwardDiff.jacobian!(diffresult, func!,y,u,cfg)

and access DiffResults.value(diffresult) and DiffResults.jacobian(diffresult) without allocations.

Not sure if there is another way for your case, but I think you can wrap this around your core()) function.

mcabbott · November 21, 2022, 4:27pm

It is easy to use a Flux model with a flat vector of parameters, the Optimisers docs have an example using ForwardDiff, and the Flux docs have one with a Hessian.

lazarusA · November 21, 2022, 9:50pm

@mcabbott Because I was looking for something with ForwardDiff I tested the example from Optimisers and compared the output with Zygote, unfortunately they are not the same , take a look at the loss output. Am I doing something wrong?

Here both versions.

With ForwardDiff and Optimisers

using ForwardDiff  # an example of a package which only likes one array
using Flux
using Random
using Optimisers
Random.seed!(123)

model = Chain(  # much smaller model example, as ForwardDiff is a slow algorithm here
          Conv((3, 3), 3 => 5, pad=1, bias=false), 
          BatchNorm(5, relu), 
          Conv((3, 3), 5 => 3, stride=16),
        )
image = rand(Float32, 224, 224, 3, 1);
@show sum(model(image));

loss(m, x) = sum(m(x))

rule = Optimisers.Adam(0.001f0,  (0.9f0, 0.999f0), 1.1920929f-7)

flat, re = Flux.destructure(model)
st = Optimisers.setup(rule, flat)  # state is just one Leaf now

∇flat = ForwardDiff.gradient(flat) do v
    loss(re(v), image) # re(v), rebuild a new object like model
end

st, flat = Optimisers.update(st, flat, ∇flat)
@show loss(re(flat),image);

sum(model(image)) = -0.33076355f0
loss(re(flat), image) = -7.7023053f0

And here the one with Zygote.

using Flux
using Random
Random.seed!(123)

model = Chain(  # much smaller model example, as ForwardDiff is a slow algorithm here
          Conv((3, 3), 3 => 5, pad=1, bias=false), 
          BatchNorm(5, relu), 
          Conv((3, 3), 5 => 3, stride=16),
        )
image = rand(Float32, 224, 224, 3, 1);
@show sum(model(image));

loss(m, x) = sum(m(x))

opt = Flux.Adam(0.001f0,  (0.9f0, 0.999f0), 1.1920929f-7)
θ = Flux.params(model)
grads = Flux.gradient(θ) do 
    loss(model, image)
end

Flux.update!(opt, θ, grads)
@show loss(model, image);

with this

sum(model(image)) = -0.33076355f0
loss(model, image) = -5.064876f0

mcabbott · November 21, 2022, 9:57pm

That’s no good, can you make an issue?

I think the core is that BatchNorm has a test/train-mode change, which doesn’t happen with ForwardDiff. Commenting out that layer leads to identical results.

lazarusA · November 21, 2022, 10:16pm

I see. Ok, here the report: ForwardDiff + destructure is different from Zygote, on a model with BatchNorm · Issue #2122 · FluxML/Flux.jl · GitHub

tchebycheff · November 23, 2022, 1:49am

This seems to be specific to this particular loss function and/or model.

Consider this example otoh:

using Flux, ForwardDiff, Random
Random.seed!(123)
mlp = Chain(Dense(20, 16, relu), Dense(16,8,relu), Dense(8,1,σ))
ps, re = Flux.destructure(mlp) 
xs = randn(20,50)
ys = mapslices(x->exp.(sin.(sum(x))), xs, dims=1)

bar(p) = Flux.mse(re(p)(xs), ys)

d1 = ForwardDiff.gradient(bar, ps)
d2 = Flux.gradient(bar, ps)[1]

d1 ≈ d2 # true

colleybrb · November 25, 2022, 4:21pm

@edit macro to get to the source code next time and if you are walking a struct I’ve found ComponentArrays as a good alternative to structs. Hope that helps for future.

Topic		Replies	Views
Is it possible perform reverse mode differentiation (Flux.jl with Zygote.jl) of a forward mode differentiation result (e.g. ForwardDiff)? Machine Learning question , flux	3	1445	March 10, 2020
Flux Get Forward Pass Results when Taking Gradient Machine Learning question	3	249	October 16, 2022
Am I using DiffResults.jl correctly? Performance diffresults , forwarddiff	7	1003	March 14, 2020
Different results between Zygote, ForwardDiff, and ReverseDiff New to Julia	11	3437	October 12, 2020
Using the DiffResult API Numerics question	1	550	August 18, 2017

How to get the results and gradients when using ForwardDiff.jl

Related topics