Different results between Zygote, ForwardDiff, and ReverseDiff

vaglino · September 13, 2020, 8:26am

Hi, I’m quite new to Julia and trying to understand automatic differentiation.

I have a loss function for a model, and I’m trying to get the gradient and hessian w.r.t parameters fitted to the model. In doing so, comparing Zygote, ForwardDiff and ReverseDiff, I noticed that the values for the gradient and hessian are very different depending on which method I am using.
Is this to be expected or am I missing something?

function loss(p,data,model,dens,cons,t)
  Σ_sol = run_model(model,p,data,dens,cons,t)
  sum(abs2, (Σ_sol - data)) #, Σ_sol
end

loss_in = (p) -> loss(p,dataset,model,dens',[],t)
loss(p,dataset,model,dens',[],t)
 # rates are fitted parameters
grad_zyg = Zygote.gradient(loss_in,rates)
grad_for = ForwardDiff.gradient(loss_in,rates)
grad_rev = ReverseDiff.gradient(loss_in,rates)
hes_for = ForwardDiff.hessian(loss_in,rates)
hes_rev = ReverseDiff.hessian(loss_in,rates)

Thanks!

Tamas_Papp · September 13, 2020, 12:48pm

Normally, they should agree, so this can be a bug. An MWE replicating this would be helpful.

ChrisRackauckas · September 13, 2020, 1:38pm

To what tolerance are you doing your computation?

vaglino · September 13, 2020, 11:39pm

Hi, here is a mwe. It seems that zygote and ReverseDiff agree both on gradient and hessian, whereas ForwardDiff is different.

using DifferentialEquations, Flux, LinearAlgebra, DiffEqFlux, DiffEqSensitivity
using ForwardDiff
using Zygote
using ReverseDiff

function bimolecular!(du,u,p,t,dens,cons)
    # unpack rates and constants
    nᵣ = u[1]
    k₁,k₋₁  = p
    mᵣ,mₗ,A = dens
    # model
    du[1] = dnᵣ = A*k₁*mᵣ*mₗ - k₋₁*nᵣ

end

function run_model(model,p,data,densities,constants,t) # version of run function with multiple models

  Σ_sol_stack = zeros(1, size(data,2))

  for i in 1:size(data,1)
      # run model with given densities
      densities_i = densities[i,:]
      f = (du,u,p,t) -> model(du,u,10 .^ p,t,densities_i,constants)
      tmp_prob = ODEProblem(f,u₀,tspan,p)
      tmp_sol = solve(tmp_prob,Vern7(),saveat=t, abstol=1e-8,reltol=1e-8)
      # stack Σ of solution across n species
      Σ_sol = sum(Array(tmp_sol),dims=1)
      Σ_sol_stack = vcat(Σ_sol_stack,Σ_sol)
  end
  return Σ_sol_stack[2:end,:]
end


function loss(p,data,model,dens,cons,t)
  Σ_sol = run_model(model,p,data,dens,cons,t)
  sum(abs2, (Σ_sol - data)) #, Σ_sol
end

dataset = [  0.25  0.0618754
  0.25  0.040822
  0.5   0.127833
  0.5   0.198451
  1.0   0.274437
  1.0   0.223144
  2.0   0.579818
  2.0   0.653926
  4.0   0.693147
  4.0   0.776529
  6.0   0.820981
  6.0   0.776529
  8.0   0.653926
  8.0   0.776529
 16.0   0.820981
 16.0   0.733969]

t = dataset[:,1]
n = dataset[:,2]

densities = [25.0, 38.0, 1.0]
tspan = (0,maximum(t)+1)
u₀ = [0.0]
rates = [ -3.367837470456765, -0.2863777340019116]


loss_new = (p) -> loss(p,n',bimolecular!,densities',[],t)
loss_new(rates)

grad_zyg = Zygote.gradient(loss_new,rates)
grad_for = ForwardDiff.gradient(loss_new,rates)
grad_rev = ReverseDiff.gradient(loss_new,rates)
hes_for = ForwardDiff.hessian(loss_new,rates)
hes_rev = ReverseDiff.hessian(loss_new,rates)
hes_zyg = Zygote.hessian(loss_new,rates)

ChrisRackauckas · September 14, 2020, 4:22am

I mentioned last time that there was an issue with uniqueness of times in the way that your model was originally posed, and you made that same issue again. If you fix that it’s fine:

using DifferentialEquations, Flux, LinearAlgebra, DiffEqFlux, DiffEqSensitivity
using ForwardDiff, FiniteDiff
using Zygote
using ReverseDiff

function bimolecular!(du,u,p,t)
    # unpack rates and constants
    nᵣ = u[1]
    k₁,k₋₁,mᵣ,mₗ,A  = p
    # model
    du[1] = dnᵣ = A*k₁*mᵣ*mₗ - k₋₁*nᵣ

end

function run_model(p,data,densities,t) # version of run function with multiple models

  Σ_sol_stack = zeros(1, size(data,2))

  for i in 1:size(data,1)
      # run model with given densities
      p_i = [10 .^ p;densities[i,:]]
      tmp_prob = ODEProblem(bimolecular!,u₀,tspan,p_i)
      tmp_sol = solve(tmp_prob,Vern7(),saveat=t, abstol=1e-14,reltol=1e-14)
      # stack Σ of solution across n species
      Σ_sol = sum(Array(tmp_sol),dims=1)
      Σ_sol_stack = vcat(Σ_sol_stack,Σ_sol)
  end
  return Σ_sol_stack[2:end,:]
end


function loss(p,data,dens,t)
  Σ_sol = run_model(p,data,dens,t)
  sum(abs2, (Σ_sol - data)) #, Σ_sol
end

dataset = [  0.25  0.0618754
  0.25  0.040822
  0.5   0.127833
  0.5   0.198451
  1.0   0.274437
  1.0   0.223144
  2.0   0.579818
  2.0   0.653926
  4.0   0.693147
  4.0   0.776529
  6.0   0.820981
  6.0   0.776529
  8.0   0.653926
  8.0   0.776529
 16.0   0.820981
 16.0   0.733969]

t = unique(dataset[:,1])
n = zeros(size(dataset,1)÷2)
for i in 1:length(n)
  n[i÷2 + 1] += dataset[i,2]
end

densities = [25.0, 38.0, 1.0]
tspan = (0,maximum(t)+1)
u₀ = [0.0]
rates = [ -3.367837470456765, -0.2863777340019116]


loss_new = (p) -> loss(p,n',densities',t)
loss_new(rates)

# Works
grad_zyg = Zygote.gradient(loss_new,rates)
grad_for = ForwardDiff.gradient(loss_new,rates)
grad_rev = ReverseDiff.gradient(loss_new,rates)
grad_rev = FiniteDiff.finite_difference_gradient(loss_new,rates)

# Also works
hes_for = ForwardDiff.hessian(loss_new,rates)
hes_zyg = Zygote.hessian(loss_new,rates)

I will make an issue to handle the first formulation better, but there’s essentially no reason to do it: it’s always going to be less tested (I don’t think I’ve ever seen non-unique saveat in the thousands of codes I’ve seen in the last 5 years), and it’s going to be less efficient (since it’s hitting callbacks and saving multiple times in a way that’s unnecessary. So I’ll try to make that safer but even then… don’t do that haha.

vaglino · September 14, 2020, 5:53am

So would I need to save at each time point only once, and then in my loss function I would compare each saveat to multiple observables?

ChrisRackauckas · September 14, 2020, 5:54am

That would be best. I opened https://github.com/SciML/DiffEqSensitivity.jl/issues/335 to make this safer but would recommend just dropping via unique before saveat.

vaglino · October 11, 2020, 10:16pm

So, I have finally implemented the loss function with unique time points, but now the zygote hessian does not work anymore.

I have a function that takes in the solution, the experimental time points (with repeats), and the unique time points, and finds the differences at each experimental time point:

function find_diffs(tmp_sol,t,unique_t,data) # solutions use unique time points,
# need to evaluate the solution at each experimental time point as well as
# calculating the difference between fit and experiment

    # Σ_sol = sum(Array(tmp_sol),dims=1)
    Σ_sol = sum(tmp_sol,dims=1)
    Δy = Zygote.Buffer(t,length(t))
    for (i, tᵢ) in enumerate(t)
        # ind = unique_t .== tᵢ
        ind = isequal.(unique_t,tᵢ)
        yᵢ_exp = Σ_sol[ind][1]
        yᵢ_obs = data[i]
        Δyᵢ = yᵢ_obs .- yᵢ_exp
        Δy[i] = Δyᵢ
    end
    return copy(Δy)
end

TypeError: in typeassert, expected Float64, got a value of type ForwardDiff.Dual{Nothing,ForwardDiff.Dual{Nothing,Float64,2},2}

The issue appears to be in the statement Δy[i] = Δyᵢ
as the Δy is a Float, whereas the Δyᵢ is a ForwardDiff.Dual type, since Σ_sol is a ForwardDiff.Dual type as well.

ChrisRackauckas · October 11, 2020, 10:59pm

@dhairyagandhi96 is Zygote.Buffer not using similar to match the eltype?

dhairyagandhi96 · October 12, 2020, 6:04am

It indeed uses similar but does not enforce the eltype by default. It follows the same API as similar, so you could pass in T as Buffer(xs, T, sz) if needed.

ChrisRackauckas · October 12, 2020, 6:10am

But the API of similar is to use the eltype of the array by default?

dhairyagandhi96 · October 12, 2020, 6:33am

Let me clarify, we use similar and splat the arguments after the array. So yes, it will check eltype via similar, but does not default to Buffer(xs, T). You should see the correct eltype, as you would with similar.

Topic		Replies	Views
Zero gradients with Zygote vs correct gradients with ReverseDiff using DiffEqFlux Machine Learning zygote , reversediff , diffeqflux	4	1403	January 24, 2022
Zygote dozens* of times slower than manually written function Performance zygote , forwarddiff	17	1777	April 21, 2022
Compute gradient of gradient norm using zygote New to Julia zygote	17	2028	August 26, 2022
Ambiguous method bug in ReverseDiff General Usage	6	194	May 9, 2023
Wrong gradient from Zygote? Numerics question , zygote	13	727	November 14, 2023

Different results between Zygote, ForwardDiff, and ReverseDiff

Related topics