I am having a strange problem with ForwardDiff that is internally used in Optim. The minimum returned by Optim is not the same value as the value of the objective function when the minimizer returned by Optim is used, see the following MWE.

using Julia
using DifferentialEquations
function f(du,u,p,t)
du[1] = dx = p[1]*u[1] - u[1]*u[2]
du[2] = dy = -3*u[2] + u[1]*u[2]
end
u0 = [1.0;1.0]
tspan = (0.0,10.0)
p = [1.5]
prob = ODEProblem(f,u0,tspan,p)
tstops = range(0.,stop=10.,length=10)
sol = solve(prob,Tsit5(),saveat=tstops)
randomized = VectorOfArray([(sol(t[i]) + .01randn(2)) for i in 1:length(tstops)])
data = convert(Array,randomized)
function least_squares(x)
_prob = remake(prob, u0=convert.(eltype(x),prob.u0),p=x)
sol = solve(_prob,Tsit5(),saveat=tstops)
sum((hcat(sol.u...) .- data).^2)
end
result = optimize(least_squares, [5.], Newton(),autodiff=:forward)
result.minimum # returns 275.97
least_squares(result.minimizer) # returns 276.68

A further example taken from the docs evaluating the gradient via ForwardDiff and Zygote.
They return different results. The difference is not necessarily negligible. I was expecting the same result as they both use AD. Is this related to dual numbers as well, and which gradient is more to trust?

using DifferentialEquations, Flux, Optim, DiffEqFlux, DiffEqSensitivity, Plots
function lotka_volterra!(du, u, p, t)
x, y = u
α, β, δ, γ = p
du[1] = dx = α*x - β*x*y
du[2] = dy = -δ*y + γ*x*y
end
u0 = [1.0, 1.0]
tspan = (0.0, 10.0)
tsteps = 0.0:0.1:10.0
p = [1.5, 1.0, 3.0, 1.0]
prob = ODEProblem(lotka_volterra!, u0, tspan, p)
function loss(p)
sol = solve(prob, Tsit5(), p=p, saveat = tsteps)
loss = sum(abs2, sol.-1)
return loss, sol
end
using Zygote, ForwardDiff
Zygote.gradient(x->loss(x)[1],p) # returns [60.889672029001524, -788.8395410840595, 878.197470730419, -2937.299618283181]
ForwardDiff.gradient(x->loss(x)[1],p) # returns [50.90171758717276, -785.0077963154041, 879.7576203004646, -2947.695409344049]