How to troubleshoot ForwardDiff

I’ve been trying to write a loss function for a PDE that I’m solving in order to identify parameters and I would like to use the gradient of my loss function in parameter optimization. Unfortunately, ForwardDiff returns a zero vector regardless of what the parameter values are. Numerical gradient determination is very slow, so I would prefer not to use that if I can avoid it.

Is there a way to get ForwardDiff to return where it’s having issues, or do I need to kind of guess and check potential problem areas?


Do you have an if statement that can cause your loss to return something that does not depend on the parameters?

What do you mean by “numerically”? Querying the gradient with finite differences or with ForwardDiff has a similar computational complexity

I have a couple layers of error checking in case of solver errors as well as a parameter rescaling system.

For the parameter rescaling, at function compilation, I generate a list of parameters which I want to fit on a log scale, and use an if statement to either take exp10() of the input parameter or leave it as is before using it to remake my problem. I have tried bypassing this code but still obtained a zero vector.

In terms of the problem error handling, first I use a try/catch around the solution generation, and then after a solution return from the solver, I use the solution retcode to determine whether the solve was successful. In both cases, infinite loss is returned

    sol = solve(my_problem)
    loss = my_loss_function(sol)
catch e
    loss = Inf

and my loss function contains

if sol.retcode != :Success
    loss = Inf
    loss = rest_of_loss_code      

where the actual loss code is basically just a convoluted version of sum of squares since I can only obtain experimental data for one boundary of my PDE system.

Removing the first checkpoint does not change the result.

The second checkpoint is recommended here:

While I am not using DiffEqParamEstim, I could not think of a better way to ensure I do not unintentionally optimize for an improperly generated solution.

However, removing that check revealed what may be the issue. My loss code begins by manipulating the continuous solution into a vector of discrete average values in order to match my experimental data collection strategy. It would seem that this approach, which relies on the code below, is incompatible with forward diff.

for j = 1:maxfracs
    CV = startCV[j]
    push!(FractionData, mean(sol(CV2time(CV):CV2time(CV+fracvol[j]), 10.0, dv = c(t,x))))

In particular, retrieval of the vector of solution values by sol(t, x, dv=dv) may be causing the issue. I will update once I figure out what specifically is causing the error.

Thank you for the tip to check if/else statements!

By “numerically”, I mean using FiniteDiff.finite_difference_gradient(error_startup,default_params) or grad(central_fdm(5,1),error_startup,default_params).

These two return different results, and the fifth order central fdm result which I trust more takes approximately a minute to complete vs 0.3s for my error function. Given that I have 14 parameters, this makes sense, but I am hoping that a proper ForwardDiff gradient will be a bit faster.

Given what I now know about the ForwardDiff result, it seems like it was not actually “doing” anything, which may have been why the time to return something was so low. I will need to fix that issue before I figure out how long each gradient evaluation actually takes to run.

Update: It seems like the PDE solve is failing, which causes the problem with solution retrieval.

This is strange, because error_(default_params) returns a reasonable value and when I @show sol.retcode it shows sol.retcode = SciMLBase.ReturnCode.Success

When I include @show sol.retcode in the error function passed through ForwardDiff, it returns sol.retcode = SciMLBase.ReturnCode.DtLessThanMin.

This explains the zero vector error at least.

1 Like

I’m curious about the edit. I was going to take a look and then the code was deleted. Should I not?

I found a mistake in my example code that fixed the error and made the toy model solve properly. I am still having the issue with my real system, but I have not been able to come up with a MWE yet. I’m still working on it though!

So far I’ve ruled out:

  • Vector equations
  • Averages and interpolations in loss function
  • Multiple solves (experiments) per loss function evaluation
  • Interpolations used as boundary conditions
  • Solver choice
  • Use of push!() to construct vectors
  • Loss calculation split into multiple functions
  • Solver tolerance (lowered to very low value with no change)
  • PDAE system using mass matrix
  • Actually stiff set of equations
  • jac = true and sparse = true
  • Data pulled from spreadsheet
  • Chebyshev discretization

I still need to test the following features present in my real system:

  • PDAE system using mass matrix (my current guess)
  • Data pulled from spreadsheet (seems unlikely)
  • Actually stiff set of equations (maybe)
  • Chebyshev discretization (seems unlikely)

I’m not sure what else I should test, but if none of these gives a real example, I may get new ideas from the results I’ve gotten so far.

I do know that it has to do with the PDE solve before any actual loss calculations are performed since I can tell that the solver halts immediately since the only value present in sol.t is 0.0.

Updated with newly tested items.

Got a working example! It seems to have something to do with this equation.

 0.0 ~ 0.1 * u(t,x)*max(0,(u(t,x) - v(t,x)))^r - (u(t,x) - v(t,x)) * ((1.1+sin(t))^r)]

and in particular, this term max(0,(u(t,x) - v(t,x)))^r

using DifferentialEquations, ModelingToolkit, MethodOfLines, DomainSets, ForwardDiff, FiniteDiff, StatsBase
# Parameters, variables, and derivatives

@parameters t, x, p, q, r
@variables u(..), v(..)
#@variables u[1:n_comp](..)
Dt = Differential(t)
Dx = Differential(x)
Dxx = Differential(x)^2
params = Symbolics.scalarize(reduce(vcat,[p => 1.5, q => 1.2, r .=> 1.3]))
# 1D PDE and boundary conditions
eqs  = [Dt(u(t, x)) ~ -q*Dx(u(t,x)) + p * Dxx(u(t, x)),
        0.0 ~ 0.1 * u(t,x)*max(0,(u(t,x) - v(t,x)))^r - (u(t,x) - v(t,x)) * ((1.1+sin(t))^r)]
bcs = [u(0, x) ~ 0.0,
        v(0,x) ~ 0.0,
        u(t, 0) ~ 0.5*exp((-(t^2))/0.01),
        v(t,0) ~ 0.5*exp((-(t^2))/0.01),
        Dx(u(t,0)) ~ 0.0,
        Dx(v(t,0)) ~ 0.0]

# Space and time domains
domains = [t ∈ Interval(0.0, 1.0),
        x ∈ Interval(0.0, 1.0)]

# PDE system

@named pdesys = PDESystem(eqs, bcs, domains, [t,x], [u(t,x), v(t,x)], params)

dx = 0.1
# Method of lines discretization
order = 2
discretization = MOLFiniteDifference([x=>dx], t; approx_order = order)
# Convert the PDE problem into an ODE problem
prob = discretize(pdesys,discretization)
# Solve ODE problem
sol = solve(prob, KenCarp47(), saveat=0.02);
function lossfun(sol)
        output = []
        for i = 1:10
                push!(output,mean(sol((i-1)/10:i/10, 1.0, dv = u(t,x))))
        outputvec = stack(output)'
        diffs = sum(abs.(outputvec.-0))
        return diffs
function errortest(param_vec)
    loss = []
    newprob = remake(prob, p = param_vec)
            sol = solve(newprob, saveat = 0.1, KenCarp47(), reltol=10^-8, abstol=10^-8)
            if sol.retcode != :Success
                    println("anomalous solution generated")
                    loss = Inf
                    diffs = lossfun(sol)
                    loss = diffs
            println("solve failed")
            push!(loss, Inf)
    return loss

default_params = [1.5 1.2 1.3]
#This works

#This does not

#This works

Try running with ForwardDiff in NaNSafe mode.

NaNSafe mode did not seem to fix things here. I got the same warnings as before.

 Warning: dt(2.220446049250313e-16) <= dtmin(2.220446049250313e-16) at t=0.00019963280237591814, and step error estimate = 0.001174877026175027. Aborting. There is either an error in your model specification or the true solution is unstable.
└ @ SciMLBase .julia\packages\SciMLBase\qp2gL\src\integrator_interface.jl:599
┌ Warning: Solution has length 1 in dimension t. Interpolation will not be possible for variable u(t, x). Solution return code is DtLessThanMin.
└ @ MethodOfLines .julia\packages\MethodOfLines\dLkkv\src\interface\solution\solution_utils.jl:17
┌ Warning: Solution has length 1 in dimension t. Interpolation will not be possible for variable v(t, x). Solution return code is DtLessThanMin.
└ @ MethodOfLines .julia\packages\MethodOfLines\dLkkv\src\interface\solution\solution_utils.jl:17
anomalous solution generated
1×3 Matrix{Float64}:
 0.0  0.0  0.0

solve(newprob, KenCarp47(), saveat = 0.1, reltol=10^-8, abstol=10^-8) ?

What about another solver like Rodas5P() or FBDF?

With the toy model, Rodas5P gives the same error as before, but FBDF is able to successfully return a gradient vector.

Unfortunately, my real system has crazy oscillations when using FBDF, so I am unable to take advantage of this solution.

I’ve tried most of the solvers on the list below, and only KenCarp47 (which isn’t on the list) avoids the oscillatory behavior.

Is the mass matrix removed by structural simplification? Check prob.f.mass_matrix

It does not seem to be removed. I’m getting an 18x18 matrix where the first nine diagonal elements are 1 for the toy model, and a larger version of the same thing for my real system.

A singular mass matrix shouldn’t work with KenCarp47 and it should be throwing in that case?

I’m not sure what’s going on then, because it solves the problem without issues, it just doesn’t work with ForwardDiff.

@xtalax could you take a look?

I have (finally) seen this, I will take a look soon

1 Like