Long training time using sciml_train on universal DAE

I’m hoping to gather some advice on how to speed up the following example, which attempts to train a NN to learn part of the dynamics of a DAE system.

The first part below simply solves the true DAE in order to generate data to compare to during training. The DAE is formulated as a mass matrix ODE with 4 differential variables and 6 algebraic variables.

using DifferentialEquations, DiffEqFlux, OrdinaryDiffEq, Flux, Optim, Plots,
      NLsolve, Sundials, OrdinaryDiffEq, Statistics, DiffEqSensitivity

#Time range and density of points
tspan = (0.0f0,3.0f0)
datasize = 100;
ts = Float32.(collect(range(tspan[1],tspan[2], length=datasize)))

#Constants
p = [0.5, 0.5]

#Calculate the true solution
function ODE_true(dx, x, p, t)
    #unpack dynamic and algebraic states
    δ₁,ω₁,δ₂,ω₂,pe₁,pe₂,V₁,θ₁,V₂,θ₂ = x
    #unpack constants
    Eq₁,Xq₁,H₁,D₁,Eq₂,Xq₂,H₂,D₂,V₀,θ₀,Xl₁₂,Xl₁₀,Xl₂₀,Ωb =  [
       #Gen 1 constants
       0.9,    #Eq_1
       0.25,   #Xq_1
       5.0,    #H_1
       2.0,    #D_1
       #Gen 1 constants
       0.9,    #Eq_2 =
       0.25,   #Xq_2 =
       5.0,    #H_2
       2.0,    #D_2 =
       #IB constants:
       1.0,    #V0
       0.0,    #θ0
       #Network constants
       0.1,   #Xl_12
       0.1,   #Xl_10
       0.1,   #Xl_20
       2*pi*60.0 # Ωb
    ]

    pm₁ = p[1]
    pm₂ = p[2]
    #Compute dq parameters for gen 1
    vd₁ = V₁ * sin(δ₁ - θ₁)
    vq₁ = V₁ * cos(δ₁ - θ₁)
    id₁= (1.0 / Xq₁) * (Eq₁ - vq₁)
    iq₁ = (1.0 / Xq₁) * (Eq₁ - vq₁)
    #Compute dq parameters for gen 2
    vd₂ = V₂* sin(δ₂ - θ₂)
    vq₂ = V₂ * cos(δ₂ - θ₂)
    id₂ = (1.0 / Xq₂) * (Eq₂ - vq₂)
    iq₂ = (1.0 / Xq₂) * (Eq₂ - vq₂)
    #Compute ODEs
    dx[1] =  (ω₁ - 1.0)
    dx[2] = (pm₁ - pe₁ - D₁ * (ω₁ - 1.0))
    dx[3] =  (ω₂ - 1.0)
    dx[4]=   (pm₂ - pe₂ - D₂ * (ω₂ - 1.0))
    #generator output power equations (2)
    dx[5] = vd₁ * id₁ + vq₁ * iq₁ - pe₁
    dx[6] = vd₂ * id₂ + vq₂ * iq₂ - pe₂
    #line flow equations (bus 1)
    dx[7] = vd₁ * id₁ + vq₁ * iq₁ - (V₁ * V₀ / Xl₁₀) * sin(θ₁ - θ₀) - (V₁ * V₂ / Xl₁₂) * sin(θ₁ - θ₂)
    dx[8] = vq₁ * id₁ - vd₁ * iq₁ - V₁^2/Xl₁₀ + (V₁ * V₀ / Xl₁₀) * cos(θ₁ - θ₀) - V₁^2/Xl₁₂ + (V₁ * V₂ / Xl₁₂) * cos(θ₁ - θ₂)
    #line flow equations (bus 2)
    dx[9] = vd₂ * id₂ + vq₂ * iq₂ - (V₂ * V₁ / Xl₁₂) * sin(θ₂ - θ₁) -  (V₂ * V₀ / Xl₂₀) * sin(θ₂ - θ₀)
    dx[10] = vq₂ * id₂ - vd₂ * iq₂ - V₂^2/Xl₁₂ + (V₂ * V₁ / Xl₁₂) * cos(θ₂ - θ₁) - V₂^2/Xl₂₀ + (V₂ * V₀ / Xl₂₀) * cos(θ₂ - θ₀)
    nothing
end

#Mass matrix
M = zeros(10,10)
M[1,1] = 1/(2*pi*60.0)   #1/Ωb
M[2,2] =  2*5.0    #2*H1
M[3,3] = 1/(2*pi*60.0)    #1/Ωb
M[4,4] =  2*5.0    #2*H1

trueODEfunc = ODEFunction(ODE_true, mass_matrix = M)
p = [0.5,0.5]
# Solve for initial condition
f = (dx, x) -> trueODEfunc(dx, x, p, 0.0)
initial_guess = [0.2,1.0,0.2,1.0,0.5,0.5,1.0,0.0,1.0,0.0]
init_sol = nlsolve(f, initial_guess, ftol = 1e-10)
x0 = init_sol.zero

#introduce disturbance to get a dyanmic response
p[1]=0.65

prob = ODEProblem(trueODEfunc,x0,tspan)
ode_sol = solve(prob, p = p, Rodas5(autodiff=false),saveat=ts, dtmax=0.02)
ode_data = Array(ode_sol)

The section below defines the new DAE which now includes the NN and trains using sciml_train:

n_neurons = 5
#Small NN has 16 parameters and takes ~2.2 seconds for 1 training iteration
ann_gen1 = FastChain(FastDense(1,n_neurons,tanh),FastDense(n_neurons,1))
#ann_gen1 = FastChain(FastDense(1,n_neurons,tanh),FastDense(n_neurons,n_neurons),FastDense(n_neurons,1))
θ = initial_params(ann_gen1)

function UODE(dx,x,p,t)
    #unpack dynamic and algebraic states
    δ₁,ω₁,δ₂,ω₂,pe₁,pe₂,V₁,θ₁,V₂,θ₂ = x
    #unpack constants
    Eq₁,Xq₁,H₁,D₁,Eq₂,Xq₂,H₂,D₂,V₀,θ₀,Xl₁₂,Xl₁₀,Xl₂₀,Ωb =  [
       #Gen 1 constants
       0.9,    #Eq_1
       0.25,   #Xq_1
       5.0,    #H_1
       2.0,    #D_1
       #Gen 1 constants
       0.9,    #Eq_2 =
       0.25,   #Xq_2 =
       5.0,    #H_2
       2.0,    #D_2 =
       #IB constants:
       1.0,    #V0
       0.0,    #θ0
       #Network constants
       0.1,   #Xl_12
       0.1,   #Xl_10
       0.1,   #Xl_20
       2*pi*60.0 # Ωb
    ]
    pm₁ = 0.65
    pm₂ = 0.5

    #Compute dq parameters for gen 1
    vd₁ = V₁ * sin(δ₁ - θ₁)
    vq₁ = V₁ * cos(δ₁ - θ₁)
    id₁= (1.0 / Xq₁) * (Eq₁ - vq₁)
    iq₁ = (1.0 / Xq₁) * (Eq₁ - vq₁)
    #Compute dq parameters for gen 2
    vd₂ = V₂* sin(δ₂ - θ₂)
    vq₂ = V₂ * cos(δ₂ - θ₂)
    id₂ = (1.0 / Xq₂) * (Eq₂ - vq₂)
    iq₂ = (1.0 / Xq₂) * (Eq₂ - vq₂)
    #Compute ODEs
    dx[1] = ann_gen1([t],p)[1]
    dx[2] = (pm₁ - pe₁ - D₁ * (ω₁ - 1.0))
    dx[3] =  (ω₂ - 1.0)
    dx[4]=   (pm₂ - pe₂ - D₂ * (ω₂ - 1.0))
    #generator output power equations (2)
    dx[5] = vd₁ * id₁ + vq₁ * iq₁ - pe₁
    dx[6] = vd₂ * id₂ + vq₂ * iq₂ - pe₂
    #line flow equations (bus 1)
    dx[7] = vd₁ * id₁ + vq₁ * iq₁ - (V₁ * V₀ / Xl₁₀) * sin(θ₁ - θ₀) - (V₁ * V₂ / Xl₁₂) * sin(θ₁ - θ₂)
    dx[8] = vq₁ * id₁ - vd₁ * iq₁ - V₁^2/Xl₁₀ + (V₁ * V₀ / Xl₁₀) * cos(θ₁ - θ₀) - V₁^2/Xl₁₂ + (V₁ * V₂ / Xl₁₂) * cos(θ₁ - θ₂)
    #line flow equations (bus 2)
    dx[9] = vd₂ * id₂ + vq₂ * iq₂ - (V₂ * V₁ / Xl₁₂) * sin(θ₂ - θ₁) -  (V₂ * V₀ / Xl₂₀) * sin(θ₂ - θ₀)
    dx[10] = vq₂ * id₂ - vd₂ * iq₂ - V₂^2/Xl₁₂ + (V₂ * V₁ / Xl₁₂) * cos(θ₂ - θ₁) - V₂^2/Xl₂₀ + (V₂ * V₀ / Xl₂₀) * cos(θ₂ - θ₀)
    nothing
end

UODEfunc = ODEFunction(UODE, mass_matrix = M)
prob_univ = ODEProblem(UODEfunc,x0,tspan,θ)

function predict_adjoint(θ)
   # Array(solve(prob_univ,Rodas5(autodiff=false),p=θ,saveat=ts,sensealg=InterpolatingAdjoint()))
    Array(solve(prob_univ,Rodas5(autodiff=false),p=θ,saveat=ts,sensealg=ForwardDiffSensitivity()))      #Best for under 100 parameters
end

pred = @time predict_adjoint(θ)

function loss_adjoint(θ)
 pred = predict_adjoint(θ)
 loss = sum(abs2, pred- ode_data)
 return loss, pred
end

@time predict_adjoint(θ)
l = @time loss_adjoint(θ)

cb = function (θ,l,pred; doplot=true)
    display(l)
    if doplot
        p = plot(solve(remake(prob_univ,p=θ),Rodas5(autodiff= false ),saveat=ts), ylims= (0,1.5), color=:red, title=("Blue = True Solution, Red = Universal ODE prediction"))
        plot!(p,ode_sol, color = :blue)
        display(p)
    end
    return false
end

@time cb(θ,l,pred)
loss1 = loss_adjoint(θ)


res1 = @time DiffEqFlux.sciml_train(loss_adjoint, θ, BFGS(initial_stepnorm=0.01), cb = cb, maxiters = 2)

When I increase the size of the NN in this example, the training becomes very slow. I’m struggling to tease out which combination of options (solver, NN size, autodiff, sensealg, etc.) makes sense for my particular (stiff) DAE and how these options will affect the training time? Is there any general advice for how to start optimizing these trade offs?

How large of a neural network did you start using? You’re right about at the end of the effectiveness of forward mode AD, so ForwardDiffSensitivity() is probably not a good choice. sensealg = InterpolatingAdjoint(autojacvec=ReverseDiffVJP(true)) is almost definitely the right choice here. This is described in https://diffeqflux.sciml.ai/dev/ControllingAdjoints/ , where any larger side neural network and you’re definitely in that 100 parameter territory.

Now, one of the interesting things here is that although Rodas5 is good forwards here, it is not the right method for the reverse pass because the size of the DAE in reverse is much larger than the forward pass. RadauIIA5 might do better. ROS34PW1a could also be a good idea, though someone would have to tune the Jacobian reuse a bit for it to truly be optimal.

Neural network size should be “the smallest that works”. That keeps the computational cost down while increasing the generalization.

For autodiff=false, that’s just the forward-mode automatic differentiation of the Jacobian for the stiff ODE solver. It can help avoid numerical errors but it doesn’t make a major difference in compute time. Here’s some examples:

So while I do want to fix that quirk, it’s not a major factor.

Hi @mbossart. I am facing similar issues. So I am wondering if you were able to solve it. Can you provide an update here? Thanks.

What’s your example?

@ChrisRackauckas In my example, I am trying to build a UDE for a system consisting of 5 ODEs with complex, time-varying parameters. The code runs but the loss values barely change with number of iterations.

What are the gradients you’re getting? Did you manually normalize them?

@ChrisRackauckas How do I check the gradients? I have normalized the dependent variables in the ODEs

Just use Zygote.gradient.

Thanks. I will do so.

@ChrisRackauckas I checked the gradients of the loss function with respect to the parameters of the neural network before training and after training. After training is when the loss values begin to saturate at a particular value and this usually happens in less than 100 iterations. In the first case, the gradients are in the range of 10^(-7) to 10^(-5), and in the second case the gradients are in the range of 10^(-11) to 10^(-11). I now understand that with these small gradients the parameters are not being updated, and is the reason for the loss values getting saturated. Could you please share any tips on how to avoid this and get the solver to obtain the right solution?

@ChrisRackauckas I observed another thing in my results. The parameters of the neural network when initialized are in the range of (-1, 1). But after 100 iterations of training they become very small i.e. to the order of 10^(-7). And the learning rate is 0.1.

That’s indicative of a local minimum. What does the local minimum look like?

What does the minimum look like?

Very poor. The loss value barely changes from the 1st iteration.

All of them are relatively the same size? Did you try multiple shooting approaches or follow Strategies to Avoid Local Minima · DiffEqFlux.jl ?

I tried the iterative growing of fits approach, and the use of allow_f_increases=true. Neither were helpful. To simplify the problem, I just used two data points. One as the initial condition and the other as the state of the system at the next time step. Even for a simple problem of this size, the solver keeps getting stuck at the local minima. I checked my model numerous times to see if there is a mistake in the equations but I did not find any. The solver is able to solve the equations.

Yeah alright. I may need to take a deeper look here. I mention at the end of [2103.15341] Stiff Neural Ordinary Differential Equations that DAEs can sometimes (but not always) exhibit stranger behavior than ODEs, and I have seen an example of this before, but I’m not quite sure what it is… it’s not a simple thing though and more of a longer term research project.

Sorry. I forgot to mention. My problem consists of multiple ODEs but not a DAE.

@saadfaizan Do you have a MWE that you can share? I have been working recently as well to avoid local minima in my training problem. My problem is in fact a DAE. I’m surprised that with just two data points you are still having issues with a local minima. I similarly have not had much luck with the iterative growing of fits approach. Might be worthwhile to have a call and discuss strategies.

In my experience these problems are very sensitive to the initial profile of the network. You could try to precondition it with a wisely guessed profile or a set of random ones adequate for the problem (like distinct scaled polynomials) and start the real training afterwards. Similar to a multistart method but with multiple initial guesses for the NN(t,x) map.