Parameter tuning in neural ODE

Hello,

I’m trying to reproduce chemical reaction equations using neural ODE system. I calculated time-series of mass fraction of chemical species and temperature using other ODE solver, and exported the solution data to CSV file. Then, the solution data in CSV file was imported to the neural ODE program as training data. The neural ODE program I used is described as follow. The program is based on the code in the website (https://julialang.org/blog/2019/01/fluxdiffeq).

using Flux, DiffEqFlux, DifferentialEquations, Plots, CSV

# Read CSV file
flamedata = CSV.read("./results.csv")
flamedata2 = flamedata[2:2:6002,:]

u0 = Float32[0.; 0.11189834407236525]

datasize = 3001
tspan = (0.0f0,0.0003f0)
#tspan = (0.0f0,1.0f0)

t = range(tspan[1],tspan[2],length=datasize)

ode_data = Matrix(flamedata2[[:1,:4]])
ode_data = transpose(ode_data)
ode_data = convert(Array{Float32}, ode_data)
ode_data[1,:] = tanh.(ode_data[1,:]*100)

dudt = Chain(
			 #x -> -tanh.(x*10),
			 Dense(2,100,relu),
			 Dense(100,100,relu),
			 Dense(100,100,relu),
			 #Dense(30,30,relu),
			 #Dense(100,200,relu),
             Dense(100,2)
			 )

ps = Flux.params(dudt)
n_ode = x->neural_ode(dudt,x,tspan,BS3(),saveat=t,dtmin=1.0E-14,maxiters=1e10,reltol=1e-7,abstol=1e-9)

function predict_n_ode()
  n_ode(u0)
end
loss_n_ode() = sum(abs2,ode_data .- predict_n_ode())

data = Iterators.repeated((), 100000)
opt = ADAM(0.001, (0.9, 0.999))
cb = function () #callback function to observe training
  display(loss_n_ode())
  # plot current prediction against data
  cur_pred = Flux.data(predict_n_ode())
  pl1 = plot(t,ode_data[1,:],label="data1",lw=2)
  plot!(pl1,t,cur_pred[1,:],label="prediction1",lw=2)
  plot!(pl1,t,ode_data[2,:],label="data2",lw=2)
  plot!(pl1,t,cur_pred[2,:],label="prediction2",lw=2)
  gui(plot(pl1))
end

# Display the ODE with the initial parameter values.
cb()

Flux.train!(loss_n_ode, ps, data, opt, cb = cb)

The attached figure shows comparison between original ODE solution and neural ODE solution. “data2” is mass fraction of hydrogen and “data1” is just reference data to make solution set at each time steps unique combination. The line of “prediction2” was getting closer to “data2” line in the simulation. However, the solution from the neural ODE system could not reproduce sharp gradient change at 1.6E-4 seconds.

Could you give me advice for how to tune parameters in the neural ODE program to get more closer solution?

Hi!

I’m not an ML / NN expert, so your mileage may vary with my advice (and I am speaking under correction), but since no one has replied yet, I thought I’d make a few suggestions that many would consider “low-hanging fruit”.

  1. In your original ODE that you solved, does it have any time-dependent function (i.e, a function that depends on time, such as an event?) If that’s the case, I’m in the same boat as you – the topic is here. However, if that’s not the case, there’s still a few things you can try:
  2. Try a lower learning rate for ADAM once you get to the point you’re at now. I’ve had quite a bit of success by doing the initial learning pass with the default learning rate, and after that initial convergence I can often eek out a substantial amount of improvement by dropping the learning rate by a factor or 10 or so. This will take a bit of experimentation.
  3. Use a different activation function. I’ve found that swish and \sigma work better than relu and tanh in my personal ODEs that I’ve tried to solve.
  4. Try a different model architecture. From personal experience, bigger isn’t always better. Your model is pretty large for an ODE (at least from what I’ve seen). How complicated is the original system model? That should help inform how expressive your model needs to be. Also consider playing around with different layer sizes. 32 --> 16 --> 8 --> 2, etc. This has achieved good results in certain cases for me, where a model with more parameters struggled somewhat to train.

Hope this helps!

BS3 at low tolerance is going to be slow. I’d suggest a different integrator. I may be able to look into this on the weekend.

Dear Michael,

Thank you for giving me some advice for getting better solution.

  1. In your original ODE that you solved, does it have any time-dependent function (i.e, a function that depends on time, such as an event?) If that’s the case, I’m in the same boat as you – the topic is here.

In my case, there is no time-dependent function as shown in your case.

  1. Try a lower learning rate for ADAM once you get to the point you’re at now. I’ve had quite a bit of success by doing the initial learning pass with the default learning rate, and after that initial convergence I can often eek out a substantial amount of improvement by dropping the learning rate by a factor or 10 or so. This will take a bit of experimentation.

I thought the function for dropping learning rate have been included by “ADAM(0.001, (0.9, 0.999))” in my code. I will try to modify the code for varying learning rate according to loss value or iteration numbers.

  1. Use a different activation function. I’ve found that swish and σ work better than relu and tanh in my personal ODEs that I’ve tried to solve.

I will try to use the swish as activation function.

  1. Try a different model architecture. From personal experience, bigger isn’t always better. Your model is pretty large for an ODE (at least from what I’ve seen). How complicated is the original system model? That should help inform how expressive your model needs to be. Also consider playing around with different layer sizes. 32 --> 16 --> 8 --> 2, etc. This has achieved good results in certain cases for me, where a model with more parameters struggled somewhat to train.

I have already reduced model size, since the original ODE solution contains time-series of mass fraction for many chemical species. I used only mass fraction of hydrogen as solution for learning in the neural ODE system. But I will try to reduce number of data in the mass fraction of hydrogen, and run the neural ODE code.
Does the different layer size mean layer structure described as following?

Chain(
			 #x -> -tanh.(x*10),
			 Dense(2,32,swish),
			 Dense(32,16,swish),
			 Dense(16,8,swish),
			 Dense(100,2)
)

Dear Chris,

Thank you for your reply.

I will try other integrator. Should I use CVODE as integrator, since the integrator is usually used in numerical simulation in combustion (chemical reaction) area?

Not necessarily. CVODE(_BDF) is great for stiff systems. CVODE_Adams is quite slow so I wouldn’t recommend it. A lot of neural ODEs that we’ve encountered so far have been non-stiff, so something like Tsit5, Vern7 or Vern9 does well. If it’s stiff, then CVODE_BDF, Rosenbrock23, Rodas5, TRBDF2, or KenCarp4 are good choices to try.

Dear Chris,

The ODE for chemical reaction I would like to solve is stiff. Therefore, I used CVODE_BDF as integrator described as below.

using Sundials

n_ode = x->neural_ode(dudt,x,tspan,CVODE_BDF(),saveat=t,dtmin=1.0E-14,maxiters=1e10,reltol=1e-7,abstol=1e-9)

I got following error message after running the code. Could you tell me correct description for CVODE_BDF integrator?

Error: LoadError: MethodError: no method matching Sundials.Nvector(::Array[Float32,1])

Oh, the CVODE integrator won’t work with Flux since it’s a C++ code compiled to use Float64’s, while Flux.jl relies on Float32 computations. Replace it with one of the other methods for stiff equations in the list, but turn off autodiff, i.e. Rodas5(autodiff=false) or TRBDF2(autodiff=false).

If you’re going to be doing the neural ODE exercise at JuliaCon then we will be explaining that in a lot more detail.

Yes, except the final layer Dense(100,2) will throw a dimension error (since the inputs of one layer must match the outputs of the previous layer).

I was referring to reducing the size of the Neural ODE (by reducing the number of parameters), rather than the original ODE model itself.

Also note that my layer-size suggestions were just suggestions — you may find that those particular sizes do not work well for your problem. But at the very least, since there are fewer parameters, it’ll be a lot faster to train, and it’ll become clearer if you require a more complex model.

1 Like

Yup. @Vaibhavdixit02 saw this yesterday :stuck_out_tongue:

2 Likes

Dear Chris,

I tried to run the neural ODE code with Rodas5(autodiff=false) and TRBDF2(autodiff=false). However, computational time for learning iteration of Rodas5 or TRBDF2 case became much longer than BS3() case. Is the following usage of the integrator correct?

  • Rodas5 case
n_ode = x->neural_ode(dudt,x,tspan,Rodas5(autodiff=false),saveat=t,dtmin=1.0E-14,maxiters=1e10,reltol=1e-7,abstol=1e-9)
  • TRBDF2 case
n_ode = x->neural_ode(dudt,x,tspan,TRBDF2(autodiff=false),saveat=t,dtmin=1.0E-14,maxiters=1e10,reltol=1e-7,abstol=1e-9)

Unfortunately, I will not attend JuliaCon this year.

That’s a telltale sign that the neural ODE is not currently stiff. It may train to become stiff later in the problem, but it almost certainly doesn’t start out stiff. I guess means using something like AutoTsit5(Rodas5(autodiff=false)) is a good thing to do, since it can be hard to know what the state of the current ODE is (especially during the training process).

@YingboMa probably is interested in that.

Dear Michael,

Thank you for your reply.

I made mistype in the layer definition. Sorry.

I’m testing neural ODE simulation with the reduced size layers. The simulation speed become much faster than before. I will modify the layer parameters if I need to reduce error between prediction of neural ODE and solution of original ODE.

I should also add that picking the right layers is crucial as well. A student of ours fit the Lotka-Volterra with a size 15 sine layer IIRC, so if you pick something that captures the dynamics the problem is much easier.

1 Like

Dear Chris,

Thank you for your reply.

The integrator, AutoTsit5(Rodas5(autodiff=false)), worked well, and simulation speed became much faster than before.
I will play around with layer parameters to find out characteristics of the problem in this weekend.

Dear Michael and Chris,

Thank you for giving me the useful advises. I revised the neural ODE code as following, and got much better solution than before. Thank you again!

using Flux, DiffEqFlux, DifferentialEquations, Plots, CSV

# Read CSV file
flamedata = CSV.read("./results.csv")
flamedata2 = flamedata[2:2:6002,:]

u0 = Float32[0.; 0.11189834407236525]

datasize = 3001
tspan = (0.0f0,0.0003f0)

t = range(tspan[1],tspan[2],length=datasize)

ode_data = Matrix(flamedata2[[:1,:4]])
ode_data = transpose(ode_data)
ode_data = convert(Array{Float32}, ode_data)
ode_data[1,:] = tanh.(ode_data[1,:]*100)

dudt = Chain(	 Dense(2,32,swish),
                 Dense(32,16,swish),
                 Dense(16,8,swish),
                 Dense(8,2)
			 )

ps = Flux.params(dudt)
n_ode = x->neural_ode(dudt,x,tspan,AutoTsit5(Rodas5(autodiff=false)),saveat=t,dtmin=1.0E-14,maxiters=1e10,reltol=1e-7,abstol=1e-9)

function predict_n_ode()
  n_ode(u0)
end
loss_n_ode() = sum(abs2,ode_data .- predict_n_ode())

data = Iterators.repeated((), 100000)
opt = ADAM(0.05, (0.9, 0.999))
cb = function () #callback function to observe training
  display(loss_n_ode())
  # plot current prediction against data
  cur_pred = Flux.data(predict_n_ode())
  
  pl1 = plot(t,ode_data[1,:],label="data1",lw=2)
  plot!(pl1,t,cur_pred[1,:],label="prediction1",lw=2)
  plot!(pl1,t,ode_data[2,:],label="data2",lw=2)
  plot!(pl1,t,cur_pred[2,:],label="prediction2",lw=2)
  gui(plot(pl1))
end

# Display the ODE with the initial parameter values.
cb()

Flux.train!(loss_n_ode, ps, data, opt, cb = cb)

I faced simulation instability with error message described as below in the neural ODE case. I will try to run the case with reduced learning rate.

Warning: Instability detected. Aborting
└└ @ DiffEqBase C:\Users\j_fukui\.julia\packages\DiffEqBase\6ewdP\src\integrator_interface.jl:162
ERROR: LoadError: First function call produced NaNs. Exiting.