Fitting neural ODE to periodic time series

SebastianCallh · June 29, 2020, 6:21pm

Hello,

I am trying to fit a neural ODE adapted from an example in the docs to the Mauna Loa dataset. but am running in to some problems. Fitting the model to too many time steps at once causes it to underfit dramatically

and training it on only the few first observations still does not allow it to fit to the data.

The code used is attached below, where t and u are vectors of timesteps and observed carbon dioxide levels.

  tspan = (minimum(t), maximum(t))
  dudt2 = FastChain(FastDense(1, 32, tanh),
		             FastDense(32, 1))

  node = NeuralODE(dudt2, tspan, Vern7(), saveat = t, 
	                abstol = 1e-6, reltol = 1e-6)

  u0 = u[1:1]
  predict(p) = node(u0, p)
  loss(p) = begin
    û = predict(p)
    l = Flux.mse(û, u)
    return l, û
  end

  result = DiffEqFlux.sciml_train(loss, node.p, ADAM(.05),
			                        maxiters = 200)
  l, û = loss(result.minimizer)
  plt = scatter(t, u, label = "Data")
  scatter!(plt, t, reshape(û, :), label = "Prediction")

I have tried tweaking the capacity of the neural network to no avail. Anyone with suggestions on how to proceed with this? Thank you!

ChrisRackauckas · June 29, 2020, 7:38pm

Yes it’s possible to hit some weird local minima. Minibatching helps a lot, though I’ve found the best thing is simply to do a growing timespan like is shown in this tutorial:

https://diffeqflux.sciml.ai/dev/examples/local_minima/

SebastianCallh · June 29, 2020, 9:24pm

Thank you for your response Chris,

Yes that is indeed the tutorial I have based my code off. I considered three observations to be a pretty small starting amount, but have also tried with only two observations. Unfortunately it still doesn’t fit.

This leads me to believe I have misunderstood something about the ODE-part of things. I did try the Tsit5 solver in addition to Vern7, however, I am not sure how to diagnose what could be improved with the implementation.
I will experiment with mini batching, but it is weird to me that the model does not fit to a simple setting with only two or three observations.

EDIT: I tried to fit a model several hundred observations and the models simply fits to it’s overall mean. So there is most likely something wrong with the implementation. I’ll keep tinkering.

ChrisRackauckas · June 29, 2020, 11:23pm

û = predict(p) is a DiffEq solution object, i.e. an array of arrays. Is u an array or an array of arrays? The u0 = u[1:1] suggests to me it might be an array, so the cost function you’ve written might not have been the one you’ve intended. Try using û = Array(predict(p))?

SebastianCallh · June 30, 2020, 7:21am

Thank you Chris, you are quite right! If we let predict(p) = reshape(Array(node(u0, p)), :) (u is a flattened array, and so û have to be flattened as well) it does give a more reasonable behaviour for many samples

but the model refuses to fit nicely when using the growing time span approach. Here I fit the model to 2, 3, 4, 5 observations, and as seen it fits nicely to the first three observations, but does not really keep up after that.

The model does work for simpler problems though (denser observations, smoother function). Here is a fit also using growing time span:

So I wonder if I am overestimating the power of neural ODEs. Do you think the (relatively) rapid change in the observations pose a problem to this type of model?

ChrisRackauckas · June 30, 2020, 11:01am

They should be fine. Did you try using NewtonTrustRegion() and such?

https://diffeqflux.sciml.ai/dev/examples/second_order_adjoints/

In this case you generally get a stiff ODE, so the same ideas apply as normal, i.e. don’t use an explicit RK method instead use a method for stiff equations like Rodas5, possibly use lower tolerances, use an optimization method that’s robust to ill-conditioned Hessians like a Newton-based method, etc.

Your example with 4 observations should be a fairly routine case, so I’m actually surprised you’d need anything to make it work: are you sure you didn’t just need to run it for more iterations or something? How big of a neural network did you use? The only real difficulty with the method is that there’s more than a few knobs to tweak.

SebastianCallh · June 30, 2020, 8:00pm

NewtonTrustRegion() gives me the error TypeError: in setfield!, expected Float64, got ForwardDiff.Dual{Nothing,Float64,12}, so I tried Optim.KrylovTrustRegion() which worked, but it didn’t gave a similar fit.

Running Rodas5 crashes with a message starting with TypeError: in typeassert, expected ForwardDiff.Dual{...} where I have omitted an absolute monster of a type signature.

I’ve tried to scale the network to hidden layers with up to 150 units. Currently using

      dudt2 = FastChain(FastDense(1, 32, swish),
			FastDense(32, 64, swish),
			FastDense(64, 32, swish),
			FastDense(32, 1))

which is significantly larger than the one in the docs example.

are you sure you didn’t just need to run it for more iterations or something?

I’m pretty sure the optimization has found a minima. Here is the loss curve and predictions when fitting the model for the fourth time, with four observations.

I am also surprised by the challenge in this. But then again, I am new to these models so I don’t fully know what to expect.

ChrisRackauckas · July 1, 2020, 5:20pm

Could you share this example? I’d like to take a deeper look at it. This shouldn’t be very difficult.

SebastianCallh · July 1, 2020, 6:39pm

Sure. I think the below snippet should do it.

  using DiffEqFlux, OrdinaryDiffEq, Flux, Optim, CUDA, Plots
  CUDA.allowscalar(false)

  function node_model(t)
      dudt2 = FastChain(FastDense(1, 100, swish),
			FastDense(100, 1))

      node = NeuralODE(dudt2, (minimum(t), maximum(t)), 
		       Tsit5(), saveat = t, 
		       abstol = 1e-9, reltol = 1e-9)
  end

  function plot_pred(t, u, û)
      plt = scatter(reshape(t,:), reshape(u,:), 
		    label = "Data")
      scatter!(plt, reshape(t,:), reshape(û,:), 
	       label = "Prediction")
  end

  function train(node, t, u, p = nothing; maxiters, lr)
      u0 = u[1:1]
      predict(p) = Array(node(u0, p))

      loss(p) = begin
	  û = predict(p)
	  l = Flux.mse(û, u)
	  return l, û
      end

      losses = []
      cb(p, l, û) = begin
	  push!(losses, l)
	  false
      end

      p = p == nothing ? node.p : p
      res = DiffEqFlux.sciml_train(loss, p, ADAMW(lr),
				   maxiters = maxiters,
				   cb = cb)

      p_pred = plot_pred(t, u, predict(res.minimizer))
      plot!(p_pred, legend = (.11, .9))
      p_loss = plot(losses, label="Loss")  
      display(plot(p_pred, p_loss, layout=(2, 1)))
      return res.minimizer
  end

 # the ten first observations after normalization
  t = [
      0.0
      0.0047485508630894695
      0.009340678874206868
      0.018675771218000677
      0.02341873555066439
      0.028161699883315395
      0.03750237875753497
      0.042094506768665066
      0.04683747110131607
      0.051580435433979784
  ]

  u = [
      -1.388342887543077
      -1.3271350629309493
      -1.3250365089442477
      -1.3827467435785388
      -1.4152743303724125
      -1.4754328779911903
      -1.4712357700177872
      -1.4243680643147854
      -1.3925399955164801
      -1.3607119267181729
  ]

  # train on the k first observations
  k = 4
  node = node_model(t[1:k])
  p = train(node, t[1:k], u[1:k],
	    maxiters = 2000, lr = .02);

I trained the model on this kaggle dataset real quick and the model fit beautifully (plot below). So it looks like the problem is specific to the implementation or the data. Intuitively the above problems seems easier since it is one dimensional, while the one below is four dimensional. But maybe this intuition is flawed.

ChrisRackauckas · July 2, 2020, 12:48pm

Ahh, @SebastianCallh, I got it. In your example, the issue is that there is no ODE that can be the solution. Remember that for ODE solutions, their flow lines can never overlap in phase space. Since you have a 1-dimensional solution, if at a given point the derivative is positive, then it cannot be negative. Remember, this neural ODE is defined as u' = f(u), so not all possible functions will exist. So at u=1.35, the derivative is either positive or negative, and this enforces monotonicity so that time series is not able to be fit.

This suggests the correct strategies though. One thing you can do is use u'=f(u,t), but this has the downside of being time dependent and thus won’t predict well. The right way to do this is to use Augmented Neural ODEs. See the tutorial in the documentation for information on how to use it:

https://diffeqflux.sciml.ai/dev/examples/augmented_neural_ode/

Here I show how I would use augmented neural ODEs on your example:

using DiffEqFlux, OrdinaryDiffEq, Flux, Optim, Plots

function node_model(t)
  dudt2 = FastChain(FastDense(2, 100, swish),
		FastDense(100, 2))

  node = AugmentedNDELayer(NeuralODE(dudt2, (minimum(t), maximum(t)),
	       Tsit5(), saveat = t,
	       abstol = 1e-12, reltol = 1e-12), 1)
end

function plot_pred(t, u, û)
  plt = scatter(reshape(t,:), reshape(u,:),
	    label = "Data")
  scatter!(plt, reshape(t,:), reshape(û,:),
       label = "Prediction")
end

function train(node, t, u, p = nothing; maxiters, optimizer, kwargs...)
  u0 = u[1:1]
  predict(p) = Array(node(u0, p))

  loss(p) = begin
	  û = predict(p)[1,:]
	  l = sum(abs2,vec(û)-vec(u))
	  return l, û
  end

  losses = []
  cb(p, l, û) = begin
	  @show l
	  push!(losses, l)
	  false
  end

  p = p == nothing ? node.p : p
  res = DiffEqFlux.sciml_train(loss, p, optimizer;
			   maxiters = maxiters,
			   cb = cb,
			   kwargs...)

  p_pred = plot_pred(t, u, predict(res.minimizer)[1,:])
  plot!(p_pred, legend = (.11, .9))
  p_loss = plot(losses, label="Loss")
  display(plot(p_pred, p_loss, layout=(2, 1)))
  return res.minimizer
end

# the ten first observations after normalization
t = [
  0.0
  0.0047485508630894695
  0.009340678874206868
  0.018675771218000677
  0.02341873555066439
  0.028161699883315395
  0.03750237875753497
  0.042094506768665066
  0.04683747110131607
  0.051580435433979784
]

u = [
  -1.388342887543077
  -1.3271350629309493
  -1.3250365089442477
  -1.3827467435785388
  -1.4152743303724125
  -1.4754328779911903
  -1.4712357700177872
  -1.4243680643147854
  -1.3925399955164801
  -1.3607119267181729
]

# train on the k first observations
k = 3
node = node_model(t[1:k])
p = train(node, t[1:k], u[1:k],
    maxiters = 10, optimizer = ADAM(.1));

p2 = train(node, t[1:k], u[1:k], p,
    maxiters = 1000, optimizer = BFGS(initial_stepnorm = 0.01));

k = 4
node = node_model(t[1:k])
p3 = train(node, t[1:k], u[1:k], p2,
    maxiters = 1000, optimizer = BFGS(initial_stepnorm = 0.01));

with a resulting loss:

l = 2.6409047003451676e-8
l = 2.2702601565538227e-9
l = 1.3131485517490487e-9
l = 6.922936097395577e-12
l = 1.1102737720276462e-13
l = 4.0216421072999674e-17

that is essentially zero.

Hope this makes a lot more sense now!

SebastianCallh · July 3, 2020, 8:55am

Oh but of course! Thank you, that makes perfect sense!

Perhaps you could shine a little light on when ANODES should be preferred over regular NODES in general? I have the impression that they are just straight up better loss-wise, but I guess you also lose some interpretability by introducing new dimensions. Additionally, if we were modeling a physical process I suppose introducing additional dimensions would be silly since we would get a model that didn’t obey the laws of physics.

ChrisRackauckas · July 3, 2020, 9:12am

Exactly. If you need to use a neural ODE as a universal approximator itself, ANODEs are how you guarantee that. However, I don’t see that as very useful. If you’re using it to model physical systems, then you want to limit the possible functions that can be approximated to those that are physically possible, and so you wouldn’t want to do augmentation. ANODEs is more for training MNIST classifiers with neural ODEs and not related to the physical case.

SebastianCallh · July 3, 2020, 10:20am

Excellent. Thank you very much for helping sove this riddle!

00krishna · March 1, 2022, 6:14pm

@ChrisRackauckas this thread is quite interesting, but I wanted to clarify some of your comments–so that I can learn myself :). You were saying that the NODE is autonomous and so the flow lines cannot cross. I get what you mean there, but I was not sure in which image you saw the flow lines crossing? My guess is that you saw this image and it looks like the system is both increasing and decreasing at the same time.

phaseimg

But I think this image is a bit deceptive–or just the plot artifacts obscure that there is a clear trend here with two time scales. Seems like the plot points/markers are a bit too big, and so it is hard to see the trend. Below is a different image of the same mauna loa dataset.

If I am missing something, please correct me. When I look at the second image, it looks like a sawtooth sharp decline in the curve after the peak. The peaks and bottoms are really sharp, so that could perhaps make the flow lines change very very quickly between peaks–and that could make the optimization routine have trouble fitting a straight NeuralODE model versus the ANODE. But then wouldn’t a stiff ode solver work better–as I believe you mentioned in a different post.

ChrisRackauckas · March 1, 2022, 7:44pm

There is no one dimensional ODE that can have the solution of the Mauna Loa plot. If you don’t see why, try to build one.

00krishna · March 1, 2022, 8:35pm

Ahh yes, I get it now. I looked back at chapter 2 of Strogatz, and that makes sense now. No oscillations in 1 dimension. I was mis-remembering the diagrams from that chapter, but I am straight now. Thanks.

Topic		Replies	Views
Neural ODE fitting really slow Machine Learning diffeq , flux , sciml	13	1927	September 11, 2020
Neural ODE with Evolutionary Parameters Machine Learning diffeq , flux , neural-network	7	1309	August 31, 2021
Running neural ODE on measurement Machine Learning flux , machine-learning , diffeqflux	7	598	June 13, 2023
DiffEqFlux with time as additional input to Neural ODE Machine Learning first-steps	7	2177	July 21, 2019
Parameter tuning in neural ODE Machine Learning diffeq , flux	15	1982	July 21, 2019

Fitting neural ODE to periodic time series

Related topics