Fitting neural ODE to periodic time series

Hello,

I am trying to fit a neural ODE adapted from an example in the docs to the Mauna Loa dataset. but am running in to some problems. Fitting the model to too many time steps at once causes it to underfit dramatically

and training it on only the few first observations still does not allow it to fit to the data.

The code used is attached below, where t and u are vectors of timesteps and observed carbon dioxide levels.

  tspan = (minimum(t), maximum(t))
  dudt2 = FastChain(FastDense(1, 32, tanh),
		             FastDense(32, 1))

  node = NeuralODE(dudt2, tspan, Vern7(), saveat = t, 
	                abstol = 1e-6, reltol = 1e-6)

  u0 = u[1:1]
  predict(p) = node(u0, p)
  loss(p) = begin
    û = predict(p)
    l = Flux.mse(û, u)
    return l, û
  end

  result = DiffEqFlux.sciml_train(loss, node.p, ADAM(.05),
			                        maxiters = 200)
  l, û = loss(result.minimizer)
  plt = scatter(t, u, label = "Data")
  scatter!(plt, t, reshape(û, :), label = "Prediction")

I have tried tweaking the capacity of the neural network to no avail. Anyone with suggestions on how to proceed with this? Thank you!

Yes it’s possible to hit some weird local minima. Minibatching helps a lot, though I’ve found the best thing is simply to do a growing timespan like is shown in this tutorial:

https://diffeqflux.sciml.ai/dev/examples/local_minima/

Thank you for your response Chris,

Yes that is indeed the tutorial I have based my code off. I considered three observations to be a pretty small starting amount, but have also tried with only two observations. Unfortunately it still doesn’t fit.

This leads me to believe I have misunderstood something about the ODE-part of things. I did try the Tsit5 solver in addition to Vern7, however, I am not sure how to diagnose what could be improved with the implementation.
I will experiment with mini batching, but it is weird to me that the model does not fit to a simple setting with only two or three observations.

EDIT: I tried to fit a model several hundred observations and the models simply fits to it’s overall mean. So there is most likely something wrong with the implementation. I’ll keep tinkering.

û = predict(p) is a DiffEq solution object, i.e. an array of arrays. Is u an array or an array of arrays? The u0 = u[1:1] suggests to me it might be an array, so the cost function you’ve written might not have been the one you’ve intended. Try using û = Array(predict(p))?

Thank you Chris, you are quite right! If we let predict(p) = reshape(Array(node(u0, p)), :) (u is a flattened array, and so have to be flattened as well) it does give a more reasonable behaviour for many samples

but the model refuses to fit nicely when using the growing time span approach. Here I fit the model to 2, 3, 4, 5 observations, and as seen it fits nicely to the first three observations, but does not really keep up after that.




The model does work for simpler problems though (denser observations, smoother function). Here is a fit also using growing time span:

So I wonder if I am overestimating the power of neural ODEs. Do you think the (relatively) rapid change in the observations pose a problem to this type of model?

They should be fine. Did you try using NewtonTrustRegion() and such?

https://diffeqflux.sciml.ai/dev/examples/second_order_adjoints/

In this case you generally get a stiff ODE, so the same ideas apply as normal, i.e. don’t use an explicit RK method instead use a method for stiff equations like Rodas5, possibly use lower tolerances, use an optimization method that’s robust to ill-conditioned Hessians like a Newton-based method, etc.

Your example with 4 observations should be a fairly routine case, so I’m actually surprised you’d need anything to make it work: are you sure you didn’t just need to run it for more iterations or something? How big of a neural network did you use? The only real difficulty with the method is that there’s more than a few knobs to tweak.

NewtonTrustRegion() gives me the error TypeError: in setfield!, expected Float64, got ForwardDiff.Dual{Nothing,Float64,12}, so I tried Optim.KrylovTrustRegion() which worked, but it didn’t gave a similar fit.

Running Rodas5 crashes with a message starting with TypeError: in typeassert, expected ForwardDiff.Dual{...} where I have omitted an absolute monster of a type signature.

I’ve tried to scale the network to hidden layers with up to 150 units. Currently using

      dudt2 = FastChain(FastDense(1, 32, swish),
			FastDense(32, 64, swish),
			FastDense(64, 32, swish),
			FastDense(32, 1))

which is significantly larger than the one in the docs example.

are you sure you didn’t just need to run it for more iterations or something?

I’m pretty sure the optimization has found a minima. Here is the loss curve and predictions when fitting the model for the fourth time, with four observations.

I am also surprised by the challenge in this. But then again, I am new to these models so I don’t fully know what to expect.

Could you share this example? I’d like to take a deeper look at it. This shouldn’t be very difficult.

Sure. I think the below snippet should do it.

  using DiffEqFlux, OrdinaryDiffEq, Flux, Optim, CUDA, Plots
  CUDA.allowscalar(false)

  function node_model(t)
      dudt2 = FastChain(FastDense(1, 100, swish),
			FastDense(100, 1))

      node = NeuralODE(dudt2, (minimum(t), maximum(t)), 
		       Tsit5(), saveat = t, 
		       abstol = 1e-9, reltol = 1e-9)
  end

  function plot_pred(t, u, û)
      plt = scatter(reshape(t,:), reshape(u,:), 
		    label = "Data")
      scatter!(plt, reshape(t,:), reshape(û,:), 
	       label = "Prediction")
  end

  function train(node, t, u, p = nothing; maxiters, lr)
      u0 = u[1:1]
      predict(p) = Array(node(u0, p))

      loss(p) = begin
	  û = predict(p)
	  l = Flux.mse(û, u)
	  return l, û
      end

      losses = []
      cb(p, l, û) = begin
	  push!(losses, l)
	  false
      end

      p = p == nothing ? node.p : p
      res = DiffEqFlux.sciml_train(loss, p, ADAMW(lr),
				   maxiters = maxiters,
				   cb = cb)

      p_pred = plot_pred(t, u, predict(res.minimizer))
      plot!(p_pred, legend = (.11, .9))
      p_loss = plot(losses, label="Loss")  
      display(plot(p_pred, p_loss, layout=(2, 1)))
      return res.minimizer
  end

 # the ten first observations after normalization
  t = [
      0.0
      0.0047485508630894695
      0.009340678874206868
      0.018675771218000677
      0.02341873555066439
      0.028161699883315395
      0.03750237875753497
      0.042094506768665066
      0.04683747110131607
      0.051580435433979784
  ]

  u = [
      -1.388342887543077
      -1.3271350629309493
      -1.3250365089442477
      -1.3827467435785388
      -1.4152743303724125
      -1.4754328779911903
      -1.4712357700177872
      -1.4243680643147854
      -1.3925399955164801
      -1.3607119267181729
  ]

  # train on the k first observations
  k = 4
  node = node_model(t[1:k])
  p = train(node, t[1:k], u[1:k],
	    maxiters = 2000, lr = .02);

I trained the model on this kaggle dataset real quick and the model fit beautifully (plot below). So it looks like the problem is specific to the implementation or the data. Intuitively the above problems seems easier since it is one dimensional, while the one below is four dimensional. But maybe this intuition is flawed.

Ahh, @SebastianCallh, I got it. In your example, the issue is that there is no ODE that can be the solution. Remember that for ODE solutions, their flow lines can never overlap in phase space. Since you have a 1-dimensional solution, if at a given point the derivative is positive, then it cannot be negative. Remember, this neural ODE is defined as u' = f(u), so not all possible functions will exist. So at u=1.35, the derivative is either positive or negative, and this enforces monotonicity so that time series is not able to be fit.

This suggests the correct strategies though. One thing you can do is use u'=f(u,t), but this has the downside of being time dependent and thus won’t predict well. The right way to do this is to use Augmented Neural ODEs. See the tutorial in the documentation for information on how to use it:

https://diffeqflux.sciml.ai/dev/examples/augmented_neural_ode/

Here I show how I would use augmented neural ODEs on your example:

using DiffEqFlux, OrdinaryDiffEq, Flux, Optim, Plots

function node_model(t)
  dudt2 = FastChain(FastDense(2, 100, swish),
		FastDense(100, 2))

  node = AugmentedNDELayer(NeuralODE(dudt2, (minimum(t), maximum(t)),
	       Tsit5(), saveat = t,
	       abstol = 1e-12, reltol = 1e-12), 1)
end

function plot_pred(t, u, û)
  plt = scatter(reshape(t,:), reshape(u,:),
	    label = "Data")
  scatter!(plt, reshape(t,:), reshape(û,:),
       label = "Prediction")
end

function train(node, t, u, p = nothing; maxiters, optimizer, kwargs...)
  u0 = u[1:1]
  predict(p) = Array(node(u0, p))

  loss(p) = begin
	  û = predict(p)[1,:]
	  l = sum(abs2,vec(û)-vec(u))
	  return l, û
  end

  losses = []
  cb(p, l, û) = begin
	  @show l
	  push!(losses, l)
	  false
  end

  p = p == nothing ? node.p : p
  res = DiffEqFlux.sciml_train(loss, p, optimizer;
			   maxiters = maxiters,
			   cb = cb,
			   kwargs...)

  p_pred = plot_pred(t, u, predict(res.minimizer)[1,:])
  plot!(p_pred, legend = (.11, .9))
  p_loss = plot(losses, label="Loss")
  display(plot(p_pred, p_loss, layout=(2, 1)))
  return res.minimizer
end

# the ten first observations after normalization
t = [
  0.0
  0.0047485508630894695
  0.009340678874206868
  0.018675771218000677
  0.02341873555066439
  0.028161699883315395
  0.03750237875753497
  0.042094506768665066
  0.04683747110131607
  0.051580435433979784
]

u = [
  -1.388342887543077
  -1.3271350629309493
  -1.3250365089442477
  -1.3827467435785388
  -1.4152743303724125
  -1.4754328779911903
  -1.4712357700177872
  -1.4243680643147854
  -1.3925399955164801
  -1.3607119267181729
]

# train on the k first observations
k = 3
node = node_model(t[1:k])
p = train(node, t[1:k], u[1:k],
    maxiters = 10, optimizer = ADAM(.1));

p2 = train(node, t[1:k], u[1:k], p,
    maxiters = 1000, optimizer = BFGS(initial_stepnorm = 0.01));

k = 4
node = node_model(t[1:k])
p3 = train(node, t[1:k], u[1:k], p2,
    maxiters = 1000, optimizer = BFGS(initial_stepnorm = 0.01));

with a resulting loss:

l = 2.6409047003451676e-8
l = 2.2702601565538227e-9
l = 1.3131485517490487e-9
l = 6.922936097395577e-12
l = 1.1102737720276462e-13
l = 4.0216421072999674e-17

that is essentially zero.

Hope this makes a lot more sense now!

Oh but of course! Thank you, that makes perfect sense!

Perhaps you could shine a little light on when ANODES should be preferred over regular NODES in general? I have the impression that they are just straight up better loss-wise, but I guess you also lose some interpretability by introducing new dimensions. Additionally, if we were modeling a physical process I suppose introducing additional dimensions would be silly since we would get a model that didn’t obey the laws of physics.

Exactly. If you need to use a neural ODE as a universal approximator itself, ANODEs are how you guarantee that. However, I don’t see that as very useful. If you’re using it to model physical systems, then you want to limit the possible functions that can be approximated to those that are physically possible, and so you wouldn’t want to do augmentation. ANODEs is more for training MNIST classifiers with neural ODEs and not related to the physical case.

Excellent. Thank you very much for helping sove this riddle!