Simple Regression using DiffEqFlux

I am trying to work out a simple application of NeuralODE for regression problems, based on the following tutorial.

Core idea here being (as per my understanding) that regression problems can be modelled as

y = \int_{t0}^{t1} F(x)dt \\ = \int_{t0}^{t1}NeuralODE(nn, (t0,t1),Tsit5()) dt

with nn and x being the neural network and the independent variable.

Given below is my attempt to fit a simple parabola to a Neural ode based on 4 layer deep neural network. (Inspired from the MNIST example)

using DiffEqFlux
using Flux

x = collect(-10:1.0:10)
y_true = 2.0 .* x.^2 .+ 10.0

nn = Chain(
    Dense(1,2,relu),
    Dense(2,2,relu),
    Dense(2,2,relu),
    Dense(2,1)
    )

n_ode = NeuralODE(nn, 
    (0.0, 1.0), 
    Tsit5(),
    save_everystep=false,
    save_start = false)

dataset = Flux.Data.DataLoader((x,y_true),batchsize=1)

model = Chain(
    (x) -> x,
    n_ode,
    (x) -> Array(x)
    )

function loss(x,y)
    return Flux.Losses.mse(model(x),y)
end


opt = Flux.Optimise.ADAMW(0.01)

function cb()
    l = 0.0
    for i = 1:length(x)
        l += (y_true[i] - model([x[i]])[1])^2
    end
    println("Loss: $l")
    # @save "NonLinearModel.bson" model    
    return false
end

Flux.@epochs 1000 Flux.train!(loss,Flux.params(n_ode.p),dataset,opt,cb=cb)

But my program saturates around loss value of ~75000 and never learns. Any help or comments in this regards is welcome. Thank you


Also as per the Flux.train! example here to use Flux.train! function we need to destructure the neural network and call the ODEProblem function. But would it not call the DifferentialEquation ODEProblem, thus generating huge gradient back propagation graph for each iteration of ODE solve function. Where would it utilize the augmented dynamics of NeuralODE? (I apologize in advance if any of the above is obviously wrong I still havent completely figured out NeuralODEs!)


Julia version = 1.6.1
DiffEqFlux = 1.41
Flux = 0.12.6

This isn’t a very deep or wide network. It’s hard to tell by eye if something is expressive enough, but I would gather that this might not be able to represent your function well. Try something with a bit larger in the layer size and just see? And decrease the tolerances a bit to improve gradient accuracy. I haven’t played with it a bit, but the neural network is so tiny and relu has that zero saturating gradient behavior that I wouldn’t be surprised if it hit some bad local minima. Maybe try a softplus activation as well.

Thanks for your input. I do get your point now. I tried even simple Flux Neural network (without the neuralODE) but even that does not converge. My aim will now be to first get a simple neural network converge on a problem and optimize network architecture. Then try similar neural network on NeuralODE. Will update results afterwards.

Hi, Sorry for this late continuation of conversation, last two weeks were bit busy.

  1. First I would like to know your opinion on architectures of neural networks for NeuralODE? I mean is there any rule of thumb on network parameters/ activation functions etc usually perform better?

  2. I tried your suggestion of optimizing the Neural network first and trying softplus etc. Once I got satisfactory backprop performance, I tried exactly same network for neuralode. Results are now significantly better then my previous attempts, but still way off as compared to simple backprop nn. (code below)

using Flux, Plots, DiffEqFlux, DifferentialEquations

x = collect(-2:0.1:2)
y = x.^2 .-2

x_t = Flux.unstack(reshape(x,length(x),1),1)
y_t = Flux.unstack(reshape(y,length(y),1),1)

###############################
# Backprop NN
###############################

model = Chain(Dense(1,10,softplus),
			Dense(10,10,softplus),
			Dense(10,10,softplus),
			Dense(10, 1))
function loss(x,y)
	l = sum(map((x)->abs.(x), model.(x) .- y))/length(x)
	return l[1]
end

cb = ()->Flux.@show(loss(x_t,y_t))
opt = Flux.ADAM(0.001)

Flux.@epochs 500 Flux.train!(loss,Flux.params(model),[(x_t,y_t)],Flux.ADAM(0.001),cb=cb)

########################################
# Neural ODE layer
########################################

nn = Chain(Dense(1,10,softplus),
		   Dense(10,10,softplus),
		   Dense(10,10,softplus),
		   Dense(10, 1))
nn_ode = NeuralODE(nn, (0.0,10.0), 
				Tsit5(), 
				save_everystep = false,
                   		reltol = 1e-6, abstol = 1e-6,
                   		save_start = false)
out_node(x) = Array(x)
model_ode = Chain(nn_ode,out_node)

function loss_ode(x,y)
	l = sum(map((x)->abs.(x), model_ode.(x) .- y))/length(x)
	return l[1]
end
cb_ode = ()->Flux.@show(loss_ode(x_t,y_t))
opt_ode = Flux.ADAM(0.001)
Flux.@epochs 500 Flux.train!(
  							loss_ode,
  							Flux.params(nn_ode.p),
  							[(x_t,y_t)],
  							opt_ode,
  							cb=cb_ode
							)

I have played around with different loss functions, increasing layers, different learning rates, but end results remain the same. The loss for NeuralODE saturates early and learning is rather poor. Below are final results

Comparison_nn_nnode

Thank you

Well yes, but that’s not surprising, right? There is actually no autonomous ODE u' = f(u) that will fit that curve, since that would mean f(0) would have to be both positive, negative, and non-zero (just look at the plot). Machine learning still has to follow the rules of math, so there is no 1-dimensional neural ODE which will give this curve. You can use something like an augmented neural ODE, or make the neural network dependent on t, but you really cannot expect this to work.

I am not sure I totally understood your point , but I think I do get the jest of it.
I was hoping for something similar to Chapter 3: Neural Ordinary Differential Equations , (where NeurlODE acts like a continuous depth ResNet) specially the vector map. Mapping the inputs to the output under integration from t=0, to t=1. For parabolic equation like I mentioned above, that would be impossible as that would result in multiple valued functions at same input.

As a corollary can I state that than my program will fail for all functions that are not monotonically increasing?

Is this limitation same as mentioned here (Neural Ordinary Differential Equations and Dynamics Models | by Machine Learning @ Berkeley | Medium) where Resnet works because of finite stepsize whereas NeuralODE cannot because it is impossible?

I read the paper on Augmented ODE (https://arxiv.org/pdf/1904.01681.pdf). I got what u meant. Thank you so much for your help!