I find I still don’t fully understand neural ordinary differential equations.
As shown in literature (Lu et al., 2017; Haber and Ruthotto, 2017; Chen el al., 2018), a sequence of
transformations
h(t+1) = h(t) + f(h,theta)
can be turned into an ODE in the limiting case
dh/dt = f(h,theta)
This formulation holds only under the condition that theta
is a function of time.
This way we can say we’ve got an equivalent block of infinite depth.
However, in applications we just define a network with a fix number of parameters.
For example, in the following codes,
using DiffEqFlux, OrdinaryDiffEq
dudt = FastChain(FastDense(2, 50, tanh), FastDense(50, 2))
u0 = Float32[2.0; 0.0]
tspan = (0.f0, 1.f0)
nn = NeuralODE(dudt, tspan, Tsit5())
nn.p
is a fix-sized vector with 252 entries.
Although the output of NN is feed into next iteration as input in the Tsit5 solver, the old parameters are still there.
Does the neural ODE hold better learning capacity than a single layer, or they are just the same in this case?