Losses for parameter estimation of Neural SDE

I’m trying to understand the various options for doing parameter estimation on a USDE. I’m completely new to parameter estimation in SDEs, so ELI-dumb-chemist would be appreciated.

Background

I’m trying to model a physical process using Brownian dynamics with a position-dependent diffusion tensor. I can model this as a non-diagonal Ito SDE. Both the drift and diffusion terms depend on neural networks. I can set things up in a way that guarantees that the system has a well-defined stationary distribution and obeys physical constraints, like fluctuation-dissipation relations.

My training data is a small number of long trajectories with observations at regular time intervals.

My questions are about parameter estimation. I’ve seen two different approaches in the documentation for DifferentialEquations.jl and DiffEqFlux.jl. I have some questions about those and how they relate to what I’ve read in other sources.

Moment Matching

The DiffEqFlux examples all seem to use moment matching. As I understand, this involves:

  1. Collecting a set of real observations for many realizations of the system starting from some initial condition.
  2. Compute the per-timestep means and variances.
  3. Run a set of realizations of the model starting from the same initial condition.
  4. Compute the per-timestep predicted means and variances
  5. Construct a loss by comparing the observed to predicted means/variances.

Is that understanding correct? In my case, I don’t think this will work well, as, over time, the distribution sampled by my model tends towards a stationary distribution that is highly non-gaussian, multi-modal, etc.

L2 Loss

The documentation page on parameter estimation has an example of using the L2 loss for parameter estimation of an SDE, but I don’t really understand how that works. The aggregate training data consists of 10,000 trajectories. The problem is formulated as an ensemble problem, solved with num_monte=1000, which samples 1000 trajectories starting from the same initial condition.

What I don’t understand is what the L2Loss is doing in this case. The predicted and observed trajectories have a different dimension, so it’s not clear to me what this is calculating, either conceptually or in practice. This is extended to consider first differences, but again, it’s not clear (to me) what is actually calculated.

Pseudo Maximum Likelihood

One approach I’ve seen works roughly as follows:

  1. Choose a random observed trajectory. Choose a random observed time point to get y_t.
  2. Start an ensemble of simulations starting from x_t = y_t and propagate to the next observed time point to get \{x_{t+\Delta t}\}.
  3. Fit a probability distribution to \{x_{t+\Delta t}\}. For small \Delta t a multivariate gaussian is probably sufficient. For longer times, use something like KDE.
  4. Compute the log-likelihood of the actual observation y_{t+\Delta t} under the model.
  5. Average over a batch of initial y_t to get an estimate of the loss.

Adjoints

How do the different adjoint options available for SDEs relate to these choices of loss?

Recommendations

I guess that is all quite vague and rambling. Here is what I could use help with:

  1. Conceptual understanding of the different ways to do parameter estimation for SDEs.
  2. A better understanding of what the documented examples are doing, especially L2Loss.
  3. Any suggestions for a strategy for the type of problem I described using tools from the Julia ecosystem. I think the key feature is that, over time, my model relaxes towards a stationary distribution, which makes moment matching a bad choice.
3 Likes

Yes, we should really get a page in DiffEqFlux that explains all of this in more detail. I’ve just been working on other areas, but you’re not the first to ask about how to exactly write good losses on SDEs. Would you be interested in helping put together this page?

Yes, then this one isn’t for you. You also need multiple data points at each point in time, which is a heavy restriction for some problems.

L2Loss requires a matrix of data and it’s doing the moment matching thing on it. So it’s the same as moment matching, except it’s our old less flexible infrastructure.

Adjoints related to how the loss is optimized, but is essentially independent of the loss. If you want a gradient, say for gradient-based optimization or Hamiltonian Monte Carlo (for Bayesian psuedo maximum likelihood), then the adjoint will calculate that. But it’s not tied to the choice of the loss function.

I should mention that this paper (https://arxiv.org/abs/2001.01328) mentions another loss function for SDEs in Section 5. I’ve been meaning to do an implementation of it, but haven’t gotten around to it. There are some good properties of it as well. I’d summarize the space as:

  1. Moment matching, which works if you have a ton of data at each point in time (like from a pseudotime timeseries generated from successive single-cell RNA-seqs :wink:)
  2. Pseudo maximum likelihood, which is a general method (and we need to showcase more in DiffEqFlux. Right now it’s only in the Turing docs… https://turing.ml/dev/tutorials/10-bayesiandiffeq/). This one is good for giving you a full sense of the distribution, since it uses Bayesian estimation.
  3. ELBO loss, which is a general method with no tutorial right now. This one is just generally good for point estimation.
  4. Collocation methods, like https://github.com/mschauer/MitosisStochasticDiffEq.jl . This one is likely more stable than the others if you have dense-ish timeseries data.

I tend to use moment matching in my examples because my biological problems fit it well, but we really shouldn’t because it’s really only for specific cases, and we should make a page that explains all 4 likelihoods. In fact, it would be nice to do a paper that investigates the performance and stability for neural SDE fitting between the four. I think @frankschae would be interested in this.

3 Likes

Do you need to train the entire neural network in drift and diffusion coefficient or do you need other parameters on top of a trained neural network?
If you could specify the model for us that would be good – the answers depend on how exactly things look like.

We need to train entire neural networks for the drift and diffusion terms.

The model is:

dx_i(x, \lambda) = \left( -D_{ij}\frac{\partial}{\partial x_j} U(x, \lambda) + \frac{\partial}{\partial x_j} D_{ij}(x, \lambda) \right) dt + \sqrt{2}g_{ij}(x, \lambda)dW_j.

\lambda(t) is a time-dependent control parameter protocol that is specified by the user, U is a scalar valued potential function that is composed of two parts, U(x, \lambda) = U_\lambda(x, \lambda) + U_0(x), where U_\lambda is known and U_0 must be learned. g_{ij} is a positive definite matrix parameterized by a NN, and D=g^2.

I might be missing a few constants (I’m doing this from memory), but that’s the gist of it.

The training data is a series of different control protocols \{\Lambda_i\}, each of which specifies a different time dependence for \lambda(t). For each protocol we have a small number of trajectories, typically n=8.

@ChrisRackauckas, thanks for the info. That clears some things up for me, particularly what L2Loss is doing.

Would you be interested in helping put together this page?

Definitely, but at this point, I don’t feel like I understand things well enough to really be helpful beyond having a good understanding of what a new user finds confusing.

I did spend most of yesterday reading, and some things are clearer now. I don’t think I understood the different classes of parameter estimation approaches, especially the distinction between shooting approaches (which are used in all of DiffEq and Turing examples) and other approaches to parameter estimation.

A few resources I found helpful: SDE Toolbox documentation, and this thesis, especially Chapter 2. I’m a biochemist, so I really find the more technical papers very hard to follow.

Shooting Methods

All of the examples I could find for fitting SDEs in the julia ecosystem are based on shooting approaches, similar to what would do for an ODE. Given some set of starting conditions, an ensemble of trajectories is generated using the current estimate of the parameters. At each time step, statistics are computed that describe the distributions. The loss is formed by comparing these statistics with those calculated on a set of observed trajectories.

In order for this to be effective, the observations at each time point must be well described by some simple distribution, like a gaussian. We also need enough samples to accurately estimate this distribution. In some cases, as shown in the examples, this can work quite well. However, there are many systems that don’t fit these assumptions. For example, my model has chaotic behaviour, where trajectories rapidly diverge, and the distributions at each timestep are highly non-gaussian.

Shooting methods work by generating an ensemble of full trajectories. When applicable, they are simple to understand and simple to implement given the powerful libraries in Julia.

However, there are many other approaches that do not generate full trajectories. Instead, they focus on characterizing the transitions between consecutive observations. I think these approaches are more suitable for the type of problem I am considering and are more generally applicable, but I haven’t seen any examples in the DiffEq ecosystem.

General Method of Moments

One example of a non-shooting approach is the general method of moments. Here one constructs a series of moment conditions that relate observations at consecutive timesteps. For example:

\begin{eqnarray} \psi_1:& \mathbb{E}[X_{k+1} - X_k - \Delta t \mu(X_k, \theta)] = 0\\ \psi_2:& \mathbb{E}[\left(X_{k+1}-X_{k}-\Delta t \mu\left(X_{k} ; \boldsymbol{\theta}\right)\right)^{2}-\Delta t g^{2}\left(X_{k} ; \boldsymbol{\theta}\right)] = 0 \end{eqnarray}

The first equation says that on average, the difference between adjacent observations should equal the drift. The second says that the variance is given by the diffusion. One can also construct higher-order terms. I’m leaving out some details about these conditions are combined to produce a loss function, but the gist is to adjust the parameters to make the statistics of the transitions match. Note that this method only works when \Delta t is small, leading to small discretization errors.

The key distinction between this and shooting methods is that a full trajectory isn’t generated. In fact, this specific algorithm doesn’t use an SDE solver at all.

Maximum Likelihood

Another class of methods tries to maximize the likelihood of the observed trajectories. The likelihood of a trajectory is given by:

L = p(x_0)\prod_{i=1}^Np(x_i | x_{i-1}, \theta),

where x_{i-1} and x_i are observations at consecutive times.

It’s common to assume that x_0 is fixed, so the first factor can be neglected. If the time between observations is small, something like the EM discretization can be used, and the transition probabilities are simply gaussian with the mean and variance determined by the drift and diffusion, respectively. The problem is that as the time between observations becomes larger, we can’t use the single-step EM transition probability due to discretization error. In general, we don’t have a closed-form solution for the transition probability, so there are many approaches that develop approximations.

One that I find appealing due to its simplicity is called simulated maximum likelihood. We take two consecutive observations x_{t_i} and x_{t_{i+1}}. We discretize the interval from t_i to t_{i+1} into N timesteps that are small enough that we can use something like EM. Say N=10. We then simulate an ensemble starting from x_{t_i} up to the penultimate time step, in this case, the 9th. We then compute the transition probability from each of those penultimate states to the observed final state x_{t_{i+1}}. We can do this because the timesteps are small enough that the gaussian approximation is valid. You can show that the average of these penultimate transition probabilities is an approximation of the overall transition probability. It is possible to dramatically improve the accuracy using importance sampling. Hopefully, this makes some sense—I’m too lazy to transcribe the equations, but the thesis I linked to above gives a good overview.

Simulated maximum likelihood does make use of an SDE solver, but only to generate short ensembles starting from observations, rather than shooting full trajectories.

A Caveat

As section 5 of the paper @ChrisRackauckas linked to explains, one has to be careful when using maximum likelihood approaches with highly expressive models for the drift term, like deep NNs. In this case, the model can effectively memorize the data, leading to a drift term that deterministically generates the data and diffusion terms that tend to zero. One would have to use more restricted/constrained models or incorporate regularization of some kind.

Summary

  • The shooting approach is both conceptually easy to understand and easy to implement using tools from the Julia ecosystem, but it is only applicable to a limited set of problems.
  • There are other approaches applicable to a wider range of problems, but they are both more conceptually complex and harder to implement. As far as I know, none of these approaches are currently implemented in the julia ecosystem, and if they are, I couldn’t find any tutorials or documentation.

I think it would be extremely helpful to have library support for some of these alternative approaches, like simulated maximum likelihood with modified Brownian bridge importance sampling from this paper. There are other approaches that may be more accurate or more efficient, but they’re beyond my background to understand. I’m willing to help however I can, but my math background is pretty weak and I’m new to Julia and the Julia way.

1 Like

I linked a method which wasn’t a shooting method: https://github.com/mschauer/MitosisStochasticDiffEq.jl . And one thing you can also do is use a multiple shooting method, which is a pretty simple extension to a shooting method that can be done on the user’s side. You just solve multiple smaller trajectories from a vector of initial conditions and define the time series loss and the initial condition loss.

Not only does that require that dt is small for the discretization errors, meaning that you have dense data, but it can also be unstable if you have not too large of a dt since it’s an explicit derivative discretization. I mentioned that collocation/interpolation based methods like this (also known as “two-stage”) have this issue.

What you described is just a form of multiple shooting. You can use any loss with multiple shooting and you’ll get the improved estimation properties. Shooting methods are maximum likelihood method under some choose of likelihood (L2 -> Normal)

Yes, and I wouldn’t recommend using a full neural SDE for any problem where you have a scientific model. As described in https://arxiv.org/abs/2001.04385 , usually you have some partial knowledge, like you know some chemical reactions in a large system and you know that many connections aren’t possible, and you can learn in that constrained framework instead. And in many cases you can even draw conclusions about the relationship between drift and diffusion terms.

Sorry, but that conclusion isn’t quite close. Not only is there not a hard divide between shooting methods and others (with multiple shooting techniques blurring the lines), these non-shooting methods are implemented in the libraries:


And tons of the differential equation solvers are built for solving Fokker-Plank equations (diffusion-advection equations, you’ll see that most of the examples and tutorials for PDEs are reaction-diffusion-advection equations). I have plans to automate the construction of FP PDEs from SDEProblems though.

That Hermite expansion approach is new to me though. We could automate it with ModelingToolkit.jl.

That looks suspiciously close to https://github.com/mschauer/MitosisStochasticDiffEq.jl

3 Likes

@ChrisRackauckas Thanks for the reply. My apologies if I mischaracterized things, that wasn’t my intent. I really appreciate the effort you and many others have put into this ecosystem. DiffEqFlux drew me here after a decade of python, and I’m here to stay. The speed is nice, but the composability is amazing.

I absolutely appreciate that the existing tools would allow one to implement many of these methods, which is incredibly cool. As you mentioned before, more tutorials/examples that demonstrate and explain different approaches would be extremely helpful. I would be willing to help with this, but don’t think I could do it alone.

What I’ve found challenging isn’t necessarily implementing a particular loss and optimizing it. Rather, it’s understanding what a reasonable loss is in the first place. Searches lead mostly to rather technical papers, which can be quite tough for someone (like me) without the right background. I think a scientist-friendly tutorial could be very helpful.

I did try to have a look at this but will admit that I gave up on the paper after a few minutes. I will have to read it carefully.

I agree100 percent! This really is the key to making ML work in the data-poor regime. I’ve experienced this first hand in trying to fit an autoregressive generative model to data. It fits the training data wonderfully but gives complete unphysical garbage everywhere else. This could potentially be another useful tutorial / demonstration.

1 Like

If you are interested in diffusion estimation with data augmentation/bridges https://projecteuclid.org/download/pdfview_1/euclid.ejs/1495850628 is a good starting point. You have parameters in the diffusion coefficient… there one has to be careful and e.g. in the Bayesian setting do a non-centred reparametrisation. For the time being, https://mmider.github.io/BridgeSDEInference.jl/dev/ is functional and well documented and does all you need, but https://github.com/mschauer/MitosisStochasticDiffEq.jl will supersede that at some point as native SciML solution. In Mitosis, sampling a single diffusion bridge (as in Durham-Gallant) with associated log-likelihood for an EM algorithm looks for example like this

using MitosisStochasticDiffEq
using MitosisStochasticDiffEq: logdensity, WGaussian
using Test, Random
using LinearAlgebra

# set true model parameters
p = [-0.1,0.2,0.9]

# define SDE function
f(u,p,t) = @. p[1]*u + p[2] - 1.5*sin(u*2pi)
g(u,p,t) = p[3] .- 0.2*(1 .-sin.(u))

# time span
tstart = 0.0
tend = 0.2
dt = 0.005

# initial condition
ustart = 0.8
# final condition
uend = 1.1

# Parameters B, β, σ for bridge proposal (constraint σ = g(uend,p,tend) for small observation noise)
plin = [-0.1,0.2, g(uend,p,tend)]


kernel = MitosisStochasticDiffEq.SDEKernel(f,g,tstart,tend,p,plin,dt=dt)


# Backward ODE
endpoint = WGaussian{(:μ,:Σ,:c)}(uend, 0.0001, 0)
message, solbw = MitosisStochasticDiffEq.backwardfilter(kernel, endpoint)


# Forward SDE
solfw, ll = MitosisStochasticDiffEq.forwardguiding(kernel, message, (ustart, 0.0),
            Z=nothing; save_noise=true)

# Bridge sample
t = solfw.t
u = first.(solfw.u)

# Constant for non-parametric reparametrization
lp̃ = logdensity(convert(WGaussian{(:F,:Γ,:c)},solbw), u0)


# Bridge sample and likelihood
u, ll + lp̃

I don’t think that’s your fault. I think what everyone does in the literature is implement their one method and compare it to GMM. That really tells us nothing about how all of the methods play against each other, just that GMM is the simplest and has faults. We need to create reference implementations of all of them and benchmark it thoroughly in order to really pinpoint the details, but you can look to the descriptions of the methods to figure out some pieces, like how collocation and interpolation based methods need a dense enough time series for derivative estimates. In SciML we plan over the next year to really characterize this much better and until then, it is a bit of guesswork. I hope in a year or so we can swap between the methods with one line of code, benchmark it on 8 applications, and just share the results. I think that would be more telling than any other argument, but to get there we first needed as fast of adjoint methods as we could get (which are complete now).

2 Likes

@mschauer Thanks for the info. Somehow I didn’t run across BridgeSDEInference when searching for this. I have a lot of questions, as most of this is over my head.

For my system L is the identity matrix and we have no observation noise. What is the clearest path forward in this case? For my specific application, I am largely interested in maximum likelihood parameter estimation, rather than a Bayesian approach. The code you posted is interesting, but I don’t quite understand some details of what exactly it is calculating.

Could you explain what you mean by this?

I played around with the code you posted and have some questions.

  1. When I run the code, I’m finding that the last point of u doesn’t match the target. There is always a discrepancy, typically on the order of 0.05. I thought that a bridge was supposed to be conditioned to hit both end points?
  2. As I understand, plin are the parameters for the auxiliary linear process. What is the best simple way to guess these?
  3. You mention that \sigma should be g(uend, p, tend) for systems with small noise. Could you elaborate on this?
  4. endpoint is gaussian. I’m assuming that 0.0001 is the measurement nose? For the case with no noise, I’m assuming best practice is to set this something small?
  5. How do I interpret the results of backwardfilter? What does solbw represent? What about messages.u?
  6. What does ll + lp represent? I guess it’s an importance weight. Can I simply average it over sample bridges to get an estimate of the likelihood of the transition under the current model parameters, or do I need to do something like MCMC sampling to get a representative ensemble of bridges? Basically, how do I get from the code you posted to a loss that I can optimize?

Hopefully, at least some of those questions make sense. Thanks for your time.

Great questions.

1.) Yes, the code is not written for exact observation case in mind, so I added a bit of assumed observation noise of order sqrt(dt*sigma^2) (a single Euler step). You can set the last point of the trajectory to the end-point if you want.

2.) Yeah, the simple way is to take B=0 and beta = 0… that gives you a Durham-Gallant type bridge and you were willing to use those in the first place, right?

3.) If you have no observation noise, you don’t have “room” to behave differently close to the endpoint where the forcing becomes strong (in order to end in the right point). If you have little noise, it is still good to respect that.

4.) Yes, 3.) is the answer to 1.)

  1. Backward filter and forward guiding are best thought of in analogy of Kalman filtering. If you want to sample the smoothing distribution, you first run a filtering pass in one direction and then a sampling pass in the opposite direction. The backward filter uses the linearisation B and beta, so it is not exact, but the forward guiding computes an importance weight which accounts for that. This is really at the heart of the two papers linked in MitosisStoch…

6.) Yes, ll + lp is an importance weight which accounts for being unable to exactly filter and therefore to sample the exact conditional/bridge distribution. The lp is a constant we add to make it possible of comparing the importance weight across different parameter choices.

The remaining question: The augmented path tells you exactly what diffusion coefficient was used to sample it. So you cannot estimate the diffusion coefficient of the true process without a trick detailed in the Toy example section of the paper I linked. Example: A Brownian bridge trajectory tells you it’s scale (quvar computes the quadratic variation.):

using Bridge
t = 0:0.0001:1.0
X = sample(t, WienerBridge(1.0, 0.0))
σ̂2 = Bridge.quvar(X)
norm(1-σ̂2) # I can(almost) recover the unknown diffusion parameter (1 in this case) from the bridge

I agree that Moritz’s paper is tough to read – but it’s totally worth it :slight_smile: . For me, it helped to first dig into what the sub-parts of the algorithm are actually supposed to do. Let me try to summarize them from a high-level perspective below. If you want to look directly into the code, the functions are in https://github.com/mschauer/MitosisStochasticDiffEq.jl/tree/main/src which might be a good starting point if you are familiar with the SciML library (in particular with StochasticDiffEq). In MitosisStochasticDiffEq.jl, we describe what happens between observations and thus it might be even easier to understand what’s going on, as compared with the full picture. There is also a gist from @mschauer: https://gist.github.com/mschauer/db807bd6f88daecd19f484dc6507ba28 which contains a full implementation (i.e., “directly for all observations”). I’ll link below to the corresponding code in MitosisStochasticDiffEq but you’ll easily find the equivalents in the linked gist.

There are three important functions:
sample
backward filtering
forward guiding

Sample

This implements the simulation of your “true” model. Numerically, that means you take the model with the “true” parameters you want to fit and run it forward [Eq. (1.1) in the BFFG paper, blue curve below]. From this you take some discrete observations, potentially subject to some measurement errors [Eq. (1.2), orange dots]. As an example, you’ll end up with something like this:
xs_and_ys
Now, the goal is to propose trajectories forward (this will be the forward guiding part) with an unknown parameter (which you’d like to fit) that match nicely with your observations. The first step towards this goal is the backward filtering described in Section 3 of the paper.

Backward filtering

What you want is a trajectory “to which you can guide” further stochastic trajectories such that they match well with your observations (and you’d like to assign likelihood weights, correct for change of measure, details, details…). The backward filtering part introduces an ODE describing the covariance and mean of the latent path. This ODE is simulated backwards in time. If you do this, you get something like:
Backwardfiltered
(sorry for the different trajectories… I didn’t immediately found the corresponding plot where only the sampling is shown…)
The green curve is said evolution of the mean value following your observations. The purple curve represents the covariance which increases in time between observations and always dips when you hit an observation, because this “decreases the uncertainty”.

Forward guiding

This is already pretty great. In the paper, it is now shown how you can write down the guided forward evolution based on the knowledge of these mean and covariance values. In simple words, one modifies the drift function such that new forward trajectories are forced towards the mean value (green curve) with a “force strength” that is somewhat proportional to the inverse of the covariance (see https://github.com/mschauer/MitosisStochasticDiffEq.jl/blob/0f7f11424734ea61205126d286b0e5181ea3e1bd/src/guiding.jl#L54). Additionally, the log-likelihood for the forward-guided trajectory is accumulated. This could look like this:
Guidedforward
WeightedGuidedforward
In the second plot the forward samples are weighted with their associated likelihood. The final step is to wrap this forward guiding part into some Metropolis-Hastings, Hamiltonian Monte Carlo, … scheme to estimate the value of your unknown parameter (see the gist).

Importantly, I think all of this will soon be fully composable with the rest of the SciML ecosystem, e.g., the SDE adjoints.

Indeed that sounds really exciting. I’d definitely be interested to contribute.

3 Likes

@mschauer, @frankschae, thank you so much for your replies. They’re very helpful. I’m slowly starting to piece things together. I appreciate your patience. I don’t have much background in SDEs, measure theory, etc, so I’m definitely struggling with the more technical aspects. But I think I’m starting to get the gist of things.

Ultimately, I’m interested in calculating maximum likelihood parameter estimates for a fully observed model without noise. Essentially, I want to reproduce Eq. 8 from Durham and Gallant 2002, just with an improved importance distribution.

Here is what I think I understand. I would appreciate if you can clear up any misconceptions.

The goal is to sample trajectories that bridge between two time points, where the distribution is as close as possible to the true conditional distribution. This is done through the addition of a guiding drift term that pushes towards the target point.

The first step is to solve for the mean and covariance of the guiding potential. This is done solving a reverse ODE starting from the target point. This ODE is solving for the means, covariances, and a factor c that is later used to compute importance weights.

Next the forward guiding step combines the original SDE with an additional guiding drift term that is governed by the parameters from the reverse ODE. The output of this step is a bridging trajectory along with its log importance weight.

To reproduce something like Eq. 8 of Durham and Gallant, I’m thinking I need to do something like:

  1. Perform the backward filtering step once
  2. Generate N forward guiding trajectories
  3. From each trajectory, take the penultimate step and compute the log probability to hit the final target step.
  4. Combine that log probability with the importance weights as in Eq. 8 of Durham and Gallant.

Does that sound about right?

I still have a number of questions.

  1. The B, \beta, and \sigma parameters of the reverse ODE should be a matrix, vector, and matrix, respectively, in the general case. Correct?

  2. In section 5.3 of the paper seems to indicate that B, \beta, and \sigma can be functions of time, but in the code, these are just constant parameters plin?

  1. I think I understand that it’s problematic to estimate the diffusion term from the imputed trajectory because I’m just going to recover the diffusion that was used to produce it. I didn’t follow the trick in the Toy example section at all—it’s just a bit over my head. Could you give me some intuition for what is happening here and what it looks like in practice? I am also interested in parameterizing the diffusion term by a NN and looking for a maximum likelihood estimate. Could you speculate about how this might change things?

Thanks again for all your help.

@jlmaccal Hi, I was curious: how did you solve it in the end?

@mschauer To be honest, this got put on the back burner. I need to revisit it again.