I’m trying to understand the various options for doing parameter estimation on a USDE. I’m completely new to parameter estimation in SDEs, so ELI-dumb-chemist would be appreciated.

## Background

I’m trying to model a physical process using Brownian dynamics with a position-dependent diffusion tensor. I can model this as a non-diagonal Ito SDE. Both the drift and diffusion terms depend on neural networks. I can set things up in a way that guarantees that the system has a well-defined stationary distribution and obeys physical constraints, like fluctuation-dissipation relations.

My training data is a small number of long trajectories with observations at regular time intervals.

My questions are about parameter estimation. I’ve seen two different approaches in the documentation for DifferentialEquations.jl and DiffEqFlux.jl. I have some questions about those and how they relate to what I’ve read in other sources.

## Moment Matching

The DiffEqFlux examples all seem to use moment matching. As I understand, this involves:

- Collecting a set of real observations for many realizations of the system starting from some initial condition.
- Compute the per-timestep means and variances.
- Run a set of realizations of the model starting from the same initial condition.
- Compute the per-timestep predicted means and variances
- Construct a loss by comparing the observed to predicted means/variances.

Is that understanding correct? In my case, I don’t think this will work well, as, over time, the distribution sampled by my model tends towards a stationary distribution that is highly non-gaussian, multi-modal, etc.

## L2 Loss

The documentation page on parameter estimation has an example of using the L2 loss for parameter estimation of an SDE, but I don’t really understand how that works. The aggregate training data consists of 10,000 trajectories. The problem is formulated as an ensemble problem, solved with `num_monte=1000`

, which samples 1000 trajectories starting from the same initial condition.

What I don’t understand is what the `L2Loss`

is doing in this case. The predicted and observed trajectories have a different dimension, so it’s not clear to me what this is calculating, either conceptually or in practice. This is extended to consider first differences, but again, it’s not clear (to me) what is actually calculated.

## Pseudo Maximum Likelihood

One approach I’ve seen works roughly as follows:

- Choose a random observed trajectory. Choose a random observed time point to get y_t.
- Start an ensemble of simulations starting from x_t = y_t and propagate to the next observed time point to get \{x_{t+\Delta t}\}.
- Fit a probability distribution to \{x_{t+\Delta t}\}. For small \Delta t a multivariate gaussian is probably sufficient. For longer times, use something like KDE.
- Compute the log-likelihood of the actual observation y_{t+\Delta t} under the model.
- Average over a batch of initial y_t to get an estimate of the loss.

## Adjoints

How do the different adjoint options available for SDEs relate to these choices of loss?

## Recommendations

I guess that is all quite vague and rambling. Here is what I could use help with:

- Conceptual understanding of the different ways to do parameter estimation for SDEs.
- A better understanding of what the documented examples are doing, especially L2Loss.
- Any suggestions for a strategy for the type of problem I described using tools from the Julia ecosystem. I think the key feature is that, over time, my model relaxes towards a stationary distribution, which makes moment matching a bad choice.