DAEs and Turing

1. Model and DifferentialEquations

I’m trying to estimate initial values and parameters in a DAE using Turing. The structure of the DAE is as follows:

\frac{dx}{dt} = f(x,z,u;\theta) \\ 0 = g(x,z,u;\theta)

I have available known inputs u(t) and measured outputs y where, in general,

y = h(x,z,u;\theta)

The unknowns (“descriptor” in DAE speak) are x,z , or could also be denoted differential variables (x; 3 temperatures) and algebraic variables (z; 3 other temperatures). The DAE is index 1, which means that the system has 3 states. So I need 3 initial values to solve the DAE. One could think of the x variables as states.

The following Julia code works for simulating the DAE:

function system!(dx,x,p,t)
    Tr,Ts,TFe,Tac,Tad,Tah = x
    mr,ms,mFe = 9260.,6827.,71200.
    Rr,Rs = 0.127e-3,1.95e-6
    chpCu,chpFe,chpa,chpw = 0.385,0.465,1.15,4.18
    mda,mdw = 49.2,54.2
    QdFes,Qdfs = 212.,422.4
    UAr2d,UAs2Fe,UAFe2a,UAx,QdFes,Qdfs = p
    Nstw = UAx/chpw/mdw
    Nsta = UAx/chpa/mda
    Nstd = Nstw - Nsta
    # equations
    dx[1]= (1.1*Rr*If(t)^2-UAr2d*(Tr-Tad))/(mr*chpCu)
    dx[2]= (3*Rs*It(t)^2-UAs2Fe*(Ts-TFe))/(ms*chpCu)
    dx[3]= (UAs2Fe*(Ts-TFe)-UAFe2a*(TFe-Tah)+QdFes)/(mFe*chpFe)
    dx[4] = mda*chpa*(Tac-Tad)+UAr2d*(Tr-Tad)+Qdfs
    dx[5] = mda*chpa*(Tad-Tah)+UAFe2a*(TFe-Tah)
    dx[6] = Tac*(Nstw-Nsta*exp(-Nstd))-Nstd*Tah-Nsta*(1-exp(-Nstd))*Twc(t)
end

To set up the model and solve it, I do as follows:

x0 = [28., 28., 28., 14.,18.,22.]
tspan = (0.0,583*60.)
UAr2d,UAs2Fe,UAFe2a,UAx,QdFes,Qdfs = 2.7,20.,14.3,44.4,212.,422.4
p = [UAr2d,UAs2Fe,UAFe2a,UAx,QdFes,Qdfs]
probfunc = ODEFunction(system!, mass_matrix=Diagonal([1.,1.,1.,0.,0.,0.])) # DAEs to ODEs using Mass matrix reduction technique
prob = ODEProblem(probfunc, x0, tspan, p)
sol = solve(prob,saveat=60.)

This set-up works, and produces the solution.

Question 1: One thing I’m not 100% sure of is: what happens if there is inconsistency between the initail values of x and z (i.e., x in the Julia code)? Will the solver make adjustments so that the algebraic equations are satisfied at every time-step, including at initial time?

2. Data fitting with Turing

For data fitting, I have available outputs y consisting of 2 temperatures out of 3 from x (differential), and 2 temperatures out of 3 from z (algebraic).

Question 2: Is the above DAE implementation suitable for AD computations with Turing?

I’m assuming that all initial descriptor variables are unknown, i.e., I specify a prior distribution for every element in the descriptor. Of course, if x is known, then z is known from the algebraic equations, but I’m assuming it is possible to postulate uncertainty in both x and z

Question 3: Does it make sense to postulate priors for both differential (x) and algebraic variables (z)? The answer to this question may depend on how the DAE solver handles satisfying the algebraic equations, I guess?

Here is how I set up the Turing model:

Turing.setadbackend(:forwarddiff) # Automatic Differentiation tool for Jacobian and Hessian
#
@model function fit_system(data,prob,tdata)
    # Measurement variance prior
    V ~ InverseGamma(2,3) # measurement variance https://en.wikipedia.org/wiki/Inverse-gamma_distribution
    # Initial differential variable priors
    Tr0 ~ truncated(Normal(30.,3),25,35) # Initial value prior
    Ts0 ~truncated(Normal(30.,3),25,35) # ""
    TFe0 ~truncated(Normal(30.,2),25,35) # ""
    # Initial algebraic variable priors
    Tac0 ~ truncated(Normal(12.,3),1.,30) # ""
    Tad0 ~truncated(Normal(18.,5),1,30) # ""
    Tah0 ~truncated(Normal(22.,3),1,60) # ""
    # Parameter priors
    UAr2d ~ truncated(Normal(2.7,1),0.2,5) # heat transfer prior
    UAs2Fe ~ truncated(Normal(20,5),1,50) # ""
    UAFe2a ~ truncated(Normal(15,2),0.5,40) # ""
    UAx ~ truncated(Normal(44,5),1,100) # ""
    QdFes ~ truncated(Normal(212.,40),20,400) # ""
    Qdfs ~ truncated(Normal(422.,20),200,500) # ""
    # Initial descriptor x0 and parameter p
    x0 = [Tr0,Ts0,TFe0,Tac0,Tad0,Tah0]
    p = [UAr2d,UAs2Fe,UAFe2a,UAx,QdFes,Qdfs]
    # Problem for differential equation -> solution
    prob_turing = remake(prob,u0=x0,p=p) # u0 and p should be of same type for autodiff
    sol = solve(prob_turing,saveat=60.,save_idxs=[2,3,4,6])
    # Output data model
    for i = 1:Nd
        data[:,i] ~ MvNormal(sol(tdata[i]), V*I)
    end
end
#
model = fit_system(data,prob,tdata); # model

Observe that I assume the same data uncertainty (variance V) in all 4 measured variables.

Question 4: Does this make sense? Perhaps… since all measurements are temperatures of similar size. But in general, this is a problem, I guess (unless data are normalized).

Finally, I run the data fitting:

Ni = 100
p_est = sample(model,NUTS(0.65),Ni)

Question 5: I’ve seen Ni referred to as the number of iterations. Is this the same as the number of outcomes/realizations in the posterior distribution?

I’ve also tried to run the code with multiple chains, either:

Nc = 3 # number of chains
p_est = sample(model,NUTS(.65),MCMCThreads(),Ni,Nc)

which seems to run… but forever… so I just have to stop it, or

p_est = sample(model,NUTS(.65),Ni,Nc)

which simply crashes…

Question 6: I assume that when I use multiple chains, this is just running the same sampling for Nc chains independently… and that the purpose is to see if I get the same posterior distributions for all chains?? Sort of to check that the found posterior distributions are representative of the true distributions?

Since MCMC is based on Monte Carlo methods, is should be possible to run samples in parallel – or – at least, chains in parallel. I assume that is what MCMCThreads() does. I have issued command Threads.nthreads() in Jupyter, and the answer is 8.

Question 7: Does this mean that I have 8 cores available in the current session? Or does it mean that my computer has 8 cores in total, and that I need to specify the number of cores I want to use when I start up Julia?

OK – sorry for many questions… when I run a single chain, typical results are that the posterior distributions are suspiciously narrow:

Although my code runs (for a single chain), it is relatively slow (which is a relative concept). And if I re-run the notebook, many times I don’t get results at all.

1 Like

Yes. https://diffeq.sciml.ai/stable/solvers/dae_solve/#Initialization-Schemes

Yes, both forward and reverse.

Yes, though initialization may cause it to be a little odd.

Looks fine.

Different chains can hit different optima so this can help explore the space. And it can be done in parallel.

Yes.

This is a ton of unrelated questions. You’ll get better answers if you don’t put these in a discussion about DAEs, since I might be the only person to dig through a very long DAE post.

Run it longer?

1 Like

Nope. But I think so far you have good answers.

The suspiciously narrow output could be a number of things, one might be that it is stuck in initial location. Another might be that the model fits the data extremely well. Are you using real data or simulated? If simulated make sure to add some nontrivial measurement noise.

Try running an optimize to get initial parameters for turing. The docs say to use init_theta but they are out of date, use keyword init_params to initialize the sample method.

2 Likes

I use real data. So I’m a little suspicious of the narrow posteriors. Running data retrodiction, I cannot distinguish between the different draws from the posterior distribution.

What I find strange is that when I get some spread in the posterior distributions, and re-run the code, I may get virtually no variation or results that look meaningsless. Sometimes when I run it with Ni=1000 iterations (or: rather particles?? or outcomes for the distribution??.. my unanswered Question 5…), I get no reasonable results – other times I get some distribution – but often not big variation.

My Question 7 was poorly phrased… because answer “yes” could mean two things…

Since my code crashed when I tried with 3 chains without MCMCThreads() – does this mean that I can only run more than one chain when I spread computations on several cores?

Start with getting a single chain to run and give reasonable results. When you specify Ni this is the length of the chain, often called the number of iterations or draws or samples. They are samples from the posterior distribution.

Most likely it is stuck at the initial position. Likely this position is weird. I suggest to run the optimization step to get a not too weird initial position. After that I sometimes run sample with the MH() method to fuzz that initial position before handing to NUTS

1 Like

OK – I’ve read parts of the documentation, and I think I understand parts of your procedure:

  1. Run (mode) optimization using Turing, either MLE or MAP, to find initial estimates for the model.
  2. Set the estimate from the optimization step as initial value in the sample method using kwarg init_params.

I can also consider an intermediate step between 1 and 2 above using the MH() sampler to fuzz the initial position before handing to NUTS.

This MH() step is not clear to me.

  • What is the outcome of the MH() step? Point estimates that I can use with init_params in the NUTS() sampler?
  • Or is the outcome of the HM() step a distribution that I can use as prior in the NUTS() step?

One more general question (related to the second bullet point above)… Suppose I run Turing recursively, and each time I run the sampler, I want to use the posterior distribution from the previous step as prior distribution in the current step… can this be done?

  • If x_d is a vector of samples from the distribution of x, can I then in the Turing model part set… x ~ x_d – instead of, say, x ~ Normal(x_, s)
  • Or would I have to use Distributions to fit the distribution sample x_d to some theoretical distribution?
  • And – how could I take correlations between parameters into account?

It is samples from the posterior, so you would use the final sample as input to init_params

No, the model uses a distribution represented as a density function, whereas x_d is samples from the distribution, not a density.

When you try to fit a distribution to the samples, inevitably you lose the dependency structure.

1 Like

Great – then I think I know how to do it.

[I’m a little surprised that I should use the *final* sample of the `MH()` samples… that indicates that the samples get progressively better – I thought they were just random samples from the distribution.]

Using NUTS(), I observe that with a low number of samples (e.g., 10), the distribution gets “wide”, and as I increase the number of samples, it gets more and more narrow… (with 100 samples, the posterior distribution is narrow, and with 1000 samples, values appear to be constant.) This would be consistent with that as the number of samples get larger and larger, the algorithm gets more and more stuck in a poor initial conditions. But that would imply that the step size in trying to escape a poor initial guess varies inversely with the number of samples??

Well, the point here is that you’re trying to find a good starting point. Once you’ve got sampling going right, any random sample is kind of as good as another. But during the initial stages you’re starting well out of equilibrium, and you’re trying to get your point into the high probability set, so generally the last sample will be closest to equilibrium.

Also, in general for MCMC each sample has some correlation with the previous sample, so it takes several steps to get a sample that is “as good as independent”. Hence the output of the ESS (effective sample size) calculation in the summaries isn’t the same as the total number of samples (usually lower, though if the samples are anti-correlated it can be higher)

1 Like

if you start far from the high probability set, you will generally converge towards it. so since you’re moving in the same direction all the time (towards the set) it will appear to be very “wide”. Once you find the high probability set you’ll just move around within it… relative to the convergence period, it could be quite narrow. The problem comes if you have NUTS learn the step size while it’s converging, but then once it’s converged it needs a smaller step size. I usually use NUTS(1000,.75) to tell it to do 1000 initial “learning” steps before doing the real sampling (0.75 = the frequency of acceptance, this lets NUTS be a little looser with its hamiltonian solver).

Then I usually have it sample between 1000 and 5000 “real” samples.

1 Like