Parameter Fitting in Turing

I am trying to run an MCMC to find the posterior of parameters of my model using Turing. The setup I have currently is the model itself, and a randomly sampled dataset I run through the model to yield values from the model. I also have observations for the inputs and outputs with errors. I’m then using the above to calculate a chi-squared value for each set of parameters. Where I’m having difficulty is trying to write that in the Turing construction, because the examples given are all generally very simple models where each variable is clearly linked to another variable and the distributions of all the variables are known. How could I wrap my model in order to allow an MCMC to be run? I’m also looking forward to whether automatic differentiation can work here because I have essentially an arbitrary function as my log-likelihood.

Can you expand on this? This sounds like you’re fitting a model where the observed values are summary statistics of a simulation? This is often called ABC.

Maybe you can pseudocode or even paste your model here?

I’ll try to explain. My model basically takes in a set of data X, and returns y (both of these are n-dimensional). The code for this looks something like

function mymodel(X, parameters)
    'do calculations' using X and parameters
    return y
end

where parameters is the set of parameters I would like to fit using MCMC (it happens to be a 9 dimensional vector, so right now I think I can represent that in Turing using the distribution parameters ~ Product(fill(Uniform(a,b), 9)) where a and b are reasonable bounds for the distribution that I need to put in later.

Where the chi-squared fit comes in is when I create try to quantify the errors on this model. I have observational data which relates various components of X and y, say for example a relationship between X[1] and y[1] and y[1] and y[2]. And these come with a standard deviation as well as the mean relation. So to calculate my chi-squared fit, I run a randomized sample of X (because running through the whole dataset X would be computationally prohibitive) through my model, and get various X and y. I compare that to the observations and calculate the deviations weighted by the variance —i.e. the chi-squared for each datapoint in X. Then I sum the chi-squareds to give me a chi-squared for this specific model parameters. I am aware that the log-likelihood function is related to chi-squared as log L =-1/2 chi^2. But I am unaware how I can make use of this in Turing.

In all, the outline of my code looks like this:

function chi2(Obs, Cal, std)
   return (Obs-Cal)^2/std^2
end

function runmymodel(data, parameters, Obs, std)
   chi2sum = 0
   for i in data
      result = mymodel(I,parameters)
      chi2sum += chi2(Obs, result,std)
   end
   return chi2sum
end

At the end of the day, I am trying to run an MCMC such that the parameters are fitted using this method with a distribution.

I hope this helps explain what I’m doing. The whole model that I’m using is rather complex so I’ve reduced that down to what I hope is a minimum example to demonstrate what I’m trying to achieve.

your “chi2” function to me looks like you are talking about a Multivariate Normal likelihood.

Rather than what you are doing, I suggest that probably you want

@model function yourturingmodel(obs, ...)
    ... some stuff...
   predictions = mymodel(parameters)
   obs ~ MvNormal(predictions,Diagonal(variances))
end

where variances is a vector of measurement variance values that get put on the diagonal of the covariance matrix.

I think I get what you’re saying, but I’m still not sure where I pass in the distribution for the parameters. Does that get defined inside yourturingmodel or do I pass it as an argument for yourturingmodel?

Also, when I do the sample method of Turing, do I pass in this yourturingmodel as the model?

With regard to the obs line, I don’t quite understand how I ‘compare’ the difference between the observation and my calculated output, is this part somewhere outside this function?

I guess I’m having a hard time following what exactly this ‘model’ is because it is clearly not mymodel, and it is also not exactly the chi squared statistic that I calculated earlier? So I guess I’m having a bit of difficulty understand what a ‘model’ is in Turing.

It would be useful for you to try programming and running an example from scratch. Perhaps one of the examples in the Turing docs.

When it comes to all of this there are two senses of “model”. There is the predictive function, which takes some parameters and provides the best guess for the data that would be seen, and then there is the “error model” which describes what deviations from the best guess are likely to occur.

I assume that “mymodel” is a predictive function, from that we build yourturingmodel which is a complete Bayesian description of both the priors for parameters, the method of running the predictive function, and the comparisons to the data to get a reweighed prior = the posterior.

See Turing.jl - Turing.jl the front page has a minimal model called demo

@model function demo(x, y)
  # Assumptions
  σ2 ~ InverseGamma(2, 3)
  σ = sqrt(σ2)
  μ ~ Normal(0, σ)

  # Observations
  x ~ Normal(μ, σ)
  y ~ Normal(μ, σ)
end

In that model the priors are all under the #Assumptions comment and the predictive errors (likelihood) are under the #Observations section

x and y are passed in as fixed observed data

Thanks for replying. I have looked at the minimal models and the issue I have with that is they seem to be too minimal. For example, I don’t understand how you have observations which have a distribution defined by the assumptions.

I think your idea for the Multivariate Normal likelihood is probably correct, but I don’t directly see how it is meant to be applied. I think what obs ~ MvNormal(predictions,Diagonal(variances)) is doing is comparing the observations to the predictions by defining the distribution as a multivariate normal likelihood (which is equivalent to what I had been calling a chi-squared statistic). Where I don’t follow is:

  1. Where does the model know to vary parameters? Do I pass it as an argument to yourturingmodel, or do I need to define its distribution (using the uniform distribution that I had earlier mentioned)?
  2. My model makes predictions based on every data point x I pass through to it, I’m calling the whole dataset X, I had expected the procedure to be something like pass through X[1:1000] (for example) to my model, and get y[1:1000], and then I’d sum up all the ‘errors’ (which I call chi-squared). In your example of yourturingmodel seems to be calculating the multivariate normal likelihood from every datapoint x (please correct me if I’m wrong), so how does the model know that each datapoint is meaningless on its own and that the whole distribution per parameter is needed?
  3. My model makes several predictions, each with its own variance. That process is done via a binning process whereby I bin outputs via say x[1] (the first element of the vector x), and plot that against y_pred[1]. Then for each bin in x[1], I calculate (y_pred[1]-y_obs[1])^2/variance, then sum all the results which gives me my chi-squared errors. I do this for as many times as I have relations between some element of x and y, or between elements in y. I don’t see how this is easily replicated via the obs ~ MvNormal(predictions,Diagonal(variances)) because my variances depend on the specific predictions and also I have multiple sets of predictions/obs. Am I getting what your code means or is there some extra layer I’m missing?

Hopefully this better explains what I’m doing. I’m coming from the python package emcee, where you more or less just had to define a log-probability function and used that as an argument in the sampler. I’m more or less trying to replicate that functionality here, but it seems I’m running into trouble because Turing is a more sophisticated statistically programming package.

Turing.jl’s domain specific language is a language for defining posterior densities.

the

@model function foo(bar)
...

Means something close to “define a function foo which takes data bar and returns a model object whose logpdf can be evaluated at various parameter values”

Turing will do the adding up of the individual terms of the logpdf for you. So when you do:

@model foo(obs)
obs ~ MvNormal(preds,covmat)

It knows that obs is data because it was passed into the model… and it knows that to evaluate the posterior density it must add up all the contributions from the differences between preds and obs.

Turing knows that any time you do:

foo ~ Bar(baz)

and foo is not in the argument to the @model function, then it’s a parameter and that parameter is given distribution Bar(baz).

Each data point has some distribution, in your case it sounds like

obs[i] ~ Normal(pred[i],sd[i])

Which is to say that once you’ve got a prediction, the observation is expected to be Normally distributed around that prediction (think of a line through some points, the data points are approximately Normally distributed around that line… this is the simplest sort of case). In your case, each prediction has its own associated size of the variance as well. This is fine, you can define a diagonal covariance matrix with different entries on the diagonal one for each prediction. That’s what Diagonal(variances) does, it creates a diagonal matrix with the variances on the diagonal (variances is a vector of length N for an NxN covariance matrix).

My suggestion is to think less about which terms you need to add up, and think more about generative modeling which is to say something like this:

Imagine that Turing has given you a random draw from the prior distribution of the parameters. Now you know the parameter values, so you make some predictions using your predictive machinery… and your variances are either something you can calculate directly, or are parameters with a random draw themselves so you know their values either way… at this point, you have predictions and variances… You can either generate fake data, or if you have data (in your case) you can determine how probable vs improbable that data would be to a robot that knows how to make these predictions but is using these particular parameters.

This is the general theory behind bayes: if told the parameters, you can determine how “surprising” vs “probable” the data is… and from that you can infer whether the parameters are “good” or “bad”. The “good” parameters are ones that make the data “not very surprising”.

I hope this helps?

1 Like

I think I get most of what you’re saying. Starting from the bottom, I understand the idea of seeing how likely you are to get that data because I think that’s exactly what the log-likelihood is doing.

My main issue here is that there are two ideas about what an ‘observation’ entails. There is x which is one datapoint which I need to pass through mymodel to get some predictions. I can clearly calculate the log-probability of this one datapoint (which I think I can achieve using MvNormal). However, if the MCMC then changes my parameters based on this one datapoint, I think that’d never converge because the variance in X is quite large. So I think based on this, and also what you said regarding adding up the individual terms of the logpdf, then ‘observation’ must be a set of n x’s. If this is the case, then I think I need to randomly sample my dataset X because its too large to run through the whole thing.

With regard to the datapoint and distribution, its not like every datapoint has a distribution, its more like every datapoint belongs to a bin, which has its own variance in a specific relation. For example, say if I run x[1] through my model, I’ll get y_pred[1]. I have relations relating x[1][1] and y[1][1], and say y[1][1], and y[1][2]. So, mechanically, what I have to do is look through a table with all the bins and find the relation and variance for my specific bin—for x[1][1], I look down the table and find the bin which corresponds to the value for x[1][1], then I look up the relation and get y_obs[1][1] as well as std[1][1]. Are you saying that I have to make a covariance matrix of NxN dimensions where N is the length of my input data (length(X))? Since I am sampling from X, this covariance matrix will necessarily be different every time I run through the data, does this mean I have to make both obs and variance an argument to the model, and have it provided every time ad-hoc to the sampling function?

Finally, when you say make different entries one for each prediction, I’m assuming you mean for each y[1:n][1]. But apart from these predictions, my model also needs to be constrained to y[1][1], and y[1][2], which would have its own covariance matrix etc. Do I need to add the distributions somehow, or do I need to concat the distributions lengthwise?

I think my issue is that, while normally when you run an MCMC, you are making predictions which you have the answer (observations) to. While in my case, I am comparing my predictions to a mean relation generated from observations. This adds an extra layer whereby the predictions can’t be directly compared to observations, but rather I need to make the predictions match the mean (and hopefully distribution) of the observations.

I think what i gather from what you’re saying is that you want to fit a model for some summary statistics of your full data. This is fine. Just calculate the summary statistics and pass them in as data. Pretend that the summary statistics are your data. Just like you can pretend that pressure is a thing, instead of actually being individual collisions of molecules with the walls of a container.

It is possible to write a Bayesian model for any level of summarization. You can talk about “the temperature of the earth” as if you were an astronomer in another solar system who just gets a few pixels of infrared data from the earth, or you can talk about “the temperature observations at each thermometer in each weather station at each separate point on the earth” and when doing that you don’t need to question the variation in average energy of molecules at the top part of the thermometer vs at the bottom part… etc.

ALL data is summary data, even at the quantum level in some sense.

I mean I get that. I understand the idea of summary statistics, I’m more asking for how that’s implemented in code. Because my statistics are generated by running mymodel many times, I get a distribution of points. Do I calculate the chi-squared as my summary statistic? From what you’ve told me about Turing, I need to specify a distribution and it automatically sums up all the contributions to the multivariate log normal likelihood. But how do I specify the distribution in the first place? It’s not as if I’m running a function, which takes in a set of values, returns the normal and standard deviation? I’ll try to outline my pipeline and where I think the gaps in my understanding are:

  1. I have a model mymodel which takes in each datapoint x (which is a matrix) and parameters and returns a vector y.
  2. I have a dataset X which contains a very large number of datapoints x (running every x through my model would be prohibitively long).
  3. I have observations which relate some element of x to some element of y or between various elements of y. (I have been using x[1] and y[1] as my example)
  4. For every parameter I’d like to run some number n of elements of x randomly selected from X.
  5. After running for each x and y I get from running mymodel, I can determine the the deviations or error from my observations by calculating y_obs[1]-y_pred[1]. This can be weighted to form the log-likelihood by the variance, which is given for every bin in x[1].
  6. My previous method was to just sum up all these likelihoods, but from what you’re telling me, there’s a smarter way of achieving the same. But I don’t see how I can specify all the deviations and variances for my random selection.
  7. When I have found the sum of the likelihoods, I’d sum over all the other likelihoods from my other distributions before (say y[1] and y[2]). I don’t know how I can do this now when I still need to specify the deviations and variances for each random selection. Is the only solution to make a covariance matrix of size length(X) by length(X)? Because that’d be prohibitively large.
  8. Once I have the various likelihoods from my distributions, I’d sum them previously, but in the Turing schema, I don’t know how to join the distributions/likelihoods.

I feel like most of what you’re saying makes sense to me. But I just don’t know how that applies to the Turing structure because it feels so different to what I was doing previously.

Trying to see if I understand your problem… You have a model which takes in an x data point and some parameters and outputs a y prediction… You also have some y observed.

Conceptually this could look like this inside the @model :

for i in eachindex(yobs)
   yobs[i] ~ Normal(mymodel(xobs[i],myparams),somesd)
end

This isn’t the fastest way to code it, but conceptually, you’re saying that given the parameters, and some covariates xobs[i] you can predict that yobs[i] should be near mymodel(xobs[i],myparams) to within some error somesd using a gaussian error term.

Now somesd may be either a parameter, or maybe a function of parameters and data… like for example maybe xobs[i] is a complex object and you need somesd(someparam,xobs[i]) which computes the models expected error given both some parameter about variation overall and some information in the observed data.

Without more specifics about your problem it’s hard to give anything but kind of general suggestions.

I’ll mention a few other things based on rereading your post:

So start by taking a sample. Randomly select 100 or 1000 or however many you can afford. Pass in this sub-sample as your data.

If you’re using independent observations / predictions, then you’re talking about a diagonal covariance matrix and it’s really just a vector of N variances, which is represented as a Diagonal() object (ie. it’s a special kind of sparse matrix type)

In Turing you express a generative model and Turing figures out how to calculate its pdf.

Yes that’s exactly how it would work in my mind. However as you’ve said this is pretty un-Turing-like. I read in the docs that multivariate distributions is preferred over this looping of univariate distributions. But I don’t quite know how I’d make it into a multivariate one because yobs has arbitrary length and because what xobs is varies.

So from this step, would the sampler know to pass in a different xobs from my random sampler already? Or would I have to put the random sampler inside my @model too?

Finally, I’m sorry for speaking in rather vague terms throughout. I feel like if I was more explicit about the project it would be quite unfair to you because I would be asking you to do it for me instead. I hope you understand.

Take a subsample of your y and x values and pass that to your turing model. Within a given fit, the x and y data values are constant.

Thanks, I think I better understand how this works now.

Great. Come back and give some plots or something once you’ve got something that looks like it’s working. Look forward to seeing you get some results!