Compositional Sampling in Turing - getting exactly same samples (standard deviation is zero)

Hello Everyone,

I have just started to use Turing. I want to implement a DINA Model ( DINA Model and Parameter Estimation: A Didactic (jstor.org)) which contains both discrete and continuous random variables. Hence, I was looking into Compositional sampling using Turing. But I keep getting exactly same samples of my continuous random variables, resulting in 0 standard deviation. Please help me understand what I may be doing wrong. Below is a version of my code that I used to replicate the issue so that I can post it here.
Thanks for the help!

Code

using Turing

using MCMCChains

@model function fn(I,X)

    #I represents number of students

  
    #discrete random variable - which dummy class a student may below to
    α ~ Bernoulli(0.5)

    # continuous random variable - probability of getting the answer correct based on the class student belongs to
    p ~ filldist(Beta(1,1), I, 2)
   
    #X represents that answers these students gave for a question
    X ~ arraydist(Bernoulli.(p[:, α+1])) # selecting p

end

# Sampling (Prior) from the model
X_sample = fn(50,missing)()

# Sampling (posterior) based on the prior sampled X. The idea here is just to see whether the code works
chain = sample(fn(50,X_sample), Gibbs(MH(:α), NUTS(100,0.85, :p)), 100)

Output (Summary):

Summary Statistics
  parameters      mean       std   naive_se      mcse       ess      rhat   es â‹Ż
      Symbol   Float64   Float64    Float64   Float64   Float64   Float64      â‹Ż

           α    1.0000    0.0000     0.0000    0.0000       NaN       NaN      ⋯
      p[1,1]    0.0811    0.0000     0.0000    0.0000    2.0408    0.9899      â‹Ż
      p[2,1]    0.6207    0.0000     0.0000    0.0000       NaN       NaN      â‹Ż
      p[3,1]    0.9574    0.0000     0.0000    0.0000       NaN       NaN      â‹Ż
      p[4,1]    0.4273    0.0000     0.0000    0.0000       NaN       NaN      â‹Ż
      p[5,1]    0.0832    0.0000     0.0000    0.0000       NaN       NaN      â‹Ż
      p[6,1]    0.7725    0.0000     0.0000    0.0000    2.0408    0.9899      â‹Ż
      p[7,1]    0.8586    0.0000     0.0000    0.0000       NaN       NaN      â‹Ż
      p[8,1]    0.2287    0.0000     0.0000    0.0000    2.0408    0.9899      â‹Ż
      p[9,1]    0.3254    0.0000     0.0000    0.0000       NaN       NaN      â‹Ż
     p[10,1]    0.6004    0.0000     0.0000    0.0000    2.0408    0.9899      â‹Ż

Output (actual samples):

So I tried to use HMC(0.2, 10, :p) instead of NUTS(100,0.85, :p) and PG(30,:α) instead of MH(:α). And the problem got fixed (samples below). Could it mean that I am not using MH and NUTS correctly?

Answer (actual samples):

α is defined as a single variable that is drawn from a Bernoulli distribution, rather than having a different value for each student – is this intended?

NUTS uses those warmup samples to estimate things like the mass matrix and to get into the high probability region, 100 samples is often not enough warm up at all. What happens if you give it 1000 ?

NUTS(1000,0.85,:p)

Thanks @dlakelan. Giving 1000 samples for warm-up solved the issue with continuous variables :smiley:. But still facing issue with the discrete variable alpha. I will keep on trying by further increasing the warm-up.

I did it to just keep the demo problem simple. In the actual model, alpha is different for each student.

I found that increasing the number of iterations did not reliably solve the problem for p. On some runs, it worked, but on other runs it did not work.

I wonder if it is possible to marginalize over p to improve performance.

1 Like

It can be extremely challenging to sample a discrete variable where with the different values of the discrete variable the continuous variables need to be very different it just makes it nearly impossible to jump from one discrete value to another without also changing the continuous variables in a large jump as well

1 Like

I think there are a couple of issues here.

First off, as others have said, it’s quite difficult to sample from such a model, in particular when you have a switching mechanism like this.

To see this, consider the first step for this sampler:

  1. Select a realization for α, e.g. 1.
  2. Take a step using HMC for p, effectively just updating p[:, 2] since the gradient wrt. p[:, 1] is in this case 0.

(2) likely brought p[:, 2] to a region of the parameter space which explains the observed X better than the random initialization, i.e. p[:, 2] before performing (2).

Now suppose we propose α=0 in the next iteration of MH. The MH-ratio will then compare the randomly initialized p[:, 1] (since we did not update this in (2)) with the p[:, 2] for which we have taken a step already. This will likely be rejected, and we move onto the HMC sampler.

The HMC sampler will probably stay in a high-probability region of the space for most of the time, and so we’ll be “stuck” in this scenario where we are comparing the fit of a randomly initialized p[:, 1] to the updated p[:, 2].

This is also probably why you made it work with HMC(0.2, 10): NUTS probably choose a larger stepsize than 0.2 and therefore ended up proposing a “better” p since it could move further in the parameter space. But, for reasons explained below, I don’t think using HMC(0.2, 10) will help here, i.e. you just got lucky with your run.

Secondly (and most importanly for the particular example you’ve given), you’ve simplified the model to a point where whether we choose p[:, 1] or p[:, 2] is completely irrelevant:) α=0 and α=1 are equally probably under your prior, and similarly for all the ps. And so whatever observations X you give me, it’s completely irrelevant which value of α we choose, as long as we choose a good value of p[:, α+1]. Once we’ve made such a choice, and we’ve found "good enough p[:, α+1], it’s very unlikely that we’ll ever accept a move away from this to some randomly initialized values of p.

For example, if you try changing

p ~ filldist(Beta(1,1), I, 2)

to

p ~ filldist(Beta(2, 5), I, 2)

and run the chain for a bit longer, e.g. 1000 steps, you find that both α and p are not stuck anymore.

Thirdly, you need to take care when using NUTS in discrete-continuous models because of how it adapts the parameters. NUTS might end up choosing sampler-parameters that works well for a particular realization of the discrete variables. Then once you change the value of the discrete parameters, the sampler-parameters that NUTS found before, can now be completely useless.

I don’t think this is the cause of the issue here, but it’s something to keep in mind.

3 Likes

If you decide to marginalize out alpha, you might consider defining a custom distribution. Perhaps a starting point is:

import Distributions: ContinuousUnivariateDistribution, logpdf

struct Mixture{T} <: ContinuousUnivariateDistribution
    c1::T
    c2::T
    w::T
end

Broadcast.broadcastable(x::Mixture) = Ref(x)

loglikelihood(d::MixtureModel, y::Int) = logpdf(d, y)

function logpdf(d::Mixture, y::Int)
    LLs = [log(d.w) + logpdf(Bernoulli(d.c1), y),
        log(1 - d.w) + logpdf(Bernoulli(d.c2), y)]
    return logsumexp(LLs)
end
1 Like

I agree, I am seeing the same results. I am not sure how to marginalize over p, but I wrote similar code in Pyro using NUTS and Pyro marginalizes over discrete random variables i.e. alpha. I didn’t face the issue getting same samples in Pyro.

Also, thanks for the code to marginalize p, I’ll surely give it a try :slight_smile: . I wonder should we marginalize over the alpha or p though?

1 Like

You are right. I misspoke. Each student has two p parameters, which forms a mixture α * p1 + (1 - α) * p2. I believe the mixture distribution I created does just that. It marginalizes out the two values of p for each student.

2 Likes

Thanks for the detailed explanation @torfjelde :slight_smile: . Below is the original code that I used to represent the DINA model using Turing. Here based on alpha, I either choose the guess or (1-slip) continuous random variable to represent the probability of the X. This code is giving me the same problem of getting exactly the same samples. Increasing warmups help but not reliably. Marginalizing over alpha and using NUTS (based on my Pyro code) does help reliably, even parameter recovery is great, but doesn’t scale if alpha is more than 100 or so students and have more than 5 skills. I wonder should I:

  • Try to think of implementing Alpha as continuous rather than discrete so that I can use NUTS.
  • Try to marginalize over alpha based on the code recommendation given by @Christopher_Fisher and see how far can I scale the model.
  • Keep alpha discrete but not go for differentiation-based algo like NUTS and try for Particle-based algos so that I don’t need to use Compositional Sampling. I did try Importance sampling algo in Turing. Although it worked but parameter recovery was not so great.

For my parameter recovery test, I generate X using my implementation of the DINA model but parameters are known beforehand and belong to the same distribution as in the model. Then I give the X back to the model and see if similar parameter values can be recovered using posterior sampling (I take mean of posterior to compare).

Thanks for all the comments. It’s helping me to think further about my problem.

In the code below,

  • i = number of students
  • j = number of questions
  • k = number of skills required to answer the questions
  • Q is {j,k} dims and contains {0,1}. Q represents which set of k skills does the jth question requires.
  • X is {i,k} dims and contains {0,1}. X represents whether the answer given by ith student for the jth question is correct or not.

This implementation is similar to the one suggested in Bayesian Estimation of the DINA Model With Gibbs Sampling

@model function dina_model(i,X,Q)

j,k = size(Q)

#Item parameters
guess ~ filldist(Beta(1,1), 1,j)
slip ~ filldist(Beta(1,1), 1,j)

#Student and skill parameters
alpha ~ filldist(Bernoulli(0.5),i,k)

#Calculating skills required to answer correctly
skills_mastered = alpha * Q’
skills_required = sum(Q, dims = 2)

#Represents ideal situation to answer the question i.e. when student is not cheating
eta = skills_mastered .== skills_required’

#Selecting either a or b based on eta
a = guess.^([1].-eta)
b = ([1].-slip).^eta

prob_X = a.*b

X ~ arraydist(Bernoulli.(prob_X))

return X

end

I implemented this code change and used HMC with 1000 steps, it seems to have fixed the problem. Also with 1000 steps, even if I use Beta(1,1), it seems to work reliably.

But then I tried with NUTS(2000,0.85, :p) and Beta(2,5) and got strange results. For some p’s, I got stdev as 0 but for the others, stdev was same as the mean.