Sampling (no inference) from a mixture model (compound probability distribution)

adomasbaliuka · January 10, 2024, 5:07pm

I’ve never used Turing before and don’t have a lot of statistics knowledge but I’d love to be able to do this kind of thing anyway:

I don’t have any data. I only want to compute the expectation value of a somewhat complicated random variable.

Suppose you have N (e.g. 800_000) things, each of which is “a failure” with (i.i.d) probability Q (e.g. 5%) and otherwise “a success”. Call the total number of failures K. (I write K ~ Binomial(N, Q)) Suppose I draw a sub-sample of size n (e.g. 10_000) and count the number failures in the sample, call it k. (I write k ~ Hypergeometric(K, N-K, n), which is now a compound probability distribution).

Suppose I repeat this a million times. Is Turing.jl a good tool for getting samples for (K, k) even though I don’t want inference? I tried

using Turing, StatsPlots, Random
@model function gdemo()
           N = 800_000
           Q = 0.05
           n = 10_000
           
           K ~ Binomial(N, Q)
           k ~ Hypergeometric(K, N-K, n)
       end
# Chose some arbitrary sampler...
my_sample = sample(gdemo(), SMC(), 1_000_000)

This runs (single-threaded) and produces some samples. However, this takes an hour on my laptop. Is that normal or am I doing it wrong?

As a sanity check, I expect the distribution of k/n to be very close to Normal(Q, sqrt(Q*(1-Q) / n)), which looks ok:

mean(my_sample[:k] / 10_000)  # 0.0500006641
std(my_sample[:k] / 10_000)   # 0.002180021202067455
sqrt(0.05*(1-0.05) / 10_000)  # 0.0021794494717703367

However, does the way I’m using Turing (with this or some other sampler) get me closer to the true distribution than this Normal approximation? I also want to try different combinations of N, n, Q where the approximation may be bad.

As the next step in the process I’m modeling, after drawing my sub-sample and counting failures k, I build a discrete random variable C (which can e.g. take values 0…10). I have functions f_0, \dots, f_{10} of two arguments (“like K and k”), which are pure and fast to evaluate but have no analytic form (some array look-ups with data). Define

C = \operatorname{argmax}_{c \in \{0,\dots, 10\}} f_c (K, k).

Finally, I want to find an “optimal sub-sample size n”:

n_{\text{opt}} = \operatorname{argmax}_n \mathbb{E}_{K, k, C}\left[ f_C(N*Q, n*Q) \right].

Note that C is also a random variable, so “each sample has to use its own f_C”. However, if I put something like C = argmax(... k ... K ...) into the model definition, C isn’t a random variable in the model and I haven’t seen anything in the Turing guide hinting at how to do this. How can I get the expectation value that depends on C? I don’t necessarily need to really do the argmax over n, maybe just trying a few values an picking the best is fine.

Is it reasonable to do this with Turing.jl? I’m not quite sure yet how to approach it

mthelm85 · January 10, 2024, 6:16pm

Maybe I’m not understanding, but couldn’t you just do this to draw the samples for (K, k)?

using Distributions

function gdemo(N, Q, n)
	d_binom = Binomial(N, Q)
	K = rand(d_binom)
	d_hyper = Hypergeometric(K, N-K, n)
	k = rand(d_hyper)
	return (K,k)
end
	
samples = [gdemo(800_000, 0.05, 10_000) for _ in 1:1e6]

mean([s[2] / 10_000 for s in samples]) # ≈ 0.05

This runs in less than a second on my machine

bertschi · January 10, 2024, 7:44pm

While you can use Turing it’s somewhat overkill as your model defines an unconditional density, i.e., without any observed values.
In particular, you do not want to use any inference algorithm (designed to sample from some intractable conditional distribution), but instead just sample from the prior:

my_sample = sample(gdemo(), Prior(), 1_000_000)

which is reasonably fast on my machine – albeit quite a bit slower than the direct approach by @mthelm85

adomasbaliuka · January 11, 2024, 10:09am

The direct sampling is actually faster? I guess I expected the package to do some kind of “magic” that it’s not built for. I will sample directly. Thanks!

Topic		Replies	Views
Why isn't mixture model sampling in Turing? General Usage question	2	142	June 18, 2024
Compositional Sampling in Turing - getting exactly same samples (standard deviation is zero) Probabilistic programming turing	13	3246	December 6, 2021
Turing Model debug info Probabilistic programming	6	669	August 17, 2020
Trouble coding a continuous / discrete 'mixture model' in Turing Probabilistic programming turing	20	673	August 22, 2022
Problem with a hypergeometric Turing model General Usage turing	9	490	May 29, 2020

Sampling (no inference) from a mixture model (compound probability distribution)

Related topics