Sampling (no inference) from a mixture model (compound probability distribution)

I’ve never used Turing before and don’t have a lot of statistics knowledge but I’d love to be able to do this kind of thing anyway:

I don’t have any data. I only want to compute the expectation value of a somewhat complicated random variable.

Suppose you have N (e.g. 800_000) things, each of which is “a failure” with (i.i.d) probability Q (e.g. 5%) and otherwise “a success”. Call the total number of failures K. (I write K ~ Binomial(N, Q)) Suppose I draw a sub-sample of size n (e.g. 10_000) and count the number failures in the sample, call it k. (I write k ~ Hypergeometric(K, N-K, n), which is now a compound probability distribution).

Suppose I repeat this a million times. Is Turing.jl a good tool for getting samples for (K, k) even though I don’t want inference? I tried

using Turing, StatsPlots, Random
@model function gdemo()
           N = 800_000
           Q = 0.05
           n = 10_000
           
           K ~ Binomial(N, Q)
           k ~ Hypergeometric(K, N-K, n)
       end
# Chose some arbitrary sampler...
my_sample = sample(gdemo(), SMC(), 1_000_000) 

This runs (single-threaded) and produces some samples. However, this takes an hour on my laptop. Is that normal or am I doing it wrong?

As a sanity check, I expect the distribution of k/n to be very close to Normal(Q, sqrt(Q*(1-Q) / n)), which looks ok:

mean(my_sample[:k] / 10_000)  # 0.0500006641
std(my_sample[:k] / 10_000)   # 0.002180021202067455
sqrt(0.05*(1-0.05) / 10_000)  # 0.0021794494717703367

However, does the way I’m using Turing (with this or some other sampler) get me closer to the true distribution than this Normal approximation? I also want to try different combinations of N, n, Q where the approximation may be bad.

As the next step in the process I’m modeling, after drawing my sub-sample and counting failures k, I build a discrete random variable C (which can e.g. take values 0…10). I have functions f_0, \dots, f_{10} of two arguments (“like K and k”), which are pure and fast to evaluate but have no analytic form (some array look-ups with data). Define

C = \operatorname{argmax}_{c \in \{0,\dots, 10\}} f_c (K, k).

Finally, I want to find an “optimal sub-sample size n”:

n_{\text{opt}} = \operatorname{argmax}_n \mathbb{E}_{K, k, C}\left[ f_C(N*Q, n*Q) \right].

Note that C is also a random variable, so “each sample has to use its own f_C”. However, if I put something like C = argmax(... k ... K ...) into the model definition, C isn’t a random variable in the model and I haven’t seen anything in the Turing guide hinting at how to do this. How can I get the expectation value that depends on C? I don’t necessarily need to really do the argmax over n, maybe just trying a few values an picking the best is fine.

Is it reasonable to do this with Turing.jl? I’m not quite sure yet how to approach it

Maybe I’m not understanding, but couldn’t you just do this to draw the samples for (K, k)?

using Distributions

function gdemo(N, Q, n)
	d_binom = Binomial(N, Q)
	K = rand(d_binom)
	d_hyper = Hypergeometric(K, N-K, n)
	k = rand(d_hyper)
	return (K,k)
end
	
samples = [gdemo(800_000, 0.05, 10_000) for _ in 1:1e6]

mean([s[2] / 10_000 for s in samples]) # ≈ 0.05

This runs in less than a second on my machine

While you can use Turing it’s somewhat overkill as your model defines an unconditional density, i.e., without any observed values.
In particular, you do not want to use any inference algorithm (designed to sample from some intractable conditional distribution), but instead just sample from the prior:

my_sample = sample(gdemo(), Prior(), 1_000_000)

which is reasonably fast on my machine – albeit quite a bit slower than the direct approach by @mthelm85

3 Likes

The direct sampling is actually faster? I guess I expected the package to do some kind of “magic” that it’s not built for. I will sample directly. Thanks!

1 Like