I’ve never used Turing before and don’t have a lot of statistics knowledge but I’d love to be able to do this kind of thing anyway:
I don’t have any data. I only want to compute the expectation value of a somewhat complicated random variable.
Suppose you have N (e.g. 800_000) things, each of which is “a failure” with (i.i.d) probability Q (e.g. 5%) and otherwise “a success”. Call the total number of failures K. (I write K ~ Binomial(N, Q)
) Suppose I draw a sub-sample of size n (e.g. 10_000) and count the number failures in the sample, call it k. (I write k ~ Hypergeometric(K, N-K, n)
, which is now a compound probability distribution).
Suppose I repeat this a million times. Is Turing.jl a good tool for getting samples for (K, k) even though I don’t want inference? I tried
using Turing, StatsPlots, Random
@model function gdemo()
N = 800_000
Q = 0.05
n = 10_000
K ~ Binomial(N, Q)
k ~ Hypergeometric(K, N-K, n)
end
# Chose some arbitrary sampler...
my_sample = sample(gdemo(), SMC(), 1_000_000)
This runs (single-threaded) and produces some samples. However, this takes an hour on my laptop. Is that normal or am I doing it wrong?
As a sanity check, I expect the distribution of k/n to be very close to Normal(Q, sqrt(Q*(1-Q) / n))
, which looks ok:
mean(my_sample[:k] / 10_000) # 0.0500006641
std(my_sample[:k] / 10_000) # 0.002180021202067455
sqrt(0.05*(1-0.05) / 10_000) # 0.0021794494717703367
However, does the way I’m using Turing (with this or some other sampler) get me closer to the true distribution than this Normal approximation? I also want to try different combinations of N, n, Q where the approximation may be bad.
As the next step in the process I’m modeling, after drawing my sub-sample and counting failures k
, I build a discrete random variable C (which can e.g. take values 0…10). I have functions f_0, \dots, f_{10} of two arguments (“like K and k”), which are pure and fast to evaluate but have no analytic form (some array look-ups with data). Define
Finally, I want to find an “optimal sub-sample size n
”:
Note that C is also a random variable, so “each sample has to use its own f_C”. However, if I put something like C = argmax(... k ... K ...)
into the model definition, C isn’t a random variable in the model and I haven’t seen anything in the Turing guide hinting at how to do this. How can I get the expectation value that depends on C? I don’t necessarily need to really do the argmax over n
, maybe just trying a few values an picking the best is fine.
Is it reasonable to do this with Turing.jl? I’m not quite sure yet how to approach it