Assume the following: I repeat a game over and over in which I roll a die a number of times. If I roll a 6 on any of the trials, I win. Otherwise, I lose. I want to model how the probability of winning this game changes as the number of trials increases.
In this case, I obviously know the answer when I’m using a standard 6-sided die:
using StatsPlots # probability of rolling a six on a single throw p = 1/6 # probability as a function of N (number of rolls) P(N) = 1 - (1 - p)^N plot( P, 0:20; ylims=(0,1), label="true", xlabel="No. Rolls", ylabel="P(success)" )
Let’s assume, though, that I do not know the probability of success for a given trial and I do not know how this probability is related to the number of trials. I just have data that show how many trials I ran during each round, and whether or not I won:
using DataFrames using Distributions data = DataFrame(num_rolls = rand(5:14, 300)) data.success = [rand(Binomial(row.num_rolls, p)) > 0 for row in eachrow(data)] 300×2 DataFrame Row │ num_rolls success │ Int64 Bool ─────┼──────────────────── 1 │ 10 true 2 │ 12 false 3 │ 11 true 4 │ 14 true 5 │ 10 true ⋮ │ ⋮ ⋮ 297 │ 7 true 298 │ 5 true 299 │ 7 false 300 │ 5 true 291 rows omitted
An important note here is that I’ve deliberately chosen N to be a random number between 5 and 14, since that results in an average success rate of roughly 80%, most of the time.
Since the variable of interest is binary, logistic regression is a common choice. I implement it as follows:
using StatsFuns using Turing @model function success_model(success, num_rolls) α ~ Normal(-5, 0.1) β ~ Normal(0, 1.5) for i ∈ eachindex(success) p = logistic(α + β * num_rolls[i]) success[i] ~ Bernoulli(p) end end
I’ve used a really strong prior for my intercept term since I know that the probability of success is zero when the number of rolls is zero. Now I fit my model and compare its average predictions across different values of N to the actual values:
m = success_model(data.success, data.num_rolls) chain = sample(m, NUTS(), 1_000) chaindf = DataFrame(chain) ᾱ = mean(chaindf.α) β̄ = mean(chaindf.β) plot( P, 0:20; ylims=(0,1), label="true", xlabel="No. Rolls", ylabel="P(success)" ) plot!( x -> logistic(ᾱ + β̄ * x), 0:20, label="model" )
As you can see, this is a pretty terrible model. My next thought was to get rid of the strong prior on the intercept term, so I did that by simply increasing the standard deviation to 2:
@model function success_model(success, num_rolls) α ~ Normal(-5, 2) β ~ Normal(0, 1.5) for i ∈ eachindex(success) p = logistic(α + β * num_rolls[i]) success[i] ~ Bernoulli(p) end end
Now, I have this:
This is a lot better, but still not very satisfying. This model is telling me I have a nonsensical 25% chance (roughly) of success if I don’t roll the die a single time. I’ve played around with different priors for the intercept term and there seems to be this tradeoff between having the model be grounded in reality when N = 0, and having it produce accurate results for higher values of N. I also tried dropping the intercept term completely, and that wasn’t good either. In addition to that, I tried balancing the data by oversampling the values where
success == false, again to no avail.
I guess my question is: is there a better link function I can use that will give a more accurate representation of my true underlying model? Or, is it simply a matter of being aware that predictions are less accurate for lower values of N? I can visualize a confidence interval for the average probability of success across different values of N and see that the model has less confidence at lower values of N:
μ = [ logistic(row.α + row.β * x) for row in eachrow(chaindf), x in 0:20 ] μ̄ = [mean(c) for c in eachcol(μ)] μ₉₀ = hcat([percentile(c, [5,95]) for c in eachcol(μ)]...)' plot!( 0:20, [μ̄ μ̄], fillrange=μ₉₀, labels=["CI for model" ""], color=:lightgrey, alpha=0.5, linewidth=0, )
Or, lastly, is it just a matter of choosing wisely the threshold value that is used when using the model for prediction tasks? I would love to hear thoughts/advice/feedback on how to model this kind of problem. I have a real-world problem that is loosely analogous to this example and it’s pretty scary knowing I can get such dramatically different results, depending on my choice of a prior. I thought I was doing the right thing with the strong prior for the intercept, since I know for sure that there is no chance of success when N = 0, but I see in this example that doing so results in a really terrible fit.