Assume the following: I repeat a game over and over in which I roll a die a number of times. If I roll a 6 on any of the trials, I win. Otherwise, I lose. I want to model how the probability of winning this game changes as the number of trials increases.
In this case, I obviously know the answer when I’m using a standard 6-sided die:
using StatsPlots
# probability of rolling a six on a single throw
p = 1/6 
# probability as a function of N (number of rolls)
P(N) = 1 - (1 - p)^N 
plot(
	P,
	0:20;
	ylims=(0,1),
	label="true",
	xlabel="No. Rolls",
	ylabel="P(success)"
)

Let’s assume, though, that I do not know the probability of success for a given trial and I do not know how this probability is related to the number of trials. I just have data that show how many trials I ran during each round, and whether or not I won:
using DataFrames
using Distributions
data = DataFrame(num_rolls = rand(5:14, 300))
data.success = [rand(Binomial(row.num_rolls, p)) > 0 for row in eachrow(data)]
300×2 DataFrame
 Row │ num_rolls  success 
     │ Int64      Bool    
─────┼────────────────────
   1 │        10     true
   2 │        12    false
   3 │        11     true
   4 │        14     true
   5 │        10     true
  ⋮  │     ⋮         ⋮
 297 │         7     true
 298 │         5     true
 299 │         7    false
 300 │         5     true
          291 rows omitted
An important note here is that I’ve deliberately chosen N to be a random number between 5 and 14, since that results in an average success rate of roughly 80%, most of the time.
Since the variable of interest is binary, logistic regression is a common choice. I implement it as follows:
using StatsFuns
using Turing
@model function success_model(success, num_rolls)
	α ~ Normal(-5, 0.1)
	β ~ Normal(0, 1.5)
	for i ∈ eachindex(success)
		p = logistic(α + β * num_rolls[i])
		success[i] ~ Bernoulli(p)
	end
end
I’ve used a really strong prior for my intercept term since I know that the probability of success is zero when the number of rolls is zero. Now I fit my model and compare its average predictions across different values of N to the actual values:
m = success_model(data.success, data.num_rolls)
chain = sample(m, NUTS(), 1_000)
chaindf = DataFrame(chain)
ᾱ = mean(chaindf.α)
β̄ = mean(chaindf.β)
plot(
	P,
	0:20;
	ylims=(0,1),
	label="true",
	xlabel="No. Rolls",
	ylabel="P(success)"
)
	
plot!(
	x -> logistic(ᾱ + β̄ * x),
	0:20,
	label="model"
)

As you can see, this is a pretty terrible model. My next thought was to get rid of the strong prior on the intercept term, so I did that by simply increasing the standard deviation to 2:
@model function success_model(success, num_rolls)
	α ~ Normal(-5, 2)
	β ~ Normal(0, 1.5)
	for i ∈ eachindex(success)
		p = logistic(α + β * num_rolls[i])
		success[i] ~ Bernoulli(p)
	end
end
Now, I have this:

This is a lot better, but still not very satisfying. This model is telling me I have a nonsensical 25% chance (roughly) of success if I don’t roll the die a single time. I’ve played around with different priors for the intercept term and there seems to be this tradeoff between having the model be grounded in reality when N = 0, and having it produce accurate results for higher values of N. I also tried dropping the intercept term completely, and that wasn’t good either. In addition to that, I tried balancing the data by oversampling the values where success == false, again to no avail.
I guess my question is: is there a better link function I can use that will give a more accurate representation of my true underlying model? Or, is it simply a matter of being aware that predictions are less accurate for lower values of N? I can visualize a confidence interval for the average probability of success across different values of N and see that the model has less confidence at lower values of N:
μ = [
	logistic(row.α + row.β * x) 
	for row in eachrow(chaindf), x in 0:20
]
μ̄ = [mean(c) for c in eachcol(μ)]
μ₉₀ = hcat([percentile(c, [5,95]) for c in eachcol(μ)]...)'
plot!(
    0:20,
    [μ̄ μ̄],
    fillrange=μ₉₀,
    labels=["CI for model" ""],
    color=:lightgrey,
    alpha=0.5,
    linewidth=0,
)

Or, lastly, is it just a matter of choosing wisely the threshold value that is used when using the model for prediction tasks? I would love to hear thoughts/advice/feedback on how to model this kind of problem. I have a real-world problem that is loosely analogous to this example and it’s pretty scary knowing I can get such dramatically different results, depending on my choice of a prior. I thought I was doing the right thing with the strong prior for the intercept, since I know for sure that there is no chance of success when N = 0, but I see in this example that doing so results in a really terrible fit.
