Bayesian Analysis of Weighted Parameter

I am interested in analyzing the posterior distribution of a Bernoulli parameter, weighted by given amounts.

A simple example:

weights = [100, 50 , 50, 150, 200 ]
trials  = [1, 0, 0, 1, 0 ]

q_weghted = sum(weights .* trials) / sum(weights)  # 0.4545... 

I wasn’t sure how to specify this in Turing (or Stan, which I’m slightly more familiar with due to working through Statistical Rethinking). The weighting is not an example I have seen.

I put together a slightly longer example in this Nextjournal.

1 Like

I am not sure I understand the problem, if you have a posterior distribution already then you can just generate any random variable from it. Or is the question about implementing that?

It’s not totally clear to me what your generative model is here, but if you really want to use MCMC you probably want to model this as a binomial response with your weights as the number of trials. Note that, at least in your simple scenario here, this reduces to a binomial with n = 550 trials and 250 successes. And if you use a Beta prior on probability of success the posterior is available analytically.

Yeah, I am interested in the posterior distribution of q_wieghted (e.g. so I can look at mass/confidence intervals) and I am not sure how to specify this in a model for Turing or similar.

Wouldn’t that make the posterior way too narrow? The applicable problem I want to apply this to is insurance claims where I know that the probability of a claim is correlated to the size of the risk, but once a claim occurs the full amount at risk is claimed.

I know that in a perfect world I would simply have a model that conditioned the Bernoulli parameter on the size of the risk, but the downstream use of the estimated Bernoulli parameter is not sophisticated enough to have a conditional parameter. Therefore I am trying to estimate the weighted parameter, which I can use to better calibrate the downstream model’s output with. The Nextjournal link in my original post has an example of what I mean by using the weighted estimate in the prediction of future claim amounts, which is a better estimate than the unweighted estimate.

Some of the details are a little unclear to me. However, one model that comes to mind when you say risk and claim probability are correlated is a logistic regression. This would allow you to characterize the relationship between weights and trials with an equation that adjusts the Bernoulli parameter.

Aside from that, it might help if you could identify three components: (1) the data, (2) deterministic components of the model, and (3) the parameters of your model.

1 Like

Aside from that, it might help if you could identify three components: (1) the data, (2) deterministic components of the model, and (3) the parameters of your model.

Sorry, my terminology might be a bit lax as I continue to learn more about bayesian modeling. Let me give it a shot:

The data:

weights = [100, 50 , 50, 150, 200 ]
trials  = [1, 0, 0, 1, 0 ]

What’s deterministic:

Each potential insurance policy has a known benefit (ie the weights above). If a claim happens, the amount of benefit paid is fixed.

What’s random:

Whether or not a policyholder has a claim. 
- I know that the probability of this varies a priori by size (ie the weight).

one model that comes to mind when you say risk and claim probability are correlated is a logistic regression. This would allow you to characterize the relationship between weights and trials with an equation that adjusts the Bernoulli parameter.

Based on my understanding of the logistic probit model, that would be useful for predicting whether or not a given risk has a claim; or to discern the correlations between different data features and the outcome.

However, my end use case is not to predict whether or not an individual claim occurs, but to use a probability (ie q_weighted in my OP) to model the aggregate outcome of a similar population.

I admit that the ideal setup in my end use case would be to individually predict whether a claim occurs or not based on the set of data features relevant to that insurance policy. However, the modeling software I use does not accept that type of input. Therefore to calibrate the dollar amount of loss I am looking to derive a weighted estimated parameter (which it’s easy to get the point estimate, q_weighted as above). However, I want to understand the potential distribution of that estimate rather than just derive a point estimate and I am not sure how to accomplish that.

Apologies if I’m being dense, I’m trying to take some of the things I’ve learned about bayesian modeling but have had a hard time adapting it to this particular problem.

Sorry, but even after this explanation I don’t understand the setup.

Generally I find it best to start from a data generating process and then a likelihood. Something like that would help clarify what you are trying to do.

I agree with Tamas. If you can share a function that simulates the assumed data generating process, it would help us understand the model setup.

Some clarification of the data would help me also. It seems like the variable “weight” might refer to the insurance benefit in thousands of dollars and “trial” is an indicator variable in which 1 denotes the insurance benefit was paid and 0 indicates it was not paid. If that is true, I would use something like “benefit” instead of “weight” and “paid” instead of trials. Even though the math doesn’t care about naming conventions, they can help or hinder people who are unfamiliar with your application.

I think clarifying these points would give us enough information to put you in the right direction.

1 Like

Yes, the weights in my original post would be insurance policy benefit amounts.

# weights represent the potential benefit amount on each policy
# so in this example, there are 5 policies with potential benefit amounts
# of $100, $50, $50, $150, and $200 respectively. They are all in force at the 
# beginning of the prior year and the only way for them to exit is via a claim 
# (ie a death claim)
policy_benefit_amounts = [100, 50 , 50, 150, 200 ]

# After one year has elapsed, we observe that two of the five have died.
# I presume that this is driven by a bernoulli random variable for each person, 
# called `q`. I also presume that `q` varies by the policy benefit amount.
experience  = [1, 0, 0, 1, 0 ]

# the unweighted estimate
q_unweighted = sum(experience) / length(experience) # 0.4

# the weighted estimate
q_weghted = sum(policy_benefit_amounts .* experience  ) / sum(policy_benefit_amounts )  # 0.4545... 

I am interested in the posterior distribution of q_weighted, because say I the next year, I have a very similar population of policyholders. In lieu of precisely modeling the q conditioned on the policy benefit amount, I can use q_weighted to estimate the mean expected claims.

If this last part is not clear, this Nextjournal shows a slightly extended example where q_weighted correctly predicts the anticipated claim amount (In that example, I instead of year-over-year analysis, I tried to use the example of applying the experience from one state to a similar population in another state).

Hopefully that clarifies the situation enough?

I still don’t understand the setup. If experience is data, then so is q_weighted, how can it have a posterior distribution?

Are you after estimating the expected q_weighted in some more complicated model? Eg assume that people die with some probability q_i, which depends on some covariates,

\text{logit}(q_i) = X_i \beta + \varepsilon_i

where X_i is known, and \beta and \sigma = \text{std}(\varepsilon) are parameters. These you can estimate from the data using a multilevel model and then obtain a posterior for q_weighted.

But if you are new to Bayesian modeling, I would really recommend working through parts of a textbook first otherwise you will run into conceptual and practical problems all the time. Gelman and Hill is a great intro.