# Simple multi-parameter bayesian linear regression returning un-expected results!

The problem is from chapter 10 question 7 of “Regression and Other Stories”, and the general idea is to use data which includes a continuous parameter “beauty” and then several binary predictors. I have solved the problem using stan_glm in R, so I know that out of the 7 predictors only three have sizable effects, beauty (continuous), gender(1,0), and nonnative English speaker (1,0). It is easy to recover these results using julia’s GLM.jl package, and I show this below:

df = DataFrame(CSV.File("beauty.csv"))

sanity_check = lm(@formula(eval~beauty+female+age+ minority+nonenglish+lower+course_id),df)


These are the results I would expect. However, when I create a Turing model for Bayesian linear regression I get that only beauty is significant.

My code for the Turing model is shown below. I use a small helper function (also shown) to standardize the predictors, HOWEVER, in this version of my turing model I only use the helper function calculate the standard deviations and means in one go. I leave the data un-transformed but center the priors because I am trying to emulate what stan_glm does here: Prior Distributions for rstanarm Models • rstanarm

function standardize(X)
μ_X = mean(X)
σ_X = std(X)
X_std = (X .- μ_X)/σ_X
return X_std, μ_X, σ_X
end

@model function beauty_regression(y,x1,x2,x3,x4,x5,x6,x7)

x1_std,μ_x1,σ_x1 = standardize(x1)
x2_std,μ_x2,σ_x2 = standardize(x2)
x3_std,μ_x3,σ_x3 = standardize(x3)
x4_std,μ_x4,σ_x4 = standardize(x4)
x5_std,μ_x5,σ_x5 = standardize(x5)
x6_std,μ_x6,σ_x6 = standardize(x6)
x7_std,μ_x7,σ_x7 = standardize(x7)

σ_y = std(y)
ϵ ~ Exponential(1/σ_y)

my = 0

β0 ~ Normal(my,2.5*σ_y)

β1 ~ Normal(0,2.5*σ_y/σ_x1)
β2 ~ Normal(0,2.5*σ_y/σ_x2)
β3 ~ Normal(0,2.5*σ_y/σ_x3)
β4 ~ Normal(0,2.5*σ_y/σ_x4)
β5 ~ Normal(0,2.5*σ_y/σ_x5)
β6 ~ Normal(0,2.5*σ_y/σ_x6)
β7 ~ Normal(0,2.5*σ_y/σ_x7)

μ = β0 .+ β1*(x1 .- μ_x1)
+ β2*(x2 .- μ_x2)
+ β3*(x3 .- μ_x3)
+ β4*(x4 .- μ_x4)
+ β5*(x5 .- μ_x5)
+ β6*(x6 .- μ_x6)
+ β7*(x7 .- μ_x7)

y ~ MvNormal(μ,ϵ)



I run my simulation with the following code

model = beauty_regression(df[:,1],df[:,2],df[:,3],df[:,4],df[:,5],df[:,6],df[:,7],df[:,8])
chain = sample(model,NUTS(0.65),3_000);



And these are the results:

Same order but different names (sorry) We see that beauty (b1) is the only significant one…My guess is that I am doing something wrong specifying priors because I have a combination of continuous and binary, however, the stan_glm methodologies makes no such mention of doing anything different. Why am I getting pretty different results with my model? I appreciate any insights EDIT1: Accidentally submitted the version where I DID standardize the data, should be the non-standardized version now

EDIT2: Fixed some embarrassing simple mistakes, but problem still persists. The code posted above has been updated to remove said mistakes.

Did you mean to call standardize(x1) 7 times?

Also, did you mean for this to be β7?

No and yes, sorry, copied these from an old file. Let me post an updated version, but it’s the same result of only one significant predictor!

Thanks for pointing that out, updated the original post and results

A few things I see:

• rstanarm uses mean(y) as the mean of the intercept prior for a normal likelihood with identity link function, but you are using 0.
• Stan and Distributions use different parameterizations of the exponential distribution. Use Exponential(sy).

In general, the model could be written more succinctly by writing x as Matrix(df[:, 2:8]):

@model function beauty_regression(
y,
x,
μ_x = vec(mean(x; dims=1)),
σ_x = vec(std(x; dims=1)),
x_c = x .- μ_x',
μ_y = mean(y),
σ_y = std(y),
βprior = arraydist(Normal.(0, 2.5*σ_y ./ σ_x))
)
ϵ ~ Exponential(σ_y)
α ~ Normal(μ_y, 2.5*σ_y)
β ~ βprior
y .~ Normal.(μ, ϵ)
end


This implementation moves all of the computations that will not change from one run to another out of the body of the model. muladd fuses the matrix-vector multiplication and addition by a scalar. Another advantage to writing it this way is that the same model can be used for different numbers of predictors.

In case you’re not aware, there’s also TuringGLM, which is the rstanarm/brms of Turing.

4 Likes

Thanks you so much seth! I am aware of TuringGLM, but At least for the next few months I want to try and build my own models because it helps keep me honest about what I do and don’t know (case-in point)! Since I have centered my data and I am just using the normal link function won’t my mean-y just be zero? I appreciate the more concise form, I’m going to take a bit to unpack it and make sure I understand everything that’s going on there hahaha!

Ah, I had interpreted the below quote to mean your data was not centered.

Either way, if the mean is 0, no harm in centering on the mean.

Sounds like a great plan!

Ok so I think that all makes sense! I made one change I want to verify, and then I have a follow up question! If we are setting the priors to have a mean of 0 that means we should center our data, correct? I added a few lines to your code to center x, shown below. Is this the correct thing to do?

@model function beauty_regression_new(
y,
x,
μ_x = vec(mean(x; dims=1)),
σ_x = vec(std(x; dims=1)),
μ_y = mean(y),
σ_y = std(y),
βprior = arraydist(Normal.(0, 2.5*σ_y ./ σ_x))
)
ϵ ~ Exponential(σ_y)
α ~ Normal(μ_y, 2.5*σ_y)
β ~ βprior

N = size(μ_x)
x_c = copy(x)
for i in 1:N
x_c[:,i] = x[:,i] .- μ_x[i]
end

y .~ Normal.(μ, ϵ)



After centering the data as above, I got basically the same results as from GLM :)!

HOWEVER, I went back and changed my old script to have the correct dispersion term Exponential(sy) and it still doesn’t return the same results…which baffles me. I agree your version of the code is much more concise and reusable, but functionally we are doing the same thing… I must be missing something!

This is the current version of my old script:

@model function beauty_regression(y,x1,x2,x3,x4,x5,x6,x7)
x1_std,μ_x1,σ_x1 = standardize(x1)
x2_std,μ_x2,σ_x2 = standardize(x2)
x3_std,μ_x3,σ_x3 = standardize(x3)
x4_std,μ_x4,σ_x4 = standardize(x4)
x5_std,μ_x5,σ_x5 = standardize(x5)
x6_std,μ_x6,σ_x6 = standardize(x6)
x7_std,μ_x7,σ_x7 = standardize(x7)

σ_y = std(y)
ϵ ~ Exponential(σ_y)

my = mean(y)

β0 ~ Normal(my,2.5*σ_y)

β1 ~ Normal(0,2.5*σ_y/σ_x1)
β2 ~ Normal(0,2.5*σ_y/σ_x2)
β3 ~ Normal(0,2.5*σ_y/σ_x3)
β4 ~ Normal(0,2.5*σ_y/σ_x4)
β5 ~ Normal(0,2.5*σ_y/σ_x5)
β6 ~ Normal(0,2.5*σ_y/σ_x6)
β7 ~ Normal(0,2.5*σ_y/σ_x7)

μ = β0 .+ β1*(x1 .- μ_x1)
+ β2*(x2 .- μ_x2)
+ β3*(x3 .- μ_x3)
+ β4*(x4 .- μ_x4)
+ β5*(x5 .- μ_x5)
+ β6*(x6 .- μ_x6)
+ β7*(x7 .- μ_x7)

y ~ MvNormal(μ,ϵ)



I feel like I have to be missing something obvious!

Only the prior of the intercept is sensitive to centering. Let’s write down the same \mu with coefficients for centered and uncentered predictors:

\begin{aligned} \mu_i &= \tilde{\beta}_0 + \sum_{k=1}^n (x_{ik} - m_k) \tilde{\beta}_k\\ &= \tilde{\beta}_0 - \sum_{k=1}^n m_k \tilde{\beta}_k + \sum_{k=1}^n x_{ik} \tilde{\beta}_k\\ &= \beta_0 + \sum_{k=1}^n x_{ik} \beta_k \end{aligned}

So \tilde{\beta}_k = \beta_k for k > 0 and \beta_0 = \tilde{\beta}_0 - \sum_{k=1}^n m_k \tilde{\beta}_k. So we only need to worry about the mean of the intercept prior when we translate our predictors. And yes, it’s only when predictors are centered that the mean of y should be used as the mean of the intercept prior. In general, centering (and standardizing) is recommended both for computational efficiency (all parameters on the same scale) and for interpretability.

It’s best to also do the centering outside of the model body, and you could do it in the signature like this:

    ...
μ_x = vec(mean(x; dims=1)),
σ_x = vec(std(x; dims=1)),
x_c = x .- μ_x',
μ_y = mean(y),
...
`

I’ve edited the above post to reflect this.

Yeah I can’t see the issue either, but it’s tricky to diagnose without a complete minimal working example I could run.

That makes it all click, thanks for taking the time to write that out. I also didn’t realize that broadcasting could generalize to work between matrices and vectors like how it works between vectors and scalars, but know that I see it, it makes a lot of sense!

1 Like

My final thought for trying to get at a difference between our two models is trying to see what y (the last line) actually is. Is there a way to extract this information from the model object? I’ve tried a few different things but outside of sampling from the model object, I don’t know how else I can interact with it.

Update: I had both versions @show what y was, and they were both Vector{Float64} of length equal to the number of observations, which is expected! I also moved all of the calculations of mean and std for the x’s out of the body, similar to what’s in your script, but still different results. Mysterious stuff!

No problem! Yes, it’s very useful. Also, broadcasting, mapping, and reducing are generally more friendly when automatically differentiating than rolling your own loops.

Mmm, not that I know of, sorry.

Well thanks for all your help. After tinkering with it this morning, I am just going to chalk it up to my coefficients not being in an array. Turing does find a smaller step size for you script, so perhaps the linear predictors just shouldn’t be loose…it’s the best rationalization I can find hahaha.