# Simple multi-parameter bayesian linear regression returning un-expected results!

The problem is from chapter 10 question 7 of “Regression and Other Stories”, and the general idea is to use data which includes a continuous parameter “beauty” and then several binary predictors. I have solved the problem using stan_glm in R, so I know that out of the 7 predictors only three have sizable effects, beauty (continuous), gender(1,0), and nonnative English speaker (1,0). It is easy to recover these results using julia’s GLM.jl package, and I show this below:

df = DataFrame(CSV.File("beauty.csv"))

sanity_check = lm(@formula(eval~beauty+female+age+ minority+nonenglish+lower+course_id),df)


These are the results I would expect. However, when I create a Turing model for Bayesian linear regression I get that only beauty is significant.

My code for the Turing model is shown below. I use a small helper function (also shown) to standardize the predictors, HOWEVER, in this version of my turing model I only use the helper function calculate the standard deviations and means in one go. I leave the data un-transformed but center the priors because I am trying to emulate what stan_glm does here: Prior Distributions for rstanarm Models • rstanarm

function standardize(X)
μ_X = mean(X)
σ_X = std(X)
X_std = (X .- μ_X)/σ_X
return X_std, μ_X, σ_X
end

@model function beauty_regression(y,x1,x2,x3,x4,x5,x6,x7)

x1_std,μ_x1,σ_x1 = standardize(x1)
x2_std,μ_x2,σ_x2 = standardize(x2)
x3_std,μ_x3,σ_x3 = standardize(x3)
x4_std,μ_x4,σ_x4 = standardize(x4)
x5_std,μ_x5,σ_x5 = standardize(x5)
x6_std,μ_x6,σ_x6 = standardize(x6)
x7_std,μ_x7,σ_x7 = standardize(x7)

σ_y = std(y)
ϵ ~ Exponential(1/σ_y)

my = 0

β0 ~ Normal(my,2.5*σ_y)

β1 ~ Normal(0,2.5*σ_y/σ_x1)
β2 ~ Normal(0,2.5*σ_y/σ_x2)
β3 ~ Normal(0,2.5*σ_y/σ_x3)
β4 ~ Normal(0,2.5*σ_y/σ_x4)
β5 ~ Normal(0,2.5*σ_y/σ_x5)
β6 ~ Normal(0,2.5*σ_y/σ_x6)
β7 ~ Normal(0,2.5*σ_y/σ_x7)

μ = β0 .+ β1*(x1 .- μ_x1)
+ β2*(x2 .- μ_x2)
+ β3*(x3 .- μ_x3)
+ β4*(x4 .- μ_x4)
+ β5*(x5 .- μ_x5)
+ β6*(x6 .- μ_x6)
+ β7*(x7 .- μ_x7)

y ~ MvNormal(μ,ϵ)



I run my simulation with the following code

model = beauty_regression(df[:,1],df[:,2],df[:,3],df[:,4],df[:,5],df[:,6],df[:,7],df[:,8])
chain = sample(model,NUTS(0.65),3_000);



And these are the results:

Same order but different names (sorry) We see that beauty (b1) is the only significant one…My guess is that I am doing something wrong specifying priors because I have a combination of continuous and binary, however, the stan_glm methodologies makes no such mention of doing anything different. Why am I getting pretty different results with my model? I appreciate any insights

EDIT1: Accidentally submitted the version where I DID standardize the data, should be the non-standardized version now

EDIT2: Fixed some embarrassing simple mistakes, but problem still persists. The code posted above has been updated to remove said mistakes.

Did you mean to call standardize(x1) 7 times?

Also, did you mean for this to be β7?

No and yes, sorry, copied these from an old file. Let me post an updated version, but it’s the same result of only one significant predictor!

Thanks for pointing that out, updated the original post and results

A few things I see:

• rstanarm uses mean(y) as the mean of the intercept prior for a normal likelihood with identity link function, but you are using 0.
• Stan and Distributions use different parameterizations of the exponential distribution. Use Exponential(sy).

In general, the model could be written more succinctly by writing x as Matrix(df[:, 2:8]):

@model function beauty_regression(
y,
x,
μ_x = vec(mean(x; dims=1)),
σ_x = vec(std(x; dims=1)),
x_c = x .- μ_x',
μ_y = mean(y),
σ_y = std(y),
βprior = arraydist(Normal.(0, 2.5*σ_y ./ σ_x))
)
ϵ ~ Exponential(σ_y)
α ~ Normal(μ_y, 2.5*σ_y)
β ~ βprior
y .~ Normal.(μ, ϵ)
end


This implementation moves all of the computations that will not change from one run to another out of the body of the model. muladd fuses the matrix-vector multiplication and addition by a scalar. Another advantage to writing it this way is that the same model can be used for different numbers of predictors.

In case you’re not aware, there’s also TuringGLM, which is the rstanarm/brms of Turing.

4 Likes

Thanks you so much seth! I am aware of TuringGLM, but At least for the next few months I want to try and build my own models because it helps keep me honest about what I do and don’t know (case-in point)! Since I have centered my data and I am just using the normal link function won’t my mean-y just be zero? I appreciate the more concise form, I’m going to take a bit to unpack it and make sure I understand everything that’s going on there hahaha!

Ah, I had interpreted the below quote to mean your data was not centered.

Either way, if the mean is 0, no harm in centering on the mean.

Sounds like a great plan!

Ok so I think that all makes sense! I made one change I want to verify, and then I have a follow up question! If we are setting the priors to have a mean of 0 that means we should center our data, correct? I added a few lines to your code to center x, shown below. Is this the correct thing to do?

@model function beauty_regression_new(
y,
x,
μ_x = vec(mean(x; dims=1)),
σ_x = vec(std(x; dims=1)),
μ_y = mean(y),
σ_y = std(y),
βprior = arraydist(Normal.(0, 2.5*σ_y ./ σ_x))
)
ϵ ~ Exponential(σ_y)
α ~ Normal(μ_y, 2.5*σ_y)
β ~ βprior

N = size(μ_x)[1]
x_c = copy(x)
for i in 1:N
x_c[:,i] = x[:,i] .- μ_x[i]
end

y .~ Normal.(μ, ϵ)



After centering the data as above, I got basically the same results as from GLM :)!

HOWEVER, I went back and changed my old script to have the correct dispersion term Exponential(sy) and it still doesn’t return the same results…which baffles me. I agree your version of the code is much more concise and reusable, but functionally we are doing the same thing… I must be missing something!

This is the current version of my old script:

@model function beauty_regression(y,x1,x2,x3,x4,x5,x6,x7)
x1_std,μ_x1,σ_x1 = standardize(x1)
x2_std,μ_x2,σ_x2 = standardize(x2)
x3_std,μ_x3,σ_x3 = standardize(x3)
x4_std,μ_x4,σ_x4 = standardize(x4)
x5_std,μ_x5,σ_x5 = standardize(x5)
x6_std,μ_x6,σ_x6 = standardize(x6)
x7_std,μ_x7,σ_x7 = standardize(x7)

σ_y = std(y)
ϵ ~ Exponential(σ_y)

my = mean(y)

β0 ~ Normal(my,2.5*σ_y)

β1 ~ Normal(0,2.5*σ_y/σ_x1)
β2 ~ Normal(0,2.5*σ_y/σ_x2)
β3 ~ Normal(0,2.5*σ_y/σ_x3)
β4 ~ Normal(0,2.5*σ_y/σ_x4)
β5 ~ Normal(0,2.5*σ_y/σ_x5)
β6 ~ Normal(0,2.5*σ_y/σ_x6)
β7 ~ Normal(0,2.5*σ_y/σ_x7)

μ = β0 .+ β1*(x1 .- μ_x1)
+ β2*(x2 .- μ_x2)
+ β3*(x3 .- μ_x3)
+ β4*(x4 .- μ_x4)
+ β5*(x5 .- μ_x5)
+ β6*(x6 .- μ_x6)
+ β7*(x7 .- μ_x7)

y ~ MvNormal(μ,ϵ)



I feel like I have to be missing something obvious!

Only the prior of the intercept is sensitive to centering. Let’s write down the same \mu with coefficients for centered and uncentered predictors:

\begin{aligned} \mu_i &= \tilde{\beta}_0 + \sum_{k=1}^n (x_{ik} - m_k) \tilde{\beta}_k\\ &= \tilde{\beta}_0 - \sum_{k=1}^n m_k \tilde{\beta}_k + \sum_{k=1}^n x_{ik} \tilde{\beta}_k\\ &= \beta_0 + \sum_{k=1}^n x_{ik} \beta_k \end{aligned}

So \tilde{\beta}_k = \beta_k for k > 0 and \beta_0 = \tilde{\beta}_0 - \sum_{k=1}^n m_k \tilde{\beta}_k. So we only need to worry about the mean of the intercept prior when we translate our predictors. And yes, it’s only when predictors are centered that the mean of y should be used as the mean of the intercept prior. In general, centering (and standardizing) is recommended both for computational efficiency (all parameters on the same scale) and for interpretability.

It’s best to also do the centering outside of the model body, and you could do it in the signature like this:

    ...
μ_x = vec(mean(x; dims=1)),
σ_x = vec(std(x; dims=1)),
x_c = x .- μ_x',
μ_y = mean(y),
...


I’ve edited the above post to reflect this.

Yeah I can’t see the issue either, but it’s tricky to diagnose without a complete minimal working example I could run.

That makes it all click, thanks for taking the time to write that out. I also didn’t realize that broadcasting could generalize to work between matrices and vectors like how it works between vectors and scalars, but know that I see it, it makes a lot of sense!

1 Like

My final thought for trying to get at a difference between our two models is trying to see what y (the last line) actually is. Is there a way to extract this information from the model object? I’ve tried a few different things but outside of sampling from the model object, I don’t know how else I can interact with it.

Update: I had both versions @show what y was, and they were both Vector{Float64} of length equal to the number of observations, which is expected! I also moved all of the calculations of mean and std for the x’s out of the body, similar to what’s in your script, but still different results. Mysterious stuff!

No problem! Yes, it’s very useful. Also, broadcasting, mapping, and reducing are generally more friendly when automatically differentiating than rolling your own loops.

Mmm, not that I know of, sorry.

Well thanks for all your help. After tinkering with it this morning, I am just going to chalk it up to my coefficients not being in an array. Turing does find a smaller step size for you script, so perhaps the linear predictors just shouldn’t be loose…it’s the best rationalization I can find hahaha.

Hi

I just thought I would give it a try with the CRRao package that we developed recently. The CRRao package can handle both simple linear regression and Bayesian linear regression with Gaussian or Ridge prior, Laplace or Lasso prior, Cauchy prior and TDist prior. I found the results came nicely as expected. Here I am trying to describe what I have done so far.

using DataFrames, CSV
using StatsBase
using Statistics
using GLM
using CRRao

df = DataFrame(CSV.File("beauty.csv"));
first(df, 5)

5×8 DataFrame
│ Row │ eval    │ beauty    │ female │ age   │ minority │ nonenglish │ lower │ course_id │
│     │ Float64 │ Float64   │ Int64  │ Int64 │ Int64    │ Int64      │ Int64 │ Int64     │
├─────┼─────────┼───────────┼────────┼───────┼──────────┼────────────┼───────┼───────────┤
│ 1   │ 4.3     │ 0.201567  │ 1      │ 36    │ 1        │ 0          │ 0     │ 3         │
│ 2   │ 4.5     │ -0.826081 │ 0      │ 59    │ 0        │ 0          │ 0     │ 0         │
│ 3   │ 3.7     │ -0.660333 │ 0      │ 51    │ 0        │ 0          │ 0     │ 4         │
│ 4   │ 4.3     │ -0.766312 │ 1      │ 40    │ 0        │ 0          │ 0     │ 2         │
│ 5   │ 4.4     │ 1.42145   │ 1      │ 31    │ 0        │ 0          │ 0     │ 0         │

## Fit simple linear regression
fit = @fitmodel((eval~beauty+female+age+ minority+nonenglish+lower+course_id)
, df
, LinearRegression())

Model Class: Linear Regression
Likelihood Mode: Gauss
Computing Method: Optimization
────────────────────────────────────────────────────────────────────────────────
Coef.  Std. Error      t  Pr(>|t|)    Lower 95%    Upper 95%
────────────────────────────────────────────────────────────────────────────────
(Intercept)   4.19361      0.146437    28.64    <1e-99   3.90583      4.48138
beauty        0.140902     0.0333132    4.23    <1e-04   0.0754356    0.206369
female       -0.19745      0.0528073   -3.74    0.0002  -0.301226    -0.0936735
age          -0.00217404   0.00280125  -0.78    0.4381  -0.00767904   0.00333096
minority     -0.0688714    0.0785761   -0.88    0.3812  -0.223288     0.0855457
nonenglish   -0.274837     0.110696    -2.48    0.0134  -0.492376    -0.0572989
lower         0.098818     0.054249     1.82    0.0692  -0.00779174   0.205428
course_id    -0.000390066  0.00298497  -0.13    0.8961  -0.0062561    0.00547596
────────────────────────────────────────────────────────────────────────────────


Based on the above analysis, we can say that the beauty, female, and nonenglish are the three variables that have significant effect on the evaluation. As I able to reproduce the results above, I tried the Bayesian linear regression with Ridge prior (aka. Gaussian prior) using the CRRao package.

fit_ridge_prior = @fitmodel((eval~beauty+female+age+ minority+nonenglish+lower+course_id)
, df
, LinearRegression()
, Prior_Ridge())



First I found the following MCMC chain summary statistics.

Chains MCMC chain (10000×22×1 Array{Float64, 3}):

Iterations        = 1001:1:11000
Number of chains  = 1
Samples per chain = 10000
Wall duration     = 16.08 seconds
Compute duration  = 16.08 seconds
parameters        = v, σ, α, β[1], β[2], β[3], β[4], β[5], β[6], β[7]
internals         = lp, n_steps, is_accept, acceptance_rate, log_density, hamiltonian_energy, hamiltonian_energy_error, max_hamiltonian_energy_error, tree_depth, numerical_error, step_size, nom_step_size

Summary Statistics
parameters      mean       std   naive_se      mcse          ess      rhat   ess_per_sec
Symbol   Float64   Float64    Float64   Float64      Float64   Float64       Float64

v    3.0633    0.9052     0.0091    0.0110    7687.9312    1.0000      478.1646
σ    0.5327    0.0178     0.0002    0.0001   12334.4027    1.0000      767.1603
α    4.1559    0.1478     0.0015    0.0018    6478.9031    0.9999      402.9670
β[1]    0.1430    0.0335     0.0003    0.0003   10478.2933    0.9999      651.7162
β[2]   -0.1924    0.0523     0.0005    0.0006    9189.6967    0.9999      571.5696
β[3]   -0.0015    0.0028     0.0000    0.0000    6912.4739    0.9999      429.9337
β[4]   -0.0673    0.0780     0.0008    0.0007   10673.4607    0.9999      663.8550
β[5]   -0.2722    0.1126     0.0011    0.0011   11336.8591    0.9999      705.1163
β[6]    0.1013    0.0549     0.0005    0.0005   10938.8474    0.9999      680.3612
β[7]   -0.0005    0.0030     0.0000    0.0000   14629.2527    0.9999      909.8926



I found the rhat statistics for all parameters are very close to 1. This means MCMC had no problem with convergence. Then I noticed the parameters. I present the 95% posterior credible interval of each parameter below:

Quantiles
parameters      2.5%     25.0%     50.0%     75.0%     97.5%
Symbol   Float64   Float64   Float64   Float64   Float64

v    1.8556    2.4446    2.8946    3.4768    5.3237
σ    0.4991    0.5205    0.5323    0.5444    0.5684
α    3.8624    4.0571    4.1567    4.2567    4.4421
β[1]    0.0779    0.1202    0.1432    0.1653    0.2078
β[2]   -0.2938   -0.2280   -0.1923   -0.1569   -0.0883
β[3]   -0.0069   -0.0034   -0.0015    0.0004    0.0041
β[4]   -0.2193   -0.1208   -0.0666   -0.0147    0.0847
β[5]   -0.4950   -0.3472   -0.2733   -0.1955   -0.0533
β[6]   -0.0065    0.0643    0.1012    0.1382    0.2086
β[7]   -0.0063   -0.0025   -0.0005    0.0015    0.0055


The first parameter, v is the hyper-parameter, and σ is the residual standard deviation. The α is the intercept parameter, and β[1] -β[7] are regression coefficients. Based on the 95% Bayesian CI, we can say that β[1], β[2] and β[5] - these three coefficients are statistically significant, and the corresponding predictors are beauty, female, and nonenglish do have a statistically significant effect over the evaluation.

I made the following table to compare the OLS and Bayesian estimates with Gauss prior.

Predictor      OLS          Bayes (Gauss/Ridge prior)
(Intercept)   4.19361       4.1559
beauty        0.140902      0.1430
female       -0.19745      -0.1924
age          -0.0022       -0.0015
minority     -0.0689       -0.0673
nonenglish   -0.2748.      -0.2722
lower         0.098818      0.1013
course_id    -0.00039.     -0.0005


The visual inspection indicates that estimates of regression coefficients by both methods are very close to each other.

I tried Laplace prior as follows:

fit_Laplace_prior = @fitmodel((eval~beauty+female+age+ minority+nonenglish+lower+course_id)
, df
, LinearRegression()
, Prior_Laplace())



The summary statistics and rhat were:

Summary Statistics
parameters      mean       rhat   ess_per_sec

v    1.3229     1.0000      658.6089
σ    0.5323     1.0000     1018.0162
α    4.1576     1.0000      574.6666
β[1]    0.1405     1.0000      877.7872
β[2]   -0.1883     1.0000      765.7493
β[3]   -0.0016     1.0000      634.6961
β[4]   -0.0661     0.9999      980.9478
β[5]   -0.2590     0.9999      773.6782
β[6]    0.0986     1.0004      805.5721
β[7]   -0.0004     0.9999     1524.5410



The results were very similar. Similarly, I tried Cauchy prior on the regression coefficients.


fit_Cauchy_prior = @fitmodel((eval~beauty+female+age+ minority+nonenglish+lower+course_id)
, df
, LinearRegression()
, Prior_Cauchy())

Summary Statistics
parameters      mean   rhat   ess_per_sec

σ    0.5299    0.9999      830.7017
α    4.1817    1.0000      692.7191
β[1]    0.1403    1.0000      814.6251
β[2]   -0.1930    0.9999      871.6434
β[3]   -0.0020    1.0001      726.4616
β[4]   -0.0691    1.0003      897.4862
β[5]   -0.2608    1.0001      958.6221
β[6]    0.0997    0.9999      961.6356
β[7]   -0.0004    1.0001     1298.7876


I combine all the results below:

Predictor      OLS            Bayes              Bayes            Bayes
(Gauss prior).  (Laplace prior).   (Cauchy prior)
(Intercept)   4.19361       4.1559            4.1576              4.1817
beauty        0.140902      0.1430            0.1405              0.1403
female       -0.19745      -0.1924           -0.1883             -0.1930
age          -0.0022       -0.0015           -0.0016             -0.0020
minority     -0.0689       -0.0673           -0.0661             -0.0691
nonenglish   -0.2748.      -0.2722           -0.2590             -0.2608
lower         0.098818      0.1013            0.0986              0.0997
course_id    -0.00039.     -0.0005           -0.0004             -0.0004
`

Note: Please note that for simple linear regression CRRao.jl call the GLM.jl and for Bayesian linear regression, it calls the Turing.jl

Thanks and regards,
Sourish