Getting model parameter names from DataFrame

Suppose I have

df = DataFrame(
    age=[20, 30, 40], 
    height=[80, 130, 200], 
)
weight = [100, 120, 200]

I want to predict weight from the other columns using linear regression. There are two options. Explicitly write each variable into the model, or build a covariate matrix X and write it into the model.

The linear regression tutorial uses a covariate matrix x:

# Bayesian linear regression.
@model function linear_regression(x, y)
    # Set variance prior.
    Οƒβ‚‚ ~ truncated(Normal(0, 100), 0, Inf)
    
    # Set intercept prior.
    intercept ~ Normal(0, sqrt(3))
    
    # Set the priors on our coefficients.
    nfeatures = size(x, 2)
    coefficients ~ MvNormal(nfeatures, sqrt(10))
    
    # Calculate all the mu terms.
    mu = intercept .+ x * coefficients
    y ~ MvNormal(mu, sqrt(Οƒβ‚‚))
end

It is easy to just use Array(df) as my covariate matrix, but that means all my coefficients have opaque names like coefficients[2] in the output.

Summary Statistics
        parameters     mean     std  naive_se    mcse       ess   r_hat
  ────────────────  ───────  ──────  ────────  ──────  ────────  ──────
   coefficients[1]  -0.0413  0.5648    0.0126  0.0389  265.1907  1.0010
   coefficients[2]   0.2770  0.6994    0.0156  0.0401  375.2777  1.0067
         intercept   0.0058  0.1179    0.0026  0.0044  580.0222  0.9995
                Οƒβ‚‚   0.3017  0.1955    0.0044  0.0132  227.2322  1.0005

Is it possible to use the names() from the dataframe to create the coefficient names? This could happen (1) during model building or (2) after the trace is constructed. I think the idea of doing it during model building is most flexible, so every part of the analysis will automatically include the names.

The StatsModels package provides a formula language to convert from a symbolic description of a regression-like model to the model matrix.

…and @cpfiffer at some point had mocked up a brms-style integration with Turing, I think it was here: https://github.com/cpfiffer/BayesModels

If you want to use StatsModels directly, then you could do something like

using StatsModels

f = @formula(weight ~ 1 + age + height)
f_concrete = apply_schema(f, schema(df))

y, x = modelcols(f, df)
# ... turing magic

respname, prednames = coefnames(f)

Note that this way, the intercept would be included in x so you’d have to modify your model, OR force no intercept by using @formula(weight ~ 0 + age + height).

1 Like

One of these days I may have to turn that from a mockup into a real package.

1 Like

it’d make a great GSOC project actually…pretty clear scope and just need someone to do it…

Oh, very true! I’ll keep that in mind when we’re adding projects for Turing.