Getting model parameter names from DataFrame

jzr · November 11, 2020, 11:03pm

Suppose I have

df = DataFrame(
    age=[20, 30, 40], 
    height=[80, 130, 200], 
)
weight = [100, 120, 200]

I want to predict weight from the other columns using linear regression. There are two options. Explicitly write each variable into the model, or build a covariate matrix X and write it into the model.

The linear regression tutorial uses a covariate matrix x:

# Bayesian linear regression.
@model function linear_regression(x, y)
    # Set variance prior.
    σ₂ ~ truncated(Normal(0, 100), 0, Inf)
    
    # Set intercept prior.
    intercept ~ Normal(0, sqrt(3))
    
    # Set the priors on our coefficients.
    nfeatures = size(x, 2)
    coefficients ~ MvNormal(nfeatures, sqrt(10))
    
    # Calculate all the mu terms.
    mu = intercept .+ x * coefficients
    y ~ MvNormal(mu, sqrt(σ₂))
end

It is easy to just use Array(df) as my covariate matrix, but that means all my coefficients have opaque names like coefficients[2] in the output.

Summary Statistics
        parameters     mean     std  naive_se    mcse       ess   r_hat
  ────────────────  ───────  ──────  ────────  ──────  ────────  ──────
   coefficients[1]  -0.0413  0.5648    0.0126  0.0389  265.1907  1.0010
   coefficients[2]   0.2770  0.6994    0.0156  0.0401  375.2777  1.0067
         intercept   0.0058  0.1179    0.0026  0.0044  580.0222  0.9995
                σ₂   0.3017  0.1955    0.0044  0.0132  227.2322  1.0005

Is it possible to use the names() from the dataframe to create the coefficient names? This could happen (1) during model building or (2) after the trace is constructed. I think the idea of doing it during model building is most flexible, so every part of the analysis will automatically include the names.

dmbates · November 12, 2020, 6:43pm

The StatsModels package provides a formula language to convert from a symbolic description of a regression-like model to the model matrix.

dave.f.kleinschmidt · November 12, 2020, 7:13pm

…and @cpfiffer at some point had mocked up a brms-style integration with Turing, I think it was here: https://github.com/cpfiffer/BayesModels

If you want to use StatsModels directly, then you could do something like

using StatsModels

f = @formula(weight ~ 1 + age + height)
f_concrete = apply_schema(f, schema(df))

y, x = modelcols(f, df)
# ... turing magic

respname, prednames = coefnames(f)

Note that this way, the intercept would be included in x so you’d have to modify your model, OR force no intercept by using @formula(weight ~ 0 + age + height).

cpfiffer · November 13, 2020, 5:41pm

One of these days I may have to turn that from a mockup into a real package.

dave.f.kleinschmidt · November 18, 2020, 9:44pm

it’d make a great GSOC project actually…pretty clear scope and just need someone to do it…

cpfiffer · November 19, 2020, 3:15am

Oh, very true! I’ll keep that in mind when we’re adding projects for Turing.

Topic		Replies	Views
Univariate linear regression with each covariate of dataframe Statistics dataframes , regression	5	802	October 15, 2019
How do I do a regression using programatically defined column names? General Usage question , dataframes , glm	4	231	July 24, 2023
Add column names to a GLM.LinearModel estimated in matrix form Statistics question , glm	11	1139	July 1, 2020
GLM.jl with unknown column names Statistics statistics , regression , glm	4	1865	February 19, 2019
DataFrame column names into GLM as variable names General Usage glm	2	545	March 24, 2022

Getting model parameter names from DataFrame

Related topics