Factors for regression models in julia

I want to build a linear mixed model in Julia but the R code I am basing it on requires the use of factors for the model. I came across a stackoverflow post on just this topic but it was from 2017 and I am hoping there is a better solution. Something as clean and easy as writting a sinlge line of code in R for as.factor for examples data$columnx <- as.factor(data$columnx). Does Julia have a better solution than PooledDataArray?

That answer is indeed quite old. If you have a string column in your data, StatsModels will automatically interpret it as categorical and dummy code it:

julia> using GLM, DataFrames

julia> df = DataFrame(y = rand(100), x1 = rand('a':'d', 100), x2 = rand(100));

julia> lm(@formula(y ~ x1 + x2), df)
StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{Vector{Float64}}, GLM.DensePredChol{Float64, LinearAlgebra.CholeskyPivoted{Float64, Matrix{Float64}}}}, Matrix{Float64}}

y ~ 1 + x1 + x2

Coefficients:
───────────────────────────────────────────────────────────────────────────
                   Coef.  Std. Error      t  Pr(>|t|)  Lower 95%  Upper 95%
───────────────────────────────────────────────────────────────────────────
(Intercept)   0.551823     0.0647513   8.52    <1e-12   0.423275  0.68037
x1: b        -0.120328     0.0671553  -1.79    0.0764  -0.253648  0.0129918
x1: c        -0.00982324   0.0750967  -0.13    0.8962  -0.158909  0.139263
x1: d        -0.116038     0.0774532  -1.50    0.1374  -0.269802  0.0377261
x2           -0.0496876    0.0955251  -0.52    0.6042  -0.239329  0.139954
───────────────────────────────────────────────────────────────────────────

If you need more control over your categoricals, look at GitHub - JuliaData/CategoricalArrays.jl: Arrays for working with categorical data (both nominal and ordinal)

6 Likes

Is there a way to call the levels of the factor to make sure its been done the way I expect it to?

Not sure I understand the question - the regression output shows you the levels (and implicitly what the base level is)?

You can specify the base level (and even other forms of coding) explicitly if necessary, see the documentation here: Contrast coding categorical variables Β· StatsModels.jl

Another way of choosing the reference level is to use a CategoricalArray (equivalent of factor in R) as @nilshg suggested, and use levels! to reorder its levels.

2 Likes

Yup, pass in a contrasts= argument to lm, something like contrasts=Dict(:x1 => DummyCoding(levels=['a', 'b', 'c', 'd'])). Then you’ll get an error if the levels don’t match. You can also use this to control the ordering of the levels, or the type of contrasts used (as described in the docs that @nilshg linked to).