Change Base Level Categorical Vector in GLM

matthieu · December 4, 2018, 5:25pm

Is there a way to change the base level of a categorical vector in GLM? It looks like the first level in the vector is chosen as the base level. I have tried to change the order of levels using levels! but it does not seem to make a difference

dmbates · December 4, 2018, 9:37pm

Check out the use of Contrasts in the StatModels package, which is what generates the model matrix in a GLM model.

yakir12 · January 20, 2020, 11:45am

Sorry for necroposting, I looked in the docs referred above but couldn’t find a simple example.

How do I specify that the base level in the following example should be "none" (because m comes before n and s, the base level is currently "many")?

using DataFrames, GLM
df = DataFrame((x = x, y = y) for x in ("none", "some", "many") for y in rand(10))
ols = lm(@formula(y ~ x), df)

Thanks!

dmbates · January 20, 2020, 5:28pm

If you convert the variable x to a CategoricalArray you can use relevel! to change the order of the levels.

DaymondLing · January 20, 2020, 5:52pm

The StatsModels documentations says the default categorical encoding is DummyEncoding with the first level as the reference level. After constructing the untyped FormulaTerm and applying schema, you can create the ModelFrame and setcontrasts! to choose the contrast coding system and the base level you want:

mf = ModelFrame(@formula(y ~ 1 + a + b), df)          # create ModelFrame

c1 = DummyCoding(base="a", levels=["a", "b", "c"])    # build desired contrasts
c2 = HelmertCoding(base=2, levels=[1, 2, 3])

setcontrast!(mf, Dict(:a => c1, :b => c2))            # set preferred contrast

@dmbates’s solution is easier if you don’t need to change contrast coding system.

yakir12 · January 23, 2020, 1:05pm

I went with this, and just for posterity’s sake the function is levels! (not relevel!, though I agree it makes more sense).

dave.f.kleinschmidt · January 23, 2020, 2:53pm

IIRC you can pass that dict as the contrasts= kw arg in the GLM functions…

Edit: you also don’t need to specify the levels just to change the base (they are extracted when creating the ContrastsMatrix later)

dave.f.kleinschmidt · January 23, 2020, 2:55pm

The docs on this could probably be better to be fair

DaymondLing · January 23, 2020, 3:06pm

Yes, thank you. The flexibility of the GLM design allows one to define characteristics of categorical variables at several places.

dave.f.kleinschmidt · February 16, 2020, 10:34pm

You should be able to do

ols = lm(@formula(y ~ x), df, contrasts = Dict(:x => DummyCoding(base="none")))

But that only works on GLM master at the moment (for some reason kwargs weren’t being passed to fit…). For now you can do

ols = fit(LinearModel, df, contrasts = Dict(:x => DummyCoding(base="none")))

Here’s an example to prove it works

julia> df = DataFrame(y = rand(12), x = repeat(["some", "none", "many"], 4))
12×2 DataFrame
│ Row │ y           │ x      │
│     │ Float64     │ String │
├─────┼─────────────┼────────┤
│ 1   │ 0.886704    │ some   │
│ 2   │ 0.226046    │ none   │
│ 3   │ 0.196006    │ many   │
│ 4   │ 0.40502     │ some   │
│ 5   │ 0.000587951 │ none   │
│ 6   │ 0.395679    │ many   │
│ 7   │ 0.773263    │ some   │
│ 8   │ 0.260962    │ none   │
│ 9   │ 0.795655    │ many   │
│ 10  │ 0.757367    │ some   │
│ 11  │ 0.401757    │ none   │
│ 12  │ 0.732157    │ many   │

julia> fit(LinearModel, @formula(y ~ x), df, contrasts = Dict(:x => DummyCoding(base="none")))
StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{Array{Float64,1}},GLM.DensePredChol{Float64,LinearAlgebra.Cholesky{Float64,Array{Float64,2}}}},Array{Float64,2}}

y ~ 1 + x

Coefficients:
───────────────────────────────────────────────────────────────────────────
             Estimate  Std. Error  t value  Pr(>|t|)   Lower 95%  Upper 95%
───────────────────────────────────────────────────────────────────────────
(Intercept)  0.222338    0.112337  1.9792     0.0792  -0.0317865   0.476463
x: many      0.307536    0.158869  1.93578    0.0849  -0.0518506   0.666923
x: some      0.48325     0.158869  3.04182    0.0140   0.123864    0.842637
───────────────────────────────────────────────────────────────────────────

What’s happening internally here is that the contrasts argument gets passed to schema as a “hint” about how to code the :x variable:

julia> sch = schema(df, Dict(:x => DummyCoding(base="none")))
StatsModels.Schema with 2 entries:
  y => y
  x => x

julia> coefnames(sch[term(:x)])
2-element Array{String,1}:
 "x: many"
 "x: some"

julia> f = apply_schema(@formula(y ~ x), sch, RegressionModel)
FormulaTerm
Response:
  y(continuous)
Predictors:
  1
  x(DummyCoding:3→2)

julia> modelcols(f.rhs, df)
12×3 Array{Float64,2}:
 1.0  0.0  1.0
 1.0  0.0  0.0
 1.0  1.0  0.0
 1.0  0.0  1.0
 1.0  0.0  0.0
 1.0  1.0  0.0
 1.0  0.0  1.0
 1.0  0.0  0.0
 1.0  1.0  0.0
 1.0  0.0  1.0
 1.0  0.0  0.0
 1.0  1.0  0.0

julia> coefnames(f.rhs)
3-element Array{String,1}:
 "(Intercept)"
 "x: many"
 "x: some"

dave.f.kleinschmidt · February 17, 2020, 10:28am

Update: once GLM v.1.3.7 is merged (New version: GLM v1.3.7 by JuliaRegistrator · Pull Request #9611 · JuliaRegistries/General · GitHub) you should be able to use the lm(::FormulaTerm, ::Table; contrasts=...) syntax

DaymondLing · February 17, 2020, 3:44pm

Thank you!

Topic		Replies	Views
Two questions about GLM Statistics glm	4	1347	September 7, 2021
[FixedEffectModels.jl] Switching from Dummy to Categorical Variables Statistics regression , linear-regression	2	621	August 16, 2022
StatsModels: get levels and model matrix for each level of categorical term New to Julia question	2	385	March 15, 2021
Regression with categorical variables. Reference porevious levels. Dummy Statistics linear-regression	5	441	November 14, 2022
Factors for regression models in julia Statistics question , package	5	1073	May 26, 2021

Change Base Level Categorical Vector in GLM

Related topics