Change Base Level Categorical Vector in GLM

Is there a way to change the base level of a categorical vector in GLM? It looks like the first level in the vector is chosen as the base level. I have tried to change the order of levels using levels! but it does not seem to make a difference

Check out the use of Contrasts in the StatModels package, which is what generates the model matrix in a GLM model.

1 Like

Sorry for necroposting, I looked in the docs referred above but couldn’t find a simple example.

How do I specify that the base level in the following example should be "none" (because m comes before n and s, the base level is currently "many")?

using DataFrames, GLM
df = DataFrame((x = x, y = y) for x in ("none", "some", "many") for y in rand(10))
ols = lm(@formula(y ~ x), df)

Thanks!

If you convert the variable x to a CategoricalArray you can use relevel! to change the order of the levels.

1 Like

The StatsModels documentations says the default categorical encoding is DummyEncoding with the first level as the reference level. After constructing the untyped FormulaTerm and applying schema, you can create the ModelFrame and setcontrasts! to choose the contrast coding system and the base level you want:

mf = ModelFrame(@formula(y ~ 1 + a + b), df)          # create ModelFrame

c1 = DummyCoding(base="a", levels=["a", "b", "c"])    # build desired contrasts
c2 = HelmertCoding(base=2, levels=[1, 2, 3])

setcontrast!(mf, Dict(:a => c1, :b => c2))            # set preferred contrast

@dmbates’s solution is easier if you don’t need to change contrast coding system.

2 Likes

I went with this, and just for posterity’s sake the function is levels! (not relevel!, though I agree it makes more sense).

IIRC you can pass that dict as the contrasts= kw arg in the GLM functions…

Edit: you also don’t need to specify the levels just to change the base (they are extracted when creating the ContrastsMatrix later)

1 Like

The docs on this could probably be better to be fair :slight_smile:

1 Like

Yes, thank you. The flexibility of the GLM design allows one to define characteristics of categorical variables at several places.

You should be able to do

ols = lm(@formula(y ~ x), df, contrasts = Dict(:x => DummyCoding(base="none")))

But that only works on GLM master at the moment (for some reason kwargs weren’t being passed to fit…). For now you can do

ols = fit(LinearModel, df, contrasts = Dict(:x => DummyCoding(base="none")))

Here’s an example to prove it works :wink:

julia> df = DataFrame(y = rand(12), x = repeat(["some", "none", "many"], 4))
12Γ—2 DataFrame
β”‚ Row β”‚ y           β”‚ x      β”‚
β”‚     β”‚ Float64     β”‚ String β”‚
β”œβ”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ 1   β”‚ 0.886704    β”‚ some   β”‚
β”‚ 2   β”‚ 0.226046    β”‚ none   β”‚
β”‚ 3   β”‚ 0.196006    β”‚ many   β”‚
β”‚ 4   β”‚ 0.40502     β”‚ some   β”‚
β”‚ 5   β”‚ 0.000587951 β”‚ none   β”‚
β”‚ 6   β”‚ 0.395679    β”‚ many   β”‚
β”‚ 7   β”‚ 0.773263    β”‚ some   β”‚
β”‚ 8   β”‚ 0.260962    β”‚ none   β”‚
β”‚ 9   β”‚ 0.795655    β”‚ many   β”‚
β”‚ 10  β”‚ 0.757367    β”‚ some   β”‚
β”‚ 11  β”‚ 0.401757    β”‚ none   β”‚
β”‚ 12  β”‚ 0.732157    β”‚ many   β”‚

julia> fit(LinearModel, @formula(y ~ x), df, contrasts = Dict(:x => DummyCoding(base="none")))
StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{Array{Float64,1}},GLM.DensePredChol{Float64,LinearAlgebra.Cholesky{Float64,Array{Float64,2}}}},Array{Float64,2}}

y ~ 1 + x

Coefficients:
───────────────────────────────────────────────────────────────────────────
             Estimate  Std. Error  t value  Pr(>|t|)   Lower 95%  Upper 95%
───────────────────────────────────────────────────────────────────────────
(Intercept)  0.222338    0.112337  1.9792     0.0792  -0.0317865   0.476463
x: many      0.307536    0.158869  1.93578    0.0849  -0.0518506   0.666923
x: some      0.48325     0.158869  3.04182    0.0140   0.123864    0.842637
───────────────────────────────────────────────────────────────────────────

What’s happening internally here is that the contrasts argument gets passed to schema as a β€œhint” about how to code the :x variable:

julia> sch = schema(df, Dict(:x => DummyCoding(base="none")))
StatsModels.Schema with 2 entries:
  y => y
  x => x

julia> coefnames(sch[term(:x)])
2-element Array{String,1}:
 "x: many"
 "x: some"

julia> f = apply_schema(@formula(y ~ x), sch, RegressionModel)
FormulaTerm
Response:
  y(continuous)
Predictors:
  1
  x(DummyCoding:3β†’2)

julia> modelcols(f.rhs, df)
12Γ—3 Array{Float64,2}:
 1.0  0.0  1.0
 1.0  0.0  0.0
 1.0  1.0  0.0
 1.0  0.0  1.0
 1.0  0.0  0.0
 1.0  1.0  0.0
 1.0  0.0  1.0
 1.0  0.0  0.0
 1.0  1.0  0.0
 1.0  0.0  1.0
 1.0  0.0  0.0
 1.0  1.0  0.0

julia> coefnames(f.rhs)
3-element Array{String,1}:
 "(Intercept)"
 "x: many"
 "x: some"
2 Likes

Update: once GLM v.1.3.7 is merged (New version: GLM v1.3.7 by JuliaRegistrator Β· Pull Request #9611 Β· JuliaRegistries/General Β· GitHub) you should be able to use the lm(::FormulaTerm, ::Table; contrasts=...) syntax

2 Likes

Thank you! :grinning:

1 Like