Is there a way to change the base level of a categorical vector in GLM? It looks like the first level in the vector is chosen as the base level. I have tried to change the order of levels using levels!
but it does not seem to make a difference
Check out the use of Contrasts
in the StatModels
package, which is what generates the model matrix in a GLM model.
Sorry for necroposting, I looked in the docs referred above but couldnβt find a simple example.
How do I specify that the base level in the following example should be "none"
(because m
comes before n
and s
, the base level is currently "many"
)?
using DataFrames, GLM
df = DataFrame((x = x, y = y) for x in ("none", "some", "many") for y in rand(10))
ols = lm(@formula(y ~ x), df)
Thanks!
If you convert the variable x
to a CategoricalArray
you can use relevel!
to change the order of the levels.
The StatsModels documentations says the default categorical encoding is DummyEncoding with the first level as the reference level. After constructing the untyped FormulaTerm and applying schema, you can create the ModelFrame and setcontrasts! to choose the contrast coding system and the base level you want:
mf = ModelFrame(@formula(y ~ 1 + a + b), df) # create ModelFrame
c1 = DummyCoding(base="a", levels=["a", "b", "c"]) # build desired contrasts
c2 = HelmertCoding(base=2, levels=[1, 2, 3])
setcontrast!(mf, Dict(:a => c1, :b => c2)) # set preferred contrast
@dmbatesβs solution is easier if you donβt need to change contrast coding system.
I went with this, and just for posterityβs sake the function is levels!
(not relevel!
, though I agree it makes more sense).
IIRC you can pass that dict as the contrasts=
kw arg in the GLM functionsβ¦
Edit: you also donβt need to specify the levels just to change the base (they are extracted when creating the ContrastsMatrix later)
The docs on this could probably be better to be fair
Yes, thank you. The flexibility of the GLM design allows one to define characteristics of categorical variables at several places.
You should be able to do
ols = lm(@formula(y ~ x), df, contrasts = Dict(:x => DummyCoding(base="none")))
But that only works on GLM master at the moment (for some reason kwargs werenβt being passed to fitβ¦). For now you can do
ols = fit(LinearModel, df, contrasts = Dict(:x => DummyCoding(base="none")))
Hereβs an example to prove it works
julia> df = DataFrame(y = rand(12), x = repeat(["some", "none", "many"], 4))
12Γ2 DataFrame
β Row β y β x β
β β Float64 β String β
βββββββΌββββββββββββββΌβββββββββ€
β 1 β 0.886704 β some β
β 2 β 0.226046 β none β
β 3 β 0.196006 β many β
β 4 β 0.40502 β some β
β 5 β 0.000587951 β none β
β 6 β 0.395679 β many β
β 7 β 0.773263 β some β
β 8 β 0.260962 β none β
β 9 β 0.795655 β many β
β 10 β 0.757367 β some β
β 11 β 0.401757 β none β
β 12 β 0.732157 β many β
julia> fit(LinearModel, @formula(y ~ x), df, contrasts = Dict(:x => DummyCoding(base="none")))
StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{Array{Float64,1}},GLM.DensePredChol{Float64,LinearAlgebra.Cholesky{Float64,Array{Float64,2}}}},Array{Float64,2}}
y ~ 1 + x
Coefficients:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Estimate Std. Error t value Pr(>|t|) Lower 95% Upper 95%
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
(Intercept) 0.222338 0.112337 1.9792 0.0792 -0.0317865 0.476463
x: many 0.307536 0.158869 1.93578 0.0849 -0.0518506 0.666923
x: some 0.48325 0.158869 3.04182 0.0140 0.123864 0.842637
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Whatβs happening internally here is that the contrasts argument gets passed to schema
as a βhintβ about how to code the :x
variable:
julia> sch = schema(df, Dict(:x => DummyCoding(base="none")))
StatsModels.Schema with 2 entries:
y => y
x => x
julia> coefnames(sch[term(:x)])
2-element Array{String,1}:
"x: many"
"x: some"
julia> f = apply_schema(@formula(y ~ x), sch, RegressionModel)
FormulaTerm
Response:
y(continuous)
Predictors:
1
x(DummyCoding:3β2)
julia> modelcols(f.rhs, df)
12Γ3 Array{Float64,2}:
1.0 0.0 1.0
1.0 0.0 0.0
1.0 1.0 0.0
1.0 0.0 1.0
1.0 0.0 0.0
1.0 1.0 0.0
1.0 0.0 1.0
1.0 0.0 0.0
1.0 1.0 0.0
1.0 0.0 1.0
1.0 0.0 0.0
1.0 1.0 0.0
julia> coefnames(f.rhs)
3-element Array{String,1}:
"(Intercept)"
"x: many"
"x: some"
Update: once GLM v.1.3.7 is merged (New version: GLM v1.3.7 by JuliaRegistrator Β· Pull Request #9611 Β· JuliaRegistries/General Β· GitHub) you should be able to use the lm(::FormulaTerm, ::Table; contrasts=...)
syntax
Thank you!