Regression with categorical variables. Reference porevious levels. Dummy

Juan · November 13, 2022, 12:20pm

Hello.

If a do a linear regression model I can include the year as a categorical variable (instead of numerical) in order to estimate a different coefficient for each year. This is commonly done when the variable is not linear or if you want to detect something strange on a specific year…

When you do it with common regression packages they automatically take one of the levels (classes) as reference and calculate the coefficients respect this base level. It can also be done taking as reference the grand mean of all levels (I think this is included in packages called contrasts but it’s related with the way we codify dummy variables).

My question is…
What if I want to calculate the model referring each year’s coefficient to the previous year instead of a common one? How would you codify it?

Your comment is awaiting moderation.

dmbates · November 13, 2022, 6:57pm

The StatsModels package provides several “contrast” specifications that can be used with modeling functions. They generally have names that end in Coding. I think you are looking for SeqDiffCoding.

Juan · November 13, 2022, 9:08pm

Thank you, I will read it.

But how would you do it without extra packages or contrasts?

Usually you would write something like this (contrasts compare to the base level)
y ~ F+G+H where F, G and H are dummy variables, they are zeroes or ones in the dataset.

If instead I want sequential differences (but keeping the same zeroes and ones on the dataset) I should specify a regression with the diferences …, how can I do it exactly?

Something like y ~ F + ((G+Not(F))/2) + ((H+Not(G))/2)

Juan · November 14, 2022, 1:09am

PS: I’ve been thinking about this and the idea it’s easy but its implemenatation it’s tedious if we have many levels.

For example if we originally have dummy variables F, G, H and I, and only one of them can be 1,
the model should calculate coefficients for the new dummy incremental variables as:
(1-F) + (1-F)(1-G) + (1-F)(1-G)(1-H) + (1-F)(1-G)(1-H)(1-I)

If I originally have a variable YEAR with many possible options then it’s better to use some automatic way to create all this, with contrasts as you suggested.

And even simpler, we can work directly with the year.
Each dummy variable would be:
(YEAR >2000), (YEAR >2001), (YEAR >2002), (YEAR >2003), …

dmbates · November 14, 2022, 9:37pm

I’m not sure I understand the point. The whole formula mechanism, including contrast specifications, is implemented in the StatsModels package. So if you are going to be using a formula you will have a direct or indirect dependence on that package anyway. I don’t see the point of recreating such a contrast specification when it has already been developed (and tested and checked) in StatsModels.

Juan · November 14, 2022, 11:08pm

I just wanted to know the concept, how it works, how it can be done independently from the package and language.

Topic		Replies	Views
Change Base Level Categorical Vector in GLM General Usage glm	11	2041	February 17, 2020
[FixedEffectModels.jl] Switching from Dummy to Categorical Variables Statistics regression , linear-regression	2	624	August 16, 2022
Should model matrix for nested factors be full-rank? Statistics	15	587	August 1, 2022
JuliaDB & OnlineStats syntax for linear regression General Usage package	5	759	June 15, 2018
ModelFrame contrast & factors in StatsModels in v"0.6.x" Statistics	5	841	September 28, 2019

Regression with categorical variables. Reference porevious levels. Dummy

Related topics