# Regression with categorical variables. Reference porevious levels. Dummy

Hello.

If a do a linear regression model I can include the year as a categorical variable (instead of numerical) in order to estimate a different coefficient for each year. This is commonly done when the variable is not linear or if you want to detect something strange on a specific year…

When you do it with common regression packages they automatically take one of the levels (classes) as reference and calculate the coefficients respect this base level. It can also be done taking as reference the grand mean of all levels (I think this is included in packages called contrasts but it’s related with the way we codify dummy variables).

My question is…
What if I want to calculate the model referring each year’s coefficient to the previous year instead of a common one? How would you codify it?

The StatsModels package provides several “contrast” specifications that can be used with modeling functions. They generally have names that end in `Coding`. I think you are looking for `SeqDiffCoding`.

Thank you, I will read it.

But how would you do it without extra packages or contrasts?

Usually you would write something like this (contrasts compare to the base level)
y ~ F+G+H where F, G and H are dummy variables, they are zeroes or ones in the dataset.

If instead I want sequential differences (but keeping the same zeroes and ones on the dataset) I should specify a regression with the diferences …, how can I do it exactly?

Something like y ~ F + ((G+Not(F))/2) + ((H+Not(G))/2)

PS: I’ve been thinking about this and the idea it’s easy but its implemenatation it’s tedious if we have many levels.

For example if we originally have dummy variables F, G, H and I, and only one of them can be 1,
the model should calculate coefficients for the new dummy incremental variables as:
(1-F) + (1-F)(1-G) + (1-F)(1-G)(1-H) + (1-F)(1-G)(1-H)(1-I)

If I originally have a variable YEAR with many possible options then it’s better to use some automatic way to create all this, with contrasts as you suggested.

And even simpler, we can work directly with the year.
Each dummy variable would be:
(YEAR >2000), (YEAR >2001), (YEAR >2002), (YEAR >2003), …

I’m not sure I understand the point. The whole formula mechanism, including contrast specifications, is implemented in the StatsModels package. So if you are going to be using a formula you will have a direct or indirect dependence on that package anyway. I don’t see the point of recreating such a contrast specification when it has already been developed (and tested and checked) in StatsModels.

I just wanted to know the concept, how it works, how it can be done independently from the package and language.