[FixedEffectModels.jl] Switching from Dummy to Categorical Variables

I am trying to use FixedEffectModels.jl with a dataset with a continuous explanatory variable, and want to run regressions on a binned version of this variable. I have two approaches: a) I can either, via cut, produce categorical variables, or b) via map produce numerical assignments of the bins that can then be treated as dummy variables via “contrasts” (I am sure there is more efficient ways to do the binning)

I have two questions:

  1. If I use categorical variables, is there a way to pick a “base” as one is allowed to for dummy variables?
  2. It seems that the regressions results differ slightly between the categorical regressions and the dummy variable regression. Why would that be?
# Categorical regression
                                       Linear Model                                       
===========================================================================================
Number of obs:                          1380   Degrees of freedom:                        2
R2:                                    0.092   R2 Adjusted:                           0.091
F-Stat:                              70.1648   p-value:                               0.000
===========================================================================================
Sales                            | Estimate Std.Error  t value Pr(>|t|) Lower 95% Upper 95%
-------------------------------------------------------------------------------------------
CategoricalPrice: [75.0, 100.0)  | -5.06421   2.77783 -1.82308    0.069  -10.5135  0.385043
CategoricalPrice: [100.0, 201.9] | -22.4024   1.89182 -11.8417    0.000  -26.1135  -18.6912
(Intercept)                      |  129.814  0.974587  133.199    0.000   127.902   131.726
===========================================================================================

# Dummy regression
                              Linear Model                              
========================================================================
Number of obs:                 1380  Degrees of freedom:               2
R2:                           0.095  R2 Adjusted:                  0.093
F-Stat:                     71.9556  p-value:                      0.000
========================================================================
Sales         | Estimate Std.Error  t value Pr(>|t|) Lower 95% Upper 95%
------------------------------------------------------------------------
DummyPrice: 2 |  -4.5169   2.76519 -1.63349    0.103  -9.94134  0.907545
DummyPrice: 3 | -22.6697   1.89169 -11.9838    0.000  -26.3806  -18.9588
(Intercept)   |  129.814  0.973439  133.356    0.000   127.904   131.723
========================================================================
# CODE
using DataFrames, RDatasets, FixedEffectModels, CategoricalArrays, StatsBase
# importing data from RDatasets
df = dataset("plm", "Cigar")
# creating categorical price column
c=cut(df.Price,[0;75;100],extend = true)
insertcols!(df,ncol(df)+1,:CategoricalPrice=>c)
# creating dummy price column 
d = map(x -> x>100 ? 3 : (x<75 ? 1 : 2), df[:,"Price"])
insertcols!(df,ncol(df)+1,:DummyPrice=>d)
# regressions
reg(df, @formula(Sales ~ CategoricalPrice)) # running regression with categorical price variable (not sure how to pick base)
reg(df, @formula(Sales ~ DummyPrice); contrasts = Dict(:DummyPrice => DummyCoding())) # running regression with dummy variable (this defaults to base=1)
reg(df, @formula(Sales ~ DummyPrice); contrasts = Dict(:DummyPrice => DummyCoding(base = 3))) # running regression with dummy variable (with base=3 imposed)
1 Like

My first question would be why you are using FixedEffectModels when you are not actually including fixed effects in your model? Seems like what you’re doing should be doable in plain GLM.

I believe the answer to your second question is this:

julia> using DataFrames, RDatasets

julia> df = dataset("plm", "Cigar");

julia> df.CategoricalPrice = cut(df.Price, [0; 75; 100], extend = true);

julia> df.DummyPrice = (x -> x > 100 ? 3 : x < 75 ? 1 : 2).(df.Price);

julia> combine(groupby(df, [:CategoricalPrice, :DummyPrice]), nrow)
4×3 DataFrame
 Row │ CategoricalPrice  DummyPrice  nrow
     │ Categorical…      Int64       Int64
─────┼─────────────────────────────────────
   1 │ [0.0, 75.0)                1    919
   2 │ [75.0, 100.0)              2    129
   3 │ [100.0, 201.9]             2      1
   4 │ [100.0, 201.9]             3    331

julia> df[string.(df.CategoricalPrice) .== "[100.0, 201.9]" .&& df.DummyPrice .== 2, :]
1×11 DataFrame
 Row │ State  Year   Price    Pop      Pop16    CPI      NDI      Sales    Pimin    CategoricalPrice  DummyPrice
     │ Int64  Int64  Float64  Float64  Float64  Float64  Float64  Float64  Float64  Categorical…      Int64
─────┼───────────────────────────────────────────────────────────────────────────────────────────────────────────
   1 │    30     86    100.0   1027.0    802.8    109.6  14550.9    195.9    104.9  [100.0, 201.9]             2

so DummyPrice and CategoricalPrice do not put all observations into the same category - you probably want x >= 100 in the function that assigns dummy variables.

1 Like

Yes, this was an embarrassing mistake on the binning and the inequality on my part, thank you so much for correcting it and thus resolving question 2).

My question 1) was answered by a friend of mine: Setting the base for a categorical variable say for corresponding numerical dummy 3 is given by:

reg(df, @formula(Sales ~ CategoricalPrice); contrasts = Dict(:CategoricalPrice => DummyCoding(base = "[100.0, 201.9]")))

which yields result

                                      Linear Model                                      
=========================================================================================
Number of obs:                         1380   Degrees of freedom:                       2
R2:                                   0.092   R2 Adjusted:                          0.091
F-Stat:                             70.1648   p-value:                              0.000
=========================================================================================
Sales                           | Estimate Std.Error t value Pr(>|t|) Lower 95% Upper 95%
-----------------------------------------------------------------------------------------
CategoricalPrice: [0.0, 75.0)   |  22.4024   1.89182 11.8417    0.000   18.6912   26.1135
CategoricalPrice: [75.0, 100.0) |  17.3382   3.06524 5.65638    0.000   11.3251   23.3512
(Intercept)                     |  107.411   1.62147 66.2432    0.000   104.231   110.592
=========================================================================================

To answer your other question about why I’m using this package at all, I am using fixed effects in my real regression, ergo the use of FixedEffectsModels.jl, but my question above was unrelated to such FE functionality, and therefore I took it out in my minimum viable programming example.

1 Like