Help fitting linear probability model with GLM.jl

I’m looking to fit a linear probability model of the form:

y_i = \alpha X_i + PT_i \beta + \phi_j + \lambda_r + \epsilon_i

Where:

  • y_i is 1 if worker is independent contractor on main job, else 0
  • X_i is a vector of personal characteristics
  • PTᵢ is 1 if the worker is part-time, else 0
  • \phi_j is a vector of occupation dummies
  • \lambda_r is a vector of region dummies

The personal characteristics are age and then a few other categorical variables (gender, educational attainment, race, etc.).

I don’t have any experience with this kind of model but, from what I’ve read, a linear probability model is just a GLM with a Binomially-distributed error term and an Identity link function. So, shouldn’t I be able to just do something like this, with GLM.jl?

model = glm(@formula(pric ~ prtage + pehspnon + ptdtrace + prftlf + pesex + peeduca + prdtocc1 + gereg), model_df, Binomial(), IdentityLink(), wts=model_df.pwsupwgt)

When I do this, I get the following error:

PosDefException: matrix is not positive definite; Cholesky factorization failed.

checkpositivedefinite@factorization.jl:18[inlined]
var"#cholesky!#125"(::Bool, ::typeof(LinearAlgebra.cholesky!), ::LinearAlgebra.Hermitian{Float64, Matrix{Float64}}, ::Val{false})@cholesky.jl:253
cholesky!@cholesky.jl:252[inlined]
delbeta!(::GLM.DensePredChol{Float64, LinearAlgebra.Cholesky{Float64, Matrix{Float64}}}, ::Vector{Float64}, ::Vector{Float64})@linpred.jl:153
_fit!(::GLM.GeneralizedLinearModel{GLM.GlmResp{Vector{Float64}, Distributions.Binomial{Float64}, GLM.IdentityLink}, GLM.DensePredChol{Float64, LinearAlgebra.Cholesky{Float64, Matrix{Float64}}}}, ::Bool, ::Int64, ::Float64, ::Float64, ::Float64, ::Nothing)@glmfit.jl:307
#fit!#12@glmfit.jl:373[inlined]
fit!@glmfit.jl:353[inlined]
var"#fit#14"(::Bool, ::Vector{Float64}, ::Vector{Int64}, ::Base.Iterators.Pairs{Union{}, Union{}, Tuple{}, NamedTuple{(), Tuple{}}}, ::typeof(StatsBase.fit), ::Type{GLM.GeneralizedLinearModel}, ::Matrix{Float64}, ::Vector{Int64}, ::Distributions.Binomial{Float64}, ::GLM.IdentityLink)@glmfit.jl:484
var"#fit#63"(::Dict{Symbol, Any}, ::Base.Iterators.Pairs{Symbol, Vector{Float64}, Tuple{Symbol}, NamedTuple{(:wts,), Tuple{Vector{Float64}}}}, ::typeof(StatsBase.fit), ::Type{GLM.GeneralizedLinearModel}, ::StatsModels.FormulaTerm{StatsModels.Term, NTuple{8, StatsModels.Term}}, ::DataFrames.DataFrame, ::Distributions.Binomial{Float64}, ::Vararg{Any, N} where N)@statsmodel.jl:88
var"#glm#16"(::Base.Iterators.Pairs{Symbol, Vector{Float64}, Tuple{Symbol}, NamedTuple{(:wts,), Tuple{Vector{Float64}}}}, ::typeof(GLM.glm), ::StatsModels.FormulaTerm{StatsModels.Term, NTuple{8, StatsModels.Term}}, ::DataFrames.DataFrame, ::Distributions.Binomial{Float64}, ::Vararg{Any, N} where N)@glmfit.jl:504
top-level scope@Local: 1

If I remove the prdtocc1 variable, it works. I’m using CategoricalArrays to convert all of my categorical values in my DataFrame. For example, the first row looks like this:

Any ideas as to why it would break with the prdtocc1 variable?

Lastly (not a Julia question), I’m a bit confused about the specification of this model. What is the interpretation of \alpha and \beta in the formula above? Is \alpha not the intercept (doesn’t seem like it is since it’s multiplied by X) and \beta not the coefficient for a particular independent variable (looks like it’s just applied to the PT variable)? Why are there no coefficients for \phi or \lambda?

Can you show the results of describe for just the variables in the model? i.e. DataFrames.describe(df[:, vars_in_model])

EDIT: Also maybe don’t use categorical arrays here. It’s automatically expanding those into fixed effects. Try the regression where all the categorical variables are instead integers.

1 Like

Description (formatting is a bit messed up because I’m in Pluto):

10×7 DataFrame
 Row │ variable  mean      min      median   max      nmissing  eltype
 │ Symbol    Union…    Any      Union…   Any      Int64     DataType
─────┼──────────────────────────────────────────────────────────────────────────────────────────
 1 │ pric      0.073156  0        0.0      1               0  Int64
 2 │ prtage    43.1526   15       43.0     85              0  Int64
 3 │ pwsupwgt  3207.41   225.676  3568.65  11820.5         0  Float64
 4 │ pehspnon            1                 2               0  CategoricalValue{Int64, UInt32}
 5 │ ptdtrace            1                 5               0  CategoricalValue{Int64, UInt32}
 6 │ peeduca             31                46              0  CategoricalValue{Int64, UInt32}
 7 │ prftlf              1                 2               0  CategoricalValue{Int64, UInt32}
 8 │ pesex               1                 2               0  CategoricalValue{Int64, UInt32}
 9 │ prdtocc1            1                 23              0  CategoricalValue{Int64, UInt32}
10 │ gereg               1                 4               0  CategoricalValue{Int64, UInt32}

If I don’t use CategoricalArrays, my DataFrame is now like this:

      From worker 5:    10×7 DataFrame
      From worker 5:     Row │ variable  mean         min      median   max      nmissing  eltype
      From worker 5:         │ Symbol    Float64      Real     Float64  Real     Int64     DataType
      From worker 5:    ─────┼──────────────────────────────────────────────────────────────────────
      From worker 5:       1 │ pric         0.073156    0         0.0       1           0  Int64
      From worker 5:       2 │ prtage      43.1526     15        43.0      85           0  Int64
      From worker 5:       3 │ pwsupwgt  3207.41      225.676  3568.65  11820.5         0  Float64
      From worker 5:       4 │ pehspnon     1.86237     1         2.0       2           0  Int64
      From worker 5:       5 │ ptdtrace     1.31701     1         1.0       5           0  Int64
      From worker 5:       6 │ peeduca     40.798      31        40.0      46           0  Int64
      From worker 5:       7 │ prftlf       1.18239     1         1.0       2           0  Int64
      From worker 5:       8 │ pesex        1.47913     1         1.0       2           0  Int64
      From worker 5:       9 │ prdtocc1    12.0863      1        14.0      23           0  Int64
      From worker 5:      10 │ gereg        2.70993     1         3.0       4           0  Int64

and my regression works, but now I don’t get a coefficient for each category. For example, if I just use the race variable and it’s included as a categorical array:

StatsModels.TableRegressionModel{GeneralizedLinearModel{GLM.GlmResp{Vector{Float64}, Binomial{Float64}, IdentityLink}, GLM.DensePredChol{Float64, LinearAlgebra.Cholesky{Float64, Matrix{Float64}}}}, Matrix{Float64}}

pric ~ 1 + ptdtrace

Coefficients:
───────────────────────────────────────────────────────────────────────────────
                  Coef.   Std. Error        z  Pr(>|z|)   Lower 95%   Upper 95%
───────────────────────────────────────────────────────────────────────────────
(Intercept)  0.0678105   3.47127e-5   1953.48    <1e-99  0.0677424   0.0678785
ptdtrace: 2  0.0039614   0.000103516    38.27    <1e-99  0.00375851  0.00416428
ptdtrace: 3  0.0335458   0.000373904    89.72    <1e-99  0.032813    0.0342787
ptdtrace: 4  0.00177455  0.000140273    12.65    <1e-35  0.00149962  0.00204948
ptdtrace: 5  0.0138233   0.000503525    27.45    <1e-99  0.0128364   0.0148102

When it’s not categorical, I get this:

pric ~ 1 + ptdtrace

Coefficients:
──────────────────────────────────────────────────────────────────────────────
                  Coef.  Std. Error        z  Pr(>|z|)   Lower 95%   Upper 95%
──────────────────────────────────────────────────────────────────────────────
(Intercept)  0.0657825   6.23563e-5  1054.95    <1e-99  0.0656603   0.0659047
ptdtrace     0.00226042  4.14767e-5    54.50    <1e-99  0.00217912  0.00234171

Then I think the problem is that you have too many categorical variables, the combination of which becomes perfectly collinear somehow. I recommend adding and removing variables from your regression to pinpoint the problem.

1 Like

If you want a coefficient for each category you’ll need to dummy code the categories and not fit an intercept. When using a categorical variable as a predictor, the default behaviour of GLM is to dummy code the variable, omit the first category, and fit an intercept. You want to override this default behaviour.

This applies whether you have 1 or several categorical variables.

2 Likes