I’m looking to fit a linear probability model of the form:
y_i = \alpha X_i + PT_i \beta + \phi_j + \lambda_r + \epsilon_i
Where:
- y_i is 1 if worker is independent contractor on main job, else 0
- X_i is a vector of personal characteristics
- PTᵢ is 1 if the worker is part-time, else 0
- \phi_j is a vector of occupation dummies
- \lambda_r is a vector of region dummies
The personal characteristics are age and then a few other categorical variables (gender, educational attainment, race, etc.).
I don’t have any experience with this kind of model but, from what I’ve read, a linear probability model is just a GLM with a Binomially-distributed error term and an Identity link function. So, shouldn’t I be able to just do something like this, with GLM.jl?
model = glm(@formula(pric ~ prtage + pehspnon + ptdtrace + prftlf + pesex + peeduca + prdtocc1 + gereg), model_df, Binomial(), IdentityLink(), wts=model_df.pwsupwgt)
When I do this, I get the following error:
PosDefException: matrix is not positive definite; Cholesky factorization failed.
checkpositivedefinite@factorization.jl:18[inlined]
var"#cholesky!#125"(::Bool, ::typeof(LinearAlgebra.cholesky!), ::LinearAlgebra.Hermitian{Float64, Matrix{Float64}}, ::Val{false})@cholesky.jl:253
cholesky!@cholesky.jl:252[inlined]
delbeta!(::GLM.DensePredChol{Float64, LinearAlgebra.Cholesky{Float64, Matrix{Float64}}}, ::Vector{Float64}, ::Vector{Float64})@linpred.jl:153
_fit!(::GLM.GeneralizedLinearModel{GLM.GlmResp{Vector{Float64}, Distributions.Binomial{Float64}, GLM.IdentityLink}, GLM.DensePredChol{Float64, LinearAlgebra.Cholesky{Float64, Matrix{Float64}}}}, ::Bool, ::Int64, ::Float64, ::Float64, ::Float64, ::Nothing)@glmfit.jl:307
#fit!#12@glmfit.jl:373[inlined]
fit!@glmfit.jl:353[inlined]
var"#fit#14"(::Bool, ::Vector{Float64}, ::Vector{Int64}, ::Base.Iterators.Pairs{Union{}, Union{}, Tuple{}, NamedTuple{(), Tuple{}}}, ::typeof(StatsBase.fit), ::Type{GLM.GeneralizedLinearModel}, ::Matrix{Float64}, ::Vector{Int64}, ::Distributions.Binomial{Float64}, ::GLM.IdentityLink)@glmfit.jl:484
var"#fit#63"(::Dict{Symbol, Any}, ::Base.Iterators.Pairs{Symbol, Vector{Float64}, Tuple{Symbol}, NamedTuple{(:wts,), Tuple{Vector{Float64}}}}, ::typeof(StatsBase.fit), ::Type{GLM.GeneralizedLinearModel}, ::StatsModels.FormulaTerm{StatsModels.Term, NTuple{8, StatsModels.Term}}, ::DataFrames.DataFrame, ::Distributions.Binomial{Float64}, ::Vararg{Any, N} where N)@statsmodel.jl:88
var"#glm#16"(::Base.Iterators.Pairs{Symbol, Vector{Float64}, Tuple{Symbol}, NamedTuple{(:wts,), Tuple{Vector{Float64}}}}, ::typeof(GLM.glm), ::StatsModels.FormulaTerm{StatsModels.Term, NTuple{8, StatsModels.Term}}, ::DataFrames.DataFrame, ::Distributions.Binomial{Float64}, ::Vararg{Any, N} where N)@glmfit.jl:504
top-level scope@Local: 1
If I remove the prdtocc1
variable, it works. I’m using CategoricalArrays
to convert all of my categorical values in my DataFrame. For example, the first row looks like this:
Any ideas as to why it would break with the prdtocc1
variable?
Lastly (not a Julia question), I’m a bit confused about the specification of this model. What is the interpretation of \alpha and \beta in the formula above? Is \alpha not the intercept (doesn’t seem like it is since it’s multiplied by X) and \beta not the coefficient for a particular independent variable (looks like it’s just applied to the PT variable)? Why are there no coefficients for \phi or \lambda?