You still haven’t provided a full MWE that gives the error you described. But here is an MWE that does and how to have a solution
julia> function _onehot(df,symb)
copy = df
for c in unique(copy[!,symb])
copy[!,Symbol(c)] = copy[!,symb] .== c
end
return(copy)
end;
julia> begin
using DataFrames, Chain
teams = ["Jazz", "Heat", "Hawks"]
rank = ["first", "second", "third"]
outcome = [true, false]
df = DataFrame(Id = 1:50, team = rand(teams, 50), rank = rand(rank, 50), outcome = rand(outcome, 50))
df2 = @chain df begin
_onehot(:team)
_onehot(:rank)
end
fm_bad = @formula(outcome ~ Jazz + Heat + Hawks + first + second + third)
# will fail, you include too many dummy variables
# logit_bad = glm(fm_bad, df2, Binomial(), ProbitLink())
fm_good1 = @formula(outcome ~ Jazz + Heat + second + third)
# will work, excluding one dummy from each
logit_good1 = glm(fm_good1, df2, Binomial(),ProbitLink())
fm_good2 = @formula(outcome ~ team + rank)
# even better, GLM handles the collinearity
logit_good2 = glm(fm_good2, df2, Binomial(), ProbitLink())
end
StatsModels.TableRegressionModel{GeneralizedLinearModel{GLM.GlmResp{Vector{Float64}, Binomial{Float64}, ProbitLink}, GLM.DensePredChol{Float64, LinearAlgebra.Cholesky{Float64, Matrix{Float64}}}}, Matrix{Float64}}
outcome ~ 1 + team + rank
Coefficients:
──────────────────────────────────────────────────────────────────────────
Coef. Std. Error z Pr(>|z|) Lower 95% Upper 95%
──────────────────────────────────────────────────────────────────────────
(Intercept) 0.128893 0.343977 0.37 0.7079 -0.54529 0.803075
team: Heat -0.94062 0.511527 -1.84 0.0659 -1.94319 0.0619533
team: Jazz -0.432373 0.465327 -0.93 0.3528 -1.3444 0.479652
rank: second -0.502133 0.474511 -1.06 0.2900 -1.43216 0.427891
rank: third -0.217286 0.464381 -0.47 0.6399 -1.12746 0.692884
──────────────────────────────────────────────────────────────────────────