There are quite a few issues in your code, and I would strongly echo Peter’s suggestion from one of your previous threads that you start by reading the Julia documentation and maybe specifically for your use case Bogumil’s excellent DataFrames tutorial.
You might also want to read up on regression modelling, as your problem here isn’t really Julia related, but a design problem with how you build your regression model. There are plenty of resources out there, a recent one I liked is Regression and Other Stories from Gelman et al.
Now what’s the problem with what you are doing here? Consider your DataFrame:
julia> Season = DataFrame(Id = 1:50, Gate = rand(50:15:3000),
Top3 = rand(Teams, 50),
Position = rand(Rank, 50),
Column = rand(Outcome .=="Win", 50))
50×5 DataFrame
Row │ Id Gate Top3 Position Column
│ Int64 Int64 String String Bool
─────┼────────────────────────────────────────
1 │ 1 590 Hawks 1st false
2 │ 2 590 Hawks 2nd true
3 │ 3 590 Jazz 3rd true
4 │ 4 590 Heat 2nd false
5 │ 5 590 Heat 3rd false
What has happened here? You constructed Gate
as rand(50:15:3000)
, and you probably intended for that to pick a different random number between 50 and 3,000 for each row. Instead, rand(50:15:3000)
will draw a single random number:
julia> rand(50:15:3000)
2690
and the DataFrames constructor will fill an entire column with this one number. You therefore have a column which is a constant, which will by construction by multicollinear with the intercept that GLM adds by default to each regression model.
This isn’t your only problem though: In the regression formula you’ve written, you are transforming your categorical variables into a set of one hot encoded columns, and then you include all of those columns in your regression model. This will again lead by definition to multicollinearity: the column combination Jazz + Heat + Hawks
will be 1 in each row, so again constant and multicollinear with the intercept. If this isn’t clear to you I again encourage you to read introductory texts on regresison modelling.
This also goes a long way to answering the question in your second post: when you include a categorical variable like Position
in your regression model via @formula
, GLM will automaticall contrast code the variable for you, which you can think of as the same thing you’re trying to achieve with your one_hot
function, while automatically dropping one of the categories (the “base level”) from the regression to avoid multicollinearity. So if you have categorical variables, there’s no need to do any one hot encoding, just include the column in your regression model and GLM will contrast code them for you. The relevant documentation is here: Contrast coding categorical variables · StatsModels.jl