PosDefException: matrix is not positive definite; Cholesky factorization failed

There are quite a few issues in your code, and I would strongly echo Peter’s suggestion from one of your previous threads that you start by reading the Julia documentation and maybe specifically for your use case Bogumil’s excellent DataFrames tutorial.

You might also want to read up on regression modelling, as your problem here isn’t really Julia related, but a design problem with how you build your regression model. There are plenty of resources out there, a recent one I liked is Regression and Other Stories from Gelman et al.

Now what’s the problem with what you are doing here? Consider your DataFrame:

julia> Season = DataFrame(Id = 1:50, Gate = rand(50:15:3000), 
                                         Top3 = rand(Teams, 50),
                                          Position = rand(Rank, 50), 
                                          Column = rand(Outcome .=="Win", 50))
50×5 DataFrame
 Row │ Id     Gate   Top3    Position  Column 
     │ Int64  Int64  String  String    Bool   
─────┼────────────────────────────────────────
   1 │     1    590  Hawks   1st        false
   2 │     2    590  Hawks   2nd         true
   3 │     3    590  Jazz    3rd         true
   4 │     4    590  Heat    2nd        false
   5 │     5    590  Heat    3rd        false

What has happened here? You constructed Gate as rand(50:15:3000), and you probably intended for that to pick a different random number between 50 and 3,000 for each row. Instead, rand(50:15:3000) will draw a single random number:

julia> rand(50:15:3000)
2690

and the DataFrames constructor will fill an entire column with this one number. You therefore have a column which is a constant, which will by construction by multicollinear with the intercept that GLM adds by default to each regression model.

This isn’t your only problem though: In the regression formula you’ve written, you are transforming your categorical variables into a set of one hot encoded columns, and then you include all of those columns in your regression model. This will again lead by definition to multicollinearity: the column combination Jazz + Heat + Hawks will be 1 in each row, so again constant and multicollinear with the intercept. If this isn’t clear to you I again encourage you to read introductory texts on regresison modelling.

This also goes a long way to answering the question in your second post: when you include a categorical variable like Position in your regression model via @formula, GLM will automaticall contrast code the variable for you, which you can think of as the same thing you’re trying to achieve with your one_hot function, while automatically dropping one of the categories (the “base level”) from the regression to avoid multicollinearity. So if you have categorical variables, there’s no need to do any one hot encoding, just include the column in your regression model and GLM will contrast code them for you. The relevant documentation is here: Contrast coding categorical variables · StatsModels.jl

3 Likes