PosDefException: matrix is not positive definite; Cholesky factorization failed

nilshg · July 29, 2021, 11:29pm

There are quite a few issues in your code, and I would strongly echo Peter’s suggestion from one of your previous threads that you start by reading the Julia documentation and maybe specifically for your use case Bogumil’s excellent DataFrames tutorial.

You might also want to read up on regression modelling, as your problem here isn’t really Julia related, but a design problem with how you build your regression model. There are plenty of resources out there, a recent one I liked is Regression and Other Stories from Gelman et al.

Now what’s the problem with what you are doing here? Consider your DataFrame:

julia> Season = DataFrame(Id = 1:50, Gate = rand(50:15:3000), 
                                         Top3 = rand(Teams, 50),
                                          Position = rand(Rank, 50), 
                                          Column = rand(Outcome .=="Win", 50))
50×5 DataFrame
 Row │ Id     Gate   Top3    Position  Column 
     │ Int64  Int64  String  String    Bool   
─────┼────────────────────────────────────────
   1 │     1    590  Hawks   1st        false
   2 │     2    590  Hawks   2nd         true
   3 │     3    590  Jazz    3rd         true
   4 │     4    590  Heat    2nd        false
   5 │     5    590  Heat    3rd        false

What has happened here? You constructed Gate as rand(50:15:3000), and you probably intended for that to pick a different random number between 50 and 3,000 for each row. Instead, rand(50:15:3000) will draw a single random number:

julia> rand(50:15:3000)
2690

and the DataFrames constructor will fill an entire column with this one number. You therefore have a column which is a constant, which will by construction by multicollinear with the intercept that GLM adds by default to each regression model.

This isn’t your only problem though: In the regression formula you’ve written, you are transforming your categorical variables into a set of one hot encoded columns, and then you include all of those columns in your regression model. This will again lead by definition to multicollinearity: the column combination Jazz + Heat + Hawks will be 1 in each row, so again constant and multicollinear with the intercept. If this isn’t clear to you I again encourage you to read introductory texts on regresison modelling.

This also goes a long way to answering the question in your second post: when you include a categorical variable like Position in your regression model via @formula, GLM will automaticall contrast code the variable for you, which you can think of as the same thing you’re trying to achieve with your one_hot function, while automatically dropping one of the categories (the “base level”) from the regression to avoid multicollinearity. So if you have categorical variables, there’s no need to do any one hot encoding, just include the column in your regression model and GLM will contrast code them for you. The relevant documentation is here: Contrast coding categorical variables · StatsModels.jl

Topic		Replies	Views
GLM.jl LogisticRegression errors: matrix is not positive definite; Cholesky factorization failed Statistics question , glm	14	4973	June 9, 2022
Error in Regression but I don't think there is collinearity: "PosDefException: matrix is not positive definite; Cholesky factorization failed." New to Julia dataframes , glm	0	312	March 7, 2021
Linear regression with a positive definite matrix in GLM.jl? Statistics glm	11	2691	February 1, 2019
Error with Bayesian Gaussian Process "PosDefException: matrix is not positive definite; Cholesky factorization failed." Statistics	2	442	August 13, 2022
Help fitting linear probability model with GLM.jl Statistics glm , econometrics	4	788	August 26, 2021

PosDefException: matrix is not positive definite; Cholesky factorization failed

Related topics