PosDefException: matrix is not positive definite; Cholesky factorization failed

YummyPampers2 · July 29, 2021, 7:49pm

Hello Julians:

I am encountering the issue above when
I am building a logistic regression model
using the GLM package.

I was able to catch a reply from @tim.holy
that covered some functions that could
potentially address this issue HERE

I am not sure what to consider when
attempting to apply the different
PositiveFactorizations.jl functions.

My DataFrame has the structure/content

Teams = ["Jazz", "Heat", "Hawks"]
Rank = ["1st", "2nd", "3rd"]
Outcome = ["Win", "Loss"]

#Make sure to add row parameter for EACH attribute (i.e. 50)
Season = DataFrame(Id = 1:50, Gate = rand(50:15:3000, 50), 
                                  Top3 = rand(Teams, 50),
                                   Position = rand(Rank, 50), 
                                   Column = rand(Outcome .=="Win", 50))

I performed the _onehot function from:

begin
function _onehot(df,symb)
	    copy = df
	    for c in unique(copy[!,symb])
	        copy[!,Symbol(c)] = copy[!,symb] .== c
	    end
	    return(copy)
	end
end

Then when I attempted perform the logistic regression build
via:

fm = @formula(Column~ Top3 + Position + Gate + Jazz + Heat + Hawks + 1st + 2nd + 3rd+ 
                         Win + Loss)
logit = glm(fm, train, Binomial(), Probit())

I am returning the error in the subject line.

YES! I understand that not using the encoded columns would
yield a result. But am curious what I needed to do to build a
logistic regression WITH the encoded columnar values.

YummyPampers2 · July 29, 2021, 7:58pm

As a follow-up to the post:

Something strange to me is happening when I
run:

@formula(Column~ Top3 + Position + Gate)
logit = glm(fm, train, Binomial(), Probit())

I am returning:

(Intercept)
Top3: Jazz
Top3: Heat
Position: 2nd
Position:3rd
Gate

You will notice that ‘Top3’ is missing ‘Hawks’ (at index 3)
and that ‘Position’ is missing ‘1st’ (at index 1).

Am I missing something?

nilshg · July 29, 2021, 11:29pm

There are quite a few issues in your code, and I would strongly echo Peter’s suggestion from one of your previous threads that you start by reading the Julia documentation and maybe specifically for your use case Bogumil’s excellent DataFrames tutorial.

You might also want to read up on regression modelling, as your problem here isn’t really Julia related, but a design problem with how you build your regression model. There are plenty of resources out there, a recent one I liked is Regression and Other Stories from Gelman et al.

Now what’s the problem with what you are doing here? Consider your DataFrame:

julia> Season = DataFrame(Id = 1:50, Gate = rand(50:15:3000), 
                                         Top3 = rand(Teams, 50),
                                          Position = rand(Rank, 50), 
                                          Column = rand(Outcome .=="Win", 50))
50×5 DataFrame
 Row │ Id     Gate   Top3    Position  Column 
     │ Int64  Int64  String  String    Bool   
─────┼────────────────────────────────────────
   1 │     1    590  Hawks   1st        false
   2 │     2    590  Hawks   2nd         true
   3 │     3    590  Jazz    3rd         true
   4 │     4    590  Heat    2nd        false
   5 │     5    590  Heat    3rd        false

What has happened here? You constructed Gate as rand(50:15:3000), and you probably intended for that to pick a different random number between 50 and 3,000 for each row. Instead, rand(50:15:3000) will draw a single random number:

julia> rand(50:15:3000)
2690

and the DataFrames constructor will fill an entire column with this one number. You therefore have a column which is a constant, which will by construction by multicollinear with the intercept that GLM adds by default to each regression model.

This isn’t your only problem though: In the regression formula you’ve written, you are transforming your categorical variables into a set of one hot encoded columns, and then you include all of those columns in your regression model. This will again lead by definition to multicollinearity: the column combination Jazz + Heat + Hawks will be 1 in each row, so again constant and multicollinear with the intercept. If this isn’t clear to you I again encourage you to read introductory texts on regresison modelling.

This also goes a long way to answering the question in your second post: when you include a categorical variable like Position in your regression model via @formula, GLM will automaticall contrast code the variable for you, which you can think of as the same thing you’re trying to achieve with your one_hot function, while automatically dropping one of the categories (the “base level”) from the regression to avoid multicollinearity. So if you have categorical variables, there’s no need to do any one hot encoding, just include the column in your regression model and GLM will contrast code them for you. The relevant documentation is here: Contrast coding categorical variables · StatsModels.jl

YummyPampers2 · July 30, 2021, 5:05am

@nilshg Thank you for your explanation and
resources.

@nilshg – You are correct about the duplicate
array values, I made the adjustment as rand(50:15:3000, 50)
as I indicated to @pdeffebach below.
I understand that adding the encoded columns will
create a multi-collinearity (relationship between
exploratory variables) that could confound results. I
was following a tutorial (from October 2020 by Kabir) on
Logistic Regression with Julia.
I attempted both one_hot and ! one_hot methods
approaches. The LogReg model was built in both cases.
I did NOT use the encoded columns to build the model
in both cases. But still got the same index skipping issue
I mentioned before.

Potential solutions may involve adjusting the ‘Cholesky factorization’
definition using the @tim.holy approach. From the StatsModels.jl
source you shared, could you provide a MWE (in context)? @nilshg

pdeffebach · July 30, 2021, 1:51pm

Please run the code you supply, confirm that it works, and carefully inspect the results. It’s very easy to see what Nils is referring to

julia> df = DataFrame(a = [1, 2, 3], b = rand(1:100:1000))
3×2 DataFrame
 Row │ a      b
     │ Int64  Int64
─────┼──────────────
   1 │     1    901
   2 │     2    901
   3 │     3    901

YummyPampers2 · July 30, 2021, 3:54pm

@pdeffebach

I did not mention I am using Pluto
and Julia v 1.6.0

The discrepancy is only that where
above for the example I provided:

Gate = rand(50:15:3000)

In my original code I added a row
argument as:

Gate = rand(50:15:3000, 50)

Is this sufficient for a MWE
data frame?

pdeffebach · July 30, 2021, 4:25pm

You still haven’t provided a full MWE that gives the error you described. But here is an MWE that does and how to have a solution

julia> function _onehot(df,symb)
           copy = df
           for c in unique(copy[!,symb])
               copy[!,Symbol(c)] = copy[!,symb] .== c
           end
           return(copy)
       end;

julia> begin
       using DataFrames, Chain
       teams = ["Jazz", "Heat", "Hawks"]
       rank = ["first", "second", "third"]
       outcome = [true, false]
       df = DataFrame(Id = 1:50, team = rand(teams, 50), rank = rand(rank, 50), outcome = rand(outcome, 50))
       df2 = @chain df begin
           _onehot(:team)
           _onehot(:rank)
       end
       fm_bad = @formula(outcome ~ Jazz + Heat + Hawks + first + second + third)
       # will fail, you include too many dummy variables
       # logit_bad = glm(fm_bad, df2, Binomial(), ProbitLink())
       fm_good1 = @formula(outcome ~ Jazz + Heat + second + third)
       # will work, excluding one dummy from each
       logit_good1 = glm(fm_good1, df2, Binomial(),ProbitLink())
       fm_good2 = @formula(outcome ~ team + rank)
       # even better, GLM handles the collinearity
       logit_good2 = glm(fm_good2, df2, Binomial(), ProbitLink())
       end
StatsModels.TableRegressionModel{GeneralizedLinearModel{GLM.GlmResp{Vector{Float64}, Binomial{Float64}, ProbitLink}, GLM.DensePredChol{Float64, LinearAlgebra.Cholesky{Float64, Matrix{Float64}}}}, Matrix{Float64}}

outcome ~ 1 + team + rank

Coefficients:
──────────────────────────────────────────────────────────────────────────
                  Coef.  Std. Error      z  Pr(>|z|)  Lower 95%  Upper 95%
──────────────────────────────────────────────────────────────────────────
(Intercept)    0.128893    0.343977   0.37    0.7079   -0.54529  0.803075
team: Heat    -0.94062     0.511527  -1.84    0.0659   -1.94319  0.0619533
team: Jazz    -0.432373    0.465327  -0.93    0.3528   -1.3444   0.479652
rank: second  -0.502133    0.474511  -1.06    0.2900   -1.43216  0.427891
rank: third   -0.217286    0.464381  -0.47    0.6399   -1.12746  0.692884
──────────────────────────────────────────────────────────────────────────

YummyPampers2 · July 30, 2021, 5:13pm

@pdeffebach Thank you.

Follow-Up Questions:

What is the threshold for ‘too many dummy variables’?

Why do the ‘good’ fm and logit variables ‘exclude one
dummy variable from each’?

pdeffebach · July 30, 2021, 5:33pm

Maybe this link will answer your questions.

This is not anything to do with Julia, but rather is about statistics

YummyPampers2 · July 30, 2021, 5:39pm

@pdeffebach – dummy variable trap.

You dropped to minimize multi-collinearity.
You could have also dropped the intercept.

Thanks for sharing.

Topic		Replies	Views
GLM.jl LogisticRegression errors: matrix is not positive definite; Cholesky factorization failed Statistics question , glm	14	5002	June 9, 2022
Error in Regression but I don't think there is collinearity: "PosDefException: matrix is not positive definite; Cholesky factorization failed." New to Julia dataframes , glm	0	315	March 7, 2021
Linear regression with a positive definite matrix in GLM.jl? Statistics glm	11	2712	February 1, 2019
Fail to do multiple linear regression using GLM Statistics regression , glm	7	1879	December 18, 2018
Help fitting linear probability model with GLM.jl Statistics glm , econometrics	4	790	August 26, 2021

PosDefException: matrix is not positive definite; Cholesky factorization failed

Related topics