PosDefException: matrix is not positive definite; Cholesky factorization failed

Hello Julians:

I am encountering the issue above when
I am building a logistic regression model
using the GLM package.

I was able to catch a reply from @tim.holy
that covered some functions that could
potentially address this issue HERE

I am not sure what to consider when
attempting to apply the different
PositiveFactorizations.jl functions.

My DataFrame has the structure/content

Teams = ["Jazz", "Heat", "Hawks"]
Rank = ["1st", "2nd", "3rd"]
Outcome = ["Win", "Loss"]

#Make sure to add row parameter for EACH attribute (i.e. 50)
Season = DataFrame(Id = 1:50, Gate = rand(50:15:3000, 50), 
                                  Top3 = rand(Teams, 50),
                                   Position = rand(Rank, 50), 
                                   Column = rand(Outcome .=="Win", 50))

I performed the _onehot function from:

begin
function _onehot(df,symb)
	    copy = df
	    for c in unique(copy[!,symb])
	        copy[!,Symbol(c)] = copy[!,symb] .== c
	    end
	    return(copy)
	end
end

Then when I attempted perform the logistic regression build
via:

fm = @formula(Column~ Top3 + Position + Gate + Jazz + Heat + Hawks + 1st + 2nd + 3rd+ 
                         Win + Loss)
logit = glm(fm, train, Binomial(), Probit())

I am returning the error in the subject line.

YES! I understand that not using the encoded columns would
yield a result. But am curious what I needed to do to build a
logistic regression WITH the encoded columnar values.

As a follow-up to the post:

Something strange to me is happening when I
run:

@formula(Column~ Top3 + Position + Gate)
logit = glm(fm, train, Binomial(), Probit())

I am returning:

(Intercept)
Top3: Jazz
Top3: Heat
Position: 2nd
Position:3rd
Gate

You will notice that β€˜Top3’ is missing β€˜Hawks’ (at index 3)
and that β€˜Position’ is missing β€˜1st’ (at index 1).

Am I missing something?

There are quite a few issues in your code, and I would strongly echo Peter’s suggestion from one of your previous threads that you start by reading the Julia documentation and maybe specifically for your use case Bogumil’s excellent DataFrames tutorial.

You might also want to read up on regression modelling, as your problem here isn’t really Julia related, but a design problem with how you build your regression model. There are plenty of resources out there, a recent one I liked is Regression and Other Stories from Gelman et al.

Now what’s the problem with what you are doing here? Consider your DataFrame:

julia> Season = DataFrame(Id = 1:50, Gate = rand(50:15:3000), 
                                         Top3 = rand(Teams, 50),
                                          Position = rand(Rank, 50), 
                                          Column = rand(Outcome .=="Win", 50))
50Γ—5 DataFrame
 Row β”‚ Id     Gate   Top3    Position  Column 
     β”‚ Int64  Int64  String  String    Bool   
─────┼────────────────────────────────────────
   1 β”‚     1    590  Hawks   1st        false
   2 β”‚     2    590  Hawks   2nd         true
   3 β”‚     3    590  Jazz    3rd         true
   4 β”‚     4    590  Heat    2nd        false
   5 β”‚     5    590  Heat    3rd        false

What has happened here? You constructed Gate as rand(50:15:3000), and you probably intended for that to pick a different random number between 50 and 3,000 for each row. Instead, rand(50:15:3000) will draw a single random number:

julia> rand(50:15:3000)
2690

and the DataFrames constructor will fill an entire column with this one number. You therefore have a column which is a constant, which will by construction by multicollinear with the intercept that GLM adds by default to each regression model.

This isn’t your only problem though: In the regression formula you’ve written, you are transforming your categorical variables into a set of one hot encoded columns, and then you include all of those columns in your regression model. This will again lead by definition to multicollinearity: the column combination Jazz + Heat + Hawks will be 1 in each row, so again constant and multicollinear with the intercept. If this isn’t clear to you I again encourage you to read introductory texts on regresison modelling.

This also goes a long way to answering the question in your second post: when you include a categorical variable like Position in your regression model via @formula, GLM will automaticall contrast code the variable for you, which you can think of as the same thing you’re trying to achieve with your one_hot function, while automatically dropping one of the categories (the β€œbase level”) from the regression to avoid multicollinearity. So if you have categorical variables, there’s no need to do any one hot encoding, just include the column in your regression model and GLM will contrast code them for you. The relevant documentation is here: Contrast coding categorical variables Β· StatsModels.jl

3 Likes

@nilshg Thank you for your explanation and
resources.

  1. @nilshg – You are correct about the duplicate
    array values, I made the adjustment as rand(50:15:3000, 50)
    as I indicated to @pdeffebach below.

  2. I understand that adding the encoded columns will
    create a multi-collinearity (relationship between
    exploratory variables) that could confound results. I
    was following a tutorial (from October 2020 by Kabir) on
    Logistic Regression with Julia.

  3. I attempted both one_hot and ! one_hot methods
    approaches. The LogReg model was built in both cases.
    I did NOT use the encoded columns to build the model
    in both cases. But still got the same index skipping issue
    I mentioned before.

Potential solutions may involve adjusting the β€˜Cholesky factorization’
definition using the @tim.holy approach. From the StatsModels.jl
source you shared, could you provide a MWE (in context)? @nilshg

Please run the code you supply, confirm that it works, and carefully inspect the results. It’s very easy to see what Nils is referring to

julia> df = DataFrame(a = [1, 2, 3], b = rand(1:100:1000))
3Γ—2 DataFrame
 Row β”‚ a      b
     β”‚ Int64  Int64
─────┼──────────────
   1 β”‚     1    901
   2 β”‚     2    901
   3 β”‚     3    901
1 Like

@pdeffebach

I did not mention I am using Pluto
and Julia v 1.6.0

The discrepancy is only that where
above for the example I provided:

Gate = rand(50:15:3000)

In my original code I added a row
argument as:

Gate = rand(50:15:3000, 50)

Is this sufficient for a MWE
data frame?

You still haven’t provided a full MWE that gives the error you described. But here is an MWE that does and how to have a solution

julia> function _onehot(df,symb)
           copy = df
           for c in unique(copy[!,symb])
               copy[!,Symbol(c)] = copy[!,symb] .== c
           end
           return(copy)
       end;

julia> begin
       using DataFrames, Chain
       teams = ["Jazz", "Heat", "Hawks"]
       rank = ["first", "second", "third"]
       outcome = [true, false]
       df = DataFrame(Id = 1:50, team = rand(teams, 50), rank = rand(rank, 50), outcome = rand(outcome, 50))
       df2 = @chain df begin
           _onehot(:team)
           _onehot(:rank)
       end
       fm_bad = @formula(outcome ~ Jazz + Heat + Hawks + first + second + third)
       # will fail, you include too many dummy variables
       # logit_bad = glm(fm_bad, df2, Binomial(), ProbitLink())
       fm_good1 = @formula(outcome ~ Jazz + Heat + second + third)
       # will work, excluding one dummy from each
       logit_good1 = glm(fm_good1, df2, Binomial(),ProbitLink())
       fm_good2 = @formula(outcome ~ team + rank)
       # even better, GLM handles the collinearity
       logit_good2 = glm(fm_good2, df2, Binomial(), ProbitLink())
       end
StatsModels.TableRegressionModel{GeneralizedLinearModel{GLM.GlmResp{Vector{Float64}, Binomial{Float64}, ProbitLink}, GLM.DensePredChol{Float64, LinearAlgebra.Cholesky{Float64, Matrix{Float64}}}}, Matrix{Float64}}

outcome ~ 1 + team + rank

Coefficients:
──────────────────────────────────────────────────────────────────────────
                  Coef.  Std. Error      z  Pr(>|z|)  Lower 95%  Upper 95%
──────────────────────────────────────────────────────────────────────────
(Intercept)    0.128893    0.343977   0.37    0.7079   -0.54529  0.803075
team: Heat    -0.94062     0.511527  -1.84    0.0659   -1.94319  0.0619533
team: Jazz    -0.432373    0.465327  -0.93    0.3528   -1.3444   0.479652
rank: second  -0.502133    0.474511  -1.06    0.2900   -1.43216  0.427891
rank: third   -0.217286    0.464381  -0.47    0.6399   -1.12746  0.692884
──────────────────────────────────────────────────────────────────────────
1 Like

@pdeffebach Thank you.

Follow-Up Questions:

What is the threshold for β€˜too many dummy variables’?

Why do the β€˜good’ fm and logit variables β€˜exclude one
dummy variable from each’?

Maybe this link will answer your questions.

This is not anything to do with Julia, but rather is about statistics

@pdeffebach – dummy variable trap.

You dropped to minimize multi-collinearity.
You could have also dropped the intercept.

Thanks for sharing.