Linear regression with a positive definite matrix in GLM.jl?

sebastianskejoe · September 6, 2018, 7:26pm

Hi

I’m trying to perform linear regression on a small dataset (these are just sample data, the real dataset will be much larger), but I’m running into a problem with the matrix not being positive definite, which it very well might be, but I wouldn’t know since I have no idea what that means However, if I perform linear regression in R, it does not complain. So how can I work around this in Julia?

Julia:

julia> using GLM

julia> using CSV

julia> d = CSV.read("test.csv")
10×6 DataFrame
│ Row │ x1    │ x2      │ x3      │ x4 │ x5    │ y       │
├─────┼───────┼─────────┼─────────┼────┼───────┼─────────┤
│ 1   │ 187.7 │ 84.1768 │ 82.4464 │ 10 │ 76.81 │ 2714.95 │
│ 2   │ 187.7 │ 87.5767 │ 84.3167 │ 12 │ 76.81 │ 2754.11 │
│ 3   │ 187.7 │ 91.9267 │ 86.0667 │ 14 │ 76.81 │ 4480.88 │
│ 4   │ 187.7 │ 91.5833 │ 89.1167 │ 16 │ 76.81 │ 4760.54 │
│ 5   │ 187.7 │ 86.915  │ 92.35   │ 18 │ 76.81 │ 5205.43 │
│ 6   │ 178.4 │ 106.757 │ 81.1379 │ 10 │ 78.97 │ 2598.67 │
│ 7   │ 178.4 │ 108.783 │ 82.9833 │ 12 │ 78.97 │ 4295.96 │
│ 8   │ 178.4 │ 103.832 │ 86.5667 │ 14 │ 78.97 │ 4647.4  │
│ 9   │ 178.4 │ 96.5583 │ 90.2203 │ 16 │ 78.97 │ 4300.89 │
│ 10  │ 178.4 │ 93.7767 │ 92.9333 │ 18 │ 78.97 │ 4254.61 │

julia> lm(@formula(y ~ x1 + x2 + x3 + x4 + x5), d)
ERROR: PosDefException: matrix is not positive definite; Cholesky factorization failed.
Stacktrace:
...

R:

> d = read.csv("test.csv")
> lm(y ~ x1 + x2 + x3 + x4 + x5, d)

Call:
lm(formula = y ~ x1 + x2 + x3 + x4 + x5, data = d)

Coefficients:
(Intercept)           x1           x2           x3           x4           x5  
   -9234.73        91.71        64.45      -206.79       590.43           NA

tshort · September 6, 2018, 8:22pm

Someone else can give you a more technical answer on the meaning of positive definite.

My guess is that the error is because your x1 and x5 terms vary together, so you don’t want to use them as separate terms. Note the NA in the R results for x5. Try taking one of those terms out.

pdeffebach · September 6, 2018, 9:22pm

R and Stata silently omit duplicate variables from regression output, and I like this feature a lot. It’s useful for heterogeneity analysis with something in your existing vector of covariates.

It would be nice if Julia had this behavior, too.

alejandromerchan · September 7, 2018, 12:05am

As @tshort pointed out, this is due to collinear variables X1 and X5. Is that a feature of your dataset or something you inserted to test the package? There are ways to deal with collinearity problems, such as PCA, but if you’re just testing, try removing X5, or change its values and you should obtain an answer.

tk3369 · September 7, 2018, 6:29am

It may be good to figure out how R does it and consider making an enhancement in GLM?

Tamas_Papp · September 7, 2018, 7:13am

FWIW, I don’t think that automagically doing something clever (automatic removal of dependent columns, regularization, shrinkage priors) is the right solution, as it just hides the problem from the user.

OTOH, an informative error message would go a long way.

sebastianskejoe · September 7, 2018, 7:35am

Thanks for the quick replies! The two columns in question are weight and height, so this might sort itself out, when I get more data, but it is probably better to just combine these into body mass index. Removing either column or using BMI instead, makes it run perfectly.

nalimilan · September 7, 2018, 7:51am

At least we could improve the error message. In case of errors like that, it would be nice to identify the problematic columns and mention then in the message. Can you file an issue?

Being able to get the same behavior as R and Stata could also be useful, maybe as an option. Though an explicit error can be more user-friendly than setting coefficients to NA without any explanation.

andreasnoack · September 7, 2018, 8:18am

We have this functionality already but, unfortunately, it is currently undocumented. It also seems to remove the intercept which might be a little annoying.

julia> lm(@formula(y ~ x1 + x2 + x3 + x4 + x5), df, true)
StatsModels.DataFrameRegressionModel{LinearModel{LmResp{Array{Float64,1}},DensePredChol{Float64,LinearAlgebra.CholeskyPivoted{Float64,Array{Float64,2}}}},Array{Float64,2}}

Formula: y ~ 1 + x1 + x2 + x3 + x4 + x5

Coefficients:
             Estimate Std.Error    t value Pr(>|t|)
(Intercept)       0.0       NaN        NaN      NaN
x1            73.8966   47.6088    1.55216   0.1956
x2            64.4484   164.694   0.391322   0.7155
x3           -206.787    1058.1  -0.195432   0.8546
x4            590.431   1353.83   0.436118   0.6853
x5           -76.7016   1076.81 -0.0712303   0.9466

sebastianskejoe · September 7, 2018, 11:37am

I have opened an issue suggesting to add what @nalimilan and @andreasnoack wrote.

sagearbor · January 31, 2019, 6:56pm

I wouldnt combine them into BMI. In that particular example, BMI is not the same as weight and height. For a 5 foot and 6 foot person could have the same BMI but if the height alone was responsible for an effect you would have lost that. Fat and muscular people can also have the same BMI.

Juan · February 1, 2019, 7:37pm

What functionality?

Topic		Replies	Views
Error in Regression but I don't think there is collinearity: "PosDefException: matrix is not positive definite; Cholesky factorization failed." New to Julia dataframes , glm	0	315	March 7, 2021
GLM.jl LogisticRegression errors: matrix is not positive definite; Cholesky factorization failed Statistics question , glm	14	5002	June 9, 2022
How to check if GLM will be able to run OLS beforehand? General Usage question , glm	8	507	November 1, 2020
PosDefException: matrix is not positive definite; Cholesky factorization failed Statistics	9	1634	July 30, 2021
Fail to do multiple linear regression using GLM Statistics regression , glm	7	1879	December 18, 2018

Linear regression with a positive definite matrix in GLM.jl?

Related topics