Linear regression with a positive definite matrix in GLM.jl?

Hi

I’m trying to perform linear regression on a small dataset (these are just sample data, the real dataset will be much larger), but I’m running into a problem with the matrix not being positive definite, which it very well might be, but I wouldn’t know since I have no idea what that means :slight_smile: However, if I perform linear regression in R, it does not complain. So how can I work around this in Julia?

Julia:

julia> using GLM

julia> using CSV

julia> d = CSV.read("test.csv")
10×6 DataFrame
│ Row │ x1    │ x2      │ x3      │ x4 │ x5    │ y       │
├─────┼───────┼─────────┼─────────┼────┼───────┼─────────┤
│ 1   │ 187.7 │ 84.1768 │ 82.4464 │ 10 │ 76.81 │ 2714.95 │
│ 2   │ 187.7 │ 87.5767 │ 84.3167 │ 12 │ 76.81 │ 2754.11 │
│ 3   │ 187.7 │ 91.9267 │ 86.0667 │ 14 │ 76.81 │ 4480.88 │
│ 4   │ 187.7 │ 91.5833 │ 89.1167 │ 16 │ 76.81 │ 4760.54 │
│ 5   │ 187.7 │ 86.915  │ 92.35   │ 18 │ 76.81 │ 5205.43 │
│ 6   │ 178.4 │ 106.757 │ 81.1379 │ 10 │ 78.97 │ 2598.67 │
│ 7   │ 178.4 │ 108.783 │ 82.9833 │ 12 │ 78.97 │ 4295.96 │
│ 8   │ 178.4 │ 103.832 │ 86.5667 │ 14 │ 78.97 │ 4647.4  │
│ 9   │ 178.4 │ 96.5583 │ 90.2203 │ 16 │ 78.97 │ 4300.89 │
│ 10  │ 178.4 │ 93.7767 │ 92.9333 │ 18 │ 78.97 │ 4254.61 │

julia> lm(@formula(y ~ x1 + x2 + x3 + x4 + x5), d)
ERROR: PosDefException: matrix is not positive definite; Cholesky factorization failed.
Stacktrace:
...

R:

> d = read.csv("test.csv")
> lm(y ~ x1 + x2 + x3 + x4 + x5, d)

Call:
lm(formula = y ~ x1 + x2 + x3 + x4 + x5, data = d)

Coefficients:
(Intercept)           x1           x2           x3           x4           x5  
   -9234.73        91.71        64.45      -206.79       590.43           NA 

Someone else can give you a more technical answer on the meaning of positive definite.

My guess is that the error is because your x1 and x5 terms vary together, so you don’t want to use them as separate terms. Note the NA in the R results for x5. Try taking one of those terms out.

5 Likes

R and Stata silently omit duplicate variables from regression output, and I like this feature a lot. It’s useful for heterogeneity analysis with something in your existing vector of covariates.

It would be nice if Julia had this behavior, too.

2 Likes

As @tshort pointed out, this is due to collinear variables X1 and X5. Is that a feature of your dataset or something you inserted to test the package? There are ways to deal with collinearity problems, such as PCA, but if you’re just testing, try removing X5, or change its values and you should obtain an answer.

It may be good to figure out how R does it and consider making an enhancement in GLM?

FWIW, I don’t think that automagically doing something clever (automatic removal of dependent columns, regularization, shrinkage priors) is the right solution, as it just hides the problem from the user.

OTOH, an informative error message would go a long way.

9 Likes

Thanks for the quick replies! The two columns in question are weight and height, so this might sort itself out, when I get more data, but it is probably better to just combine these into body mass index. Removing either column or using BMI instead, makes it run perfectly.

2 Likes

At least we could improve the error message. In case of errors like that, it would be nice to identify the problematic columns and mention then in the message. Can you file an issue?

Being able to get the same behavior as R and Stata could also be useful, maybe as an option. Though an explicit error can be more user-friendly than setting coefficients to NA without any explanation.

2 Likes

We have this functionality already but, unfortunately, it is currently undocumented. It also seems to remove the intercept which might be a little annoying.

julia> lm(@formula(y ~ x1 + x2 + x3 + x4 + x5), df, true)
StatsModels.DataFrameRegressionModel{LinearModel{LmResp{Array{Float64,1}},DensePredChol{Float64,LinearAlgebra.CholeskyPivoted{Float64,Array{Float64,2}}}},Array{Float64,2}}

Formula: y ~ 1 + x1 + x2 + x3 + x4 + x5

Coefficients:
             Estimate Std.Error    t value Pr(>|t|)
(Intercept)       0.0       NaN        NaN      NaN
x1            73.8966   47.6088    1.55216   0.1956
x2            64.4484   164.694   0.391322   0.7155
x3           -206.787    1058.1  -0.195432   0.8546
x4            590.431   1353.83   0.436118   0.6853
x5           -76.7016   1076.81 -0.0712303   0.9466
1 Like

I have opened an issue suggesting to add what @nalimilan and @andreasnoack wrote.

I wouldnt combine them into BMI. In that particular example, BMI is not the same as weight and height. For a 5 foot and 6 foot person could have the same BMI but if the height alone was responsible for an effect you would have lost that. Fat and muscular people can also have the same BMI.

What functionality?