I’m trying to perform linear regression on a small dataset (these are just sample data, the real dataset will be much larger), but I’m running into a problem with the matrix not being positive definite, which it very well might be, but I wouldn’t know since I have no idea what that means However, if I perform linear regression in R, it does not complain. So how can I work around this in Julia?
Someone else can give you a more technical answer on the meaning of positive definite.
My guess is that the error is because your x1 and x5 terms vary together, so you don’t want to use them as separate terms. Note the NA in the R results for x5. Try taking one of those terms out.
R and Stata silently omit duplicate variables from regression output, and I like this feature a lot. It’s useful for heterogeneity analysis with something in your existing vector of covariates.
As @tshort pointed out, this is due to collinear variables X1 and X5. Is that a feature of your dataset or something you inserted to test the package? There are ways to deal with collinearity problems, such as PCA, but if you’re just testing, try removing X5, or change its values and you should obtain an answer.
FWIW, I don’t think that automagically doing something clever (automatic removal of dependent columns, regularization, shrinkage priors) is the right solution, as it just hides the problem from the user.
OTOH, an informative error message would go a long way.
Thanks for the quick replies! The two columns in question are weight and height, so this might sort itself out, when I get more data, but it is probably better to just combine these into body mass index. Removing either column or using BMI instead, makes it run perfectly.
At least we could improve the error message. In case of errors like that, it would be nice to identify the problematic columns and mention then in the message. Can you file an issue?
Being able to get the same behavior as R and Stata could also be useful, maybe as an option. Though an explicit error can be more user-friendly than setting coefficients to NA without any explanation.
We have this functionality already but, unfortunately, it is currently undocumented. It also seems to remove the intercept which might be a little annoying.
I wouldnt combine them into BMI. In that particular example, BMI is not the same as weight and height. For a 5 foot and 6 foot person could have the same BMI but if the height alone was responsible for an effect you would have lost that. Fat and muscular people can also have the same BMI.