Can someone replicate this GLM problem with linear regression on your computer?

The CSV file is here: https://github.com/JuliaStats/GLM.jl/files/6384056/Car-Training.csv

The GLM issue that I reported is here: https://github.com/JuliaStats/GLM.jl/issues/426

The codes are very simple:

using GLM
using DataFrames
using CSV

data = CSV.read( "Car-Training.csv", DataFrame )
model = @formula( Price ~ Year + Mileage )
results = lm( model, data )

Could you see if I did something wrong here? It seems so basic but yet I got strange results that are wrong:

StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{Array{Float64,1}},GLM.DensePredChol{Float64,LinearAlgebra.CholeskyPivoted{Float64,Array{Float64,2}}}},Array{Float64,2}}

Price ~ 1 + Year + Mileage

Coefficients:
─────────────────────────────────────────────────────────────────────────────────
                  Coef.    Std. Error       t  Pr(>|t|)    Lower 95%    Upper 95%
─────────────────────────────────────────────────────────────────────────────────
(Intercept)   0.0        NaN           NaN       NaN     NaN          NaN
Year          8.17971      0.167978     48.70    <1e-73    7.84664      8.51278
Mileage      -0.0580528    0.00949846   -6.11    <1e-7    -0.0768865   -0.0392191
─────────────────────────────────────────────────────────────────────────────────

Could this be a problem of input data normalization?
In any event subtracting 2000 from the input Year seems to unlock the issue.

1 Like

Thank you vey much!

I wasn’t sure if I had become crazy, because the problem was so basic and yet there was this error.

I think normalization could be the culprit.

I’m teaching a course that uses basic regression and conducts simple forecasting. Subtracting 2000 will solve the problem. However, it also means that when forecasting, one needs to pay attention to this as well, which is unnecessarily complicating the issue.

And this is a part of a take-home exam. So it will not be very welcomed by students to have this extra complexity.

I hope that the GLM’s maintainer or someone who knows better Julia-fu can fix the issue and publish a newer version of GLM, so that I can simply ask students to update the GLM package. This will be a better and simpler solution, I think.

Linear regression is the most basic regression. Is there an alternative package that can perform linear regression?

Check LsqFit.jl out.

Thank you very much again!

LsqFit.jl is probably fine. However, one needs to specify x data, y data, and initial values.

I plan to wait a bit and see if a maintainer of GLM.jl can fix the issue. The take-home exam has a deadline that is in two weeks. I can wait a bit.

If GLM.jl cannot be fixed, then probably LsqFit.jl will have to be used, or I have to write a linear regression package myself.

Uhm, it does just not sound right that LsqFit.jl is a solution here, but just a workaround.

Not an expert on numerical analysis but there should be a best practice for data normalization (assuming this is the problem) when the input variables are so different, before throwing them into GLM (ex: subtract mean and divide by standard deviation, or something like that).

Thank you.

For the most elementary use of regression, I think by default one does not transform data. Data is thrown to regression, without transformation. And results are produced.

The silly part is that I have been telling students that Julia is much better than Excel for basic data analyses. Now Excel runs the simple regression correctly, but not Julia. The laugh is on me. :rofl:

Please wait for the feedback from the GLM experts on this matter.

Yeah!

The bug is from dropcollinear keyword argument. Set dropcollinear=false as a kw in lm and you will get the same results as R. I’m commenting on the issue as well and will explore this throughout the day.

2 Likes

This solved the issue. Thank you very much!

As a side comment, I’m not sure if this is a sane option to set dropcollinear=true as the default. I have used a number of statistical software. This is the first time such an option is set as the default.

It’s the default in Stata, at least.

Fwiw, I think this is a bug, possibly introduced by me. I don’t think that having dropcollinear=true should behave differently when X is full rank. So having this be the default shouldn’t cause this type of problem.

EDIT: It’s also the default in R

3 Likes

Good to know. I have used Stata in some datasets. I didn’t know about it. Thanks!