Can someone replicate this GLM problem with linear regression on your computer?

Chen · April 27, 2021, 11:26am

The CSV file is here: https://github.com/JuliaStats/GLM.jl/files/6384056/Car-Training.csv

The GLM issue that I reported is here: https://github.com/JuliaStats/GLM.jl/issues/426

The codes are very simple:

using GLM
using DataFrames
using CSV

data = CSV.read( "Car-Training.csv", DataFrame )
model = @formula( Price ~ Year + Mileage )
results = lm( model, data )

Could you see if I did something wrong here? It seems so basic but yet I got strange results that are wrong:

StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{Array{Float64,1}},GLM.DensePredChol{Float64,LinearAlgebra.CholeskyPivoted{Float64,Array{Float64,2}}}},Array{Float64,2}}

Price ~ 1 + Year + Mileage

Coefficients:
─────────────────────────────────────────────────────────────────────────────────
                  Coef.    Std. Error       t  Pr(>|t|)    Lower 95%    Upper 95%
─────────────────────────────────────────────────────────────────────────────────
(Intercept)   0.0        NaN           NaN       NaN     NaN          NaN
Year          8.17971      0.167978     48.70    <1e-73    7.84664      8.51278
Mileage      -0.0580528    0.00949846   -6.11    <1e-7    -0.0768865   -0.0392191
─────────────────────────────────────────────────────────────────────────────────

rafael.guerra · April 27, 2021, 12:03pm

Could this be a problem of input data normalization?
In any event subtracting 2000 from the input Year seems to unlock the issue.

Chen · April 27, 2021, 12:09pm

Thank you vey much!

I wasn’t sure if I had become crazy, because the problem was so basic and yet there was this error.

I think normalization could be the culprit.

I’m teaching a course that uses basic regression and conducts simple forecasting. Subtracting 2000 will solve the problem. However, it also means that when forecasting, one needs to pay attention to this as well, which is unnecessarily complicating the issue.

And this is a part of a take-home exam. So it will not be very welcomed by students to have this extra complexity.

I hope that the GLM’s maintainer or someone who knows better Julia-fu can fix the issue and publish a newer version of GLM, so that I can simply ask students to update the GLM package. This will be a better and simpler solution, I think.

Chen · April 27, 2021, 12:16pm

Linear regression is the most basic regression. Is there an alternative package that can perform linear regression?

rafael.guerra · April 27, 2021, 12:21pm

Check LsqFit.jl out.

Chen · April 27, 2021, 12:31pm

Thank you very much again!

LsqFit.jl is probably fine. However, one needs to specify x data, y data, and initial values.

I plan to wait a bit and see if a maintainer of GLM.jl can fix the issue. The take-home exam has a deadline that is in two weeks. I can wait a bit.

If GLM.jl cannot be fixed, then probably LsqFit.jl will have to be used, or I have to write a linear regression package myself.

rafael.guerra · April 27, 2021, 12:56pm

Uhm, it does just not sound right that LsqFit.jl is a solution here, but just a workaround.

Not an expert on numerical analysis but there should be a best practice for data normalization (assuming this is the problem) when the input variables are so different, before throwing them into GLM (ex: subtract mean and divide by standard deviation, or something like that).

Chen · April 27, 2021, 12:58pm

Thank you.

For the most elementary use of regression, I think by default one does not transform data. Data is thrown to regression, without transformation. And results are produced.

The silly part is that I have been telling students that Julia is much better than Excel for basic data analyses. Now Excel runs the simple regression correctly, but not Julia. The laugh is on me.

rafael.guerra · April 27, 2021, 1:00pm

Please wait for the feedback from the GLM experts on this matter.

Chen · April 27, 2021, 1:08pm

Yeah!

pdeffebach · April 27, 2021, 1:15pm

The bug is from dropcollinear keyword argument. Set dropcollinear=false as a kw in lm and you will get the same results as R. I’m commenting on the issue as well and will explore this throughout the day.

Chen · April 27, 2021, 1:24pm

This solved the issue. Thank you very much!

As a side comment, I’m not sure if this is a sane option to set dropcollinear=true as the default. I have used a number of statistical software. This is the first time such an option is set as the default.

pdeffebach · April 27, 2021, 1:26pm

It’s the default in Stata, at least.

Fwiw, I think this is a bug, possibly introduced by me. I don’t think that having dropcollinear=true should behave differently when X is full rank. So having this be the default shouldn’t cause this type of problem.

EDIT: It’s also the default in R

Chen · April 27, 2021, 1:31pm

Good to know. I have used Stata in some datasets. I didn’t know about it. Thanks!

Topic		Replies	Views
The simplest linear fit with GLM Tooling glm	13	5285	November 11, 2021
This linear regression fails in Julia (GLM) vs Python (sklearn) Data linearalgebra , linear-regression	7	411	July 4, 2023
Simple Linear Regression: Domain Error with 0.0 New to Julia	27	1464	March 31, 2022
Discrepancy between lme4 and GLM.jl Machine Learning statistics , linear-regression	7	1082	November 1, 2022
GLM is slow on large datasets. Using OnlineStats for regressions? MixedModels? Performance glm	25	5092	November 26, 2018

Can someone replicate this GLM problem with linear regression on your computer?

Related topics