using GLM
using DataFrames
using CSV
data = CSV.read( "Car-Training.csv", DataFrame )
model = @formula( Price ~ Year + Mileage )
results = lm( model, data )
Could you see if I did something wrong here? It seems so basic but yet I got strange results that are wrong:
StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{Array{Float64,1}},GLM.DensePredChol{Float64,LinearAlgebra.CholeskyPivoted{Float64,Array{Float64,2}}}},Array{Float64,2}}
Price ~ 1 + Year + Mileage
Coefficients:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Coef. Std. Error t Pr(>|t|) Lower 95% Upper 95%
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
(Intercept) 0.0 NaN NaN NaN NaN NaN
Year 8.17971 0.167978 48.70 <1e-73 7.84664 8.51278
Mileage -0.0580528 0.00949846 -6.11 <1e-7 -0.0768865 -0.0392191
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
I wasnβt sure if I had become crazy, because the problem was so basic and yet there was this error.
I think normalization could be the culprit.
Iβm teaching a course that uses basic regression and conducts simple forecasting. Subtracting 2000 will solve the problem. However, it also means that when forecasting, one needs to pay attention to this as well, which is unnecessarily complicating the issue.
And this is a part of a take-home exam. So it will not be very welcomed by students to have this extra complexity.
I hope that the GLMβs maintainer or someone who knows better Julia-fu can fix the issue and publish a newer version of GLM, so that I can simply ask students to update the GLM package. This will be a better and simpler solution, I think.
Uhm, it does just not sound right that LsqFit.jl is a solution here, but just a workaround.
Not an expert on numerical analysis but there should be a best practice for data normalization (assuming this is the problem) when the input variables are so different, before throwing them into GLM (ex: subtract mean and divide by standard deviation, or something like that).
For the most elementary use of regression, I think by default one does not transform data. Data is thrown to regression, without transformation. And results are produced.
The silly part is that I have been telling students that Julia is much better than Excel for basic data analyses. Now Excel runs the simple regression correctly, but not Julia. The laugh is on me.
The bug is from dropcollinear keyword argument. Set dropcollinear=false as a kw in lm and you will get the same results as R. Iβm commenting on the issue as well and will explore this throughout the day.
As a side comment, Iβm not sure if this is a sane option to set dropcollinear=true as the default. I have used a number of statistical software. This is the first time such an option is set as the default.
Fwiw, I think this is a bug, possibly introduced by me. I donβt think that having dropcollinear=true should behave differently when X is full rank. So having this be the default shouldnβt cause this type of problem.