LinearRegressionKit results correctness

Eric · June 30, 2022, 7:52pm

I haven’t posted for a while, but I noticed some discussion on the forum about results correctness, and that got me thinking that maybe I should do more on my little package.
Hence I used the tests that everyone seems to be looking at https://itl.nist.gov/div898/strd/lls/lls.shtml
And LinearRegressionKit against them; I also tested GLM because it is the standard, and the method I used to compute the least-square is not famous for being remarkably accurate. Instead, the focus is on speed. The original paper indicates a need to tune the sweep operator with some tolerance value when working with polynomials starting from the 5th degree, which I didn’t experiment with yet.

I run some tests on Julia Version 1.7.2.

LinearRegressionKit v0.7.4
GLM v1.8.0
for the main packages.

The tests are available as Gist (links below).

Norris

Norris - LRK
delta B0 1.5815126985785355e-13
delta B1 3.9968028886505635e-15
delta std err B0 4.755917881738014e-13
delta std err B1 8.790169113426227e-16
delta resid std dev 1.8096635301390052e-12
delta R2 3.3306690738754696e-16
Norris - GLM
delta B0 1.084687895058778e-13
delta B1 4.6629367034256575e-15
delta std err B0 3.1363800445660672e-15
delta std err B1 4.716279450311944e-18
delta resid std dev 0.10193173351428586
delta R2 3.3306690738754696e-16

Not much to comment; the resulting deltas are relatively low and comparable to GLM. It is not surprising as it is categorized as low difficulty.

Pontius:

Pontius - LRK
delta B0 3.0665574246580007e-15
delta B1 5.293955920339377e-21
delta B2 1.828185147849695e-27
delta std err B0 1.7718994132186194e-14
delta std err B1 2.5906728098489697e-20
delta std err B2 7.988615660874347e-27
delta resid std dev 3.3683261258313224e-14
delta R2 2.220446049250313e-16
Pontius - GLM
delta B0 0.000673565789473684
delta B1 8.753152890137734e-10
delta B2 2.3721281545096935e-16
delta std err B0 NaN
delta std err B1 5.557349197992017e-11
delta std err B2 5.709830460148961e-18
delta resid std dev 0.00020509329395895348
delta R2 1.0505786940395723e-7

The LRK results are within my expectations, given that it is a low-difficulty dataset. However, the GLM results appear to show an error as one of the std error gives back a NaN; probably as a consequence, the RMSE delta is relatively large.

NotInt1 & 2
NoInt1
NoInt2

NoInt1 - LRK
delta B1 3.9968028886505635e-15
delta std err B1 5.238864897449957e-16
delta resid std dev 1.0969003483296547e-13
delta R2 0.0
NoInt1 - GLM
delta B1 3.9968028886505635e-15
delta std err B1 0.0
delta resid std dev 9.159742387209336
delta R2 1.1563902856870918

NoInt2 - LRK
delta B1 3.3306690738754696e-16
delta std err B1 0.0
delta resid std dev 6.661338147750939e-16
delta R2 0.0
NoInt2 - GLM
delta B1 0.0
delta std err B1 0.0
delta resid std dev 0.23291083657436149
delta R2 0.4024390243902445

Here for the medium-difficulty cases, LRK gives ok results.
With GLM, the problem with the RMSE, is probably my fault as I did not remember the correct formula for the RMSE using the deviance() and dof_residuals(). For the RSquare, the formula is different when there is no intercept, and the GLM package does not appear to account for that.

The Wrangler (1 to 5). The results are ok for LRK and GLM, sometimes a slight advantage for one package and some other time for the others.
[wrangler 1]
(wrangler1.jl · GitHub)
[wrangler 2]
(test_wrangler2.jl · GitHub)
[wrangler 3]
(https://gist.github.com/ericqu/8dceb0c6fbae8ea811cd4a594748c0cf)
[wrangler 4]
(https://gist.github.com/ericqu/0157983611ab3cb9765c7caf36f2e718)
[wrangler 5]
(https://gist.github.com/ericqu/91e52a2d299b8f42ec8b326acad07b2c)

Finally Filip!
This is the most challenging of the lot (at least for LRK and GLM).

Filip - LRK
delta B0 42.181664600308686
delta B1 85.56594026913072
delta B2 71.04442186725043
delta std err B0 73.65537847950405
delta std err B1 116.11298373784764
delta std err B2 78.04574485659762
delta resid std dev 0.0037046249359867383
delta R2 0.004462898883547672
Filip - GLM
delta B0 0.000673565789473684
delta B1 7.32059160401003e-7
delta B2 3.16081871345029e-15
delta std err B0 NaN
delta std err B1 NaN
delta std err B2 NaN
delta resid std dev 0.014246850475811437
delta R2 4.575918810437186

LRK gives results relatively far from the target. I am a bit disappointed but not too surprised; this is probably where the algorithm might require some tuning. It seems the situation is similar on GLM side.

So it seems that high polynomials have a relatively higher risk of inaccurate results. Unfortunately, I have little time these days, so this will be put as a future to-do.
If you encounter any other issues with the package, let me know through the Github (I check regularly), and I will try to fix the issue(s) asap.
If you get different results (for instance with another version of Julia, please let me know too. Thanks

Topic		Replies	Views
This linear regression fails in Julia (GLM) vs Python (sklearn) Data linearalgebra , linear-regression	7	411	July 4, 2023
GLM - inconsistency with R on ISLR dataset? Performance question , glm	2	555	January 30, 2022
GLM is slow on large datasets. Using OnlineStats for regressions? MixedModels? Performance glm	25	5093	November 26, 2018
Accuracy_score General Usage	2	377	October 3, 2020
Problems with simple linear model (R and Julia comparison) General Usage	3	93	March 21, 2025

LinearRegressionKit results correctness

Related topics