[ANN] LinearRegressionKit

Eric · November 22, 2022, 12:36am

I don’t have a strong reference or a robust quantitative analysis. The limited, hence in my view anecdotal, is the one from NIST:

the introduction: (Linear Least Squares Regression Background Information)
the data: (StRD Linear Least Squares Regression Datasets)

I mentioned it in a post some time ago ( LinearRegressionKit results correctness), but since time passed, I made another gist:
Here

I only looked at the value of estimated coefficients (I was lazy enough not to want to calculate them for Base(), and I choose to take the error’s average. There are probably many other ways to look at the results.
In my view, the Sweep operator is between Cholesky (GLM) and QR (base ), I still note that a few cause issues for GLM/Cholesky (Filip: the standard errors become NaN, and Longley: in which the estimated coefficients are particularly wrong). Indeed this may not be directly related to the Cholesky decomposition.

I did run it with the following information:

Julia Version 1.8.3
Commit 0434deb161e (2022-11-14 20:14 UTC)
Platform Info:
  OS: macOS (arm64-apple-darwin21.3.0)
  CPU: 8 × Apple M1
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-13.0.1 (ORCJIT, apple-m1)
  Threads: 1 on 4 virtual cores
Environment:
  JULIA_EDITOR = code
  JULIA_NUM_THREADS = 
Status `~/.julia/environments/v1.8/Project.toml`
  [336ed68f] CSV v0.10.7
  [324d7699] CategoricalArrays v0.10.7
⌅ [a93c6f00] DataFrames v1.2.2
  [1313f7d8] DataFramesMeta v0.12.0
  [38e38edf] GLM v1.8.1
  [e91d531d] LinearRegressionKit v0.7.8
  [3eaba693] StatsModels v0.6.33
  [112f6efa] VegaLite v2.6.0

and obtained this:

 Row │ lrk_mces       glm_mces          lrqr_mces      conds         
     │ Float64        Float64           Float64        Float64       
─────┼───────────────────────────────────────────────────────────────
   1 │   8.1074e-14        5.65659e-14    1.33227e-14  1362.31
   2 │   1.02219e-15       0.000224522    1.49222e-16     2.16738e16
   3 │   3.9968e-15        3.9968e-15     3.9968e-15    101.499
   4 │   3.33067e-16       2.22045e-16    1.19209e-7     25.7108
   5 │ 708.127           738.694        738.754           2.05099e15
   6 │   1.95531e-5   497730.0            2.04614e-7      8.07878e9
   7 │   5.77698e-8        6.59178e-8     2.07236e-10     3.8327e16
   8 │   1.36005e-12       7.42606e-13    4.25137e-16     1.20013e21
   9 │   5.77698e-8        6.59178e-8     6.95803e-11     9.30321e6
  10 │   5.77698e-8        6.59178e-8     8.65831e-10     9.3434e6
  11 │   5.77698e-8        6.59178e-8     8.65831e-10     9.3434e6

and after LinearRegressionKit v0.7.10 (the condition number being fixed) gives:

 Row │ lrk_mces       glm_mces          lrqr_mces      conds        
     │ Float64        Float64           Float64        Float64      
─────┼──────────────────────────────────────────────────────────────
   1 │   8.1074e-14        5.65659e-14    1.33227e-14  855.223
   2 │   1.02219e-15       0.000224522    1.49222e-16    1.42303e13
   3 │   3.9968e-15        3.9968e-15     3.9968e-15     1.0
   4 │   3.33067e-16       2.22045e-16    1.19209e-7     1.0
   5 │ 708.127           738.694        738.754          1.76796e15
   6 │   1.95531e-5   497730.0            2.04614e-7     4.85926e9
   7 │   5.77698e-8        6.59178e-8     2.07236e-10    6.39893e6
   8 │   1.36005e-12       7.42606e-13    4.25137e-16    6.39893e6
   9 │   5.77698e-8        6.59178e-8     6.95803e-11    6.39893e6
  10 │   5.77698e-8        6.59178e-8     8.65831e-10    6.39893e6
  11 │   5.77698e-8        6.59178e-8     8.65831e-10    6.39893e6

Sorry, I mainly meant using a Kit to get the needed stats quickly and some plots without doing too much coding.

Now there is a small advantage too with the Sweep operator, the Error Sum of Squares (or SSE), also known as Residual Sum of Square (RSS), is a by-product of the Sweep operator hence the MSE, and RMSE are almost precomputed. Maybe it is also the case with Cholesky/QR, I did not check.

I think the error of the estimated coefficients is not only due to the condition number. You can look at the code’s results; at first sight, it seems there are two digits of difference in most cases, but there is some variability.

I will finish, with a word of caution, while I try to avoid bugs by putting tests from SAS and R, there could be errors in the code; please let me know if you find any. The package is only used by a few users, so it could be some edge cases that have not been seen yet.

Topic		Replies	Views
Efficient way of doing linear regression Performance regression	44	20821	February 7, 2022
GLM is slow on large datasets. Using OnlineStats for regressions? MixedModels? Performance glm	25	5153	November 26, 2018
Convert a (numerically) non-pd matrix into pd matrix General Usage	2	551	May 10, 2019
Linear solver \(A, B) performance vs Matlab A\b General Usage	32	7765	May 21, 2017
How to check if LU factorization failed? Numerics question , linearalgebra	29	5084	September 20, 2017

[ANN] LinearRegressionKit

Related topics