I don’t have a strong reference or a robust quantitative analysis. The limited, hence in my view anecdotal, is the one from NIST:
- the introduction: (Linear Least Squares Regression Background Information)
- the data: (StRD Linear Least Squares Regression Datasets)
I mentioned it in a post some time ago ( LinearRegressionKit results correctness), but since time passed, I made another gist:
Here
I only looked at the value of estimated coefficients (I was lazy enough not to want to calculate them for Base(), and I choose to take the error’s average. There are probably many other ways to look at the results.
In my view, the Sweep operator is between Cholesky (GLM) and QR (base ), I still note that a few cause issues for GLM/Cholesky (Filip: the standard errors become NaN, and Longley: in which the estimated coefficients are particularly wrong). Indeed this may not be directly related to the Cholesky decomposition.
I did run it with the following information:
Julia Version 1.8.3
Commit 0434deb161e (2022-11-14 20:14 UTC)
Platform Info:
OS: macOS (arm64-apple-darwin21.3.0)
CPU: 8 × Apple M1
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-13.0.1 (ORCJIT, apple-m1)
Threads: 1 on 4 virtual cores
Environment:
JULIA_EDITOR = code
JULIA_NUM_THREADS =
Status `~/.julia/environments/v1.8/Project.toml`
[336ed68f] CSV v0.10.7
[324d7699] CategoricalArrays v0.10.7
⌅ [a93c6f00] DataFrames v1.2.2
[1313f7d8] DataFramesMeta v0.12.0
[38e38edf] GLM v1.8.1
[e91d531d] LinearRegressionKit v0.7.8
[3eaba693] StatsModels v0.6.33
[112f6efa] VegaLite v2.6.0
and obtained this:
Row │ lrk_mces glm_mces lrqr_mces conds
│ Float64 Float64 Float64 Float64
─────┼───────────────────────────────────────────────────────────────
1 │ 8.1074e-14 5.65659e-14 1.33227e-14 1362.31
2 │ 1.02219e-15 0.000224522 1.49222e-16 2.16738e16
3 │ 3.9968e-15 3.9968e-15 3.9968e-15 101.499
4 │ 3.33067e-16 2.22045e-16 1.19209e-7 25.7108
5 │ 708.127 738.694 738.754 2.05099e15
6 │ 1.95531e-5 497730.0 2.04614e-7 8.07878e9
7 │ 5.77698e-8 6.59178e-8 2.07236e-10 3.8327e16
8 │ 1.36005e-12 7.42606e-13 4.25137e-16 1.20013e21
9 │ 5.77698e-8 6.59178e-8 6.95803e-11 9.30321e6
10 │ 5.77698e-8 6.59178e-8 8.65831e-10 9.3434e6
11 │ 5.77698e-8 6.59178e-8 8.65831e-10 9.3434e6
and after LinearRegressionKit v0.7.10 (the condition number being fixed) gives:
Row │ lrk_mces glm_mces lrqr_mces conds
│ Float64 Float64 Float64 Float64
─────┼──────────────────────────────────────────────────────────────
1 │ 8.1074e-14 5.65659e-14 1.33227e-14 855.223
2 │ 1.02219e-15 0.000224522 1.49222e-16 1.42303e13
3 │ 3.9968e-15 3.9968e-15 3.9968e-15 1.0
4 │ 3.33067e-16 2.22045e-16 1.19209e-7 1.0
5 │ 708.127 738.694 738.754 1.76796e15
6 │ 1.95531e-5 497730.0 2.04614e-7 4.85926e9
7 │ 5.77698e-8 6.59178e-8 2.07236e-10 6.39893e6
8 │ 1.36005e-12 7.42606e-13 4.25137e-16 6.39893e6
9 │ 5.77698e-8 6.59178e-8 6.95803e-11 6.39893e6
10 │ 5.77698e-8 6.59178e-8 8.65831e-10 6.39893e6
11 │ 5.77698e-8 6.59178e-8 8.65831e-10 6.39893e6
Sorry, I mainly meant using a Kit to get the needed stats quickly and some plots without doing too much coding.
Now there is a small advantage too with the Sweep operator, the Error Sum of Squares (or SSE), also known as Residual Sum of Square (RSS), is a by-product of the Sweep operator hence the MSE, and RMSE are almost precomputed. Maybe it is also the case with Cholesky/QR, I did not check.
Yes — in particular, since it squares the condition number, then it depends on the conditioning of your matrix. e.g. if your matrix has a condition number of 100, then it loses about 2 extra digits due to roundoff errors compared to QR, but if it has a condition number of 10^8 10810^8 then you lose about 8 extra digits. See also this discussion: Efficient way of doing linear regression - #33 by stevengj
I think the error of the estimated coefficients is not only due to the condition number. You can look at the code’s results; at first sight, it seems there are two digits of difference in most cases, but there is some variability.
I will finish, with a word of caution, while I try to avoid bugs by putting tests from SAS and R, there could be errors in the code; please let me know if you find any. The package is only used by a few users, so it could be some edge cases that have not been seen yet.