[ANN] LinearRegressionKit

LinearRegressionKit is now in version 0.7.4 and is now registered (you can use add LinearRegressionKit from the package prompt). See LinearRegressionKit project page for details.
The package LinearRegression will not be updated anymore.

In version 0.7.4 the main change is the addition of the ridge regression (and weighted ridge expression). An example has been added to the documentation.

As usual, please share your view in the comments, or if you spot a problem, feel free to open an issue in the GitHub project.


The package is now in version 0.7.8 :sweat_smile: , and a few items have been added:

  • The sweep operator has been modified to leverage the column-major structure of Julia.
  • The F Value / F Statistics (SAS / R) is now computed by default.
  • The significance codes, as presented in R, are now also presented when p-values are requested.
  • And, of course, some bug fixes.
    Please let me know if there are more bugs to squash :sweat_smile: or features that will be useful :smiley:.

Isn’t the sweep operator closely related to Cholesky on the normal equations? Doesn’t this have the disadvantage (compared to QR or SVD) of squaring the condition number? Also, don’t you lose the advantage of highly optimized BLAS/LAPACK functions if you implement your own elimination?

I know that the sweep operator is widely used in statistics, but as an outsider I’m unclear on why you would prefer it over QR. What are the advantages?


Thank you for making a very on-point and concise critique of the sweep operator; I agree with your remarks’ essence, although I would nuance the conclusion you seem to reach.
Others have better than I could put forward the pros and cons of it; you can find some of them in the readme on the project.
I see it as an advantage to fit somewhere between Cholesky (from the anecdotal examples I have seen less prone to numerical errors) and QR/SVD (faster but less precise); because I often use linear regression as a tool to first study the problem at hand. As it enables going further than visual/graph-driven analysis quickly, for instance, analysis of residuals is usually what will help me choose the best approach to work further on the question at hand (that being choosing a non-linear approach method, the QR/SVD or anything in the myriad of tools/algorithms/methods available).
To answer your points with more detail, the condition number is negatively impacted; this is why I added a parameter to make it easier for the user to compute and access that value.
About BLAS/LAPACK, as you noted, the SO works on the normal equations and performance wise most of the computation is spent on forming these equations, which I believe can rely on the BLAS/LAPACK when available (also, I think the BLAS/LAPACK implementation is not very efficient on my personal computer…). Performance could be improved by the use of multithreading/multiprocessing; so far, I haven’t heard from users that this is needed.
And finally, the loss of precision compared to QR/SVD, I think, is very dependent on the data studied; from the initial paper and the NIST test data, up to a polynomial of 6th degree, it seems to give relatively similar results.
Hence I think there is no ground-breaking result with the sweep operator, but the package gives access to a wide range of statistics/tools to enable a prompt analysis and, in most cases, correct results. This can be followed up by more adequate methods to address the problem.
I fear that this doesn’t address your concerns. Please let me know if there is something that could help.

Do you have a reference? Is there a quantitative analysis somewhere? The condition number seems to be the same as Cholesky so at best it seems you would gain a small constant factor.

How is this different from QR?

Yes — in particular, since it squares the condition number, then it depends on the conditioning of your matrix. e.g. if your matrix has a condition number of 100, then it loses about 2 extra digits due to roundoff errors compared to QR, but if it has a condition number of 10^8 then you lose about 8 extra digits. See also this discussion: Efficient way of doing linear regression - #33 by stevengj

The reason why libraries often default to QR is that it can be hard for users to predict (or comprehend) whether their problem is well-conditioned, and reliability is prioritized over speed.

(I’ve also seen references to the ability of the sweep approach to incrementally add additional columns/variables, but of course you can update QR incrementally as well.)


I don’t have a strong reference or a robust quantitative analysis. The limited, hence in my view anecdotal, is the one from NIST:

I mentioned it in a post some time ago ( LinearRegressionKit results correctness), but since time passed, I made another gist:

I only looked at the value of estimated coefficients (I was lazy enough not to want to calculate them for Base(), and I choose to take the error’s average. There are probably many other ways to look at the results.
In my view, the Sweep operator is between Cholesky (GLM) and QR (base ), I still note that a few cause issues for GLM/Cholesky (Filip: the standard errors become NaN, and Longley: in which the estimated coefficients are particularly wrong). Indeed this may not be directly related to the Cholesky decomposition.

I did run it with the following information:

Julia Version 1.8.3
Commit 0434deb161e (2022-11-14 20:14 UTC)
Platform Info:
  OS: macOS (arm64-apple-darwin21.3.0)
  CPU: 8 × Apple M1
  LIBM: libopenlibm
  LLVM: libLLVM-13.0.1 (ORCJIT, apple-m1)
  Threads: 1 on 4 virtual cores
Status `~/.julia/environments/v1.8/Project.toml`
  [336ed68f] CSV v0.10.7
  [324d7699] CategoricalArrays v0.10.7
⌅ [a93c6f00] DataFrames v1.2.2
  [1313f7d8] DataFramesMeta v0.12.0
  [38e38edf] GLM v1.8.1
  [e91d531d] LinearRegressionKit v0.7.8
  [3eaba693] StatsModels v0.6.33
  [112f6efa] VegaLite v2.6.0

and obtained this:

 Row │ lrk_mces       glm_mces          lrqr_mces      conds         
     │ Float64        Float64           Float64        Float64       
   1 │   8.1074e-14        5.65659e-14    1.33227e-14  1362.31
   2 │   1.02219e-15       0.000224522    1.49222e-16     2.16738e16
   3 │   3.9968e-15        3.9968e-15     3.9968e-15    101.499
   4 │   3.33067e-16       2.22045e-16    1.19209e-7     25.7108
   5 │ 708.127           738.694        738.754           2.05099e15
   6 │   1.95531e-5   497730.0            2.04614e-7      8.07878e9
   7 │   5.77698e-8        6.59178e-8     2.07236e-10     3.8327e16
   8 │   1.36005e-12       7.42606e-13    4.25137e-16     1.20013e21
   9 │   5.77698e-8        6.59178e-8     6.95803e-11     9.30321e6
  10 │   5.77698e-8        6.59178e-8     8.65831e-10     9.3434e6
  11 │   5.77698e-8        6.59178e-8     8.65831e-10     9.3434e6

and after LinearRegressionKit v0.7.10 (the condition number being fixed) gives:

 Row │ lrk_mces       glm_mces          lrqr_mces      conds        
     │ Float64        Float64           Float64        Float64      
   1 │   8.1074e-14        5.65659e-14    1.33227e-14  855.223
   2 │   1.02219e-15       0.000224522    1.49222e-16    1.42303e13
   3 │   3.9968e-15        3.9968e-15     3.9968e-15     1.0
   4 │   3.33067e-16       2.22045e-16    1.19209e-7     1.0
   5 │ 708.127           738.694        738.754          1.76796e15
   6 │   1.95531e-5   497730.0            2.04614e-7     4.85926e9
   7 │   5.77698e-8        6.59178e-8     2.07236e-10    6.39893e6
   8 │   1.36005e-12       7.42606e-13    4.25137e-16    6.39893e6
   9 │   5.77698e-8        6.59178e-8     6.95803e-11    6.39893e6
  10 │   5.77698e-8        6.59178e-8     8.65831e-10    6.39893e6
  11 │   5.77698e-8        6.59178e-8     8.65831e-10    6.39893e6

Sorry, I mainly meant using a Kit to get the needed stats quickly and some plots without doing too much coding.

Now there is a small advantage too with the Sweep operator, the Error Sum of Squares (or SSE), also known as Residual Sum of Square (RSS), is a by-product of the Sweep operator hence the MSE, and RMSE are almost precomputed. Maybe it is also the case with Cholesky/QR, I did not check.

I think the error of the estimated coefficients is not only due to the condition number. You can look at the code’s results; at first sight, it seems there are two digits of difference in most cases, but there is some variability.

I will finish, with a word of caution, while I try to avoid bugs by putting tests from SAS and R, there could be errors in the code; please let me know if you find any. The package is only used by a few users, so it could be some edge cases that have not been seen yet.

There’s a huge qualitative difference between algorithms whose error bounds scale as the condition number squared (Cholesky and presumably sweep — anything that uses the normal equations) and algorithms scale as the condition number (QR and SVD). It’s not just a matter of “between”.

Second, I’m really confused by your test data. For, example, if I take the Pontius data

y = [0.11019, 0.21956, 0.32949, 0.43899, 0.54803, 0.65694, 0.76562, 0.87487, 0.98292, 1.09146, 1.20001, 1.30822, 1.41599, 1.52399, 1.63194, 1.73947, 1.84646, 1.95392, 2.06128, 2.16844, 0.11052, 0.22018, 0.32939, 0.43886, 0.54798, 0.65739, 0.76596, 0.87474, 0.983, 1.0915, 1.20004, 1.30818, 1.41613, 1.52408, 1.63159, 1.73965, 1.84696, 1.95445, 2.06177, 2.16829]
x = [150000, 300000, 450000, 600000, 750000, 900000, 1050000, 1200000, 1350000, 1500000, 1650000, 1800000, 1950000, 2100000, 2250000, 2400000, 2550000, 2700000, 2850000, 3000000, 150000, 300000, 450000, 600000, 750000, 900000, 1050000, 1200000, 1350000, 1500000, 1650000, 1800000, 1950000, 2100000, 2250000, 2400000, 2550000, 2700000, 2850000, 3000000]

and form the matrix A = [x x.^2] corresponding to the model y ~ x + x^2, then condition number is:

julia> cond(A)

which doesn’t appear to correspond closely to any of the numbers in the “conds” column of your table, so I’m not sure what you are showing? Second, if I compute the “exact” least-square result A \ y in BigFloat with 4096-bit precision (> 1000 decimal digits, which should be plenty!), then I get:

julia> setprecision(4096);

julia> Float64.(BigFloat.(A) \ parse.(BigFloat, string.(y)))
2-element Vector{Float64}:

whereas the “correct” results you are comparing against in your script are [7.32059160401003e-7, -3.16081871345029e-15], which seem to be extremely inaccurate in comparison. Am I misunderstanding the problem you are solving?

Condition numbers only give you an upper bound on the amplification of forward error from the backward error. There are other coefficients that appear in the backward error (for a given algorithm), and there are also the specifics of the roundoff errors from a particular set of data. So no, the condition number alone doesn’t tell you the error, but the dependence on the condition number is still a big factor.


Good morning,

The reference is there: Pontius reference data

to be sure, are you forming the matrix with an intercept or without (y ~ x + x^2 or y ~ 1 + x + x^2) often, the intercept is implied but not shown in the formula.

I guess this is a “oh oh” moment, :cold_sweat:, I realize that I computed the condition of the [X y] matrix and not the X, while it seems weird to me that the y vector(s) are not computed I should have checked.
I will add a bug for this.
Edit: fixed starting from version 0.7.10.

I suppose this is because of the lack of intercept in your model. the same code with an intercept gives me accurate results:

3-element Vector{Float64}:

I am still not too sure about the condition number, in this dataset the most obvious issue I have is with Wampler 4
In which it seems obvious that only using the x doesn’t describe the complexity of the full problem.

Ah, thanks, I didn’t realize that the intercept term was implied; now it makes more sense.

In particular, a more systematic comparison is to plot the relative error as a function of the condition number.

For example, here is a plot of error (in the least-square solution A \ b) vs condition number for randomly generated 1000\times 5 matrices A with condition numbers \kappa ranging from 10 to 10^9 and log-spaced singular values. It clearly shows that both the Cholesky / normal-equations approach and the sweep approach have similar errors, and both scale with \kappa^2, while the QR approach scales with \kappa:

Source code
using LinearAlgebra

# generate a random (Haar-uniform) m×n real orthogonal-column matrix
# using algorithm adapted from https://arxiv.org/abs/math-ph/0609050
function randQ(m::Integer,n::Integer)
    QR = qr(randn(m,n))
    return QR.Q * Diagonal(sign.(diag(QR.R)))

# random m×n matrix with condition number κ and
# logarithmically-spaced singular values.
function randcond(m::Integer,n::Integer,κ::Real)
    κ ≥ 1 || throw(ArgumentError("κ=$κ should be ≥ 1"))
    r = min(m,n)
    σ = exp.(range(0, -log(κ), length=r))
    U = randQ(m,r)
    V = randQ(n,r)
    return U * Diagonal(σ) * V'

import LinearRegressionKit

function leastsquare_errors(m, n, κ)
    A = randcond(m, n, κ)
    b = randn(m)
    x_exact = setprecision(4096) do
        Float64.(big.(A) \ big.(b))
    x_qr = A \ b
    x_chol = try
        cholesky!(Hermitian(A'A)) \ (A'b)
        (A'A) \ (A'b) # cholesky can fail if κ is large
    x_sweep = LinearRegressionKit.sweep_linreg(A, b)
    return norm(x_qr - x_exact) / norm(x_exact),
           norm(x_chol - x_exact) / norm(x_exact),
           norm(x_sweep - x_exact) / norm(x_exact)

κ = exp10.(range(1, 9, length=15))
errs = leastsquare_errors.(1000, 5, κ)

using PyPlot

loglog(κ, getindex.(errs, 1), "bo-")
loglog(κ, getindex.(errs, 2), "r*-")
loglog(κ, getindex.(errs, 3), "gs-")
loglog(κ, κ * 1e-16, "k--")
loglog(κ, κ.^2 * 1e-18, "k:")
xlabel(L"condition number $\kappa$")
ylabel("relative error in fit")
legend(["QR", "Cholesky", "Sweep", L"\sim \kappa", L"\sim \kappa^2"], loc="upper left")