Efficient way of doing linear regression

This is perfectly reasonable. In my previous post I was trying to expand on the use of \. If you do not use regressions or modelling very often it is more convenient to use something like linreg.

However, I think that a technical audience would find somewhat harder to memorise the name of a function for simple regressions - plus with small changes in the formula you can also do ridge :slight_smile:

I totally agree, except that the function should be called linear_regression or similar, rather than linreg, and should probably be in a package like StatsBase.jl.

11 Likes

Using the normal equation will likely use Cholesky decomposition while the \ operator uses QR decomposition. Cholesky is more efficient which is why it may be faster. QR decomposition is more stable and precise. There are a few variants of Cholesky decomposition implemented in Julia (LL', LDL', Bunch-Kaufman, and their pivoted/sparse variants). Using factorize on a dense symmetric/hermitian matrix will use the appropriate one.

using LinearAlgebra
F = Hermitian(A'A) |> factorize
x = F \ (A'b) # Solves a linear equation system Ax = b

Most Cholesky implementations are O(n^3) while most implementations of QR decomposition (e.g., Householder reflections) are O(mn^2) which means if the linear system is overly specified (e.g., many more linearly rows than columns in A) then Cholesky will be faster.

6 Likes

Can you please link the discussion with that message? I am rather surprised that this was suggested, one should never use the textbook OLS formula for actual calculation (except for some special cases, eg 1 covariate); it is very easy to run into ill-conditioning in practice.

The polyalgorithm for \ picks a very common and reasonable solution, based on the QR decomposition. For “small” problems, it is not uncommon to use SVD, for large datasets an iterative algorithm is often used.

Since the original problem of this discussion seems to be 1D, just using the specific formula without forming the matrices may be fastest:

2 Likes

I was paraphrasing, not arguing that someone exactly said that. But that is the gist I get - every time someone asks “why was linreg removed” the answer is “just use \”, and then the asker goes “huh” because of course that ignores the intercept.

I don’t think I have seen people asking for linreg for more than a year now, but I have a very different impression: when someone asks about linreg or linear regression in general, generally they are just directed to the appropriate package, eg:

There are now excellent packages for various implementations of linear regression and more general models. They should be the first choice, there is little need to implement OLS in Julia for most users now.

1 Like

more details on this are found in, for instance, Golub&van Loan (3rd ed). I like the very last sentence of section 5.3: “At the very minimum, this discussion should convince you how difficult it can be to choose the ‘right’ algorithm!”

1 Like

Could you point out which those packages are, please?

1 Like

I have only used

and

in practice.

3 Likes

I use polynomials.jl and lsqfit.jl. Not caring about speed I’ve never benchmarked.

Example:

using Polynomials
p = polyfit(x,y,1) # fit first order polynomial
coeffs(p)
2 Likes

I describe a number of different ways of performing a simple linear regression in

https://github.com/dmbates/CopenhagenEcon/blob/master/jmd/03-LinearAlgebra.jmd

As shown in the README.md file for the repository an effective way of running the .jmd file is to use Weave.convert_doc to create a Jupyter notebook and run that. (By the way, this notebook also shows using the R ggplot2 graphics package from Julia through RCall.).

It is difficult to benchmark a simple calculation like this but I suspect that the method of augmenting the model matrix with the response vector and using a Cholesky factorization will be competitive. It has the advantage of also providing the sum of squared residuals without needing to evaluate the residuals.

9 Likes

what about?

using Random

N = 10000
x = rand(N)
X = [ones(N) x]
y = 10 .+ x .* 0.3

function linreg1(y, X)
    β_hat = (X' * X) \ X' * y
    return(β_hat)
end

function linreg2(y, X)
    β_hat = X \ y
    return(β_hat)
end

using GLM
GLM.fit(LinearModel, X, y, true)

using DataFrames, GLM
data = DataFrame(X = x, Y = y)
lm(@formula(Y ~ X), data)


using BenchmarkTools
@benchmark linreg1(y, X)
@benchmark linreg2(y, X)
@benchmark GLM.fit(LinearModel, X, y, true)
@benchmark lm(@formula(Y ~ X), data)

and the results are

lingerg1:

BenchmarkTools.Trial: 
  memory estimate:  156.78 KiB
  allocs estimate:  8
  --------------
  minimum time:     213.900 μs (0.00% GC)
  median time:      289.351 μs (0.00% GC)
  mean time:        552.780 μs (4.66% GC)
  maximum time:     76.192 ms (0.00% GC)
  --------------
  samples:          8836
  evals/sample:     1

linreg2:

BenchmarkTools.Trial: 
  memory estimate:  303.69 KiB
  allocs estimate:  45
  --------------
  minimum time:     118.300 μs (0.00% GC)
  median time:      170.300 μs (0.00% GC)
  mean time:        344.598 μs (17.20% GC)
  maximum time:     57.911 ms (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

fit():

BenchmarkTools.Trial: 
  memory estimate:  470.45 KiB
  allocs estimate:  26
  --------------
  minimum time:     133.700 μs (0.00% GC)
  median time:      189.299 μs (0.00% GC)
  mean time:        428.768 μs (17.15% GC)
  maximum time:     45.171 ms (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

lm():

BenchmarkTools.Trial: 
  memory estimate:  1.08 MiB
  allocs estimate:  186
  --------------
  minimum time:     546.600 μs (0.00% GC)
  median time:      839.500 μs (0.00% GC)
  mean time:        1.917 ms (10.94% GC)
  maximum time:     127.190 ms (0.00% GC)
  --------------
  samples:          2588
  evals/sample:     1
4 Likes

First you should decide what you want the function to be. Here you are assuming that the model matrix X has been externally created. I believe the original suggestion was to have two vectors, x and y, as the arguments. In that case it is best to go old school and create X'X and X'y without ever creating X. I would say that this knowledge was from back in the old days except that, astonishingly, you could probably get these formulas from a recent introductory stats text.

julia> linreg1(x::AbstractVector{T}, y::AbstractVector{T}) where {T<:AbstractFloat} = [ones(length(x)) x]\y
linreg1 (generic function with 1 method)

julia> linreg1(x, y)
2-element Array{Float64,1}:
 0.4986028246024461
 0.29993048137652306

julia> @benchmark linreg1($x, $y)
BenchmarkTools.Trial: 
  memory estimate:  538.31 KiB
  allocs estimate:  52
  --------------
  minimum time:     135.517 μs (0.00% GC)
  median time:      140.943 μs (0.00% GC)
  mean time:        150.165 μs (3.99% GC)
  maximum time:     832.710 μs (78.93% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> function linreg2(x::AbstractVector{T}, y::AbstractVector{T}) where {T<:AbstractFloat}
           X = [ones(length(x)) x]
           ldiv!(cholesky!(Symmetric(X'X, :U)), X'y)
       end
linreg2 (generic function with 1 method)

julia> linreg2(x, y)
2-element Array{Float64,1}:
 0.4986028246024445
 0.29993048137652634

julia> @benchmark linreg2($x, $y)
BenchmarkTools.Trial: 
  memory estimate:  234.94 KiB
  allocs estimate:  13
  --------------
  minimum time:     67.063 μs (0.00% GC)
  median time:      70.251 μs (0.00% GC)
  mean time:        74.033 μs (3.30% GC)
  maximum time:     652.841 μs (86.41% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> function linreg3(x::AbstractVector{T}, y::AbstractVector{T}) where {T<:AbstractFloat}
           (N = length(x)) == length(y) || throw(DimensionMismatch())
           ldiv!(cholesky!(Symmetric([T(N) sum(x); zero(T) sum(abs2, x)], :U)), [sum(y), dot(x, y)])
       end
linreg3 (generic function with 1 method)

julia> linreg3(x, y)
2-element Array{Float64,1}:
 0.49860282460244537
 0.299930481376524

julia> @benchmark linreg3($x, $y)
BenchmarkTools.Trial: 
  memory estimate:  368 bytes
  allocs estimate:  9
  --------------
  minimum time:     10.041 μs (0.00% GC)
  median time:      10.260 μs (0.00% GC)
  mean time:        10.473 μs (0.00% GC)
  maximum time:     105.643 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

If you look closely the big differences are in the amount of storage that is allocated.

12 Likes

Sounds like a pretty strong case for having an efficient linreg function in the Statistics stdlib.

8 Likes

Wouldn’t you always have to create X externally in a for loop?

I’m not sure I understand the question. If you have the vector x and you want X'X and X'y you don’t need to allocate and populate the N by 2 matrix X. That’s what the section in the code with

Symmetric([T(N) sum(x); zero(T) sum(abs2, x)], :U)

and

[sum(y); dot(x, y)]

is doing.

Once upon a time (I took my first statistics course in 1968) we needed to learn this kind of short cut because we were doing the calculations by hand. But these days you would need to be doing a very large number of such calculations before saving a few hundred microseconds would be of importance.

And, from the viewpoint of a statistician, there is a lot of information in addition to the coefficient estimates that you would want.

5 Likes

I just learned so much from just those two lines!
Thank you! I didn’t even know there was such a thing…

Sorry to come late to this thread, but I’m surprised that no one seems to have given the correct explanation here for the performance differences between various methods to solve \min_x \Vert Ax-b\Vert_2.

  • x̂ = (A'A)\(A'b) is the “normal equations” approach. (You could also use cholesky(Hermitian(A'A))\(A'b) to explicitly take advantage of the fact that A'A is SPD). This (and variations thereof) is the fastest approach, but it is also the least accurate. In particular, it squares the condition number of A, so it doubles the number of digits you lose to roundoff errors and similar. You should only use this method if you know that you have a well-conditioned A.

  • x̂ = A \ b is equivalent to qr(A, Val{true}()) \ b — it uses a pivoted QR factorization. This is slower than the normal-equations approach, but it is much more accurate for badly conditioned matrices because it doesn’t square the condition number.

  • x̂ = pinv(A) * b uses the SVD of A to apply the pseudo-inverse. This is the slowest method, but it gives you the most control over accuracy for ill-conditioned matrices because you can specify a tolerance in pinv to regularize the problem by dropping noisy singular values.

QR is the default choice used for A \ b (in Julia and many other systems) because it is reliable without being too slow.

75 Likes

does x’ mean the derivative of x? and is x independent, while y is dependent?

1 Like

x’ means transpose of x in this context.

If we were talking about odes it’d probably mean x’(t)

3 Likes