Efficient way of doing linear regression

I use polynomials.jl and lsqfit.jl. Not caring about speed I’ve never benchmarked.

Example:

using Polynomials
p = polyfit(x,y,1) # fit first order polynomial
coeffs(p)
1 Like

I describe a number of different ways of performing a simple linear regression in

As shown in the README.md file for the repository an effective way of running the .jmd file is to use Weave.convert_doc to create a Jupyter notebook and run that. (By the way, this notebook also shows using the R ggplot2 graphics package from Julia through RCall.).

It is difficult to benchmark a simple calculation like this but I suspect that the method of augmenting the model matrix with the response vector and using a Cholesky factorization will be competitive. It has the advantage of also providing the sum of squared residuals without needing to evaluate the residuals.

8 Likes

what about?

using Random

N = 10000
x = rand(N)
X = [ones(N) x]
y = 10 .+ x .* 0.3

function linreg1(y, X)
    β_hat = (X' * X) \ X' * y
    return(β_hat)
end

function linreg2(y, X)
    β_hat = X \ y
    return(β_hat)
end

using GLM
GLM.fit(LinearModel, X, y, true)

using DataFrames, GLM
data = DataFrame(X = x, Y = y)
lm(@formula(Y ~ X), data)


using BenchmarkTools
@benchmark linreg1(y, X)
@benchmark linreg2(y, X)
@benchmark GLM.fit(LinearModel, X, y, true)
@benchmark lm(@formula(Y ~ X), data)

and the results are

lingerg1:

BenchmarkTools.Trial: 
  memory estimate:  156.78 KiB
  allocs estimate:  8
  --------------
  minimum time:     213.900 μs (0.00% GC)
  median time:      289.351 μs (0.00% GC)
  mean time:        552.780 μs (4.66% GC)
  maximum time:     76.192 ms (0.00% GC)
  --------------
  samples:          8836
  evals/sample:     1

linreg2:

BenchmarkTools.Trial: 
  memory estimate:  303.69 KiB
  allocs estimate:  45
  --------------
  minimum time:     118.300 μs (0.00% GC)
  median time:      170.300 μs (0.00% GC)
  mean time:        344.598 μs (17.20% GC)
  maximum time:     57.911 ms (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

fit():

BenchmarkTools.Trial: 
  memory estimate:  470.45 KiB
  allocs estimate:  26
  --------------
  minimum time:     133.700 μs (0.00% GC)
  median time:      189.299 μs (0.00% GC)
  mean time:        428.768 μs (17.15% GC)
  maximum time:     45.171 ms (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

lm():

BenchmarkTools.Trial: 
  memory estimate:  1.08 MiB
  allocs estimate:  186
  --------------
  minimum time:     546.600 μs (0.00% GC)
  median time:      839.500 μs (0.00% GC)
  mean time:        1.917 ms (10.94% GC)
  maximum time:     127.190 ms (0.00% GC)
  --------------
  samples:          2588
  evals/sample:     1
2 Likes

First you should decide what you want the function to be. Here you are assuming that the model matrix X has been externally created. I believe the original suggestion was to have two vectors, x and y, as the arguments. In that case it is best to go old school and create X'X and X'y without ever creating X. I would say that this knowledge was from back in the old days except that, astonishingly, you could probably get these formulas from a recent introductory stats text.

julia> linreg1(x::AbstractVector{T}, y::AbstractVector{T}) where {T<:AbstractFloat} = [ones(length(x)) x]\y
linreg1 (generic function with 1 method)

julia> linreg1(x, y)
2-element Array{Float64,1}:
 0.4986028246024461
 0.29993048137652306

julia> @benchmark linreg1($x, $y)
BenchmarkTools.Trial: 
  memory estimate:  538.31 KiB
  allocs estimate:  52
  --------------
  minimum time:     135.517 μs (0.00% GC)
  median time:      140.943 μs (0.00% GC)
  mean time:        150.165 μs (3.99% GC)
  maximum time:     832.710 μs (78.93% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> function linreg2(x::AbstractVector{T}, y::AbstractVector{T}) where {T<:AbstractFloat}
           X = [ones(length(x)) x]
           ldiv!(cholesky!(Symmetric(X'X, :U)), X'y)
       end
linreg2 (generic function with 1 method)

julia> linreg2(x, y)
2-element Array{Float64,1}:
 0.4986028246024445
 0.29993048137652634

julia> @benchmark linreg2($x, $y)
BenchmarkTools.Trial: 
  memory estimate:  234.94 KiB
  allocs estimate:  13
  --------------
  minimum time:     67.063 μs (0.00% GC)
  median time:      70.251 μs (0.00% GC)
  mean time:        74.033 μs (3.30% GC)
  maximum time:     652.841 μs (86.41% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> function linreg3(x::AbstractVector{T}, y::AbstractVector{T}) where {T<:AbstractFloat}
           (N = length(x)) == length(y) || throw(DimensionMismatch())
           ldiv!(cholesky!(Symmetric([T(N) sum(x); zero(T) sum(abs2, x)], :U)), [sum(y), dot(x, y)])
       end
linreg3 (generic function with 1 method)

julia> linreg3(x, y)
2-element Array{Float64,1}:
 0.49860282460244537
 0.299930481376524

julia> @benchmark linreg3($x, $y)
BenchmarkTools.Trial: 
  memory estimate:  368 bytes
  allocs estimate:  9
  --------------
  minimum time:     10.041 μs (0.00% GC)
  median time:      10.260 μs (0.00% GC)
  mean time:        10.473 μs (0.00% GC)
  maximum time:     105.643 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

If you look closely the big differences are in the amount of storage that is allocated.

7 Likes

Sounds like a pretty strong case for having an efficient linreg function in the Statistics stdlib.

6 Likes

Wouldn’t you always have to create X externally in a for loop?

I’m not sure I understand the question. If you have the vector x and you want X'X and X'y you don’t need to allocate and populate the N by 2 matrix X. That’s what the section in the code with

Symmetric([T(N) sum(x); zero(T) sum(abs2, x)], :U)

and

[sum(y); dot(x, y)]

is doing.

Once upon a time (I took my first statistics course in 1968) we needed to learn this kind of short cut because we were doing the calculations by hand. But these days you would need to be doing a very large number of such calculations before saving a few hundred microseconds would be of importance.

And, from the viewpoint of a statistician, there is a lot of information in addition to the coefficient estimates that you would want.

5 Likes

I just learned so much from just those two lines!
Thank you! I didn’t even know there was such a thing…

Sorry to come late to this thread, but I’m surprised that no one seems to have given the correct explanation here for the performance differences between various methods to solve \min_x \Vert Ax-b\Vert_2.

  • x̂ = (A'A)\(A'b) is the “normal equations” approach. (You could also use cholesky(Hermitian(A'A))\(A'b) to explicitly take advantage of the fact that A'A is SPD). This (and variations thereof) is the fastest approach, but it is also the least accurate. In particular, it squares the condition number of A, so it doubles the number of digits you lose to roundoff errors and similar. You should only use this method if you know that you have a well-conditioned A.

  • x̂ = A \ b is equivalent to qr(A, Val{true}()) \ b — it uses a pivoted QR factorization. This is slower than the normal-equations approach, but it is much more accurate for badly conditioned matrices because it doesn’t square the condition number.

  • x̂ = pinv(A) * b uses the SVD of A to apply the pseudo-inverse. This is the slowest method, but it gives you the most control over accuracy for ill-conditioned matrices because you can specify a tolerance in pinv to regularize the problem by dropping noisy singular values.

QR is the default choice used for A \ b (in Julia and many other systems) because it is reliable without being too slow.

50 Likes

does x’ mean the derivative of x? and is x independent, while y is dependent?

1 Like

x’ means transpose of x in this context.

If we were talking about odes it’d probably mean x’(t)

3 Likes

that makes sense.

1 Like

Sorry to bring this discussion back. But I think that many times the user is just searching for a function which given two vectors x and y, returns the coefficients of the linear fit and the Pearson correlation coefficient, and perhaps a third vector with the fitted Y values in the same x positions, such that plotting a linear regression over some data is direct. Something like:

x = [...]
y = [...]
L = linreg(x,y)
scatter(x,y)
plot!(x,L.Y,label="R^2=$(L.R2)")

Actually I think one such alternative would be nice to exist in the Plots package perhaps.

I mean, that is what I was searching for now, and it was easier to just plot what I wanted somewhere else at the end. A suite of the most simple regressions (linear, square, log, perhaps) would be nice accompanying Plots for a fast visual inspection of data.

Ps. I know this is three-lines of code anyway. But I assure you that most people in the world doing linear regressions have no idea of how they are done.

Ps2. Noticed now that this discussion is the “Performance” section, so that was not the case here. But this is one of the threads one reaches searching for a simple way to plot a linear regression.

You can do exactly that already. I will agree that it isn’t as “in your face” or easy to find as it could be though.

using GLM
x = [...]
y = [...]
reg = lm(x,y)
scatter(x,y)
plot!(x,predict(reg),label="R^2=$(r2(reg))")

also Plots.jl or StatsPlots.jl (I can’t remember which one) provides a direct option in the scatter command to put the regression line on the plot for you.

5 Likes

The keyword is smooth = true. However, it doesn’t provide any analytics beyond the curve itself.

1 Like

see also https://github.com/JuliaPlots/StatsPlots.jl/pull/293

1 Like

This (linear regression with confidence interval) is also implemented in AlgebraOfGraphics with the linear analysis.

1 Like

Related to your question:

1 Like

If anyone is interested, i have implemented a faster and also numerically stable method for underdetermined linear systems in this post:

2 Likes

Just to mention that I wrote a simple package to do what I was searching for when I found this thread, which was a very simple interface to do simple fits of 2D data:

1 Like