# Efficient way of doing linear regression

``````using Random

N = 10000
x = rand(N)
X = [ones(N) x]
y = 10 .+ x .* 0.3

function linreg1(y, X)
β_hat = (X' * X) \ X' * y
return(β_hat)
end

function linreg2(y, X)
β_hat = X \ y
return(β_hat)
end

using GLM
GLM.fit(LinearModel, X, y, true)

using DataFrames, GLM
data = DataFrame(X = x, Y = y)
lm(@formula(Y ~ X), data)

using BenchmarkTools
@benchmark linreg1(y, X)
@benchmark linreg2(y, X)
@benchmark GLM.fit(LinearModel, X, y, true)
@benchmark lm(@formula(Y ~ X), data)

``````

and the results are

lingerg1:

``````BenchmarkTools.Trial:
memory estimate:  156.78 KiB
allocs estimate:  8
--------------
minimum time:     213.900 μs (0.00% GC)
median time:      289.351 μs (0.00% GC)
mean time:        552.780 μs (4.66% GC)
maximum time:     76.192 ms (0.00% GC)
--------------
samples:          8836
evals/sample:     1
``````

linreg2:

``````BenchmarkTools.Trial:
memory estimate:  303.69 KiB
allocs estimate:  45
--------------
minimum time:     118.300 μs (0.00% GC)
median time:      170.300 μs (0.00% GC)
mean time:        344.598 μs (17.20% GC)
maximum time:     57.911 ms (0.00% GC)
--------------
samples:          10000
evals/sample:     1
``````

fit():

``````BenchmarkTools.Trial:
memory estimate:  470.45 KiB
allocs estimate:  26
--------------
minimum time:     133.700 μs (0.00% GC)
median time:      189.299 μs (0.00% GC)
mean time:        428.768 μs (17.15% GC)
maximum time:     45.171 ms (0.00% GC)
--------------
samples:          10000
evals/sample:     1
``````

lm():

``````BenchmarkTools.Trial:
memory estimate:  1.08 MiB
allocs estimate:  186
--------------
minimum time:     546.600 μs (0.00% GC)
median time:      839.500 μs (0.00% GC)
mean time:        1.917 ms (10.94% GC)
maximum time:     127.190 ms (0.00% GC)
--------------
samples:          2588
evals/sample:     1
``````
2 Likes

First you should decide what you want the function to be. Here you are assuming that the model matrix `X` has been externally created. I believe the original suggestion was to have two vectors, `x` and `y`, as the arguments. In that case it is best to go old school and create `X'X` and `X'y` without ever creating X. I would say that this knowledge was from back in the old days except that, astonishingly, you could probably get these formulas from a recent introductory stats text.

``````julia> linreg1(x::AbstractVector{T}, y::AbstractVector{T}) where {T<:AbstractFloat} = [ones(length(x)) x]\y
linreg1 (generic function with 1 method)

julia> linreg1(x, y)
2-element Array{Float64,1}:
0.4986028246024461
0.29993048137652306

julia> @benchmark linreg1(\$x, \$y)
BenchmarkTools.Trial:
memory estimate:  538.31 KiB
allocs estimate:  52
--------------
minimum time:     135.517 μs (0.00% GC)
median time:      140.943 μs (0.00% GC)
mean time:        150.165 μs (3.99% GC)
maximum time:     832.710 μs (78.93% GC)
--------------
samples:          10000
evals/sample:     1

julia> function linreg2(x::AbstractVector{T}, y::AbstractVector{T}) where {T<:AbstractFloat}
X = [ones(length(x)) x]
ldiv!(cholesky!(Symmetric(X'X, :U)), X'y)
end
linreg2 (generic function with 1 method)

julia> linreg2(x, y)
2-element Array{Float64,1}:
0.4986028246024445
0.29993048137652634

julia> @benchmark linreg2(\$x, \$y)
BenchmarkTools.Trial:
memory estimate:  234.94 KiB
allocs estimate:  13
--------------
minimum time:     67.063 μs (0.00% GC)
median time:      70.251 μs (0.00% GC)
mean time:        74.033 μs (3.30% GC)
maximum time:     652.841 μs (86.41% GC)
--------------
samples:          10000
evals/sample:     1

julia> function linreg3(x::AbstractVector{T}, y::AbstractVector{T}) where {T<:AbstractFloat}
(N = length(x)) == length(y) || throw(DimensionMismatch())
ldiv!(cholesky!(Symmetric([T(N) sum(x); zero(T) sum(abs2, x)], :U)), [sum(y), dot(x, y)])
end
linreg3 (generic function with 1 method)

julia> linreg3(x, y)
2-element Array{Float64,1}:
0.49860282460244537
0.299930481376524

julia> @benchmark linreg3(\$x, \$y)
BenchmarkTools.Trial:
memory estimate:  368 bytes
allocs estimate:  9
--------------
minimum time:     10.041 μs (0.00% GC)
median time:      10.260 μs (0.00% GC)
mean time:        10.473 μs (0.00% GC)
maximum time:     105.643 μs (0.00% GC)
--------------
samples:          10000
evals/sample:     1
``````

If you look closely the big differences are in the amount of storage that is allocated.

11 Likes

Sounds like a pretty strong case for having an efficient `linreg` function in the Statistics stdlib.

8 Likes

Wouldn’t you always have to create X externally in a for loop?

I’m not sure I understand the question. If you have the vector `x` and you want `X'X` and `X'y` you don’t need to allocate and populate the N by 2 matrix `X`. That’s what the section in the code with

``````Symmetric([T(N) sum(x); zero(T) sum(abs2, x)], :U)
``````

and

``````[sum(y); dot(x, y)]
``````

is doing.

Once upon a time (I took my first statistics course in 1968) we needed to learn this kind of short cut because we were doing the calculations by hand. But these days you would need to be doing a very large number of such calculations before saving a few hundred microseconds would be of importance.

And, from the viewpoint of a statistician, there is a lot of information in addition to the coefficient estimates that you would want.

5 Likes

I just learned so much from just those two lines!
Thank you! I didn’t even know there was such a thing…

Sorry to come late to this thread, but I’m surprised that no one seems to have given the correct explanation here for the performance differences between various methods to solve \min_x \Vert Ax-b\Vert_2.

• `x̂ = (A'A)\(A'b)` is the “normal equations” approach. (You could also use `cholesky(Hermitian(A'A))\(A'b)` to explicitly take advantage of the fact that `A'A` is SPD). This (and variations thereof) is the fastest approach, but it is also the least accurate. In particular, it squares the condition number of `A`, so it doubles the number of digits you lose to roundoff errors and similar. You should only use this method if you know that you have a well-conditioned `A`.

• `x̂ = A \ b` is equivalent to `qr(A, Val{true}()) \ b` — it uses a pivoted QR factorization. This is slower than the normal-equations approach, but it is much more accurate for badly conditioned matrices because it doesn’t square the condition number.

• `x̂ = pinv(A) * b` uses the SVD of `A` to apply the pseudo-inverse. This is the slowest method, but it gives you the most control over accuracy for ill-conditioned matrices because you can specify a tolerance in `pinv` to regularize the problem by dropping noisy singular values.

QR is the default choice used for `A \ b` (in Julia and many other systems) because it is reliable without being too slow.

67 Likes

does x’ mean the derivative of x? and is x independent, while y is dependent?

1 Like

x’ means transpose of x in this context.

If we were talking about odes it’d probably mean x’(t)

3 Likes

that makes sense.

1 Like

Sorry to bring this discussion back. But I think that many times the user is just searching for a function which given two vectors x and y, returns the coefficients of the linear fit and the Pearson correlation coefficient, and perhaps a third vector with the fitted Y values in the same x positions, such that plotting a linear regression over some data is direct. Something like:

``````x = [...]
y = [...]
L = linreg(x,y)
scatter(x,y)
plot!(x,L.Y,label="R^2=\$(L.R2)")
``````

Actually I think one such alternative would be nice to exist in the Plots package perhaps.

I mean, that is what I was searching for now, and it was easier to just plot what I wanted somewhere else at the end. A suite of the most simple regressions (linear, square, log, perhaps) would be nice accompanying Plots for a fast visual inspection of data.

Ps. I know this is three-lines of code anyway. But I assure you that most people in the world doing linear regressions have no idea of how they are done.

Ps2. Noticed now that this discussion is the “Performance” section, so that was not the case here. But this is one of the threads one reaches searching for a simple way to plot a linear regression.

You can do exactly that already. I will agree that it isn’t as “in your face” or easy to find as it could be though.

``````using GLM
x = [...]
y = [...]
reg = lm(x,y)
scatter(x,y)
plot!(x,predict(reg),label="R^2=\$(r2(reg))")
``````

also `Plots.jl` or `StatsPlots.jl` (I can’t remember which one) provides a direct option in the `scatter` command to put the regression line on the plot for you.

8 Likes

The keyword is `smooth = true`. However, it doesn’t provide any analytics beyond the curve itself.

2 Likes
1 Like

This (linear regression with confidence interval) is also implemented in AlgebraOfGraphics with the linear analysis.

3 Likes

1 Like

If anyone is interested, i have implemented a faster and also numerically stable method for underdetermined linear systems in this post:

2 Likes

Just to mention that I wrote a simple package to do what I was searching for when I found this thread, which was a very simple interface to do simple fits of 2D data:

8 Likes

I also needed basic linear regression, a bit beyond X \ y, and didn’t want the full GLM dependency chain, so I tried to incorporate the various suggestions from this thread in a very simple, minimal-dependency package, see GitHub - st--/LinearRegression.jl: Linear regression - in case anyone else who stumbles across this thread may find it useful!

I also put links to various packages for more fancy versions of linear regression (generalized, ridge, sparse, bayesian, online, …) into the readme. 7 Likes

great stuff, I just walked the same path

1 Like