more details on this are found in, for instance, Golub&van Loan (3rd ed). I like the very last sentence of section 5.3: “At the very minimum, this discussion should convince you how difficult it can be to choose the ‘right’ algorithm!”
Could you point out which those packages are, please?
I have only used
and
in practice.
I use polynomials.jl and lsqfit.jl. Not caring about speed I’ve never benchmarked.
Example:
using Polynomials
p = polyfit(x,y,1) # fit first order polynomial
coeffs(p)
I describe a number of different ways of performing a simple linear regression in
https://github.com/dmbates/CopenhagenEcon/blob/master/jmd/03-LinearAlgebra.jmd
As shown in the README.md
file for the repository an effective way of running the .jmd file is to use Weave.convert_doc
to create a Jupyter notebook and run that. (By the way, this notebook also shows using the R ggplot2
graphics package from Julia through RCall
.).
It is difficult to benchmark a simple calculation like this but I suspect that the method of augmenting the model matrix with the response vector and using a Cholesky factorization will be competitive. It has the advantage of also providing the sum of squared residuals without needing to evaluate the residuals.
what about?
using Random
N = 10000
x = rand(N)
X = [ones(N) x]
y = 10 .+ x .* 0.3
function linreg1(y, X)
β_hat = (X' * X) \ X' * y
return(β_hat)
end
function linreg2(y, X)
β_hat = X \ y
return(β_hat)
end
using GLM
GLM.fit(LinearModel, X, y, true)
using DataFrames, GLM
data = DataFrame(X = x, Y = y)
lm(@formula(Y ~ X), data)
using BenchmarkTools
@benchmark linreg1(y, X)
@benchmark linreg2(y, X)
@benchmark GLM.fit(LinearModel, X, y, true)
@benchmark lm(@formula(Y ~ X), data)
and the results are
lingerg1:
BenchmarkTools.Trial:
memory estimate: 156.78 KiB
allocs estimate: 8
--------------
minimum time: 213.900 μs (0.00% GC)
median time: 289.351 μs (0.00% GC)
mean time: 552.780 μs (4.66% GC)
maximum time: 76.192 ms (0.00% GC)
--------------
samples: 8836
evals/sample: 1
linreg2:
BenchmarkTools.Trial:
memory estimate: 303.69 KiB
allocs estimate: 45
--------------
minimum time: 118.300 μs (0.00% GC)
median time: 170.300 μs (0.00% GC)
mean time: 344.598 μs (17.20% GC)
maximum time: 57.911 ms (0.00% GC)
--------------
samples: 10000
evals/sample: 1
fit():
BenchmarkTools.Trial:
memory estimate: 470.45 KiB
allocs estimate: 26
--------------
minimum time: 133.700 μs (0.00% GC)
median time: 189.299 μs (0.00% GC)
mean time: 428.768 μs (17.15% GC)
maximum time: 45.171 ms (0.00% GC)
--------------
samples: 10000
evals/sample: 1
lm():
BenchmarkTools.Trial:
memory estimate: 1.08 MiB
allocs estimate: 186
--------------
minimum time: 546.600 μs (0.00% GC)
median time: 839.500 μs (0.00% GC)
mean time: 1.917 ms (10.94% GC)
maximum time: 127.190 ms (0.00% GC)
--------------
samples: 2588
evals/sample: 1
First you should decide what you want the function to be. Here you are assuming that the model matrix X
has been externally created. I believe the original suggestion was to have two vectors, x
and y
, as the arguments. In that case it is best to go old school and create X'X
and X'y
without ever creating X. I would say that this knowledge was from back in the old days except that, astonishingly, you could probably get these formulas from a recent introductory stats text.
julia> linreg1(x::AbstractVector{T}, y::AbstractVector{T}) where {T<:AbstractFloat} = [ones(length(x)) x]\y
linreg1 (generic function with 1 method)
julia> linreg1(x, y)
2-element Array{Float64,1}:
0.4986028246024461
0.29993048137652306
julia> @benchmark linreg1($x, $y)
BenchmarkTools.Trial:
memory estimate: 538.31 KiB
allocs estimate: 52
--------------
minimum time: 135.517 μs (0.00% GC)
median time: 140.943 μs (0.00% GC)
mean time: 150.165 μs (3.99% GC)
maximum time: 832.710 μs (78.93% GC)
--------------
samples: 10000
evals/sample: 1
julia> function linreg2(x::AbstractVector{T}, y::AbstractVector{T}) where {T<:AbstractFloat}
X = [ones(length(x)) x]
ldiv!(cholesky!(Symmetric(X'X, :U)), X'y)
end
linreg2 (generic function with 1 method)
julia> linreg2(x, y)
2-element Array{Float64,1}:
0.4986028246024445
0.29993048137652634
julia> @benchmark linreg2($x, $y)
BenchmarkTools.Trial:
memory estimate: 234.94 KiB
allocs estimate: 13
--------------
minimum time: 67.063 μs (0.00% GC)
median time: 70.251 μs (0.00% GC)
mean time: 74.033 μs (3.30% GC)
maximum time: 652.841 μs (86.41% GC)
--------------
samples: 10000
evals/sample: 1
julia> function linreg3(x::AbstractVector{T}, y::AbstractVector{T}) where {T<:AbstractFloat}
(N = length(x)) == length(y) || throw(DimensionMismatch())
ldiv!(cholesky!(Symmetric([T(N) sum(x); zero(T) sum(abs2, x)], :U)), [sum(y), dot(x, y)])
end
linreg3 (generic function with 1 method)
julia> linreg3(x, y)
2-element Array{Float64,1}:
0.49860282460244537
0.299930481376524
julia> @benchmark linreg3($x, $y)
BenchmarkTools.Trial:
memory estimate: 368 bytes
allocs estimate: 9
--------------
minimum time: 10.041 μs (0.00% GC)
median time: 10.260 μs (0.00% GC)
mean time: 10.473 μs (0.00% GC)
maximum time: 105.643 μs (0.00% GC)
--------------
samples: 10000
evals/sample: 1
If you look closely the big differences are in the amount of storage that is allocated.
Sounds like a pretty strong case for having an efficient linreg
function in the Statistics stdlib.
Wouldn’t you always have to create X externally in a for loop?
I’m not sure I understand the question. If you have the vector x
and you want X'X
and X'y
you don’t need to allocate and populate the N by 2 matrix X
. That’s what the section in the code with
Symmetric([T(N) sum(x); zero(T) sum(abs2, x)], :U)
and
[sum(y); dot(x, y)]
is doing.
Once upon a time (I took my first statistics course in 1968) we needed to learn this kind of short cut because we were doing the calculations by hand. But these days you would need to be doing a very large number of such calculations before saving a few hundred microseconds would be of importance.
And, from the viewpoint of a statistician, there is a lot of information in addition to the coefficient estimates that you would want.
I just learned so much from just those two lines!
Thank you! I didn’t even know there was such a thing…
Sorry to come late to this thread, but I’m surprised that no one seems to have given the correct explanation here for the performance differences between various methods to solve \min_x \Vert Ax-b\Vert_2.
-
x̂ = (A'A)\(A'b)
is the “normal equations” approach. (You could also usecholesky(Hermitian(A'A))\(A'b)
to explicitly take advantage of the fact thatA'A
is SPD). This (and variations thereof) is the fastest approach, but it is also the least accurate. In particular, it squares the condition number ofA
, so it doubles the number of digits you lose to roundoff errors and similar. You should only use this method if you know that you have a well-conditionedA
. -
x̂ = A \ b
is equivalent toqr(A, Val{true}()) \ b
— it uses a pivoted QR factorization. This is slower than the normal-equations approach, but it is much more accurate for badly conditioned matrices because it doesn’t square the condition number. -
x̂ = pinv(A) * b
uses the SVD ofA
to apply the pseudo-inverse. This is the slowest method, but it gives you the most control over accuracy for ill-conditioned matrices because you can specify a tolerance inpinv
to regularize the problem by dropping noisy singular values.
QR is the default choice used for A \ b
(in Julia and many other systems) because it is reliable without being too slow.
does x’ mean the derivative of x? and is x independent, while y is dependent?
x’ means transpose of x in this context.
If we were talking about odes it’d probably mean x’(t)
that makes sense.
Sorry to bring this discussion back. But I think that many times the user is just searching for a function which given two vectors x and y, returns the coefficients of the linear fit and the Pearson correlation coefficient, and perhaps a third vector with the fitted Y values in the same x positions, such that plotting a linear regression over some data is direct. Something like:
x = [...]
y = [...]
L = linreg(x,y)
scatter(x,y)
plot!(x,L.Y,label="R^2=$(L.R2)")
Actually I think one such alternative would be nice to exist in the Plots package perhaps.
I mean, that is what I was searching for now, and it was easier to just plot what I wanted somewhere else at the end. A suite of the most simple regressions (linear, square, log, perhaps) would be nice accompanying Plots for a fast visual inspection of data.
Ps. I know this is three-lines of code anyway. But I assure you that most people in the world doing linear regressions have no idea of how they are done.
Ps2. Noticed now that this discussion is the “Performance” section, so that was not the case here. But this is one of the threads one reaches searching for a simple way to plot a linear regression.
You can do exactly that already. I will agree that it isn’t as “in your face” or easy to find as it could be though.
using GLM
x = [...]
y = [...]
reg = lm(x,y)
scatter(x,y)
plot!(x,predict(reg),label="R^2=$(r2(reg))")
also Plots.jl
or StatsPlots.jl
(I can’t remember which one) provides a direct option in the scatter
command to put the regression line on the plot for you.
The keyword is smooth = true
. However, it doesn’t provide any analytics beyond the curve itself.
This (linear regression with confidence interval) is also implemented in AlgebraOfGraphics with the linear analysis.