Efficient way of doing linear regression

First you should decide what you want the function to be. Here you are assuming that the model matrix X has been externally created. I believe the original suggestion was to have two vectors, x and y, as the arguments. In that case it is best to go old school and create X'X and X'y without ever creating X. I would say that this knowledge was from back in the old days except that, astonishingly, you could probably get these formulas from a recent introductory stats text.

julia> linreg1(x::AbstractVector{T}, y::AbstractVector{T}) where {T<:AbstractFloat} = [ones(length(x)) x]\y
linreg1 (generic function with 1 method)

julia> linreg1(x, y)
2-element Array{Float64,1}:
 0.4986028246024461
 0.29993048137652306

julia> @benchmark linreg1($x, $y)
BenchmarkTools.Trial: 
  memory estimate:  538.31 KiB
  allocs estimate:  52
  --------------
  minimum time:     135.517 μs (0.00% GC)
  median time:      140.943 μs (0.00% GC)
  mean time:        150.165 μs (3.99% GC)
  maximum time:     832.710 μs (78.93% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> function linreg2(x::AbstractVector{T}, y::AbstractVector{T}) where {T<:AbstractFloat}
           X = [ones(length(x)) x]
           ldiv!(cholesky!(Symmetric(X'X, :U)), X'y)
       end
linreg2 (generic function with 1 method)

julia> linreg2(x, y)
2-element Array{Float64,1}:
 0.4986028246024445
 0.29993048137652634

julia> @benchmark linreg2($x, $y)
BenchmarkTools.Trial: 
  memory estimate:  234.94 KiB
  allocs estimate:  13
  --------------
  minimum time:     67.063 μs (0.00% GC)
  median time:      70.251 μs (0.00% GC)
  mean time:        74.033 μs (3.30% GC)
  maximum time:     652.841 μs (86.41% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> function linreg3(x::AbstractVector{T}, y::AbstractVector{T}) where {T<:AbstractFloat}
           (N = length(x)) == length(y) || throw(DimensionMismatch())
           ldiv!(cholesky!(Symmetric([T(N) sum(x); zero(T) sum(abs2, x)], :U)), [sum(y), dot(x, y)])
       end
linreg3 (generic function with 1 method)

julia> linreg3(x, y)
2-element Array{Float64,1}:
 0.49860282460244537
 0.299930481376524

julia> @benchmark linreg3($x, $y)
BenchmarkTools.Trial: 
  memory estimate:  368 bytes
  allocs estimate:  9
  --------------
  minimum time:     10.041 μs (0.00% GC)
  median time:      10.260 μs (0.00% GC)
  mean time:        10.473 μs (0.00% GC)
  maximum time:     105.643 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

If you look closely the big differences are in the amount of storage that is allocated.

12 Likes