Efficient way of doing linear regression

Hello,

I need an efficient way to performs linear regression because I have to fit several segments in a big for loops which takes time to achieve. In the past, their was a linreg function which has been depreciated. Currently I am using polyfit to do a polynomial fit of order 1, for each segment.

Is there a more efficient way to do this ?

2 Likes

The \ operation is for linear regression. Just build the matrix.

7 Likes

Oh I see. I’ll try it thanks !

Ok so i didn’t understand this as good as i thought i did. Considering that I have two 1D arrays and I want to find the best linear curve to find one element of y given the corresponding element of x, how do I do it with \ ?

You can estimate the coefficients via OLS: β̂ = (x'*x)\x'*y.

The estimated regression line is ŷ=x*β̂.

3 Likes

OK say you have the datapoints (x1, y1) (x2, y2) and (x3, y3).
in this case you could simply describe it via M*v=d with :

M =[1 x1    
    1 x2
    1 x3]
v= [a
    b]
d=[y1
   y2
   y3]

v contains then the coefficients of your regression so that y = a+b*x.
@fipelle’s answer is a shortcut for that process. v is then obtained with v=d\M.

4 Likes

Super clear thanks.

To come back to the original issue, it looks like doing the linear regression this way is slower than with polyfit ?
with polyfit :

0.058074 seconds (48.54 k allocations: 2.364 MiB)

with matrix inversion :

0.061431 seconds (65.20 k allocations: 3.137 MiB)

Are you benchmarking in global scope? I wonder because of the allocs. Could type instability play a role here?

What do you mean by global scope ?

I think Performance tips gives the answer to that quastion.

3 Likes

I have recently noticed that b=(x'x)\(x'y) is often twice as fast as b = x\y, for instance, when x is 1000x10 and y is 1000x1. So far, I have no good explanation for it.

2 Likes

I find it rather dissatisfying that linreg was replaced, with the message of “just do (x'*x)\x'*y”, like “it’s only 5 more characters and much more explicit”. Right but a small percentage of people might not have the formula for linear regression memorized at all times.

18 Likes

To answer your question : no I was not running my benchmark on global scope :confused:

Something about making it positive semidefiniate so that \ will use cholesky ?
(Don’t quote me on that)

1 Like

This is perfectly reasonable. In my previous post I was trying to expand on the use of \. If you do not use regressions or modelling very often it is more convenient to use something like linreg.

However, I think that a technical audience would find somewhat harder to memorise the name of a function for simple regressions - plus with small changes in the formula you can also do ridge :slight_smile:

I totally agree, except that the function should be called linear_regression or similar, rather than linreg, and should probably be in a package like StatsBase.jl.

11 Likes

Using the normal equation will likely use Cholesky decomposition while the \ operator uses QR decomposition. Cholesky is more efficient which is why it may be faster. QR decomposition is more stable and precise. There are a few variants of Cholesky decomposition implemented in Julia (LL', LDL', Bunch-Kaufman, and their pivoted/sparse variants). Using factorize on a dense symmetric/hermitian matrix will use the appropriate one.

using LinearAlgebra
F = Hermitian(A'A) |> factorize
x = F \ (A'b) # Solves a linear equation system Ax = b

Most Cholesky implementations are O(n^3) while most implementations of QR decomposition (e.g., Householder reflections) are O(mn^2) which means if the linear system is overly specified (e.g., many more linearly rows than columns in A) then Cholesky will be faster.

6 Likes

Can you please link the discussion with that message? I am rather surprised that this was suggested, one should never use the textbook OLS formula for actual calculation (except for some special cases, eg 1 covariate); it is very easy to run into ill-conditioning in practice.

The polyalgorithm for \ picks a very common and reasonable solution, based on the QR decomposition. For “small” problems, it is not uncommon to use SVD, for large datasets an iterative algorithm is often used.

Since the original problem of this discussion seems to be 1D, just using the specific formula without forming the matrices may be fastest:

2 Likes

I was paraphrasing, not arguing that someone exactly said that. But that is the gist I get - every time someone asks “why was linreg removed” the answer is “just use \”, and then the asker goes “huh” because of course that ignores the intercept.

I don’t think I have seen people asking for linreg for more than a year now, but I have a very different impression: when someone asks about linreg or linear regression in general, generally they are just directed to the appropriate package, eg:

There are now excellent packages for various implementations of linear regression and more general models. They should be the first choice, there is little need to implement OLS in Julia for most users now.

1 Like