Efficient way of doing linear regression

Johncowk · November 18, 2019, 5:21pm

Hello,

I need an efficient way to performs linear regression because I have to fit several segments in a big for loops which takes time to achieve. In the past, their was a linreg function which has been depreciated. Currently I am using polyfit to do a polynomial fit of order 1, for each segment.

Is there a more efficient way to do this ?

ChrisRackauckas · November 18, 2019, 5:31pm

The \ operation is for linear regression. Just build the matrix.

Johncowk · November 18, 2019, 5:33pm

Oh I see. I’ll try it thanks !

Johncowk · November 18, 2019, 8:02pm

Ok so i didn’t understand this as good as i thought i did. Considering that I have two 1D arrays and I want to find the best linear curve to find one element of y given the corresponding element of x, how do I do it with \ ?

fipelle · November 18, 2019, 8:14pm

You can estimate the coefficients via OLS: β̂ = (x'*x)\x'*y.

The estimated regression line is ŷ=x*β̂.

MatFi · November 18, 2019, 8:32pm

OK say you have the datapoints (x1, y1) (x2, y2) and (x3, y3).
in this case you could simply describe it via M*v=d with :

M =[1 x1    
    1 x2
    1 x3]
v= [a
    b]
d=[y1
   y2
   y3]

v contains then the coefficients of your regression so that y = a+b*x.
@fipelle’s answer is a shortcut for that process. v is then obtained with v=d\M.

Johncowk · November 18, 2019, 8:48pm

Super clear thanks.

To come back to the original issue, it looks like doing the linear regression this way is slower than with polyfit ?
with polyfit :

0.058074 seconds (48.54 k allocations: 2.364 MiB)

with matrix inversion :

0.061431 seconds (65.20 k allocations: 3.137 MiB)

MatFi · November 18, 2019, 9:13pm

Are you benchmarking in global scope? I wonder because of the allocs. Could type instability play a role here?

Johncowk · November 18, 2019, 9:16pm

What do you mean by global scope ?

MatFi · November 18, 2019, 9:25pm

I think Performance tips gives the answer to that quastion.

Paul_Soderlind · November 18, 2019, 9:45pm

I have recently noticed that b=(x'x)\(x'y) is often twice as fast as b = x\y, for instance, when x is 1000x10 and y is 1000x1. So far, I have no good explanation for it.

mkborregaard · November 18, 2019, 10:09pm

I find it rather dissatisfying that linreg was replaced, with the message of “just do (x'*x)\x'*y”, like “it’s only 5 more characters and much more explicit”. Right but a small percentage of people might not have the formula for linear regression memorized at all times.

Johncowk · November 19, 2019, 12:22am

To answer your question : no I was not running my benchmark on global scope

oxinabox · November 19, 2019, 12:41am

Something about making it positive semidefiniate so that \ will use cholesky ?
(Don’t quote me on that)

fipelle · November 19, 2019, 12:46am

This is perfectly reasonable. In my previous post I was trying to expand on the use of \. If you do not use regressions or modelling very often it is more convenient to use something like linreg.

However, I think that a technical audience would find somewhat harder to memorise the name of a function for simple regressions - plus with small changes in the formula you can also do ridge

dpsanders · November 19, 2019, 3:13am

I totally agree, except that the function should be called linear_regression or similar, rather than linreg, and should probably be in a package like StatsBase.jl.

Nosferican · November 20, 2019, 6:44am

Using the normal equation will likely use Cholesky decomposition while the \ operator uses QR decomposition. Cholesky is more efficient which is why it may be faster. QR decomposition is more stable and precise. There are a few variants of Cholesky decomposition implemented in Julia (LL', LDL', Bunch-Kaufman, and their pivoted/sparse variants). Using factorize on a dense symmetric/hermitian matrix will use the appropriate one.

using LinearAlgebra
F = Hermitian(A'A) |> factorize
x = F \ (A'b) # Solves a linear equation system Ax = b

Most Cholesky implementations are O(n^3) while most implementations of QR decomposition (e.g., Householder reflections) are O(mn^2) which means if the linear system is overly specified (e.g., many more linearly rows than columns in A) then Cholesky will be faster.

Tamas_Papp · November 20, 2019, 7:23am

Can you please link the discussion with that message? I am rather surprised that this was suggested, one should never use the textbook OLS formula for actual calculation (except for some special cases, eg 1 covariate); it is very easy to run into ill-conditioning in practice.

The polyalgorithm for \ picks a very common and reasonable solution, based on the QR decomposition. For “small” problems, it is not uncommon to use SVD, for large datasets an iterative algorithm is often used.

Since the original problem of this discussion seems to be 1D, just using the specific formula without forming the matrices may be fastest:

mkborregaard · November 20, 2019, 9:44am

I was paraphrasing, not arguing that someone exactly said that. But that is the gist I get - every time someone asks “why was linreg removed” the answer is “just use \”, and then the asker goes “huh” because of course that ignores the intercept.

Tamas_Papp · November 20, 2019, 9:58am

I don’t think I have seen people asking for linreg for more than a year now, but I have a very different impression: when someone asks about linreg or linear regression in general, generally they are just directed to the appropriate package, eg:

There are now excellent packages for various implementations of linear regression and more general models. They should be the first choice, there is little need to implement OLS in Julia for most users now.

Topic		Replies	Views
GLM is slow on large datasets. Using OnlineStats for regressions? MixedModels? Performance glm	25	5092	November 26, 2018
Best way for linear regression problem on product features General Usage performance , regression , qr	28	299	April 11, 2025
[ANN] LinearRegressionKit Package Announcements package , regression , linear-regression	9	2668	November 22, 2022
What is the equivalent Julia code for statsmodels.api.OLS Statistics python	15	467	March 26, 2024
The simplest linear fit with GLM Tooling glm	13	5291	November 11, 2021

Efficient way of doing linear regression

Related topics