Matrix formulation of multiple regression

I’ve been studying linear algebra and I decided I would see what it takes to code a multiple regression model from scratch to deepen my understanding of some of the things I’ve been learning. I couldn’t really find any Julia examples so I’m posting this here in the event that others (as I did) search for an example here.

I must say:

  1. I cannot believe how easy it is to do this in Julia (I expected to spend several hours on this task)

  2. I’m a bit disappointed at how easy it is to do this in Julia :laughing: (I’m not sure I actually gained anything from this coding exercise because I literally just had to copy a formula from the excellent resource I was following.

There’s obviously a lot more to regression analysis that’s not included here (e.g. determining the fit of the model, ANOVA, etc.) so I may add to this post at a later date if I code some of those things.

Nonetheless, here it is. The link to the data I used is here.

using DelimitedFiles

data = readdlm(raw"C:\Users\mthel\Downloads\Fish.csv", ',', header=true)

X = Array{Float64, 2}(hcat(ones(size(data[1], 1)), data[1][:, 3:7]))
Y = convert.(Float64, data[1][:, 2])

β = inv(X'X)X'Y

Isn’t that ridiculous? In 5 lines we have our model:

julia> β
1-element Array{Array{Float64,1},1}:
 [-499.5869553569656, 62.35521443224927, -6.5267524918569135, -29.026218612712704, 28.29735132227026, 22.473306652229603]

# y = -499.59 + 62.36x1 - 6.53x2 - 29.03x3 + 28.3x4 + 22.47x5

EDIT:

As @stevengj kindly points out, β = X \ Y is preferable to β = inv(X'X)X'Y

Don’t do this. This is the textbook “normal equations” formulation of linear least-squares problems, and is often a bad idea in a practical numerical context because it squares the condition number (basically: it doubles the number of digits that you lose due to roundoff or other errors).

Just do β = X \ Y and it will use a more reliable approach.

(If you want to repeat this for multiple Y right hand sides, set QR = qr(X) and then do QR \ Y. In numerics, we rarely compute matrix inverses explicitly.)

(If you google “normal equations least squares condition number” you will find lots of references. See also any textbook on numerical linear algebra, e.g. Trefethen and Bau.)

3 Likes

See also Efficient way of doing linear regression

1 Like