Matrix formulation of multiple regression

I’ve been studying linear algebra and I decided I would see what it takes to code a multiple regression model from scratch to deepen my understanding of some of the things I’ve been learning. I couldn’t really find any Julia examples so I’m posting this here in the event that others (as I did) search for an example here.

I must say:

  1. I cannot believe how easy it is to do this in Julia (I expected to spend several hours on this task)

  2. I’m a bit disappointed at how easy it is to do this in Julia :laughing: (I’m not sure I actually gained anything from this coding exercise because I literally just had to copy a formula from the excellent resource I was following.

There’s obviously a lot more to regression analysis that’s not included here (e.g. determining the fit of the model, ANOVA, etc.) so I may add to this post at a later date if I code some of those things.

Nonetheless, here it is. The link to the data I used is here.

using DelimitedFiles

data = readdlm(raw"C:\Users\mthel\Downloads\Fish.csv", ',', header=true)

X = Array{Float64, 2}(hcat(ones(size(data[1], 1)), data[1][:, 3:7]))
Y = convert.(Float64, data[1][:, 2])

β = inv(X'X)X'Y

Isn’t that ridiculous? In 5 lines we have our model:

julia> β
1-element Array{Array{Float64,1},1}:
 [-499.5869553569656, 62.35521443224927, -6.5267524918569135, -29.026218612712704, 28.29735132227026, 22.473306652229603]

# y = -499.59 + 62.36x1 - 6.53x2 - 29.03x3 + 28.3x4 + 22.47x5


As @stevengj kindly points out, β = X \ Y is preferable to β = inv(X'X)X'Y


Don’t do this. This is the textbook “normal equations” formulation of linear least-squares problems, and is often a bad idea in a practical numerical context because it squares the condition number (basically: it doubles the number of digits that you lose due to roundoff or other errors).

Just do β = X \ Y and it will use a more reliable approach.

(If you want to repeat this for multiple Y right hand sides, set QR = qr(X) and then do QR \ Y. In numerics, we rarely compute matrix inverses explicitly.)

(If you google “normal equations least squares condition number” you will find lots of references. See also any textbook on numerical linear algebra, e.g. Trefethen and Bau.)


See also Efficient way of doing linear regression


Here’s another approach (because I’m a glutton for punishment :rofl:):

using Random
using SymPy

X = 1:4
Y = rand(collect(-1:0.1:1), 4) .+ collect(1:2:8)

# We want to find values for β₁ and β₂
ŷ(β₁,β₂,x) = β₁ + β₂*x

@vars β₁ β₂ x y real=true

# A function to calculate the residual at a given point
S(β₁,β₂,x,y) = (y - (β₁ + x*β₂))^2

S_components = [[subs(S(β₁,β₂,x,y),x=>i[1],y=>i[2]) for i in zip(X,Y)][i] for i in 1:length(X)]

# Expand S based on our 4 observations
S(β₁,β₂) = S_components[1] + S_components[2] + S_components[3] + S_components[4]

julia> S(β₁,β₂)
                2                                    2
(-β₁ - β₂ + 0.4)  + 10.24⋅(-0.3125⋅β₁ - 0.625⋅β₂ + 1)  + 27.04⋅(-0.19230769230

7692⋅β₁ - 0.576923076923077⋅β₂ + 1)  + 46.24⋅(-0.147058823529412⋅β₁ - 0.588235

# We need to calculate the partial derivatives aka "the normal equations"
julia> ∂s∂β₁ = diff(S(β₁,β₂),β₁)
8.0⋅β₁ + 20.0⋅β₂ - 31.2

julia> ∂s∂β₂ = diff(S(β₁,β₂),β₂)
20.0⋅β₁ + 60.0⋅β₂ - 99.2

# Let SymPy solve this system for us
julia> β = solve([∂s∂β₁,∂s∂β₂],[β₁,β₂])
Dict{Any,Any} with 2 entries:
  β₁ => -1.40000000000000
  β₂ => 2.12000000000000

# Check that our answer is right
julia> hcat(ones(length(X)), X) \ Y
2-element Array{Float64,1}:

# Get an estimate at x = 4
julia> ŷ(β[β₁],β[β₂],4)
1 Like