Matrix formulation of multiple regression

mthelm85 · January 14, 2020, 6:55pm

I’ve been studying linear algebra and I decided I would see what it takes to code a multiple regression model from scratch to deepen my understanding of some of the things I’ve been learning. I couldn’t really find any Julia examples so I’m posting this here in the event that others (as I did) search for an example here.

I must say:

I cannot believe how easy it is to do this in Julia (I expected to spend several hours on this task)
I’m a bit disappointed at how easy it is to do this in Julia (I’m not sure I actually gained anything from this coding exercise because I literally just had to copy a formula from the excellent resource I was following.

There’s obviously a lot more to regression analysis that’s not included here (e.g. determining the fit of the model, ANOVA, etc.) so I may add to this post at a later date if I code some of those things.

Nonetheless, here it is. The link to the data I used is here.

using DelimitedFiles

data = readdlm(raw"C:\Users\mthel\Downloads\Fish.csv", ',', header=true)

X = Array{Float64, 2}(hcat(ones(size(data[1], 1)), data[1][:, 3:7]))
Y = convert.(Float64, data[1][:, 2])

β = inv(X'X)X'Y

Isn’t that ridiculous? In 5 lines we have our model:

julia> β
1-element Array{Array{Float64,1},1}:
 [-499.5869553569656, 62.35521443224927, -6.5267524918569135, -29.026218612712704, 28.29735132227026, 22.473306652229603]

# y = -499.59 + 62.36x1 - 6.53x2 - 29.03x3 + 28.3x4 + 22.47x5

EDIT:

As @stevengj kindly points out, β = X \ Y is preferable to β = inv(X'X)X'Y

stevengj · January 14, 2020, 7:02pm

Don’t do this. This is the textbook “normal equations” formulation of linear least-squares problems, and is often a bad idea in a practical numerical context because it squares the condition number (basically: it doubles the number of digits that you lose due to roundoff or other errors).

Just do β = X \ Y and it will use a more reliable approach.

(If you want to repeat this for multiple Y right hand sides, set QR = qr(X) and then do QR \ Y. In numerics, we rarely compute matrix inverses explicitly.)

(If you google “normal equations least squares condition number” you will find lots of references. See also any textbook on numerical linear algebra, e.g. Trefethen and Bau.)

dmbates · January 14, 2020, 7:43pm

See also Efficient way of doing linear regression - #27 by markushhh

mthelm85 · April 3, 2020, 4:56pm

Here’s another approach (because I’m a glutton for punishment ):

using Random
using SymPy

X = 1:4
Y = rand(collect(-1:0.1:1), 4) .+ collect(1:2:8)

# We want to find values for β₁ and β₂
ŷ(β₁,β₂,x) = β₁ + β₂*x

@vars β₁ β₂ x y real=true

# A function to calculate the residual at a given point
S(β₁,β₂,x,y) = (y - (β₁ + x*β₂))^2

S_components = [[subs(S(β₁,β₂,x,y),x=>i[1],y=>i[2]) for i in zip(X,Y)][i] for i in 1:length(X)]

# Expand S based on our 4 observations
S(β₁,β₂) = S_components[1] + S_components[2] + S_components[3] + S_components[4]

julia> S(β₁,β₂)
                2                                    2
(-β₁ - β₂ + 0.4)  + 10.24⋅(-0.3125⋅β₁ - 0.625⋅β₂ + 1)  + 27.04⋅(-0.19230769230

                                   2
7692⋅β₁ - 0.576923076923077⋅β₂ + 1)  + 46.24⋅(-0.147058823529412⋅β₁ - 0.588235

# We need to calculate the partial derivatives aka "the normal equations"
julia> ∂s∂β₁ = diff(S(β₁,β₂),β₁)
8.0⋅β₁ + 20.0⋅β₂ - 31.2

julia> ∂s∂β₂ = diff(S(β₁,β₂),β₂)
20.0⋅β₁ + 60.0⋅β₂ - 99.2

# Let SymPy solve this system for us
julia> β = solve([∂s∂β₁,∂s∂β₂],[β₁,β₂])
Dict{Any,Any} with 2 entries:
  β₁ => -1.40000000000000
  β₂ => 2.12000000000000

# Check that our answer is right
julia> hcat(ones(length(X)), X) \ Y
2-element Array{Float64,1}:
 -1.3999999999999997
  2.1199999999999997

# Get an estimate at x = 4
julia> ŷ(β[β₁],β[β₂],4)
7.08000000000000

Topic		Replies	Views
Efficient way of doing linear regression Performance regression	44	20610	February 7, 2022
Linear Regression step by step for newcomers Machine Learning question	7	5022	November 2, 2022
GLM is slow on large datasets. Using OnlineStats for regressions? MixedModels? Performance glm	25	5092	November 26, 2018
Strange error when performing linear regression General Usage	3	311	December 18, 2021
Linear regression from scratch using QR decomposition General Usage linearalgebra , regression , qr	3	1153	November 12, 2021

Matrix formulation of multiple regression

Related topics