The simplest linear fit with GLM

EDIT: In the spirit of TL;DR, here is what you need to make a simple linear fit:

julia> X=[1,2,3]; Y=[2,4,7]; data = (;X, Y)
(X = [1, 2, 3], Y = [2, 4, 7])

julia> lm(@formula(Y~X), data)
StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{Vector{Float64}}, GLM.DensePredChol{Float64, LinearAlgebra.CholeskyPivoted{Float64, Matrix{Float64}}}}, Matrix{Float64}}

Y ~ 1 + X

Coefficients:
─────────────────────────────────────────────────────────────────────────
                 Coef.  Std. Error      t  Pr(>|t|)  Lower 95%  Upper 95%
─────────────────────────────────────────────────────────────────────────
(Intercept)  -0.666667    0.62361   -1.07    0.4788   -8.59038    7.25704
X             2.5         0.288675   8.66    0.0732   -1.16797    6.16797
─────────────────────────────────────────────────────────────────────────

Original post below:

I have tried to copy the first example from the documentation:

But without contructing a DataFrame, which seems unnecessary. My attempt is

julia> using GLM

julia> x = [1, 2, 3]
3-element Vector{Int64}:
 1
 2
 3

julia> y = [2, 4, 7]
3-element Vector{Int64}:
 2
 4
 7

julia> lm(x, y)
ERROR: MethodError: no method matching fit(::Type{LinearModel}, ::Vector{Int64}, ::Vector{Int64}, ::Nothing)
Closest candidates are:
  fit(::StatisticalModel, ::Any...) at C:\Users\densb\.julia\packages\StatsBase\Q76Ni\src\statmodels.jl:178
  fit(::Type{LinearModel}, ::AbstractMatrix{var"#s49"} where var"#s49"<:Real, ::AbstractVector{var"#s50"} where var"#s50"<:Real, ::Union{Nothing, Bool}; wts, dropcollinear) at C:\Users\densb\.julia\packages\GLM\5CcRd\src\lm.jl:156
  fit(::Type{T}, ::FormulaTerm, ::Any, ::Any...; contrasts, kwargs...) where T<:RegressionModel at C:\Users\densb\.julia\packages\StatsModels\m1jYD\src\statsmodel.jl:78
  ...
Stacktrace:
 [1] lm(X::Vector{Int64}, y::Vector{Int64}, allowrankdeficient_dep::Nothing; kwargs::Base.Iterators.Pairs{Union{}, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
   @ GLM C:\Users\densb\.julia\packages\GLM\5CcRd\src\lm.jl:179
 [2] lm(X::Vector{Int64}, y::Vector{Int64}, allowrankdeficient_dep::Nothing) (repeats 2 times)      
   @ GLM C:\Users\densb\.julia\packages\GLM\5CcRd\src\lm.jl:179
 [3] top-level scope
   @ REPL[12]:1

The problem seems to be that my input x is a 1D vector, instead of a 2D matrix. But I have one input for each output, and that is also valid, right? In fact, is this not the canonical example of a linear relationship? So why does this not work? Thanks.

And also, the GLM docs should have a section on making fits without dataframes - at least an example or two. I would be happy to make it once I figure out the basic functionality myself :sweat_smile:

I’m not sure there’s any deeper reasoning here - a Vector{Float64} is not a Matrix{Float64}, and fit only has a method for (::Matrix, ::Vector).

In practice I suppose using GLM without DataFrames with the lm function is pretty niche, and running univariate regressions is even more niche, which is why the method is missing. The usage of lm is documented here:

https://juliastats.org/GLM.jl/stable/api/#GLM.lm

In the second method, X must be a matrix holding values of the independent variable(s) in columns (including if appropriate the intercept)

If you feel strongly about this I guess you can make a PR to the Examples section of the docs including an example for the non-DataFrames usage.

Right, but the functionality is implemented with lm(@formula(y ~ x), data::DataFrame). So the question is more, why is it only defined for a Tables.jl data-format? It should be dead-simple to implement the method for lm(x::Vector, y::Vector), right?

Well of course using GLM without DataFrames is niche - the usage is not documented!
Not sure why using lm is relevant - it simply calls fit(LinearModel), right?

I would argue that univariate regression is the most normal, but that is simply a matter of different day-to-day encounters with regression. To me, fitting some mathematical equation to relate two physical sizes, say voltage and current, is all I do.

I would also say that by not prioritizing support for simple usage (univariate, no DataFrame), one creates the need for packages like EasyFit.jl, instead of having a single package that can do it all. If there is one thing the Julia ecosystem does not need, it is more splitting into packages that do similar things slightly differently. And it seems like GLM is more than capable, which is why I think it is a shame - GLM would simply need a little more catering towards simple usage, to make it the GOAT of linear fitting in the Julia ecosystem.

(I am rather new to linear fitting, so take my statements on the state of linear fitting in the Julia ecosystem with a grain of salt. But this is the way it currently seems to me.)

I’d say at a minimum univariate regression without intercept is something fairly non-standard in the world of regression modelling, and you might as well just do x\y without using any packages if that’s all you’re after? Or alternatively read this very detailed thread on all the different ways to do linear regression in Julia.

I’m not sure there’s a particular reason why all methods in GLM require a matrix for the covariates (other than I assume this is by far the most common use case for people using GLM, and even for a univariate regression people will generally hcat a vector of ones to x to include an intercept, creating a matrix), but I’m also not sure it needs new methods when you can just do lm(reshape(x, length(x), 1) if you absolutely want to use lm to do x\y. But I guess you could try a PR that defines lm(x::AbstractVector, y::AbstractVector) = lm(reshape(x, length(x), 1), y) and see what the maintainers think.

EDIT - the code is updated to include through_origin

Thank you - using hcat with a bunch of ones was what I was looking for.

I have read the thread you referenced before making this post. I must have been unclear - I would like to have an intercept (y = ax+b, as opposed to y = ax). I ran into the need to hcat a vector of ones in my class yesterday, where Matlab is used. I have to admit, that I don’t understand much of why, but it is coherent that the same has to be done in the Julia implementation of the same functionality.

Defining

using GLM
import GLM: fit, lm
function fit(::Type{LinearModel}, X::AbstractVector, y::AbstractVector, through_origin::Bool=false)
    if through_origin
        return fit(LinearModel, reshape(x, length(x), 1), y)
    else
        return fit(LinearModel, hcat(X, ones(eltype(X), length(X))), y)
    end
end
lm(X::AbstractVector, y::AbstractVector, through_origin::Bool=false) = fit(LinearModel, X, y, through_origin)

is plenty sufficient for what I was looking for, and it automates the need to hcat zeros, which Matlab was sorely lacking. I will go ahead and make a PR to GLM - thank you for pointing me in the right direction, and thanks for your time.

Here is the PR for anyone interested:

Ah sorry, I misunderstood you then - if you want a fit with an intercept, you have to do hcat anyway and there’s no need for an extra method, better to be explicit.

I don’t think this PR makes a lot of sense - it sneakily adds an intercept to a model which the user might expect not to have an intercept (as the current lm interface requires explicit addition of the ones), and makes it cumbersome to fit models without intercept (I guess one would have to hcat zeros here to get it to work?)

1 Like

I have now made a PR where one does not need to hcat anything. It makes sense to me, because I don’t really understand what the vector of zeros or ones really means , so that is only confuses me. So I find it very beneficial to be relieved of that mental overhead, and I think others (I know my fellow undergrad engineers taking Statistics for one) would also appriciate this.

So I don’t understand why things were the way they were, because I don’t have the knowlegde yet. But I still want to use GLM for thing that I understand, like fitting a line to two vectors of data, and then learn more about GLM over time. This is opposed to using another package, like EasyFit, and then having to change package as I get more competent. I really think that this PR makes GLM more usable to newcomers to statistics and linear fitting, such as myself.

I think that the argument through_origin, which I just added, fixes the “sneaky”-ness you are concerned with. I feel that the PR currently adresses what I was missing, without interfering with the functionality that was there in the first place - Please let me know if you have any other issues with the it.

Okay I’m not sure this forum is necessarily the best place to learn introductory regression modelling (for this I would recommend either “Introductory Econometrics” by Wooldrigde or Gelman et al’s latest “Regression and other stories”), but at least the zero one part should be straightforward:

Take the model

y_i = \alpha + \beta x_i

which includes a slope and intercept coefficient. You see that the second (slope) coefficient is multiplied by x_i, which is - when applied to a series of observations stacked in a vector - your x vector. The first (intercept) coefficient is multiplied (implicitly) by one - this is where the vector of ones comes from. In matrix notation if you have two observations y_1 and y_2, with corresponding covariates x_1 and x_2, your system of equations would be:

\begin{bmatrix}y_1 \\ y_2 \end{bmatrix} = \begin{bmatrix} 1 & x_1 \\ 1 & x_2 \end{bmatrix} \begin{bmatrix} \alpha \\ \beta \end{bmatrix}

which if you multiply out the right hand side gives:

\begin{bmatrix}y_1 \\ y_2 \end{bmatrix} = \begin{bmatrix} \alpha + \beta x_1 \\ \alpha + \beta x_2 \end{bmatrix}

Hopefully that clarifies why the X matrix has a column of ones as the first column if you’re fitting a model with a constant.

I think GLM is consistent and well-documented in what it does, and see no reason for changing anything, but reasonably people can disagree on this. If you often find yourself using the non-@formula way of fitting models in GLM but want and intercept, you could just have the function in your PR as your own lm_with_intercept function.

3 Likes

Thank you so much for the explanaition - it helps a lot.
Perhaps simply adding documentation for someone like me would suffice - I will give it a shot.

FWIW, the following are 2 shorter ways for converting an input vector to lm() into a matrix:

reshape(x,:,1)
view(x,:,:)
1 Like

I would argue the simplest way to do this is a NamedTuple of Vectors

julia> using GLM

julia> x = rand(10); y = rand(10);

julia> lm(@formula(y ~ x), (;x, y))
StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{Vector{Float64}}, GLM.DensePredChol{Float64, LinearAlgebra.CholeskyPivoted{Float64, Matrix{Float64}}}}, Matrix{Float64}}

y ~ 1 + x

Coefficients:
──────────────────────────────────────────────────────────────────────────
                  Coef.  Std. Error      t  Pr(>|t|)  Lower 95%  Upper 95%
──────────────────────────────────────────────────────────────────────────
(Intercept)   0.554798     0.188588   2.94    0.0187   0.119913   0.989684
x            -0.0331544    0.340483  -0.10    0.9248  -0.81831    0.752001
──────────────────────────────────────────────────────────────────────────
5 Likes

Billiant! I think this is a very elegant solution, without depending on creating a DataFrame.

I have made #454, adding this to the documentation, and closed #453. 453 seems to be doing a little too much to make GLM work for me, and I think documenting this (as is done in 454) would have made me happy as a first time user, without compromising GLM at all.

Do you know how one would force the fit through (0, 0) with this interface? Or would one have to use the lm(X::Matrix, Y::Vector) interface, with the first column of X being zeros, to do that?

I found it, from looking at Modeling tabular data · StatsModels.jl.
For others, it is done by lm(@formula(Y~0+X), data)

EDIT: I have added this to PR #454. I am very happy with how it is turning out - I have learned a lot, and I think the docs (if the PR is merged) will be exacly what I would have needed when I first arrived at the GLM docs.