What shape should data be in to perform a linear fit with GLM?

alice · January 10, 2021, 4:49pm

I am trying to fit some data with a variable number of dimensions using a linear fit. I am currently trying to use GLM, but I am happy to use any other library.

I’ve been running in to a lot of trouble getting the data into the right shape. This is the shape I’m currently trying to use. The example dataset below has 8 observations with 3 input dimensions each.

using DataFrames, GLM

input = [[587000.0, 16.5, 6.2],[713000.0, 17.2, 4.9],[749000.0, 14.3, 6.4],[7.895e6, 18.1, 6.0],[2.793e6, 19.1, 5.8],[716000.0, 17.9, 6.7],[595000.0, 20.2, 8.4],[3.353e6, 16.9, 6.7]]
output = [11.2, 8.7, 9.6, 14.5, 15.7, 14.9, 21.7, 25.7]

model = lm(@formula(Y ~ X), DataFrame(X=input, Y=output))

However when I print the coefficients of the above, I get this:

julia> coef(lm(@formula(Y ~ X), DataFrame(X=input, Y=output)))
8-element Array{Float64,1}:
 11.199999999999983
 10.50000000000002
 -2.4999999999999827
  3.700000000000017
 -1.5999999999999834
  4.500000000000016
 14.500000000000018
  3.300000000000017

This is quite weird — a dataset with 3 input dimensions should have four coefficients. One for the offset and one for each dimension. It seems like transposing the data is the answer, but then the data frame is unhappy with the dimensions.

# with transposed input
using DataFrames, GLM

input = [[587000.0, 16.5, 6.2],[713000.0, 17.2, 4.9],[749000.0, 14.3, 6.4],[7.895e6, 18.1, 6.0],[2.793e6, 19.1, 5.8],[716000.0, 17.9, 6.7],[595000.0, 20.2, 8.4],[3.353e6, 16.9, 6.7]]
output = [11.2, 8.7, 9.6, 14.5, 15.7, 14.9, 21.7, 25.7]

model = lm(@formula(Y ~ X), DataFrame(X=transpose(input), Y=output))

ERROR: ArgumentError: adding AbstractArray other than AbstractVector as a column of a data frame is not allowed
Stacktrace:
 [1] DataFrame(::Array{Any,1}, ::DataFrames.Index; copycols::Bool) at /home/alice/.julia/packages/DataFrames/yqToF/src/dataframe/dataframe.jl:201
 [2] DataFrame(; kwargs::Base.Iterators.Pairs{Symbol,AbstractArray,Tuple{Symbol,Symbol},NamedTuple{(:X, :Y),Tuple{LinearAlgebra.Transpose{LinearAlgebra.Transpose{Float64,Array{Float64,1}},Array{Array{Float64,1},1}},Array{Float64,1}}}}) at /home/alice/.julia/packages/DataFrames/yqToF/src/dataframe/dataframe.jl:284
 [3] top-level scope at REPL[7]:1
 [4] run_repl(::REPL.AbstractREPL, ::Any) at /build/julia/src/julia-1.5.3/usr/share/julia/stdlib/v1.5/REPL/src/REPL.jl:288

I have also tried a matrix instead of an array of arrays, but the data frame also did not allow that.

So how do I do this correctly?

Note that the first model with the extra coefficients also prints the following error if I try to println the model object, which seems worrying:

julia> lm(@formula(Y ~ X), DataFrame(X=input, Y=output))
Error showing value of type StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{Array{Float64,1}},GLM.DensePredChol{Float64,LinearAlgebra.Cholesky{Float64,Array{Float64,2}}}},Array{Float64,2}}:
ERROR: ArgumentError: FDist: the condition ν1 > zero(ν1) && ν2 > zero(ν2) is not satisfied.
Stacktrace:
 [1] macro expansion at /home/alice/.julia/packages/Distributions/7Ifyw/src/utils.jl:6 [inlined]
 [2] FDist at /home/alice/.julia/packages/Distributions/7Ifyw/src/univariate/continuous/fdist.jl:30 [inlined]
 [3] FDist at /home/alice/.julia/packages/Distributions/7Ifyw/src/univariate/continuous/fdist.jl:35 [inlined]
 [4] FDist at /home/alice/.julia/packages/Distributions/7Ifyw/src/univariate/continuous/fdist.jl:37 [inlined]
 [5] coeftable(::LinearModel{GLM.LmResp{Array{Float64,1}},GLM.DensePredChol{Float64,LinearAlgebra.Cholesky{Float64,Array{Float64,2}}}}; level::Float64) at /home/alice/.julia/packages/GLM/dbJFe/src/lm.jl:185
 [6] coeftable(::LinearModel{GLM.LmResp{Array{Float64,1}},GLM.DensePredChol{Float64,LinearAlgebra.Cholesky{Float64,Array{Float64,2}}}}) at /home/alice/.julia/packages/GLM/dbJFe/src/lm.jl:182
 [7] coeftable(::StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{Array{Float64,1}},GLM.DensePredChol{Float64,LinearAlgebra.Cholesky{Float64,Array{Float64,2}}}},Array{Float64,2}}; kwargs::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{,Tuple{}}}) at /home/alice/.julia/packages/StatsModels/Oreif/src/statsmodel.jl:173
 [8] coeftable at /home/alice/.julia/packages/StatsModels/Oreif/src/statsmodel.jl:173 [inlined]
 [9] show(::IOContext{REPL.Terminals.TTYTerminal}, ::StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{Array{Float64,1}},GLM.DensePredChol{Float64,LinearAlgebra.Cholesky{Float64,Array{Float64,2}}}},Array{Float64,2}}) at /home/alice/.julia/packages/StatsModels/Oreif/src/statsmodel.jl:184
 [10] show(::IOContext{REPL.Terminals.TTYTerminal}, ::MIME{Symbol("text/plain")}, ::StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{Array{Float64,1}},GLM.DensePredChol{Float64,LinearAlgebra.Cholesky{Float64,Array{Float64,2}}}},Array{Float64,2}}) at ./multimedia.jl:47
 [11] display(::REPL.REPLDisplay, ::Any) at /build/julia/src/julia-1.5.3/usr/share/julia/stdlib/v1.5/REPL/src/REPL.jl:218

Additionally, I assume the model will yell at me if the data does not have enough data points to uniquely determine the fit, but in this case I’d really prefer to just get any fit that minimizes the squared error.

I am new to Julia, but I am not new at programming.

pdeffebach · January 10, 2021, 5:33pm

using DataFrames, GLM

input = [[587000.0, 16.5, 6.2],[713000.0, 17.2, 4.9],[749000.0, 14.3, 6.4],[7.895e6, 18.1, 6.0],[2.793e6, 19.1, 5.8],[716000.0, 17.9, 6.7],[595000.0, 20.2, 8.4],[3.353e6, 16.9, 6.7]]
output = [11.2, 8.7, 9.6, 14.5, 15.7, 14.9, 21.7, 25.7]

model = lm(@formula(Y ~ X), DataFrame(X=input, Y=output))

I get an error with that, since the DataFrame created has input as a vector of vectors.

You ideally want the following, right?

julia> df = DataFrame();

julia> df.x1 = getindex.(input, 1);

julia> df.x2 = getindex.(input, 2);

julia> df.x3 = getindex.(input, 3);

julia> df.output = output;

julia> lm(@formula(output ~ x1 + x2 + x3), df)
StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{Array{Float64,1}},GLM.DensePredChol{Float64,LinearAlgebra.Cholesky{Float64,Array{Float64,2}}}},Array{Float64,2}}

output ~ 1 + x1 + x2 + x3

Coefficients:
──────────────────────────────────────────────────────────────────────────────────
                    Coef.   Std. Error      t  Pr(>|t|)     Lower 95%    Upper 95%
──────────────────────────────────────────────────────────────────────────────────
(Intercept)  -20.9353      20.0733      -1.04    0.3559  -76.6678      34.7972
x1             6.95998e-7   8.05544e-7   0.86    0.4363   -1.54055e-6   2.93255e-6
x2             0.624308     1.23562      0.51    0.6399   -2.80631      4.05493
x3             3.71514      2.20241      1.69    0.1669   -2.39972      9.83
──────────────────────────────────────────────────────────────────────────────────

You need an array of numbers, not vectors, in your dataframe.

alice · January 10, 2021, 5:57pm

That seems reasonable, but I have a variable number of dimensions, so I need some way to be able to construct the dataframe and @formula without hardcoding the dimensions in the code.

pdeffebach · January 10, 2021, 6:04pm

for constructing the data frame you can

push! a NamedTuple in a loop, so you work with named tuples instead Vectors in input.
Make a Matrix via mapreduce(permutedims, vcat, input) and construct a DataFrame from that matrix via DataFrame(mat, :auto).

For using lm you can

Use lm(y, X) instead of @formula
Construct a formula programmatically in the method described here.

alice · January 10, 2021, 6:31pm

Using lm(y, X) instead of @formula seems like it makes my life simpler. What would the types of y and X be in this case?

pdeffebach · January 10, 2021, 6:35pm

y is a Vector and X is a Matrix (which can be created with mapreduce(permutedims, vcat, input)

alice · January 10, 2021, 6:36pm

Just to confirm, the Matrix is also what Array{Float64}(undef, rows, columns) produces? There’s no reason to first construct the vectors in-place and use mapreduce if I can just construct the matrix directly in-place.

alice · January 11, 2021, 9:22am

Okay I did something like this:

input = Array{Float64}(undef, num_points, 1+dims)
output = Array{Float64}(undef, num_points)

... write data to input and output

model = lm(input, output)

with one more dimension than I really have, having every point in that dimension set to 1, as it otherwise does not include an offset in the linear fit.

It sometimes failed with this error:

ERROR: PosDefException: matrix is not positive definite; Cholesky factorization failed.
Stacktrace:
 [1] checkpositivedefinite at /build/julia/src/julia-1.5.3/usr/share/julia/stdlib/v1.5/LinearAlgebra/src/factorization.jl:18 [inlined]
 [2] cholesky!(::LinearAlgebra.Hermitian{Float64,Array{Float64,2}}, ::Val{false}; check::Bool) at /build/julia/src/julia-1.5.3/usr/share/julia/stdlib/v1.5/LinearAlgebra/src/cholesky.jl:219
 [3] cholesky! at /build/julia/src/julia-1.5.3/usr/share/julia/stdlib/v1.5/LinearAlgebra/src/cholesky.jl:218 [inlined] (repeats 2 times)
 [4] GLM.DensePredChol(::Array{Float64,2}, ::Bool) at /home/alice/.julia/packages/GLM/dbJFe/src/linpred.jl:107
 [5] cholpred at /home/alice/.julia/packages/GLM/dbJFe/src/linpred.jl:117 [inlined]
 [6] fit(::Type{GLM.LinearModel}, ::Array{Float64,2}, ::Array{Float64,1}, ::Bool; wts::Array{Float64,1}) at /home/alice/.julia/packages/GLM/dbJFe/src/lm.jl:127
 [7] fit at /home/alice/.julia/packages/GLM/dbJFe/src/lm.jl:127 [inlined]
 [8] #lm#2 at /home/alice/.julia/packages/GLM/dbJFe/src/lm.jl:144 [inlined]
 [9] lm at /home/alice/.julia/packages/GLM/dbJFe/src/lm.jl:144 [inlined] (repeats 2 times)
 [10] multistart_compute_clusters(::Main.DatasetMod.Dataset, ::Int64) at /home/alice/src/VCLR/src/clr.jl:527
 [11] optimize!(::Main.CLRSolver.CLRBranchAndBound) at /home/alice/src/VCLR/src/clr.jl:79
 [12] top-level scope at REPL[13]:1
 [13] run_repl(::REPL.AbstractREPL, ::Any) at /build/julia/src/julia-1.5.3/usr/share/julia/stdlib/v1.5/REPL/src/REPL.jl:288

This happens in the case I also mentioned in my original post:

Additionally, I assume the model will yell at me if the data does not have enough data points to uniquely determine the fit, but in this case I’d really prefer to just get any fit that minimizes the squared error.

However it appears to have been fixed by setting allowrankdeficient to true.

model = lm(input, output, true)

Thanks for the help!

Topic		Replies	Views
The simplest linear fit with GLM Tooling glm	13	5298	November 11, 2021
GLM.jl Is posulbie to use lm() for two vectors without DataFrames? General Usage glm	2	622	November 4, 2019
GLM: "no method matching fit" New to Julia fit , glm	9	3374	September 10, 2019
Error when using fit(LinearModel,...) through GLM General Usage glm	1	585	August 19, 2021
Linear regression without the intercept term Statistics question , regression , fit , glm	7	4149	March 8, 2023

What shape should data be in to perform a linear fit with GLM?

Related topics