What shape should data be in to perform a linear fit with GLM?

I am trying to fit some data with a variable number of dimensions using a linear fit. I am currently trying to use GLM, but I am happy to use any other library.

I’ve been running in to a lot of trouble getting the data into the right shape. This is the shape I’m currently trying to use. The example dataset below has 8 observations with 3 input dimensions each.

using DataFrames, GLM

input = [[587000.0, 16.5, 6.2],[713000.0, 17.2, 4.9],[749000.0, 14.3, 6.4],[7.895e6, 18.1, 6.0],[2.793e6, 19.1, 5.8],[716000.0, 17.9, 6.7],[595000.0, 20.2, 8.4],[3.353e6, 16.9, 6.7]]
output = [11.2, 8.7, 9.6, 14.5, 15.7, 14.9, 21.7, 25.7]

model = lm(@formula(Y ~ X), DataFrame(X=input, Y=output))

However when I print the coefficients of the above, I get this:

julia> coef(lm(@formula(Y ~ X), DataFrame(X=input, Y=output)))
8-element Array{Float64,1}:
 11.199999999999983
 10.50000000000002
 -2.4999999999999827
  3.700000000000017
 -1.5999999999999834
  4.500000000000016
 14.500000000000018
  3.300000000000017

This is quite weird — a dataset with 3 input dimensions should have four coefficients. One for the offset and one for each dimension. It seems like transposing the data is the answer, but then the data frame is unhappy with the dimensions.

# with transposed input
using DataFrames, GLM

input = [[587000.0, 16.5, 6.2],[713000.0, 17.2, 4.9],[749000.0, 14.3, 6.4],[7.895e6, 18.1, 6.0],[2.793e6, 19.1, 5.8],[716000.0, 17.9, 6.7],[595000.0, 20.2, 8.4],[3.353e6, 16.9, 6.7]]
output = [11.2, 8.7, 9.6, 14.5, 15.7, 14.9, 21.7, 25.7]

model = lm(@formula(Y ~ X), DataFrame(X=transpose(input), Y=output))
ERROR: ArgumentError: adding AbstractArray other than AbstractVector as a column of a data frame is not allowed
Stacktrace:
 [1] DataFrame(::Array{Any,1}, ::DataFrames.Index; copycols::Bool) at /home/alice/.julia/packages/DataFrames/yqToF/src/dataframe/dataframe.jl:201
 [2] DataFrame(; kwargs::Base.Iterators.Pairs{Symbol,AbstractArray,Tuple{Symbol,Symbol},NamedTuple{(:X, :Y),Tuple{LinearAlgebra.Transpose{LinearAlgebra.Transpose{Float64,Array{Float64,1}},Array{Array{Float64,1},1}},Array{Float64,1}}}}) at /home/alice/.julia/packages/DataFrames/yqToF/src/dataframe/dataframe.jl:284
 [3] top-level scope at REPL[7]:1
 [4] run_repl(::REPL.AbstractREPL, ::Any) at /build/julia/src/julia-1.5.3/usr/share/julia/stdlib/v1.5/REPL/src/REPL.jl:288

I have also tried a matrix instead of an array of arrays, but the data frame also did not allow that.

So how do I do this correctly?

Note that the first model with the extra coefficients also prints the following error if I try to println the model object, which seems worrying:

julia> lm(@formula(Y ~ X), DataFrame(X=input, Y=output))
Error showing value of type StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{Array{Float64,1}},GLM.DensePredChol{Float64,LinearAlgebra.Cholesky{Float64,Array{Float64,2}}}},Array{Float64,2}}:
ERROR: ArgumentError: FDist: the condition ν1 > zero(ν1) && ν2 > zero(ν2) is not satisfied.
Stacktrace:
 [1] macro expansion at /home/alice/.julia/packages/Distributions/7Ifyw/src/utils.jl:6 [inlined]
 [2] FDist at /home/alice/.julia/packages/Distributions/7Ifyw/src/univariate/continuous/fdist.jl:30 [inlined]
 [3] FDist at /home/alice/.julia/packages/Distributions/7Ifyw/src/univariate/continuous/fdist.jl:35 [inlined]
 [4] FDist at /home/alice/.julia/packages/Distributions/7Ifyw/src/univariate/continuous/fdist.jl:37 [inlined]
 [5] coeftable(::LinearModel{GLM.LmResp{Array{Float64,1}},GLM.DensePredChol{Float64,LinearAlgebra.Cholesky{Float64,Array{Float64,2}}}}; level::Float64) at /home/alice/.julia/packages/GLM/dbJFe/src/lm.jl:185
 [6] coeftable(::LinearModel{GLM.LmResp{Array{Float64,1}},GLM.DensePredChol{Float64,LinearAlgebra.Cholesky{Float64,Array{Float64,2}}}}) at /home/alice/.julia/packages/GLM/dbJFe/src/lm.jl:182
 [7] coeftable(::StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{Array{Float64,1}},GLM.DensePredChol{Float64,LinearAlgebra.Cholesky{Float64,Array{Float64,2}}}},Array{Float64,2}}; kwargs::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{,Tuple{}}}) at /home/alice/.julia/packages/StatsModels/Oreif/src/statsmodel.jl:173
 [8] coeftable at /home/alice/.julia/packages/StatsModels/Oreif/src/statsmodel.jl:173 [inlined]
 [9] show(::IOContext{REPL.Terminals.TTYTerminal}, ::StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{Array{Float64,1}},GLM.DensePredChol{Float64,LinearAlgebra.Cholesky{Float64,Array{Float64,2}}}},Array{Float64,2}}) at /home/alice/.julia/packages/StatsModels/Oreif/src/statsmodel.jl:184
 [10] show(::IOContext{REPL.Terminals.TTYTerminal}, ::MIME{Symbol("text/plain")}, ::StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{Array{Float64,1}},GLM.DensePredChol{Float64,LinearAlgebra.Cholesky{Float64,Array{Float64,2}}}},Array{Float64,2}}) at ./multimedia.jl:47
 [11] display(::REPL.REPLDisplay, ::Any) at /build/julia/src/julia-1.5.3/usr/share/julia/stdlib/v1.5/REPL/src/REPL.jl:218

Additionally, I assume the model will yell at me if the data does not have enough data points to uniquely determine the fit, but in this case I’d really prefer to just get any fit that minimizes the squared error.

I am new to Julia, but I am not new at programming.

using DataFrames, GLM

input = [[587000.0, 16.5, 6.2],[713000.0, 17.2, 4.9],[749000.0, 14.3, 6.4],[7.895e6, 18.1, 6.0],[2.793e6, 19.1, 5.8],[716000.0, 17.9, 6.7],[595000.0, 20.2, 8.4],[3.353e6, 16.9, 6.7]]
output = [11.2, 8.7, 9.6, 14.5, 15.7, 14.9, 21.7, 25.7]

model = lm(@formula(Y ~ X), DataFrame(X=input, Y=output))

I get an error with that, since the DataFrame created has input as a vector of vectors.

You ideally want the following, right?

julia> df = DataFrame();

julia> df.x1 = getindex.(input, 1);

julia> df.x2 = getindex.(input, 2);

julia> df.x3 = getindex.(input, 3);

julia> df.output = output;

julia> lm(@formula(output ~ x1 + x2 + x3), df)
StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{Array{Float64,1}},GLM.DensePredChol{Float64,LinearAlgebra.Cholesky{Float64,Array{Float64,2}}}},Array{Float64,2}}

output ~ 1 + x1 + x2 + x3

Coefficients:
──────────────────────────────────────────────────────────────────────────────────
                    Coef.   Std. Error      t  Pr(>|t|)     Lower 95%    Upper 95%
──────────────────────────────────────────────────────────────────────────────────
(Intercept)  -20.9353      20.0733      -1.04    0.3559  -76.6678      34.7972
x1             6.95998e-7   8.05544e-7   0.86    0.4363   -1.54055e-6   2.93255e-6
x2             0.624308     1.23562      0.51    0.6399   -2.80631      4.05493
x3             3.71514      2.20241      1.69    0.1669   -2.39972      9.83
──────────────────────────────────────────────────────────────────────────────────

You need an array of numbers, not vectors, in your dataframe.

1 Like

That seems reasonable, but I have a variable number of dimensions, so I need some way to be able to construct the dataframe and @formula without hardcoding the dimensions in the code.

for constructing the data frame you can

  1. push! a NamedTuple in a loop, so you work with named tuples instead Vectors in input.
  2. Make a Matrix via mapreduce(permutedims, vcat, input) and construct a DataFrame from that matrix via DataFrame(mat, :auto).

For using lm you can

  1. Use lm(y, X) instead of @formula
  2. Construct a formula programmatically in the method described here.
2 Likes

Using lm(y, X) instead of @formula seems like it makes my life simpler. What would the types of y and X be in this case?

1 Like

y is a Vector and X is a Matrix (which can be created with mapreduce(permutedims, vcat, input)

Just to confirm, the Matrix is also what Array{Float64}(undef, rows, columns) produces? There’s no reason to first construct the vectors in-place and use mapreduce if I can just construct the matrix directly in-place.

Okay I did something like this:

input = Array{Float64}(undef, num_points, 1+dims)
output = Array{Float64}(undef, num_points)

... write data to input and output

model = lm(input, output)

with one more dimension than I really have, having every point in that dimension set to 1, as it otherwise does not include an offset in the linear fit.

It sometimes failed with this error:

ERROR: PosDefException: matrix is not positive definite; Cholesky factorization failed.
Stacktrace:
 [1] checkpositivedefinite at /build/julia/src/julia-1.5.3/usr/share/julia/stdlib/v1.5/LinearAlgebra/src/factorization.jl:18 [inlined]
 [2] cholesky!(::LinearAlgebra.Hermitian{Float64,Array{Float64,2}}, ::Val{false}; check::Bool) at /build/julia/src/julia-1.5.3/usr/share/julia/stdlib/v1.5/LinearAlgebra/src/cholesky.jl:219
 [3] cholesky! at /build/julia/src/julia-1.5.3/usr/share/julia/stdlib/v1.5/LinearAlgebra/src/cholesky.jl:218 [inlined] (repeats 2 times)
 [4] GLM.DensePredChol(::Array{Float64,2}, ::Bool) at /home/alice/.julia/packages/GLM/dbJFe/src/linpred.jl:107
 [5] cholpred at /home/alice/.julia/packages/GLM/dbJFe/src/linpred.jl:117 [inlined]
 [6] fit(::Type{GLM.LinearModel}, ::Array{Float64,2}, ::Array{Float64,1}, ::Bool; wts::Array{Float64,1}) at /home/alice/.julia/packages/GLM/dbJFe/src/lm.jl:127
 [7] fit at /home/alice/.julia/packages/GLM/dbJFe/src/lm.jl:127 [inlined]
 [8] #lm#2 at /home/alice/.julia/packages/GLM/dbJFe/src/lm.jl:144 [inlined]
 [9] lm at /home/alice/.julia/packages/GLM/dbJFe/src/lm.jl:144 [inlined] (repeats 2 times)
 [10] multistart_compute_clusters(::Main.DatasetMod.Dataset, ::Int64) at /home/alice/src/VCLR/src/clr.jl:527
 [11] optimize!(::Main.CLRSolver.CLRBranchAndBound) at /home/alice/src/VCLR/src/clr.jl:79
 [12] top-level scope at REPL[13]:1
 [13] run_repl(::REPL.AbstractREPL, ::Any) at /build/julia/src/julia-1.5.3/usr/share/julia/stdlib/v1.5/REPL/src/REPL.jl:288

This happens in the case I also mentioned in my original post:

Additionally, I assume the model will yell at me if the data does not have enough data points to uniquely determine the fit, but in this case I’d really prefer to just get any fit that minimizes the squared error.

However it appears to have been fixed by setting allowrankdeficient to true.

model = lm(input, output, true)

Thanks for the help!

2 Likes