I am trying to fit some data with a variable number of dimensions using a linear fit. I am currently trying to use GLM, but I am happy to use any other library.
I’ve been running in to a lot of trouble getting the data into the right shape. This is the shape I’m currently trying to use. The example dataset below has 8 observations with 3 input dimensions each.
using DataFrames, GLM
input = [[587000.0, 16.5, 6.2],[713000.0, 17.2, 4.9],[749000.0, 14.3, 6.4],[7.895e6, 18.1, 6.0],[2.793e6, 19.1, 5.8],[716000.0, 17.9, 6.7],[595000.0, 20.2, 8.4],[3.353e6, 16.9, 6.7]]
output = [11.2, 8.7, 9.6, 14.5, 15.7, 14.9, 21.7, 25.7]
model = lm(@formula(Y ~ X), DataFrame(X=input, Y=output))
However when I print the coefficients of the above, I get this:
julia> coef(lm(@formula(Y ~ X), DataFrame(X=input, Y=output)))
8-element Array{Float64,1}:
11.199999999999983
10.50000000000002
-2.4999999999999827
3.700000000000017
-1.5999999999999834
4.500000000000016
14.500000000000018
3.300000000000017
This is quite weird — a dataset with 3 input dimensions should have four coefficients. One for the offset and one for each dimension. It seems like transposing the data is the answer, but then the data frame is unhappy with the dimensions.
# with transposed input
using DataFrames, GLM
input = [[587000.0, 16.5, 6.2],[713000.0, 17.2, 4.9],[749000.0, 14.3, 6.4],[7.895e6, 18.1, 6.0],[2.793e6, 19.1, 5.8],[716000.0, 17.9, 6.7],[595000.0, 20.2, 8.4],[3.353e6, 16.9, 6.7]]
output = [11.2, 8.7, 9.6, 14.5, 15.7, 14.9, 21.7, 25.7]
model = lm(@formula(Y ~ X), DataFrame(X=transpose(input), Y=output))
ERROR: ArgumentError: adding AbstractArray other than AbstractVector as a column of a data frame is not allowed
Stacktrace:
[1] DataFrame(::Array{Any,1}, ::DataFrames.Index; copycols::Bool) at /home/alice/.julia/packages/DataFrames/yqToF/src/dataframe/dataframe.jl:201
[2] DataFrame(; kwargs::Base.Iterators.Pairs{Symbol,AbstractArray,Tuple{Symbol,Symbol},NamedTuple{(:X, :Y),Tuple{LinearAlgebra.Transpose{LinearAlgebra.Transpose{Float64,Array{Float64,1}},Array{Array{Float64,1},1}},Array{Float64,1}}}}) at /home/alice/.julia/packages/DataFrames/yqToF/src/dataframe/dataframe.jl:284
[3] top-level scope at REPL[7]:1
[4] run_repl(::REPL.AbstractREPL, ::Any) at /build/julia/src/julia-1.5.3/usr/share/julia/stdlib/v1.5/REPL/src/REPL.jl:288
I have also tried a matrix instead of an array of arrays, but the data frame also did not allow that.
So how do I do this correctly?
Note that the first model with the extra coefficients also prints the following error if I try to println
the model object, which seems worrying:
julia> lm(@formula(Y ~ X), DataFrame(X=input, Y=output))
Error showing value of type StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{Array{Float64,1}},GLM.DensePredChol{Float64,LinearAlgebra.Cholesky{Float64,Array{Float64,2}}}},Array{Float64,2}}:
ERROR: ArgumentError: FDist: the condition ν1 > zero(ν1) && ν2 > zero(ν2) is not satisfied.
Stacktrace:
[1] macro expansion at /home/alice/.julia/packages/Distributions/7Ifyw/src/utils.jl:6 [inlined]
[2] FDist at /home/alice/.julia/packages/Distributions/7Ifyw/src/univariate/continuous/fdist.jl:30 [inlined]
[3] FDist at /home/alice/.julia/packages/Distributions/7Ifyw/src/univariate/continuous/fdist.jl:35 [inlined]
[4] FDist at /home/alice/.julia/packages/Distributions/7Ifyw/src/univariate/continuous/fdist.jl:37 [inlined]
[5] coeftable(::LinearModel{GLM.LmResp{Array{Float64,1}},GLM.DensePredChol{Float64,LinearAlgebra.Cholesky{Float64,Array{Float64,2}}}}; level::Float64) at /home/alice/.julia/packages/GLM/dbJFe/src/lm.jl:185
[6] coeftable(::LinearModel{GLM.LmResp{Array{Float64,1}},GLM.DensePredChol{Float64,LinearAlgebra.Cholesky{Float64,Array{Float64,2}}}}) at /home/alice/.julia/packages/GLM/dbJFe/src/lm.jl:182
[7] coeftable(::StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{Array{Float64,1}},GLM.DensePredChol{Float64,LinearAlgebra.Cholesky{Float64,Array{Float64,2}}}},Array{Float64,2}}; kwargs::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{,Tuple{}}}) at /home/alice/.julia/packages/StatsModels/Oreif/src/statsmodel.jl:173
[8] coeftable at /home/alice/.julia/packages/StatsModels/Oreif/src/statsmodel.jl:173 [inlined]
[9] show(::IOContext{REPL.Terminals.TTYTerminal}, ::StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{Array{Float64,1}},GLM.DensePredChol{Float64,LinearAlgebra.Cholesky{Float64,Array{Float64,2}}}},Array{Float64,2}}) at /home/alice/.julia/packages/StatsModels/Oreif/src/statsmodel.jl:184
[10] show(::IOContext{REPL.Terminals.TTYTerminal}, ::MIME{Symbol("text/plain")}, ::StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{Array{Float64,1}},GLM.DensePredChol{Float64,LinearAlgebra.Cholesky{Float64,Array{Float64,2}}}},Array{Float64,2}}) at ./multimedia.jl:47
[11] display(::REPL.REPLDisplay, ::Any) at /build/julia/src/julia-1.5.3/usr/share/julia/stdlib/v1.5/REPL/src/REPL.jl:218
Additionally, I assume the model will yell at me if the data does not have enough data points to uniquely determine the fit, but in this case I’d really prefer to just get any fit that minimizes the squared error.
I am new to Julia, but I am not new at programming.