GLM with weights

question
package

#1

I’m getting the following error when trying to run GLM with a weight vector for my data:

Julia Client – Internal Error
DomainError(-1.867506166794264e-7, "sqrt will only return a complex result if called with a complex argument. Try sqrt(Complex(x)).")
throw_complex_domainerror(::Symbol, ::Float64) at math.jl:31
sqrt at math.jl:492 [inlined]
_broadcast_getindex_evalf at broadcast.jl:574 [inlined]
_broadcast_getindex at broadcast.jl:547 [inlined]
getindex at broadcast.jl:507 [inlined]
macro expansion at broadcast.jl:838 [inlined]
macro expansion at simdloop.jl:73 [inlined]
copyto! at broadcast.jl:837 [inlined]
copyto! at broadcast.jl:792 [inlined]
copy at broadcast.jl:768 [inlined]
materialize at broadcast.jl:748 [inlined]
stderror(::GeneralizedLinearModel{GlmResp{Array{Float64,1},Normal{Float64},IdentityLink},DensePredChol{Float64,Cholesky{Float64,Array{Float64,2}}}}) at linpred.jl:223
coeftable(::GeneralizedLinearModel{GlmResp{Array{Float64,1},Normal{Float64},IdentityLink},DensePredChol{Float64,Cholesky{Float64,Array{Float64,2}}}}) at glmfit.jl:163
coeftable at statsmodel.jl:110 [inlined]
show(::IOContext{Base.GenericIOBuffer{Array{UInt8,1}}}, ::StatsModels.DataFrameRegressionModel{GeneralizedLinearModel{GlmResp{Array{Float64,1},Normal{Float64},IdentityLink},DensePredChol{Float64,Cholesky{Float64,Array{Float64,2}}}},Array{Float64,2}}) at statsmodel.jl:121
show at sysimg.jl:194 [inlined]
(::getfield(Atom, Symbol("##27#28")){StatsModels.DataFrameRegressionModel{GeneralizedLinearModel{GlmResp{Array{Float64,1},Normal{Float64},IdentityLink},DensePredChol{Float64,Cholesky{Float64,Array{Float64,2}}}},Array{Float64,2}}})(::Base.GenericIOBuffer{Array{UInt8,1}}) at display.jl:17
#sprint#325(::Nothing, ::Int64, ::Function, ::Function) at io.jl:101
sprint at io.jl:97 [inlined]
render at display.jl:16 [inlined]
Type at types.jl:39 [inlined]
Type at types.jl:40 [inlined]
render at display.jl:19 [inlined]
displayandrender(::StatsModels.DataFrameRegressionModel{GeneralizedLinearModel{GlmResp{Array{Float64,1},Normal{Float64},IdentityLink},DensePredChol{Float64,Cholesky{Float64,Array{Float64,2}}}},Array{Float64,2}}) at showdisplay.jl:127
(::getfield(Atom, Symbol("##115#120")){String})() at eval.jl:102
macro expansion at essentials.jl:697 [inlined]
(::getfield(Atom, Symbol("##111#116")))(::Dict{String,Any}) at eval.jl:86
handlemsg(::Dict{String,Any}, ::Dict{String,Any}) at comm.jl:164
(::getfield(Atom, Symbol("##19#21")){Array{Any,1}})() at task.jl:259

The code and data are shown below. The weight vector sums to one. Any ideas why I’m throwing this error?

ols = glm(@formula(Y ~ X), data_profit, Normal(), IdentityLink(),wts=reg_weights)

Data:

julia> show(data_profit,allrows=true)
37×2 DataFrame
│ Row │ X     │ Y            │
│     │ Int64 │ Float64      │
├─────┼───────┼──────────────┤
│ 1   │ 1     │ 0.000313178  │
│ 2   │ 1     │ 0.000228847  │
│ 3   │ 1     │ 0.000230182  │
│ 4   │ 1     │ 0.00023272   │
│ 5   │ 1     │ 0.000235561  │
│ 6   │ 1     │ 0.000238828  │
│ 7   │ 1     │ 0.00024262   │
│ 8   │ 1     │ 0.000246501  │
│ 9   │ 1     │ 0.000250485  │
│ 10  │ 1     │ 0.00019644   │
│ 11  │ 0     │ 0.00011969   │
│ 12  │ 0     │ 0.00010914   │
│ 13  │ 0     │ 0.000104962  │
│ 14  │ 0     │ 0.000102631  │
│ 15  │ 0     │ 0.000101194  │
│ 16  │ 0     │ 0.00010041   │
│ 17  │ 0     │ 9.98285e-5   │
│ 18  │ 0     │ 9.93539e-5   │
│ 19  │ 0     │ 9.89764e-5   │
│ 20  │ 0     │ 9.86519e-5   │
│ 21  │ 0     │ 9.83628e-5   │
│ 22  │ 0     │ 9.80978e-5   │
│ 23  │ 0     │ 9.78451e-5   │
│ 24  │ 0     │ 0.00012086   │
│ 25  │ 0     │ 2.24932e-5   │
│ 26  │ 0     │ -1.12741e-5  │
│ 27  │ 0     │ 4.30009e-6   │
│ 28  │ 1     │ 5.51942e-5   │
│ 29  │ 1     │ 5.29287e-5   │
│ 30  │ 1     │ 5.06829e-5   │
│ 31  │ 0     │ -0.000835512 │
│ 32  │ 0     │ -0.000938462 │
│ 33  │ 0     │ -0.00108555  │
│ 34  │ 0     │ -7.13685e-5  │
│ 35  │ 0     │ -4.40181e-6  │
│ 36  │ 0     │ 1.32171e-5   │
│ 37  │ 0     │ 2.06316e-5   │

julia> show(reg_weights)
[0.0695251, 0.0455689, 0.0426326, 0.0400022, 0.03757, 0.0352973, 0.0331633, 0.0311543, 0.0292603, 0.
027473, 0.0257856, 0.024192, 0.0226868, 0.021265, 0.0199222, 0.0186542, 0.017457, 0.0163271, 0.01526
08, 0.0142551, 0.0133068, 0.012413, 0.011571, 0.0107781, 0.0100319, 0.00933, 0.0186224, 0.0552164, 0
.0491013, 0.0431579, 0.0377491, 0.0329251, 0.0287195, 0.0244011, 0.0211243, 0.0182855, 0.0158137]

#2

AFAIK weights passed to GLM are interpreted as frequency weights, so they should generally be integers. In general I think non-integer values should also work, but maybe estimation doesn’t behave properly when the sum of weights is too small (of sum of 1 is as if you had a single observation).

If these weights are sampling/probability weights, you should rescale them to sum to the number of observations to get correct inference. Then maybe that will also work around the bug?


#3

You also have some negative weights, which I don’t think are allowed.


#4

Negative values are in the response, not in weights, right?


#5

Ah sorry.

PS the docs for GLM could some love, I couldn’t find any discussion of this in the docs. OP if you have any ideas on how to improve the documentation for this you should submit a PR!


#6

I think the issue here is that that X'X becomes numerically singular.


#7

thanks all. I’ll put in a PR for more explanation in docs around what options can specified, and what format they should be in.