GLM with weights

I’m getting the following error when trying to run GLM with a weight vector for my data:

Julia Client – Internal Error
DomainError(-1.867506166794264e-7, "sqrt will only return a complex result if called with a complex argument. Try sqrt(Complex(x)).")
throw_complex_domainerror(::Symbol, ::Float64) at math.jl:31
sqrt at math.jl:492 [inlined]
_broadcast_getindex_evalf at broadcast.jl:574 [inlined]
_broadcast_getindex at broadcast.jl:547 [inlined]
getindex at broadcast.jl:507 [inlined]
macro expansion at broadcast.jl:838 [inlined]
macro expansion at simdloop.jl:73 [inlined]
copyto! at broadcast.jl:837 [inlined]
copyto! at broadcast.jl:792 [inlined]
copy at broadcast.jl:768 [inlined]
materialize at broadcast.jl:748 [inlined]
stderror(::GeneralizedLinearModel{GlmResp{Array{Float64,1},Normal{Float64},IdentityLink},DensePredChol{Float64,Cholesky{Float64,Array{Float64,2}}}}) at linpred.jl:223
coeftable(::GeneralizedLinearModel{GlmResp{Array{Float64,1},Normal{Float64},IdentityLink},DensePredChol{Float64,Cholesky{Float64,Array{Float64,2}}}}) at glmfit.jl:163
coeftable at statsmodel.jl:110 [inlined]
show(::IOContext{Base.GenericIOBuffer{Array{UInt8,1}}}, ::StatsModels.DataFrameRegressionModel{GeneralizedLinearModel{GlmResp{Array{Float64,1},Normal{Float64},IdentityLink},DensePredChol{Float64,Cholesky{Float64,Array{Float64,2}}}},Array{Float64,2}}) at statsmodel.jl:121
show at sysimg.jl:194 [inlined]
(::getfield(Atom, Symbol("##27#28")){StatsModels.DataFrameRegressionModel{GeneralizedLinearModel{GlmResp{Array{Float64,1},Normal{Float64},IdentityLink},DensePredChol{Float64,Cholesky{Float64,Array{Float64,2}}}},Array{Float64,2}}})(::Base.GenericIOBuffer{Array{UInt8,1}}) at display.jl:17
#sprint#325(::Nothing, ::Int64, ::Function, ::Function) at io.jl:101
sprint at io.jl:97 [inlined]
render at display.jl:16 [inlined]
Type at types.jl:39 [inlined]
Type at types.jl:40 [inlined]
render at display.jl:19 [inlined]
displayandrender(::StatsModels.DataFrameRegressionModel{GeneralizedLinearModel{GlmResp{Array{Float64,1},Normal{Float64},IdentityLink},DensePredChol{Float64,Cholesky{Float64,Array{Float64,2}}}},Array{Float64,2}}) at showdisplay.jl:127
(::getfield(Atom, Symbol("##115#120")){String})() at eval.jl:102
macro expansion at essentials.jl:697 [inlined]
(::getfield(Atom, Symbol("##111#116")))(::Dict{String,Any}) at eval.jl:86
handlemsg(::Dict{String,Any}, ::Dict{String,Any}) at comm.jl:164
(::getfield(Atom, Symbol("##19#21")){Array{Any,1}})() at task.jl:259

The code and data are shown below. The weight vector sums to one. Any ideas why I’m throwing this error?

ols = glm(@formula(Y ~ X), data_profit, Normal(), IdentityLink(),wts=reg_weights)

Data:

julia> show(data_profit,allrows=true)
37×2 DataFrame
│ Row │ X     │ Y            │
│     │ Int64 │ Float64      │
├─────┼───────┼──────────────┤
│ 1   │ 1     │ 0.000313178  │
│ 2   │ 1     │ 0.000228847  │
│ 3   │ 1     │ 0.000230182  │
│ 4   │ 1     │ 0.00023272   │
│ 5   │ 1     │ 0.000235561  │
│ 6   │ 1     │ 0.000238828  │
│ 7   │ 1     │ 0.00024262   │
│ 8   │ 1     │ 0.000246501  │
│ 9   │ 1     │ 0.000250485  │
│ 10  │ 1     │ 0.00019644   │
│ 11  │ 0     │ 0.00011969   │
│ 12  │ 0     │ 0.00010914   │
│ 13  │ 0     │ 0.000104962  │
│ 14  │ 0     │ 0.000102631  │
│ 15  │ 0     │ 0.000101194  │
│ 16  │ 0     │ 0.00010041   │
│ 17  │ 0     │ 9.98285e-5   │
│ 18  │ 0     │ 9.93539e-5   │
│ 19  │ 0     │ 9.89764e-5   │
│ 20  │ 0     │ 9.86519e-5   │
│ 21  │ 0     │ 9.83628e-5   │
│ 22  │ 0     │ 9.80978e-5   │
│ 23  │ 0     │ 9.78451e-5   │
│ 24  │ 0     │ 0.00012086   │
│ 25  │ 0     │ 2.24932e-5   │
│ 26  │ 0     │ -1.12741e-5  │
│ 27  │ 0     │ 4.30009e-6   │
│ 28  │ 1     │ 5.51942e-5   │
│ 29  │ 1     │ 5.29287e-5   │
│ 30  │ 1     │ 5.06829e-5   │
│ 31  │ 0     │ -0.000835512 │
│ 32  │ 0     │ -0.000938462 │
│ 33  │ 0     │ -0.00108555  │
│ 34  │ 0     │ -7.13685e-5  │
│ 35  │ 0     │ -4.40181e-6  │
│ 36  │ 0     │ 1.32171e-5   │
│ 37  │ 0     │ 2.06316e-5   │

julia> show(reg_weights)
[0.0695251, 0.0455689, 0.0426326, 0.0400022, 0.03757, 0.0352973, 0.0331633, 0.0311543, 0.0292603, 0.
027473, 0.0257856, 0.024192, 0.0226868, 0.021265, 0.0199222, 0.0186542, 0.017457, 0.0163271, 0.01526
08, 0.0142551, 0.0133068, 0.012413, 0.011571, 0.0107781, 0.0100319, 0.00933, 0.0186224, 0.0552164, 0
.0491013, 0.0431579, 0.0377491, 0.0329251, 0.0287195, 0.0244011, 0.0211243, 0.0182855, 0.0158137]

AFAIK weights passed to GLM are interpreted as frequency weights, so they should generally be integers. In general I think non-integer values should also work, but maybe estimation doesn’t behave properly when the sum of weights is too small (of sum of 1 is as if you had a single observation).

If these weights are sampling/probability weights, you should rescale them to sum to the number of observations to get correct inference. Then maybe that will also work around the bug?

You also have some negative weights, which I don’t think are allowed.

Negative values are in the response, not in weights, right?

Ah sorry.

PS the docs for GLM could some love, I couldn’t find any discussion of this in the docs. OP if you have any ideas on how to improve the documentation for this you should submit a PR!

I think the issue here is that that X'X becomes numerically singular.

thanks all. I’ll put in a PR for more explanation in docs around what options can specified, and what format they should be in.