GLM - Questions

DaKlingons · March 21, 2022, 4:35am

Hi All.
I have 2 questions about implementing GLM.jl

Weighting the Dependent Variable - I’m looking at using the Poisson distribution, whereby the dependent variable, is weighted by a Unit of Exposure (DaysActive/365).
In SAS you use the Weight function, In R its the Offset function. I haven’t found the equivalent in Julia in any example or document. Am I missing the obvious here?
I have some a categorical variable with ~180 levels, when I fit some are statistically significant, others are not. I’d like to only keep those which are significant (pasimony). In R one uses (var==“value1”)+(var=="value2) which is read like a IF statement creating a dummy variable, SAS is similar. Again when looking online whilst I see some information about "contrasts = Dict(:var => DummyCoding(), I’m unsure how to use this in what I am doing.

I would assume these are really common user cases, so hopefully someone can provide clarity (or reference to some example I may have missed or other package which may be appropriate).

Thanks

huang_min · March 21, 2022, 4:47am

I see both wts and offset arguments in GLM.jl

I do not see why a person needs a categorical variable with 180 levels. Why not making it continuous?

DaKlingons · March 21, 2022, 6:01am

Hi huang_min,
Do you have an example using wts &/or offsets with a Poisson regression? In my tests did not work (but user error is possible to likely).

With respect to Categorical Variables, the Dataset I’m using is ~2.5m rows. The categorical variable is geographic areas (think suburb_town as groups of addresses).
I generally:

Start by putting all levels in for any categorical variable
Remove grossly insignificant ones (t-test) & retest as I remove.
Group similar co-efficient’s especially if geographic area is close (use AIC/BIC)
Finally settle on a selection of single & grouped variables for the final iteration.

From experience, its fairly dangerous to treat say an Integer representation of a Categorical Variable as continuous. Our user cases may well be different.

Thanks

huang_min · March 21, 2022, 6:47am

using GLM, DataFrames
df = DataFrame(:y => rand(1:20,100), :x => rand(100), :d => rand(100))
glm(@formula(y ~ x),df,Poisson(), wts = df[!,:d])

I think this is a minimal example you need.

As for the second one, maybe you can try MixedModels.jl.

DaKlingons · March 21, 2022, 10:56pm

Hi huang_min.

Many thanks for your example. It was very instructive. What I found was:

Your example worked perfectly, yet mine did not.
I examined the wts = df[!,:d]) in your & my equivalent and noticed mine was described as
Vector{Union{Missing, Float64}} whereas yours was Vector{Float64}.

Whilst there were no Missing values in the underlying dataset (explicitly removed prior to creating the data frame), the Vector has made an allowance for Missing Values.

I changed the code slightly to wts=coalesce.(df[!,:d], 0) which overwrites Missing Values and this changed it to Vector{Float64}, and this worked without issue in the glm function.

A good learning as the error message I was getting "
TypeError: in keyword argument wts, expected AbstractArray{#s37,1} where #s37<:Real, got Array{Union{Missing, Float64},1}" did not make sense to me.

I will review MixedModels.jl to see if this assists in specifying categorical variables.

huang_min · March 22, 2022, 1:31am

It’s great to know that.

Topic		Replies	Views
Using Survey/Inverse Probability Weights in Regression Statistics	3	2045	April 19, 2018
Usage of different types of weights Statistics	12	3166	July 12, 2017
Indicator matrix for categorical data in GLM.jl with DataFrames.jl General Usage	4	2070	October 5, 2019
Weightened linear model in GLM.jl General Usage glm	3	106	September 28, 2024
How do I fit generalised linear multilevel models including offsets? Statistics glm	9	2045	January 10, 2022

GLM - Questions

Related topics