GLM - Questions

Hi All.
I have 2 questions about implementing GLM.jl

  1. Weighting the Dependent Variable - I’m looking at using the Poisson distribution, whereby the dependent variable, is weighted by a Unit of Exposure (DaysActive/365).
    In SAS you use the Weight function, In R its the Offset function. I haven’t found the equivalent in Julia in any example or document. Am I missing the obvious here?

  2. I have some a categorical variable with ~180 levels, when I fit some are statistically significant, others are not. I’d like to only keep those which are significant (pasimony). In R one uses (var==“value1”)+(var=="value2) which is read like a IF statement creating a dummy variable, SAS is similar. Again when looking online whilst I see some information about "contrasts = Dict(:var => DummyCoding(), I’m unsure how to use this in what I am doing.

I would assume these are really common user cases, so hopefully someone can provide clarity (or reference to some example I may have missed or other package which may be appropriate).

Thanks

2 Likes

I see both wts and offset arguments in GLM.jl

I do not see why a person needs a categorical variable with 180 levels. Why not making it continuous?

Hi huang_min,
Do you have an example using wts &/or offsets with a Poisson regression? In my tests did not work (but user error is possible to likely).

With respect to Categorical Variables, the Dataset I’m using is ~2.5m rows. The categorical variable is geographic areas (think suburb_town as groups of addresses).
I generally:

  1. Start by putting all levels in for any categorical variable
  2. Remove grossly insignificant ones (t-test) & retest as I remove.
  3. Group similar co-efficient’s especially if geographic area is close (use AIC/BIC)
  4. Finally settle on a selection of single & grouped variables for the final iteration.

From experience, its fairly dangerous to treat say an Integer representation of a Categorical Variable as continuous. Our user cases may well be different.

Thanks

using GLM, DataFrames
df = DataFrame(:y => rand(1:20,100), :x => rand(100), :d => rand(100))
glm(@formula(y ~ x),df,Poisson(), wts = df[!,:d])

I think this is a minimal example you need.

As for the second one, maybe you can try MixedModels.jl.

Hi huang_min.

Many thanks for your example. It was very instructive. What I found was:

  1. Your example worked perfectly, yet mine did not.

  2. I examined the wts = df[!,:d]) in your & my equivalent and noticed mine was described as
    Vector{Union{Missing, Float64}} whereas yours was Vector{Float64}.

Whilst there were no Missing values in the underlying dataset (explicitly removed prior to creating the data frame), the Vector has made an allowance for Missing Values.

  1. I changed the code slightly to wts=coalesce.(df[!,:d], 0) which overwrites Missing Values and this changed it to Vector{Float64}, and this worked without issue in the glm function.

A good learning as the error message I was getting "
TypeError: in keyword argument wts, expected AbstractArray{#s37,1} where #s37<:Real, got Array{Union{Missing, Float64},1}" did not make sense to me.

I will review MixedModels.jl to see if this assists in specifying categorical variables.

1 Like

It’s great to know that.