Hi All.
I have 2 questions about implementing GLM.jl
-
Weighting the Dependent Variable - I’m looking at using the Poisson distribution, whereby the dependent variable, is weighted by a Unit of Exposure (DaysActive/365).
In SAS you use the Weight function, In R its the Offset function. I haven’t found the equivalent in Julia in any example or document. Am I missing the obvious here?
-
I have some a categorical variable with ~180 levels, when I fit some are statistically significant, others are not. I’d like to only keep those which are significant (pasimony). In R one uses (var==“value1”)+(var=="value2) which is read like a IF statement creating a dummy variable, SAS is similar. Again when looking online whilst I see some information about "contrasts = Dict(:var => DummyCoding(), I’m unsure how to use this in what I am doing.
I would assume these are really common user cases, so hopefully someone can provide clarity (or reference to some example I may have missed or other package which may be appropriate).
Thanks
2 Likes
I see both wts and offset arguments in GLM.jl
I do not see why a person needs a categorical variable with 180 levels. Why not making it continuous?
Hi huang_min,
Do you have an example using wts &/or offsets with a Poisson regression? In my tests did not work (but user error is possible to likely).
With respect to Categorical Variables, the Dataset I’m using is ~2.5m rows. The categorical variable is geographic areas (think suburb_town as groups of addresses).
I generally:
- Start by putting all levels in for any categorical variable
- Remove grossly insignificant ones (t-test) & retest as I remove.
- Group similar co-efficient’s especially if geographic area is close (use AIC/BIC)
- Finally settle on a selection of single & grouped variables for the final iteration.
From experience, its fairly dangerous to treat say an Integer representation of a Categorical Variable as continuous. Our user cases may well be different.
Thanks
using GLM, DataFrames
df = DataFrame(:y => rand(1:20,100), :x => rand(100), :d => rand(100))
glm(@formula(y ~ x),df,Poisson(), wts = df[!,:d])
I think this is a minimal example you need.
As for the second one, maybe you can try MixedModels.jl.
Hi huang_min.
Many thanks for your example. It was very instructive. What I found was:
-
Your example worked perfectly, yet mine did not.
-
I examined the wts = df[!,:d]) in your & my equivalent and noticed mine was described as
Vector{Union{Missing, Float64}} whereas yours was Vector{Float64}.
Whilst there were no Missing values in the underlying dataset (explicitly removed prior to creating the data frame), the Vector has made an allowance for Missing Values.
- I changed the code slightly to wts=coalesce.(df[!,:d], 0) which overwrites Missing Values and this changed it to Vector{Float64}, and this worked without issue in the glm function.
A good learning as the error message I was getting "
TypeError: in keyword argument wts, expected AbstractArray{#s37,1} where #s37<:Real, got Array{Union{Missing, Float64},1}" did not make sense to me.
I will review MixedModels.jl to see if this assists in specifying categorical variables.
1 Like