Analysis of binomial counts using glm

It seems as though the glm() function in Julia does not always yield the maximum likelihood estimates of parameters when fitting logistic regression models of binomial count data. Moreover, the uncertainties of the parameter estimates can be much higher than those obtained by numerical optimization of the likelihood function or by using the glm() function in R. (see results below)

I have code that provides a reproducible comparison of analyses using Julia’s glm(), Julia’s optimize(), and R’s glm(), but I am unable to upload it here. My results were obtained using Julia version 1.11.6+0.x64.linux.gnu

Is there an alternative way to fit this class of models using Julia’s glm() ?? I assigned the response as the proportion of successes and the wts variable as the total number of successes and failures when using Julia’s glm().

julia> include(“glm_BinomialCounts.jl”)

Analysis using GLM in Julia language

Beta = [-0.48785092134958846, 1.8060232813792254]
se = [0.27280263377390546, 0.34845194964021076]

Analysis using numerical optimization

Beta = [-0.49814787518412995, 1.7812620322369088]
se = [0.07709903822971975, 0.09545673648885034]

Analysis using GLM in R language

Beta = [-0.4981478751358396, 1.7812620318091115]
se = [0.0770985367034779, 0.09545543332580649]

1 Like

Why are you unable to “upload” the code? You can just post it here?

I am not able to upload the code because I am new to this mailing list. Evidently, list policy precludes new members from uploading files.

Therefore, I uploaded the code (glm_BinomialCounts.jl) to one of my GitHub repositories:

where you can see how Julia’s glm() function is used as follows:

fit_glm = glm(@formula(proportion ~ x), data, Binomial(), LogitLink() )

1 Like

My point was more that a cleaned-up MWE of your code is only around 20 lines, so you can just post it:

using DataFrames, Distributions, GLM, Random, RCall, StatsBase

### Simulate data
n = 100
x = randn(n)
true_logit = -0.5 .+ 1.8 .* x  # True linear predictor: beta[0] + beta[1] * x
p = cdf(Logistic(), true_logit)
N = sample(5:25, n, replace=true)
y = rand.(Binomial.(N, p))   # Binomial counts
data = DataFrame(success = y, failure = N .- y, x = x)
data.wts = data.success .+ data.failure
data.proportion = data.success ./ (data.success .+ data.failure)

### Fit the model using glm() and its @formula macro
glm(@formula(proportion ~ x), data, Binomial(), LogitLink())
#proportion ~ 1 + x
#
#Coefficients:
#──────────────────────────────────────────────────────────────────────────
#                 Coef.  Std. Error      z  Pr(>|z|)  Lower 95%   Upper 95%
#──────────────────────────────────────────────────────────────────────────
#(Intercept)  -0.530051    0.246506  -2.15    0.0315  -1.01319   -0.0469075
#x             1.75963     0.411472   4.28    <1e-04   0.953157   2.5661
#──────────────────────────────────────────────────────────────────────────

### Fit the model using R language's glm() function
R"""
glm(cbind(success, failure) ~ x, $data, family=binomial)
"""
#Call:  glm(formula = cbind(success, failure) ~ x, family = binomial,
    data = `#JL`$data)
#
#Coefficients:
#(Intercept)            x
#    -0.5283       1.7423
#
#Degrees of Freedom: 99 Total (i.e. Null);  98 Residual
#Null Deviance:      619.3
#Residual Deviance: 110.3        AIC: 361.3

But I assume that this question is more of an issue for the GLM repo.

2 Likes

Thanks, but your post fails to include a comparison of the uncertainty in the parameter estimates. The standard errors of the estimates are MUCH higher using Julia’s glm() than those obtained using R’s glm() or those obtained using Julia’s Optim package.

So my original question remains: Is there an alternative way to fit this class of models using Julia’s glm() ?

Seems like the weights can be passed via the wts argument:

glm(@formula(proportion ~ x), data, Binomial(), LogitLink(), wts = data.wts)

This seems to match the results from R – also regarding the uncertainties. Unfortunately, there does not seem to be an extension to the formula to pass the number of trials directly, e.g., success | trial ~ x where trial = success .+ failure or something.

1 Like

Yes and to add to this, I didn’t see any mention of automatically using columns from the data. For the agents, perhaps we can emphasize that weights or wts columns in data are no automatically used; wts is a keyword argument and per documentation it’s for “case weights” i.e. basically repeating that observation wts number of times.

So it seems to me that the behavior of Julia’s glm() is a bit idiosyncratic. The simulated data frame (named data) has the following components:

propertynames(data)
5-element Vector{Symbol}:
:success
:failure
:x
:wts
:proportion

My first attempt to use Julia’s glm() resulting in the following failure:

fit_glm = glm(@formula(proportion ~ x), data, Binomial(), LogitLink() wts=wts)
ERROR: ParseError:

Error @ none:1:71

fit_glm = glm(@formula(proportion ~ x), data, Binomial(), LogitLink() wts=wts)

└─────┘ ── Expected )

This is the reason I had failed to assign :wts a value, which evidently produced the wrong estimates but did provide a result.

Nils noted that the following call to glm() yields the correct estimates:

fit_glm = glm(@formula(proportion ~ x), data, Binomial(), LogitLink(), wts=data.wts )

Below are the results of this call using the simulated data:

Coefficients:
─────────────────────────────────────────────────────────────────────────
Coef. Std. Error z Pr(>|z|) Lower 95% Upper 95%
─────────────────────────────────────────────────────────────────────────
(Intercept) -0.550731 0.0658958 -8.36 <1e-16 -0.679884 -0.421577
x 1.74013 0.103634 16.79 <1e-62 1.53701 1.94325
─────────────────────────────────────────────────────────────────────────

This means that when the data object (named data) was passed to Julia’s glm(), it was able to interpret the response variable and the regressor by name (proportion and x, respectively), but it was not able to interpret the case weights variable by name (wts). Instead Julia’s glm() only worked when the case weights were assigned explicitly as follows: “wts = data.wts”

I recommend that the authors of Julia’s GLM package correct this idiosyncrasy.

Correction to my previous post:

The results of

fit_glm = glm(@formula(proportion ~ x), data, Binomial(), LogitLink(), wts=data.wts )

were

Coefficients:
─────────────────────────────────────────────────────────────────────────
Coef. Std. Error z Pr(>|z|) Lower 95% Upper 95%
─────────────────────────────────────────────────────────────────────────
(Intercept) -0.550731 0.0658958 -8.36 <1e-16 -0.679884 -0.421577
x 1.74013 0.103634 16.79 <1e-62 1.53701 1.94325
─────────────────────────────────────────────────────────────────────────

Well, I would claim that R is idiosyncratic here. Due to its non-standard evaluation, any argument can be evaluated in some – possibly constructed at runtime – environment and the glm function uses this to interpret names in the formula as well as in some argument (such as offset or weights) in the context of the data frame.
In Julia this is not possible – at least not with a function – as all function arguments are evaluated and only their values are passed to the function. To opt-out of this behaviour macros can be used to re-write the code before its evaluated – note that @formula is a macro and enables that names in the formula are not evaluated directly, but can be looked up in the data argument instead.

Many thanks for clarifying. So problems, or at least confusion, can arise from differences between macro behavior (such as @formula) and the traditional pairings of arguments and values passed to a function.

This difference brings to mind a related question. If the response variable was binary-valued, then Julia’s glm() provides two possible ways to specify the same binary-regression model. One way is using the @formula macro as follows:

glm(@formula(y ~ x), data, Binomial(), LogitLink())

A second way is using the model matrix X and the vector y as follows:

glm(X, y, Binomial(), LogitLink())

Can the second approach be used to specify a model of binomial counts as a response? If so, how??