Analysis of binomial counts using glm

RobertDorazio · October 2, 2025, 11:00pm

It seems as though the glm() function in Julia does not always yield the maximum likelihood estimates of parameters when fitting logistic regression models of binomial count data. Moreover, the uncertainties of the parameter estimates can be much higher than those obtained by numerical optimization of the likelihood function or by using the glm() function in R. (see results below)

I have code that provides a reproducible comparison of analyses using Julia’s glm(), Julia’s optimize(), and R’s glm(), but I am unable to upload it here. My results were obtained using Julia version 1.11.6+0.x64.linux.gnu

Is there an alternative way to fit this class of models using Julia’s glm() ?? I assigned the response as the proportion of successes and the wts variable as the total number of successes and failures when using Julia’s glm().

julia> include(“glm_BinomialCounts.jl”)

Analysis using GLM in Julia language

Beta = [-0.48785092134958846, 1.8060232813792254]
se = [0.27280263377390546, 0.34845194964021076]

Analysis using numerical optimization

Beta = [-0.49814787518412995, 1.7812620322369088]
se = [0.07709903822971975, 0.09545673648885034]

Analysis using GLM in R language

Beta = [-0.4981478751358396, 1.7812620318091115]
se = [0.0770985367034779, 0.09545543332580649]

nilshg · October 3, 2025, 8:30am

Why are you unable to “upload” the code? You can just post it here?

RobertDorazio · October 3, 2025, 11:12am

I am not able to upload the code because I am new to this mailing list. Evidently, list policy precludes new members from uploading files.

Therefore, I uploaded the code (glm_BinomialCounts.jl) to one of my GitHub repositories:

where you can see how Julia’s glm() function is used as follows:

fit_glm = glm(@formula(proportion ~ x), data, Binomial(), LogitLink() )

nilshg · October 3, 2025, 12:46pm

My point was more that a cleaned-up MWE of your code is only around 20 lines, so you can just post it:

using DataFrames, Distributions, GLM, Random, RCall, StatsBase

### Simulate data
n = 100
x = randn(n)
true_logit = -0.5 .+ 1.8 .* x  # True linear predictor: beta[0] + beta[1] * x
p = cdf(Logistic(), true_logit)
N = sample(5:25, n, replace=true)
y = rand.(Binomial.(N, p))   # Binomial counts
data = DataFrame(success = y, failure = N .- y, x = x)
data.wts = data.success .+ data.failure
data.proportion = data.success ./ (data.success .+ data.failure)

### Fit the model using glm() and its @formula macro
glm(@formula(proportion ~ x), data, Binomial(), LogitLink())
#proportion ~ 1 + x
#
#Coefficients:
#──────────────────────────────────────────────────────────────────────────
#                 Coef.  Std. Error      z  Pr(>|z|)  Lower 95%   Upper 95%
#──────────────────────────────────────────────────────────────────────────
#(Intercept)  -0.530051    0.246506  -2.15    0.0315  -1.01319   -0.0469075
#x             1.75963     0.411472   4.28    <1e-04   0.953157   2.5661
#──────────────────────────────────────────────────────────────────────────

### Fit the model using R language's glm() function
R"""
glm(cbind(success, failure) ~ x, $data, family=binomial)
"""
#Call:  glm(formula = cbind(success, failure) ~ x, family = binomial,
    data = `#JL`$data)
#
#Coefficients:
#(Intercept)            x
#    -0.5283       1.7423
#
#Degrees of Freedom: 99 Total (i.e. Null);  98 Residual
#Null Deviance:      619.3
#Residual Deviance: 110.3        AIC: 361.3

But I assume that this question is more of an issue for the GLM repo.

RobertDorazio · October 3, 2025, 1:39pm

Thanks, but your post fails to include a comparison of the uncertainty in the parameter estimates. The standard errors of the estimates are MUCH higher using Julia’s glm() than those obtained using R’s glm() or those obtained using Julia’s Optim package.

So my original question remains: Is there an alternative way to fit this class of models using Julia’s glm() ?

bertschi · October 3, 2025, 3:18pm

Seems like the weights can be passed via the wts argument:

glm(@formula(proportion ~ x), data, Binomial(), LogitLink(), wts = data.wts)

This seems to match the results from R – also regarding the uncertainties. Unfortunately, there does not seem to be an extension to the formula to pass the number of trials directly, e.g., success | trial ~ x where trial = success .+ failure or something.

RobertDorazio · October 3, 2025, 7:52pm

So it seems to me that the behavior of Julia’s glm() is a bit idiosyncratic. The simulated data frame (named data) has the following components:

propertynames(data)
5-element Vector{Symbol}:
:success
:failure
:x
:wts
:proportion

My first attempt to use Julia’s glm() resulting in the following failure:

fit_glm = glm(@formula(proportion ~ x), data, Binomial(), LogitLink() wts=wts)
ERROR: ParseError:

Error @ none:1:71

fit_glm = glm(@formula(proportion ~ x), data, Binomial(), LogitLink() wts=wts)

└─────┘ ── Expected `)`

This is the reason I had failed to assign :wts a value, which evidently produced the wrong estimates but did provide a result.

Nils noted that the following call to glm() yields the correct estimates:

fit_glm = glm(@formula(proportion ~ x), data, Binomial(), LogitLink(), wts=data.wts )

Below are the results of this call using the simulated data:

Coefficients:
─────────────────────────────────────────────────────────────────────────
Coef. Std. Error z Pr(>|z|) Lower 95% Upper 95%
─────────────────────────────────────────────────────────────────────────
(Intercept) -0.550731 0.0658958 -8.36 <1e-16 -0.679884 -0.421577
x 1.74013 0.103634 16.79 <1e-62 1.53701 1.94325
─────────────────────────────────────────────────────────────────────────

This means that when the data object (named data) was passed to Julia’s glm(), it was able to interpret the response variable and the regressor by name (proportion and x, respectively), but it was not able to interpret the case weights variable by name (wts). Instead Julia’s glm() only worked when the case weights were assigned explicitly as follows: “wts = data.wts”

I recommend that the authors of Julia’s GLM package correct this idiosyncrasy.

RobertDorazio · October 3, 2025, 8:00pm

Correction to my previous post:

The results of

fit_glm = glm(@formula(proportion ~ x), data, Binomial(), LogitLink(), wts=data.wts )

were

Coefficients:
─────────────────────────────────────────────────────────────────────────
Coef. Std. Error z Pr(>|z|) Lower 95% Upper 95%
─────────────────────────────────────────────────────────────────────────
(Intercept) -0.550731 0.0658958 -8.36 <1e-16 -0.679884 -0.421577
x 1.74013 0.103634 16.79 <1e-62 1.53701 1.94325
─────────────────────────────────────────────────────────────────────────

bertschi · October 3, 2025, 8:12pm

Well, I would claim that R is idiosyncratic here. Due to its non-standard evaluation, any argument can be evaluated in some – possibly constructed at runtime – environment and the glm function uses this to interpret names in the formula as well as in some argument (such as offset or weights) in the context of the data frame.
In Julia this is not possible – at least not with a function – as all function arguments are evaluated and only their values are passed to the function. To opt-out of this behaviour macros can be used to re-write the code before its evaluated – note that @formula is a macro and enables that names in the formula are not evaluated directly, but can be looked up in the data argument instead.

RobertDorazio · October 3, 2025, 8:29pm

Many thanks for clarifying. So problems, or at least confusion, can arise from differences between macro behavior (such as @formula) and the traditional pairings of arguments and values passed to a function.

RobertDorazio · October 3, 2025, 8:57pm

This difference brings to mind a related question. If the response variable was binary-valued, then Julia’s glm() provides two possible ways to specify the same binary-regression model. One way is using the @formula macro as follows:

glm(@formula(y ~ x), data, Binomial(), LogitLink())

A second way is using the model matrix X and the vector y as follows:

glm(X, y, Binomial(), LogitLink())

Can the second approach be used to specify a model of binomial counts as a response? If so, how??

RobertDorazio · October 4, 2025, 1:51pm

The answer to my recent question is yes.

Evidently,

X = hcat(ones(Float64, n), data.x)
fit_glm = glm(X, data.proportion, Binomial(), LogitLink(), wts=data.wts)

and

fit_glm = glm(@formula(proportion ~ x), data, Binomial(), LogitLink(), wts=data.wts)

yield identical results.

Topic		Replies	Views
GLM.jl, error on Binomial distribution, fixed number of trials Statistics glm	4	1139	May 8, 2018
How do I fit generalised linear multilevel models including offsets? Statistics glm	9	2107	January 10, 2022
Help understanding GLM (the concept, not the package) Statistics glm	4	1673	March 13, 2020
GLM is slow on large datasets. Using OnlineStats for regressions? MixedModels? Performance glm	25	5200	November 26, 2018
How to get log likelihood value from GLM.jl estimation New to Julia glm	3	1651	March 28, 2019

Analysis of binomial counts using glm

Error @ none:1:71

└─────┘ ── Expected )

Related topics

└─────┘ ── Expected `)`