Is there any glmmTMB package for Julia?

Juan · January 21, 2019, 12:51pm

I’m trying to fit mixed-effect models to large dataset.
R’s lme4 is very slow and can’t deal with more than 50000 rows on my computer. The amount of memory seems to increase with the cube of the number of rows.
Julia’s Mixedmodels is approximatelly 5 times faster but has the same limit.

There are other R’s packages that sometimes can deal with much larger datasets. For example mgcv but it only works well if you have few levels of the random variable.

The other package I’ve found is glmmTMB (Generalized linear mixed models with various extensions, including zero-inflation. ), I’m trying but for now I’ve got to fit the model with 400000 rows, though it took 850 seconds. I think it will be able to fit larger problems.

Is there any “Template Model Builder” package for Julia? I guess it would be faster than its R counterpart.

Tamas_Papp · January 21, 2019, 12:57pm

Can you describe the concrete model (the forum supports LaTeX), and your data dimensions (is 400k rows the target, or was that just an experiment)?

Juan · January 21, 2019, 3:49pm

When I wrote he message I got to fit the model with 400000 rows, I’m trying with increasingly more rows and different packages, now I’ve tried again and got to do it with 1.5 million rows (needed 1 hour computation on a computer with 16GB RAM) with glmmTMB. If I try with lme4 and Mixedmodels.jl they just crash even with much smaller problems.

My whole problem has information (from hospitals) of 4 million people through 15 years, repeated measures.
When I reshape the data to long format I get a table with around 60 million row (if I just keep one measure per year).

I want to fit several models, some linear, some logistic, such as
(binaryoutcome ~ age + weight + year + sex + treatment1 + treatment2 + race + (1|ID) + (1|hospital), family=bernoulli)
or maybe a little bit more complex and with some interactions or even nesting the random effects.

I’ve found the main problem is the (1|ID) term because ID has too many levels.

I know I can try different tricks to simplify things but I’d also like to get the results of the whole model in order to compare it.
And of course I have already standarized the x variables, otherwise it never converges.

nilshg · January 21, 2019, 4:55pm

Sorry if this is slightly off-topic, but given that you’ve been asking quite a few similar questions over the past weeks - have you considered running your models on random samples?

Looking e.g. at the NHS England resource allocation formula, which uses data on 55 million people over 3-5 years, and fits similar models (albeit often with lots more covariates) they seem to get by perfectly fine on a small random sample - you might want to think about just sampling as much as your hardware can handle in a reasonable amount of time?

Juan · January 21, 2019, 6:02pm

Yes, this is one of the things I want to do, and I want to compare the results to the results obtained by using the whole dataset.

Tamas_Papp · January 21, 2019, 6:02pm

Written in this form, I think your main problem is the very weak or non-existent identification of ID (given that the majority of people die at most once).

pdeffebach · January 21, 2019, 6:07pm

Have you tried OLS? The assymptotic properties of the t-statistics are broadly similar. Plus you get the easy benefit of interpretable coefficients.

You should try fixedeffectmodels.jl which claims to have something similar to Stata’s reghdfe.

jkbest2 · January 21, 2019, 8:09pm

To address the original question, there isn’t anything equivalent to TMB in Julia yet. For those who aren’t familiar, TMB (https://arxiv.org/pdf/1509.00660.pdf; GitHub - kaskr/adcomp: AD computation with Template Model Builder (TMB)) is a software package that fits statistical models, typically using maximum likelihood (MCMC is also available). You write your model as a templated C++ function that returns a log-likelihood (log-posterior density if you’re Bayesian). in that sense, it is similar in spirit to @Tamas_Papp’s DynamicHMC.jl. TMB uses the CppAD package to take derivatives and can integrate out arbitrary sets of parameters (typically any random effects) using the Laplace approximation. The primary interface is through R. This function is then maximized, typically using R’s optim or similar. glmmTMB defines a few common models.

We have many of the pieces in place to do something similar in Julia. Distributions.jl defines most of the probability density functions you could ever want, Optim.jl has a robust set of optimization algorithms, including those that can use gradient and Hessian information. I’m not quite sure what the higher-order autodiff story is at this point. TMB uses up to third derivatives to get gradients of the Laplace-approximated expected densities. I think the biggest missing piece at this point is a package for doing Laplace approximations.

ohadle · January 21, 2019, 9:49pm

Going further in the direction of subsampling, I’d suggest plotting a “learning curve”, or some model convergence metric vs. the number of samples in a dataset you can easily fit. This would address the question “how much data does my model need?”, and whether more data is expected to help.
As someone who works with large medical data, I’d be surprised if your current numbers aren’t more than enough for the model you described, or even much, much more complex ones.

Though that TMB idea does sound pretty cool.

Juan · January 21, 2019, 11:15pm

I was doing that MixedModels.jl and lme4 can’t deal with that amount of data.

Sumbsampling sounds nice but sometimes it doesn’t work: when some combinations of the predictive variables happen very rarely.

Juan · January 21, 2019, 11:19pm

All right, I told death but I meant any binary outcome.

Tamas_Papp · January 22, 2019, 6:37am

From the paper it looks like they use CppAD and optimize for sparsity in the AD graph, which is a neat trick, but I am not aware of any explicit support for this in Julia’s AD packages.

I am curious about the accuracy of Laplace approximations, it is a pity that they only compare to another package using the same methodology.

jkbest2 · January 22, 2019, 3:25pm

Yeah, I think the sparsity optimization is key for performance.

There is a comparison of random effects estimated via the Laplace approximation vs. MCMC in this paper.

Juan · January 22, 2019, 3:42pm

Will this be integrated on MixedModels.jl?

jkbest2 · January 22, 2019, 5:27pm

That’s probably unlikely; these techniques are general and would fit better in a dedicated package. I also don’t think that MixedModels.jl depends on any autodiff packages, so this would be a large change with limited payoff.

I think that you should reconsider subsampling your data. Remember that if you have enough covariates, every covariate combination is rare. You want to think carefully about how you subsample; don’t take random rows, but subsample ID, then take the rows that correspond to those IDs. That way you keep entire patient histories. It’s unlikely that you’ll lose much.

palday · April 28, 2022, 4:44am

It’s a long time since, but dealing with blocking variables with many levels is one of the reasons we introduced the Grouping() pseudo-constrast. So you use e.g. contrasts=Dict(:ID => Grouping()) and this avoids constructing a real contrast matrix (which is what causes the memory problem) that’s never actually used when constructing the random effects matrices.

Topic		Replies	Views
Any Julia's equivalent to R's packages mcgv or mixed-effects models larger than memory? Statistics	9	2957	November 19, 2018
Laplace approximation for mixed models in Julia (TMB-like functionality) Statistics question	35	5196	August 10, 2021
Fastest way for GLMM Performance question , package	6	243	October 7, 2024
GLM is slow on large datasets. Using OnlineStats for regressions? MixedModels? Performance glm	25	5092	November 26, 2018
Fitting Mixed Effects Models - Python, Julia or R? Statistics blog-post	0	530	January 20, 2022

Is there any glmmTMB package for Julia?

Related topics