I’m trying to fit mixed-effect models to large dataset.
R’s lme4 is very slow and can’t deal with more than 50000 rows on my computer. The amount of memory seems to increase with the cube of the number of rows.
Julia’s Mixedmodels is approximatelly 5 times faster but has the same limit.
There are other R’s packages that sometimes can deal with much larger datasets. For example mgcv but it only works well if you have few levels of the random variable.
The other package I’ve found is glmmTMB (Generalized linear mixed models with various extensions, including zero-inflation. ), I’m trying but for now I’ve got to fit the model with 400000 rows, though it took 850 seconds. I think it will be able to fit larger problems.
Is there any “Template Model Builder” package for Julia? I guess it would be faster than its R counterpart.
Can you describe the concrete model (the forum supports LaTeX), and your data dimensions (is 400k rows the target, or was that just an experiment)?
When I wrote he message I got to fit the model with 400000 rows, I’m trying with increasingly more rows and different packages, now I’ve tried again and got to do it with 1.5 million rows (needed 1 hour computation on a computer with 16GB RAM) with glmmTMB. If I try with lme4 and Mixedmodels.jl they just crash even with much smaller problems.
My whole problem has information (from hospitals) of 4 million people through 15 years, repeated measures.
When I reshape the data to long format I get a table with around 60 million row (if I just keep one measure per year).
I want to fit several models, some linear, some logistic, such as
(binaryoutcome ~ age + weight + year + sex + treatment1 + treatment2 + race + (1|ID) + (1|hospital), family=bernoulli)
or maybe a little bit more complex and with some interactions or even nesting the random effects.
I’ve found the main problem is the (1|ID) term because ID has too many levels.
I know I can try different tricks to simplify things but I’d also like to get the results of the whole model in order to compare it.
And of course I have already standarized the x variables, otherwise it never converges.
Sorry if this is slightly off-topic, but given that you’ve been asking quite a few similar questions over the past weeks - have you considered running your models on random samples?
Looking e.g. at the NHS England resource allocation formula, which uses data on 55 million people over 3-5 years, and fits similar models (albeit often with lots more covariates) they seem to get by perfectly fine on a small random sample - you might want to think about just sampling as much as your hardware can handle in a reasonable amount of time?
Yes, this is one of the things I want to do, and I want to compare the results to the results obtained by using the whole dataset.
Written in this form, I think your main problem is the very weak or non-existent identification of
ID (given that the majority of people die at most once).
Have you tried OLS? The assymptotic properties of the t-statistics are broadly similar. Plus you get the easy benefit of interpretable coefficients.
You should try
fixedeffectmodels.jl which claims to have something similar to Stata’s
To address the original question, there isn’t anything equivalent to TMB in Julia yet. For those who aren’t familiar, TMB (https://arxiv.org/pdf/1509.00660.pdf; https://github.com/kaskr/adcomp) is a software package that fits statistical models, typically using maximum likelihood (MCMC is also available). You write your model as a templated C++ function that returns a log-likelihood (log-posterior density if you’re Bayesian). in that sense, it is similar in spirit to @Tamas_Papp’s
DynamicHMC.jl. TMB uses the CppAD package to take derivatives and can integrate out arbitrary sets of parameters (typically any random effects) using the Laplace approximation. The primary interface is through R. This function is then maximized, typically using
optim or similar.
glmmTMB defines a few common models.
We have many of the pieces in place to do something similar in Julia.
Distributions.jl defines most of the probability density functions you could ever want,
Optim.jl has a robust set of optimization algorithms, including those that can use gradient and Hessian information. I’m not quite sure what the higher-order autodiff story is at this point. TMB uses up to third derivatives to get gradients of the Laplace-approximated expected densities. I think the biggest missing piece at this point is a package for doing Laplace approximations.
Going further in the direction of subsampling, I’d suggest plotting a “learning curve”, or some model convergence metric vs. the number of samples in a dataset you can easily fit. This would address the question “how much data does my model need?”, and whether more data is expected to help.
As someone who works with large medical data, I’d be surprised if your current numbers aren’t more than enough for the model you described, or even much, much more complex ones.
Though that TMB idea does sound pretty cool.
I was doing that MixedModels.jl and lme4 can’t deal with that amount of data.
Sumbsampling sounds nice but sometimes it doesn’t work: when some combinations of the predictive variables happen very rarely.
All right, I told death but I meant any binary outcome.
From the paper it looks like they use CppAD and optimize for sparsity in the AD graph, which is a neat trick, but I am not aware of any explicit support for this in Julia’s AD packages.
I am curious about the accuracy of Laplace approximations, it is a pity that they only compare to another package using the same methodology.
Yeah, I think the sparsity optimization is key for performance.
There is a comparison of random effects estimated via the Laplace approximation vs. MCMC in this paper.
Will this be integrated on MixedModels.jl?
That’s probably unlikely; these techniques are general and would fit better in a dedicated package. I also don’t think that
MixedModels.jl depends on any autodiff packages, so this would be a large change with limited payoff.
I think that you should reconsider subsampling your data. Remember that if you have enough covariates, every covariate combination is rare. You want to think carefully about how you subsample; don’t take random rows, but subsample
ID, then take the rows that correspond to those
IDs. That way you keep entire patient histories. It’s unlikely that you’ll lose much.
It’s a long time since, but dealing with blocking variables with many levels is one of the reasons we introduced the
Grouping() pseudo-constrast. So you use e.g.
contrasts=Dict(:ID => Grouping()) and this avoids constructing a real contrast matrix (which is what causes the memory problem) that’s never actually used when constructing the random effects matrices.