The main reason is fitting a model with Laplace approximation uses to be much faster.
That’s true, it used to be much faster than MCMC, but improvements in MCMC samplers and better computers mean that you usually don’t need to use Laplace approximation anymore. Laplace approximation is going to fail very often, especially if you have more than a handful of variables.
I found the Lapplace approximation some time ago when I needed to fit a complex model with many variables (some were categorical with some levels occurring few times) to a very large dataset.
Libraries such as MixedModels or R’s lme4 just didn’t work because they use huge amounts of memory (much more than available on my computer) and can be extremely slow.
glmTMB (an R library that uses the Lapplace pproximation) allowed me to compute everything quickly, and the results were very accurate (I tested it with small samples).
Maybe this methods are not always good, but in general they are very useful.
What do you mean by small samples? If you’re using very small samples, that probably wouldn’t be enough to reveal an error in the distribution. In addition, if you tested it against lme4’s answers, that’s not going to reveal the problems because lme4 uses a method very similar to Laplace approximation (restricted maximum likelihood) and will therefore tend to fail in the same way and in the same cases as Laplace approximation. If you want, you can use Turing.jl’s No-U-Turn Sampler (NUTS) to check the fit; it’ll probably give you a more accurate answer faster.
@Lime I appreciate your thoughts and recognize that the LA is not a panacea. However, the Stan dev team sees value in it (see here and note their arguments for its value), and INLA and glmmTMB are used widely. It’s also not always an approximation, if your likelihood is normal then it’s accurate. So it is clearly a powerful tool for many applied statistics situations and I would argue a really valuable asset for the Julia language to have. In quantitative ecology the capabilities of TMB have transformed the field in my opinion. So I disagree there are “almost always far better ways to fit models.”
This post wasn’t about HMC per se but I’ll note that it is trivial to plug a TMB model into Stan (via tmbstan package) and do NUTS or ADVI. And note that TMB already has the capability to use NUTS on the marginal posterior after applying the LA to arbitrary subsets of parameters, see the brief demo in this paper, which is what they are trying to build into Stan with their adjoint method. In my experience NUTS is usually more than 100 times slower than the LA, which is why the LA is predominantly used rather than NUTS despite the availability of both. I haven’t tried ADVI mostly b/c the Stan team steers people away from it anyway (if I understand correctly). Unless I’m missing something and Julia has substantially better HMC capabilities than Stan? FWIW TMB also has an importance sampling option to check the accuracy of the approximation.
Having said that, I feel like this thread has wandered a bit off topic. If this is the right place to have this conversation for the Julia community, then that’s fine. But I prefer to keep this thread focused on implementation of the LA in Julia, rather than the pros/cons of the approach in general. What I hope to get out of this is a first-pass benchmark showing that Julia can replicate TMB for non-trivial models. I think that would be a compelling capability for Julia for a subset of applied statisticians.
I don’t have any problems with LA itself – it’s a beautiful method that works very well in some cases. I recognize that Laplace Approximation can be useful as part of a larger overall algorithm (and have used it in similar cases myself). Note that I’d consider LA+Importance Sampling to be one of the “Better methods” I described, since it does actually check whether an adequate fit has been achieved. However, there’s a time and a place for LA, and that time and place is not as a default choice for people trying to fit their data without carefully checking that it’s a good approximation just because “It’s a lot faster” (even though that speed comes at the cost of silently returning bad fits very often). This is the kind of thing packages like lme4 encourage, and it’s not a good habit.
That being said, if you’d like to use LA, it’s not difficult to implement. Build some function that evaluates the log-likelihood at a point (Soss and Turing can both do this for you). Use Optim to find your MAP estimate, then use Zygote to take the Hessian at that point. This should be pretty easy to do.
I think it would be a good idea to add LA+IS to Turing.jl, and if anyone in this thread would like to do that I’d welcome it. I just think that using LA without good diagnostics or a very strong understanding of how your posterior will be shaped is a very bad idea.
Adding to Cole’s explanation, the random effects we’re talking about here are typically high-dimensional latent spatial fields and/or time series, which are a challenge to fit with Turing/NUTS (speaking from personal experience). If the runtime difference is 1 second vs 2 minutes, sure, I’ll just have a cup of coffee while I wait for an exact posterior. But if it’s 1 day vs. 2 months, the accuracy/runtime tradeoff looks a lot different.
As you say, the LA itself is just a couple of lines of code in Julia, but for a big latent field the
Zygote.hessian might not be practical to compute, or even fit in memory (hence all the talk above about automatically sparse differentiation). I agree it would be great to integrate the LA with Turing. Once the remaining sparse Hessian issues get ironed out, that (hopefully) shouldn’t be too hard to do.
Yeah, that makes sense, and there’s nothing wrong with something like Laplace approximation or variational inference if you also add an importance sampling step. I will note that as a rule of thumb, variational inference is a better idea than Laplace approximation if you’re using a good method like SVGD or NFVI, and it gets you an answer on a roughly similar time scale. AFAIK the only thing that’s been implemented in Julia is ADVI, though, which is extremely restrictive (even moreso than LA if you don’t use full rank). The problem I have is more with doing LA without having any kind of check to make sure you got good convergence. I think building LA+an importance sampler would be a great idea and I’d love to see it implemented.
Would you be willing to provide more details on the types of models that used too much memory when you tried to fit them with MixedModels? We have been able to fit some very large models with this package and we are always interested in getting examples to show us what we need to work on.
There is a MixedModels-dev channel on julialang.zulipchat.com
Regarding lme4 and MixedModels, for a linear mixed-effects model the profiled log-likelihood is evaluated directly - there is no approximation involved. For a generalized linear mixed model the default method to evaluate the log-likelihood is the Laplace approximation. Adaptive Gauss-Hermite quadrature is also available for scalar random effects with a single grouping factor.
I did it a long time ago, but it fas something like this
binaryoutcome ~ age + weight + year + sex + weight + t1 + t2 + t3 +t4+ t5 + t6 + race + interaccions… + (1|hospital/ID) + (1|city), family=bernoulli)
Where ID has 10 million levels, and for each ID I have up to 100 rows.
t1, t2, t3, t4, t5, t6 are categorical variables with just two levels. City and hospital have 100 levels. Race and year have 10 levels. The other variables are continuous.
The dataset is very umbalenced.
If I just use a small sample of the data I miss some combinations of the covariables with low frequency levels.
I did compare different approaches, subsampling, partitionins and using metaanalysis and also with the whole dataset and a large SSD disk.
My experience was that glmmTMB was able to deal with datasets ten times larger than MixedModels without crashing.
That’s interesting. It may be possible to fit such a model more effectively in the development version of MixedModels. If you can point me to some data (without violating HIPPA guidelines) of this type I could try.
The way you have set up the random effects term (1|hospital/ID) treats a person within hospital A as unrelated to the same person within hospital B.
Given that you have so much data I might be tempted to treat hospital and city as fixed effects.