Laplace approximation for mixed models in Julia (TMB-like functionality)

Hello,

[Apologies for lack of URLs, I’m a new user so limited to only 2. I can provide more in a follow up as requested]

I’m writing to inquire about the current state of Julia with respect to the Laplace approximation for mixed effects statistical models. I was recently turned on to Julia from a nice talk by Sam (which is posted on discourse but not linked here). I think the biggest hurdle for recruiting new users to Julia in our field (quantitative ecology/fisheries science) is that it needs to provide similar or improved functionality over the software package Template Model Builder. This has been discussed briefly on forums here before (see especially this discussion).

From a user perspective the beauty of TMB is you simply write your objective function in a simple C++ template. TMB then does the Laplace approximation to integrate out arbitrary subsets of parameters (typically random effects), declared in R after compilation, while automatically detecting and using sparsity for calculations (unlike INLA where sparsity must be known a priori). See the adcomp github repo tutorial for a flavor of the syntax and user experience.

Per the previously linked thread, it’s been used widely in glmmTMB. But more importantly, TMB scales incredibly well to diverse mixed models beyond regressions, and has led to recent a surge in methods in ecology and related fields (e.g., spatial factor analysis, state-space population dynamics, life history theory, and especially spatio-temporal models). TMB is great, but there would be an undeniable elegance to doing the whole statistical analysis in a single programming language. I haven’t used Julia but am certainly intrigued and I think that if Julia had similar functionality it would make an attractive alternative for cutting edge statistical model development. Not to mention it would be good to have an alternative to TMB just for benchmarking, checking, and testing.

It was clear in 2019 that “…many of the pieces are in place…” What is the current status of this kind of functionality in Julia now? In particular, what are the existing capabilities for (higher-order) AD, Hessian sparsity detection, Laplace approximation, sparse Cholesky decomposition? How active is development on these features, and how much work would it take to implement TMB’s functionality in Julia?

Thanks in advance!

6 Likes

As I see it,

Statistical tools:

  • Distributions.jl has extensive coverage of statistical distributions. It’s one of the older packages in the Julia ecosystem, so it’s well-tested but also has some rough edges around e.g. working with autodiff. There are extension packages such as DistributionsAD.jl that alleviate this.

  • MeasureTheory.jl is currently under development, but is exploring ways to improve Distributions, including dropping normalizing constants when they aren’t needed to speed up computations.

  • For SPDE models, Sam’s SPDEPrecisionMatrices.jl and GaussianMarkovRandomFields.jl packages are working, but still have a ways to go to match some of the more advanced features available in INLA such as nonstationarity.

  • For model fitting, there is Optim.jl, but also a number of MCMC samplers including multiple implementations of NUTS/HMC.

Autodifferentiation:

  • There are many autodiff packages in Julia. Fortunately they are mostly interchangeable and composable. The most commonly used packages seem to be ForwardDiff.jl for forward mode and ReverseDiff.jl or Zygote.jl for reverse mode.

  • There is some interesting (mostly theoretical?) work on higher-order autodiff that may develop into something useful (as differentiating through the Laplace approximation requires third-order derviatives).

  • I think that it should be possible to differentiate through a Laplace approximation with the tools currently available, but I haven’t implemented it.

  • Some of the autodiff ecosystem is moving towards using ChainRules.jl to define derivative rules. This means that including custom adjoints in other packages is pretty straightforward and only depends on the lightweight ChainRulesCore.jl package.

  • Sparsity detection is available through ModelingToolkit.jl (and previously as a standalone package in SparsityDetection.jl). Reorderings to increase sparsity of the Cholesky are also available via Metis.jl, as is available in TMB’s runSymbolicAnalysis.

Laplace approximation:

  • There isn’t currently a canonical Laplace approximation package in Julia.

  • Sam’s package MarginalLogDensities.jl includes a Laplace Approximation

  • My LaplaceApproximations.jl also has a version, including an attempt at implementing the adjoint method from the Stan folks, but it’s not currently complete.

Fisheries-specific functionality

  • One of (if not the) primary advantages of Julia is composability, making modularity straightforward. It would not be difficult to develop different pieces of a population dynamics model in different modules but still have them work together nicely. It could also be possible to write these in a way that parameters for each component could be either numbers (e.g. for optimization) or distributions (e.g. as priors).

  • Another strength is the ability to write domain-specific languages (DSLs). This can make defining models convenient and clear, but can also be used to transform the code for a model so that a single definition can be used to generate code to calculate a likelihood/posterior density as well as to simulate fake data from the model.

Essentially, I believe that all the pieces are in place to implement a TMB equivalent in Julia. The only place we may be pushing the boundaries of current practice is in higher-order autodifferentiation, particularly in needing third order derivatives.

5 Likes

Can you be a bit more explicit or provide links to what type of mixed-effects models you are interested in? The MixedModels package allows for fitting linear mixed models (LMMs) or generalized linear mixed models (GLMMs) in Julia. The scope of these models is similar to that of the lme4 package in R. The evaluation of the profiled log-likelihood for LMMs is direct so there is no need to use an approximation.

The Laplace approximation is the default for evaluating the deviance for GLMMs. It is also possible to use adaptive Gauss-Hermite quadrature for models with a single grouping factor for the random effects.

2 Likes

Unfortunately many of the models we need to fit aren’t (to my knowledge) covered by lme4 or MixedModels.

We’re looking to fit models that may include:

  • spatial and/or spatiotemporal random effects (we usually use the SPDE approach to represent a GMRF for these)
  • observation likelihoods with both exact zeros and positive values, typically formulated as a delta model or a compound Poisson gamma

More complicated models such as Stock Synthesis may include:

  • a population dynamics model and
  • multiple data sources with different likelihoods

Thanks for the clarification.

1 Like

I agree with @jkbest2, we’re talking about non-regression type models. Here are some links I couldn’t post above.

  • spatial factor analysis code
  • state-space population dynamics code

Another big idea is we can write out more complex models with non-linearities, flexible likelihoods like John mentioned, and that have hundreds of fixed effects and tens of thousands of random effects. Another key feature is building out multiple features. So you might have a single model that can be used as a spatiotemporal model, spatial model, or non-spatial (GLMM type) by just fixing certain parameters to 0 and changing what is estimated. So the same model may be used for multiple, diverse applications, but is the same underlying code. This is why the model code linked above are hundreds or thousands of lines.

1 Like

Welcome Cole! I gave that talk, and can answer a few of your questions.

All the pieces mentioned in that previous TMB thread are still in place, and we have a few more as well. I’m not sure about the status of higher-order AD, but as to the other pieces:

  • Hessian sparsity detection sparse differentiation: sort of there, but not quite. My impression (maybe @ChrisRackauckas can elaborate) is that it wouldn’t take too much effort to implement, it just needs attention from someone with the time and domain knowledge.

  • Laplace approximation: given an optimizer and an AD system, this is in principle just three lines of code (assuming f is your log-density function and N is the number of variables):

using Optim, ForwardDiff, LinearAlgebra
opt = optimize(f, zeros(N), LBFGS(), autodiff=:forward)
H = ForwardDiff.hessian(f, opt.minimizer)
integral = -opt.minimum + 0.5 * (log((2π)^N) - logdet(H))

@jkbest and I already have working implementations here and here. The trickier parts are making sure the optimization is fast, and the Hessian is automatically sparse if it can be.

  • Sparse Cholesky decomposition in Julia is done via SuiteSparse/CHOLMOD, which I think is the same as TMB. This PR should let those decompositions be used more easily for AD and Laplace approximations after the next Julia release (see also this issue).

Basically, I agree with @jkbest that a TMB equivalent in Julia is very much within reach. My work-in-progress MarginalLogDensities package can already take an arbitrary log-density function and integrate out an arbitrary subset of its arguments using the Laplace approximation with automatic sparsity detection (though that part is not optimal). See this simple demo for an example. It will just take some sustained effort from a couple/few knowledgeable people to fill in the missing bits and plug all the pieces together.

3 Likes

Hessian sparsity detection is already done, along with the necessary acyclic coloring. What’s needed for sparse Hessian differentiation is the compression and decompression scheme.

2 Likes

Sorry, I should have said “sparse differentiation,” not “sparsity detection.” Do you have a sense of how big a lift implementing the compression/decompression scheme would be? Is it the sort of thing one of us quantitative ecologists could usefully contribute to?

Well this is quite promising. Thanks for the replies.

I feel like doing a benchmark comparison of a spatial factor analysis would be a good place to start. Maybe Sam could modify his example based on a simple TMB example I could provide?

It’s relatively straightforward. See https://epubs.siam.org/doi/10.1137/S0036144504444711 .

3 Likes

Sure, I’d be willing to try that! Just let me know what example you’ve got in mind.

A bigger-picture question to ponder as well is where and how to assemble the pieces of this hypothetical TMB-like functionality.

MarginalLogDensities is loosely modeled on @Tamas_Papp’s LogDensityProblems: it provides a relatively simple wrapper around a Julia log-likelihood/log-posterior function, which adds some useful extra information and capabilities. (In LogDensityProblems, those are variable transformations and gradients; in MarginalLogDensities they are a set of parameters to integrate out and a method for doing it.) Something along these lines, maybe integrating with the LogDensityProblems interface, is probably the simplest way to get up and running.

However, it would also be nice to have this kind of functionality for probabilistic programming frameworks like Turing and Soss (cf this issue, referencing @monnahc’s paper on tmbstan). It would also be nice to have this capability within the ModelingToolkit/GalacticOptim ecosystem.

I guess once the last few missing pieces are in place it shouldn’t be too hard to incorporate Laplace-approximation marginalization into various frameworks, but it’s worth mulling over what kind of interface would be best from a non-expert user’s perspective.

2 Likes

Okay, I spent a few hours yesterday evening working through the relevant sections of that paper and have the “direct” Hessian decompression method working:

Sparse Hessian MWE
using ForwardDiff
using SparsityDetection, SparseDiffTools
using LinearAlgebra, SparseArrays

# simple nonlinear test function
f(x) =  dot(x, x.^2) + 0.3x[2]*x[3]
g(x) = ForwardDiff.gradient(f, x)
x = randn(3)
hs = hessian_sparsity(f, x)
col = matrix_colors(tril(hs))
D = hcat([float.(i .== col) for i in 1:maximum(col)]...)
δ = 1e-8
ii, jj, vv = findnz(hs)
Hc = hcat([(g(x + δ * D[:, j]) - g(x)) / δ for j in 1:maximum(col)]...)
H = sparse(ii, jj, zeros(length(vv)))
for (i, j) in zip(ii, jj)
    H[i, j] = Hc[i, col[j]]
end
maximum(H .- ForwardDiff.hessian(f, x))

I just put a quick-and-dirty implementation of this in MarginalLogDensities, though it should/will eventually go in SparseDiffTools. It currently uses finite differencing of AD gradients (a numauto_* method in the SparseDiffTools nomenclature, I think?). Can probably be improved, but it’s faster than what I was using before, so we can provisionally cross this piece off the list.

1 Like

@ElOceanografo I just invited you to a github repo that has the SFA example. Feel free to push changes as you see fit.

This spatial factor analysis is a good first example I think. It’s relatively simple to code, but is challenging to fit. It’s also really easy to scale it up in dimensionality. It uses the SPDE approach from INLA, and I put a trivial delta-method example of a derived quantity (Range).

I used R to simulate data and fit the TMB model with some crude timing benchmarks. The TMB model is in this file, which should be pretty clear but I can help translate if you have any questions.

I’ll be very curious to see how the Julia implementation of this does. Thanks for looking into it.

2 Likes

That’s not optimal. You want to use the backtracking sequential coloring (which is implemented already: https://github.com/JuliaDiff/SparseDiffTools.jl/pull/52) and then implement the more specialized Hessian-based decompression, which would make use of the symmetry to reduce the calculations a bit more. What you showed is a nice quick hack though.

Is that the BacktrackingColor() method for matrix_colors? If so, it seems not to work:

BacktrackingColor error
using SparseDiffTools
hs = [1 0 0; 0 1 1; 0 1 1]
matrix_colors(hs, SparseDiffTools.BacktrackingColor())

ERROR: MethodError: no method matching free_colors(::Int64, ::Vector{Int64}, ::Vector{Int64}, ::Vector{Int64}, ::VertexSafeGraphs.VSafeGraph{Int64, LightGraphs.SimpleGraphs.SimpleGraph{Int64}}, ::Int64)
Closest candidates are:
  free_colors(::Integer, ::AbstractVector{var"#s14"} where var"#s14"<:Integer, ::AbstractVector{var"#s11"} where var"#s11"<:Integer, ::Vector{Integer}, ::LightGraphs.AbstractGraph, ::Integer) at C:\Users\sam.urmy\.julia\packages\SparseDiffTools\HWfTE\src\coloring\backtracking_coloring.jl:148
Stacktrace:
 [1] color_graph(g::VertexSafeGraphs.VSafeGraph{Int64, LightGraphs.SimpleGraphs.SimpleGraph{Int64}}, #unused#::SparseDiffTools.BacktrackingColor)
   @ SparseDiffTools ~\.julia\packages\SparseDiffTools\HWfTE\src\coloring\backtracking_coloring.jl:45
 [2] #matrix_colors#1
   @ ~\.julia\packages\SparseDiffTools\HWfTE\src\coloring\high_level.jl:19 [inlined]
 [3] matrix_colors(A::Matrix{Int64}, alg::SparseDiffTools.BacktrackingColor)
   @ SparseDiffTools ~\.julia\packages\SparseDiffTools\HWfTE\src\coloring\high_level.jl:17
 [4] top-level scope
   @ REPL[26]:1

Am I missing something, or is this a bug?

Also, just so I’m clear, is the more specialized Hessian decompression method you’re talking about the one Gebremedhin et al. call the “substitution method” (section 6 in the paper)?

Actually sorry, I think you’re close. IIRC from Section 4 of the paper, you just need to change your code to use the star coloring algorithm SparseDiffTools.jl/greedy_star2_coloring.jl at master · JuliaDiff/SparseDiffTools.jl · GitHub , and then call Symmetric on the result (because the coloring will only guarantee calculating half of the values). Then if there’s a few choices for how the gradient is done I think it’s all done.

Ok, great! This is wandering off the original topic, so I opened an issue in SparseDiffTools. Just a note for posterity, the trick I was doing above calling matrix_colors on the lower-triangle of the sparsity pattern doesn’t work for all Hessians. I did it because it made the coloring match for one of the small example matrices in the paper, but it doesn’t in general, and can produce errors.

1 Like

So, as you can see, there’s tons of ways to do a Laplace approximation in Julia.

But the bigger question is, why do you want to use Laplace approximations? There’s almost always far better ways to fit models. Even just adding an importance sampling step will give you a method that is actually consistent (probably).

Turing.jl has methods for fitting models that are much better than Laplace approximations (e.g. HMC, Particle Sampling, and ADVI); plus it has an amazing modeling language that’s very easy to use to boot. If you want to build packages for fitting models more easily, I’d talk to the folks there and in the Probabilistic Programming section of the Discourse.