Pathfinder is a variational method designed to replace some or all phases of the warm-up phase of Markov Chain Monte Carlo. Its draws can also be used to diagnose problems with Bayesian models early in a model development cycle.
Pathfinder integrates with both packages that implement adaptive HMC (AdvancedHMC.jl and DynamicHMC.jl) and also with Turing.jl. Soss.jl and Tilde.hl integration is in the works. See the documentation for details about the method and examples.
Lu Zhang, Bob Carpenter, Andrew Gelman, Aki Vehtari (2021).
Pathfinder: Parallel quasi-Newton variational inference.
arXiv: 2108.03782 [stat.ML]
Great work! I do have nitpick though. Is there a concrete plan of action such that Pathfinder can “replace” any of the Stan adaptation schemes? Pathfinder would definitely be a cool way to search the typical set and initialize the covariance matrix adaptation phase, but that would come in front of the Stan adaptation procedure instead of replacing any of them.
Having Pathfinder as an adaptation procedure is very useful, but it’s also useful to be able to run it in isolation, use the Pathfinder results to diagnose any model problems without running HMC, and then afterward use Pathfinder to initialize HMC. This is already possible with both DynamicHMC and AdvancedHMC. See https://sethaxen.github.io/Pathfinder.jl/stable/examples/initializing-hmc/ for worked examples. With Soss (due to SampleChainsDynamicHMC), one could already do this. With Turing, one does not usually use AdvancedHMC directly but instead Turing.NUTS, which does not yet support enough of AdvancedHMC’s API to make this possible. Hopefully this will soon change, and running Pathfinder then NUTS would be 1 or 2 lines of code.
Very neat, I’ve been on the lookout for better MCMC init schemes for BAT.jl and wanted to do something Hessian-based. Pathfinder seems to be the perfect fit.
The lbfgs_inverse_hessian functionality might be very useful for other use cases as well - do you think that could live in a more lightweight place somewhere (since Pathfinder.jl itself is quite heavy, dependency-wise)?
As for dependencies, I’d like to reduce Pathfinder’s dependencies. Only PSIS.jl and Optim.jl are strictly necessary, but they comprise ~60% of the loadtime. The rest of the dependencies improve usability, efficiency, or extensibility. Since it’s a user-facing package, I felt this was warranted, but if a more reduced developer facing package would be useful, we can discuss that.
I haven’t used BAT, but let me know if you need any help setting that up.
I’ll need to refactor our chain init code a bit, so I take advantage of the MvNormal estimate for both the init point and the initial MCMC proposal/mass matrix (we use a homegrown Metropolis sampler and AdvancedHMC with an additional custom init scheme).
The plan currently is to upstream WoodburyPDMats to PDMatsExtras.jl
That would be neat! Maybe the inverse-hessian estimator could live there too, with more generic arguments (just points and gradients)?
The lbfgs_inverse_hessian functionality might be very useful for other use cases as well
it would be good to have some extra use cases in mind first.
Just from a viewpoint of BAT.jl: We already have a mode-finder capability and will add more VI/ELBO-optimization code for other samplers, so we already have (and will have more) some parts of Pathfinder in it, so to speak. Instead of using Pathfinder as a “black box” via Requires (may be too heavy as a direct dependency, have to check) we could integrate the pathfinder algorithm more deeply into the existing stuff by using lbfgs_inverse_hessian directly (which would be a fairly light dependency).
At the moment, BAT.jl itself is a heavy monolithic monster, but we’re about to split it up into several packages, so I’m starting to pay more attention to dependency load times.
I haven’t look too deeply into the pathfinder maths yet - does the inverse-Hessian estimate have to come from an LBFGS-optimizer “trace”, or would any gradient-based optimizer do? Could one use an SGD scheme, even?
I think that would be somewhat beyond the scope of that package. What might be more useful is an OptimBase-compatible implementation of L-BFGS that explicitly constructs the inverse Hessian matrices we need (as opposed to computing their action using the two-loop algorithm) and supports storing them in the extended trace. This would be more efficient, and maybe we could then drop the Optim dependency. But it seems easy to make mistakes here, so I’d rather use an existing robust implementation.
Yeah from looking at BAT’s current dependencies, I suspect many of Pathfinder’s are already indirect dependencies of BAT, so it may not be an issue.
Probably not, so long as we use Distributions. We want the matrix to be a PDMat type so we can use it as a covariance for a Distributions.MvNormal, which requires a PDMat type. This is the same reason for PDMatsExtras.WoodburyPDMat, hence why ours is named the same, but Pathfinder’s is more general, which is why we can upstream the code. However, if we instead upstream to WoodburyMatrices, we’d have the same issue with the resulting matrix not being suitable with Distributions.
PDMatsExtras also requires that their WoodburyPDMat is Zygote-friendly, which WoodburyMatrices isn’t I think. In Pathfinder we don’t test this, but ensuring Zygote-friendliness would be part of upstreaming.
The decomposition we use could be useful for WoodburyMatrices as well, but it only works for positive-definite matrices, and WoodburyMatrices.SymWoodbury does not have this constraint.
I have just tried Pathfinder.jl and it is very cool. In particular, I appreciate the possibility of using it to boost the efficiency of HMC (I tested it with AdvancedHMC.jl).
I have a question.
In the examples, you showed how to use Pathfinder results to boost HMC, both using one of the draws as the starting point or using the Pathfinder’s inverse metric estimate. While I have succesfully done it with single Pathfinder (using your examples) it looks to me less obvious how to do it using MultiPathfinder.
The solution I have found is to take one of the draws as the starting point (or the mean of the draws) and evaluate the covariance matrix of the draws using CovarianceEstimation.jl. Is there a more direct/clever way to obtain the same result? I prefer to use MultiPathfinder to warm-up HMC because it looks to be way more robust (some runs of Pathfinder may not end up in the correct region of parameter space).
The short answer is that this is an active area of research only mentioned by the Pathfinder paper, and I haven’t done enough tests on a wide enough range of models to give concrete recommendations. That being said, I can think of at least 2 ways to go, with pros and cons.
1. Initialize with a draw and the metric from the component that draw came from.
Pros:
If the distribution is multimodal with well-separated, approximately-normal “modes”, then this would promote good exploration of each mode represented by the initialization draws
the metric is a WoodburyPDMat, which is more efficient to use for high-dimensional distributions than a dense metric.
Cons:
If the "mode"s are very non-normal, then it’s possible the metric selected is far from the ideal metric one would use for mode exploration.
When multi-path Pathfinder catastrophically fails, many of the draws could be the exact same point; this would result in all chains initializing from the same point.
2. Initialize with a draw and use the covariance of all draws to set the starting metric
This is equivalent to your approach.
Pros:
If none of the individual runs are particularly good fits, then this might yield a better metric for exploration
This is closer to how HMC warm-up actually works, so I would imagine it could be more robust
Cons:
If the distribution is multimodal with well-separated modes, this could produce a very poor metric for sampling the modes
This would yield a diagonal or dense metric, not the efficient low-rank-plus-diagonal type returned by Pathfinder. For high-dimensional distributions, selecting a diagonal metric could be less effective, and selecting a dense one could be less efficient.
Note that it might make more sense to compute covariance with importance-weighting instead of importance-resampling. We currently don’t return the pre-resampling draws, but we should.
I suspect there isn’t a one-size-fits-all approach, but it would be good to identify classes of distributions for which one approach, the other, or neither works particularly well.