Yes, that’s one of the things that got my attention, I really like that modular approach - I’m also considering to split off the (IMHO fairly good) MH-sampler in BAT.jl as a separate package at some point. (HMC is obviously superior in most use cases, if you have a gradient, but we often have to deal with likelihoods where a gradient is hard or impossible to obtain, and where good old MH will do well enough since the number of parameters is fairly low).
LogDensityProblems.jl […] You can use this interface to perform AD for you with any of the supported AD packages (and it is easy to change later).
Yes, I know about LogDensityProblems.jl. I actually have a similar thing in BAT.jl (minus AD at the moment), and I’ve been considering to adopt the LogDensityProblems API. Do you expect it to be stable from now on (I know you did some changes recently)? We have been asked to release an API-stable version 1.0 of BAT.jl soon.
The only thing that is missing is allowing log density functions to return an extra payload (which is useful for eg indirect inference, containing simulation results). I did not do it yet because this was a feature request by someone who did not follow up on it, but I see how it can be useful and I may end up doing it in the long run if there is interest.
In any case, I can refactor the API to allow these kinds of extensions, and then it would be stable.
Note that for widely used packages of mine, I generally coordinate with all dependent packages, eg
so if I ever change the interface, there will be pre-release discussion, a major bump in SemVer, etc.
The only difference I see is AdvancedHMC supporting the slice sampler and the turning criterion of the original NUTS, but since both have been superseded (by the multinomial sampler and the generalized turning criterion), I don’t think this is relevant for most users; as these are less efficient variants, not something that you would ever want to go back to.
I agree the slice sampler outpeforms mutonomial one pratically.
For the generalised U-turn criterion, I believe it is just a numerical approximation of the original U-turn criterion, and they should be mathematically the same on Euclidean space; but the Riemannian HMC does need the generalised one.
Also I think static HMC samplers are quite useful.
For AdvancedHMC.jl, the API to for log-density is passed through two functions:
One for log density which takes in parameter and returns the corresponding log density;
One for gradient which takes in parameter and returns a tuple of log density and the corresponding gradient. What method to compute the gradient is your choice: using some AD library or with a hand-written gradient.
I believe the example in the README.md of our repo is clear enough regarding this.
One thing I noticed is that DynamicHMC has more and heavier dependencies, it pulls in 37 packages (directly and indirectly) vs. 23 packages for AdvancedHMC. Package reflect this (about 4.0 s for DynamicHMC vs. 2.0 s for AdvancedHMC).
This is not meant to be a criticism, I assume there’s probably very good reasons for DynamicHMC’s heavier footprint. But it’s something to consider when deciding whether to integrate a backend unconditionally via Project.toml or only optionally via Requires.jl.
It’s the other way round: the new multinomial sampler with the forward bias is better than the slice sampler.
My understanding that the generalized criterion is the more fundamental one, and the original NUTS paper criterion approximates it (and yes, it came before, theory in this area usually follows practice). BTW, there was a nice discussion about extensions on the Stan forum recently:
I think I disagree here: except for some simple examples, varying curvature means that static HMC always over- or undershoots,
Yes, that’s basically Optim which pulls in everything but the kitchen sink. I will think about streamlining this, but generally finding a local optimum helps quite a bit for initial adaptation, and it is one of the nice native Julia packages.
So if I understand correctly, the APIs we currently have for log-density sampling problems are:
AdvancedHMC: There isn’t really an API or trait, the user code passes a function for the density, and another function for density plus gradient. The dimensionality of the problem is determined from θ_init (is that correct?).
DynamicHMC, via LogDensityProblems: There’s trait-based API with functions dimension, logdensity and logdensity_and_gradient`. A density can be any type.
BAT (the package I’m involved with): Type-based API, densities must be subtypes of BAT.AbstractDensity and implement functions nparams and density_logval. So very similar to LogDensityProblems, except for the abstract type. BAT is currently still Metropolis-only, so there’s no function for density-plus-gradient yet, but I always intended for it to be something ending with a !, to allow passing the gradient into a preallocated array - but since all current AD frameworks are allocating arrays anyhow, maybe that’s a moot point. GC cost has gone down too, in recent Julia versions.
Originally, my API was much heavier, with functions to negotiate use of threads and a density_logval! that calculates multiple density values at once (e.g. for nested sampling / ensembles). But partr and custom broadcasting have made that obsolete.
I do have an additional function param_bounds in my API, though - parameter bounds are important for gradient-free samplers like MH, but I also want to use the bounds information to choose a good transformation for HMC automatically. nparams is inferred from the parameter bounds, by default.
I wonder if the time may have come for a more-or-less official Julia density-problems API, defined by a lightweight package that lives in JuliaStats organization (or some other central place)? What do you guys think?
It’s the other way round: the new multinomial sampler with the forward bias is better than the slice sampler.
Sorry it was a typo in my original text. I mean the same thing as yours: multinomial > slice empirically.
My understanding that the generalized criterion is the more fundamental one, and the original NUTS paper criterion approximates it (and yes, it came before, theory in this area usually follows practice). BTW, there was a nice discussion about extensions on the Stan forum recently:
The generalized criterion approximates the original one by integrating momentum variables through the trajectory; see page 57-58 of Betancourt (2017). They are computing the same thing (on Euclidean space) while the generalized version uses numerical approximation.
I think I disagree here: except for some simple examples, varying curvature means that static HMC always over- or undershoots,
When HMC is used as one part of the inference step, the U-turn criterion is no more wanted. This is could happen when HMC is used as a Gibbs step together with other MCMC algorithm (e.g. to deal with discrete variables using PG), or HMC is used with variational inference, e.g. Ruiz & Titsias (2019).
But I will agree with you that for most of the users who focus on modeling, they probably just want to use NUTS. And again AHMC itself doesn’t aim to face users of all levels but is rather for package developers or ML researchers.
Yes, that’s pretty much it, except there is also LogDensityProblems.capabilities, which returns a trait telling you if the callable supports logdensity_and_gradient. So eg if you use this API in BAT, you can check that
Yes, I like the capabilities trait. Maybe as a first step, I can make my AbstractDensity support the LogDensityProblems interface automatically.
Long term, it would be really nice if we had a community consensus on this kind of API (I think LogDensityProblems goes in exactly the right direction).
There is a neat little function LogDensityProblems.benchmark_ForwardDiff_chunks which benchmarks various chunk sizes for you systematically. This can easily speed up ADing with ForwardDiff by a factor of 10. While forward AD is of course not ideal for larger problems, it is still quite robust.
If you need something specific, please let me know. Preferably in an issue so that it does not get lost.
AdvancedHMC: There isn’t really an API or trait, the user code passes a function for the density, and another function for density plus gradient. The dimensionality of the problem is determined from θ_init (is that correct?).
Correct!
I wonder if the time may have come for a more-or-less official Julia density-problems API, defined by a lightweight package that lives in JuliaStats organization (or some other central place)? What do you guys think?
I feel a Distributions.Distribution object itself can represent a density problem, and the Julia way would be to provide such an object with functions for it to get log density, gradient, support, etc, which is what Distributions.jl currently doing. But ATM the one for support doesn’t seem to be in the common interface of MultivariateDistribution yet, which is something we are more interested in, but only for UnivariateDistribution. Once the support is given, it’s possible to derive the transformed distribution automatically, which is what Bijectors.jl is doing.
I’ve been thinking about that - but the typical assumption with a distribution is that it is normalized, and the typical likelihood densities that we deal with could only implement a small fraction of the Distribution API anyhow. I would rather say some densities are distributions - most common prior densities for example. So what we’d need is to have something like Distribution <: StatisticalDensity. I wonder if that might be accepted into Distributions, or if the Distributions maintainers might be willing to take on such a dependency, defined in a lightweight package “StatisticalDensities” (or similarly named). What do you guys think?
I’m quite happy with the API as it is, at the moment - except for the fact that I need parameter bounds information. But a generic way to express that is a complex topic by itself. Would you consider a v1.0 for LogLikelihoodProblems in the near future, though?
I think for now I’ll probably make BAT.jl compatible with LogLikelihoodProblems, but keep BAT.AbstractDensity around for BAT.jl v1.0 (since I have to release that soon). But for a BAT.jl v2.0, I’d be very glad to switch to a lightweight generic API for statistical densities with community support (e.g. adopted by packages like DynamicHMC and AdvancedHMC ).
Good point. I guess one can do the other way too: UnnormalisedDistribution <: Distribution? Not sure if this one is worse or better but it is more straightforward to extend from Distributions.jl. We can still define logpdf and stuff on it (of couse in this case logpdf will have to mean unnormalised log density but this function is used to compute probablity mass as well so I guess it’s not very rigrous name anyway).
The StatisticalDensity.jl idea really needs discussions in the JuliaStats side. But I feel it’s not easy to add dependencies to the core packages. But who knows?
Speaking of distributions as priors, I’ve been exploring this in BAT.jl recently (https://github.com/bat/BAT.jl/blob/master/src/parameters/named_prior.jl). I have a NamedPrior type that integrated with ShapesOfVariables.jl (and hopefully TransformVariables.jl too in the future). It lets you specify a structured prior like this (including pinned/constant parameters):
julia> prior = NamedPrior(
a = 5,
b = Normal(),
c = -4..5,
d = MvNormal([1.2 0.5; 0.5 2.1])
);
julia> prior isa Distribution
true
julia> length(prior)
4
julia> prior.a
BAT.ConstValueDistribution{Univariate,Int64}(value=5)
julia> prior.c
Uniform{Float64}(a=-4.0, b=5.0)
julia> cov(prior)
4×4 Array{Float64,2}:
1.0 0.0 0.0 0.0
0.0 6.75 0.0 0.0
0.0 0.0 1.2 0.5
0.0 0.0 0.5 2.1
I’ve been thinking about releasing it as a standalone package “NamedPriors” or similar.
I would say no - except if there also a NormalisedDistribution <: Distribution that all the current distributions in Distributions.jl would be subtypes of.
And even then, that might not be a good structure - it’s not that our densities are often not normalized (e.g. posteriors): Some of them (e.g. likelihoods) may well be normalized, but still unable to support most functions in the Distributions API, because the result values are very hard to calculate efficiently (that’s why we need MCMC, after all).
Not to go too far off topic, but I saw this post on the Stan Discourse which presents another way to get into the typical set intially. If I understand it correctly, running shorter (lower max treedepth) presumably with larger steps (lower target acceptance rate) will “bleed off” energy quickly and get you there.