I was thinking of a front end like you’ve described for a long term project.
I’m not a salesman, and my efforts to produce a notebook demonstrating the idea have been slow.
But a brief summary:
Given a D dimensional probability distribution that we evaluate and take its gradient at N data points, we have N*(D+1) points for interpolation.
We could be extremely conservative, throw away the gradients, and approximation the CDF with a step function (degree 0 spline). Or we could interpolate some function of the N*(D+1) values of the PDF using either polynomials (iffy) or higher degree simplex splines that are easy to integrate, marginalize, etc.
At least for modest D, this should lead to dramatically faster convergence relative to N.
Furthermore, MCMC can struggle with high degrees of correlation between parameters, such as in tall hierarchical models. You can see this with Stan, as the sampling can slow down dramatically, as it cranks up the leap frog steps / work put into solving the Hamiltonian differential equation.
Many hierarchical models are easy to factor into DAGs. Components that can’t be factored / represented well by a DAG can exist as a large many-parameter node of a graph, but in many cases other components can still be factored.
Each node can be evaluated independently, fit with a spline, and pieces quickly marginalized to give updated interpolated values to pass on as messages to connected nodes. For example, if nodes A[params: w, x, y, z]
and B[t,u,v,w,x]
are connected, y
and z
can be marginalized out of A
, and a message send containing the density of w
,x
at each of the N points where B is evaluated to update the interpolation conditions of the spline fit to node B.
Thus, large models can readily be factored into small pieces that can be solved independently, and it should be able to overcome the problems associated with high correlations.
If one constructs grids a priori (eg, quasi monte-carlo) this can also all be run on a GPU, which should really help speed up extended simulations. But I’ve also wondered if some adaption of HMC could be used to find a representative region of the posterior.
Right now my best idea involves first completing a maximization step, and using the maximum and Hessian at the maximum to center and scale the grid.
Higher order derivatives are also additional value to interpolate.
If you’re mildly interested, I’d like to ask if anyone working on your project can make things modular enough that
- You can easily switch back ends.
- Possibly work factoring probability densities into graphical models into the model interpretation, or keep everything modular enough that it wont break too much for someone to try and go in and change things like that in the future. BUGS/JAGS use graphical models, but Stan does not because it does not need to, and graphs were restrictive. My proposed solution to the latter is to leave anything that can’t be broken down in one large node, so that “non-graphical” models are simply a one-node subset.
Hamiltonian Monte Carlo requires gradients.
Stan (written in C++) uses reverse mode AD, and is about 10x faster according to Turing’s benchmarks when otherwise the same algorithms.
Out of curiosity, your Dict{Any,Any}
s for caching don’t cause any problems with type inference?