I frequently run MCMC for Bayesian estimation which
takes a long time (weeks),
had a large dimension (10^5–10^6)
It would be great to have a means to
save the results to disk while in progress,
“peek into” the progress of the chain while it is running (eg calculate ESS & \hat{R})
I am wondering if there is an existing solution, and if not, opening this discussion to brainstorm about it, so that various MCMC implementations could standardize on a common format.
The ideal solution would be
economical with disk space (eg allow the user Float32 when that is enough for storage, or automatically thin samples),
failure tolerant (eg computation shut down in the middle for whatever reason would still have partial results),
have a “core” API just for retrieving the posterior results (analogous to an AbstractMatrix),
but also means to save occasional simple metadata (adaptation info, etc)
not be tied to a specific Julia version or data structure (so serialization and JLD2 are not ideal)
A netcdf (using NCDatasets.jl for instance) could do the job? It’s mainly used for climate data but it ticks all the boxes and can easily be used outside of julia.
Hey man, so funny you say this. I have had the same thought
I’ve been working on a personal project for fun, a combination of wanting to move outside the bounds of the existing ecosystem in Julia and wanting to build a Bayesian “IDE”. It’s a TUI – so not quite what you want, but I think this is a clear demonstration that callbacks are fine with NUTS sampling.
I shared your frustration immensely that there was no out of the box solution for looking at chains in real time.
Here’s a clip of my Tachikoma live sampling viewer. Give it a moment for when Enzyme is compiling the gradient!
I have my own ad-hoc solution (dump to a CSV line by line, then have a script that loads each and calculates what I want), just thought I would invest in something less hacky.
A GUI/TUI for monitoring is one application, but I think it would make sense to figure out the storage format separately and build on that.
Some of the tooling for machine learning is pretty good for this. TensorBoardLogger.jl works if you integrate it into your solver to run every X updates. It just saves to disk and you can launch a web interface with the tensorboard python package to view the results in a browser. Also works if you run on a remote machine like a HPC if you port forward (very easy with VS code SSH extension).
At our work, I’ve set up an MLFlow server that’s publicly available and any machine with the right credentials can save “experiments” (similar to tensorboard with scalar, picture or matrix logging). This is a much more python centric solution but there are some client libraries in Julia like MLFlowClient.jl but really useful to monitor experiments that take forever and to share results with team members.
Note that the above tools that are best for monitoring. For long running jobs I tend to use something like HDF5 and save snapshots that can be restarted and run to completion. You can add extra storage to MLFlow to add files to a particular “experiment”, but it’s not the most performant solution.
nuts.rs, the Rust implementation of NUTS in nutpie, which is the sampling backend recommended by PyMC devs, can write online to Stan CSV files: https://github.com/pymc-devs/nuts-rs/pull/39 . I don’t think this is exposed via the Python interface though; I think they stream instead to zarr or arrow.
I get that CSV files are way more accessible, but I find the reliance on them in this case a bit perplexing. I imagine most users needing to read their chains in real time are probably users for whom reading an HDF5 file is no big deal.
HDF5 offers way more flexibility, efficiency, compression, etc.
IIUC, HDF5 by itself does not satisfy the goal property
HDF “files” must be explicitly closed to flush data (and metadata) to the filesystem, and glitches before closing may leave them much less useful than a truncated CSV file. So you might want to augment an HDF5 sink with some checkpointing and output-swapping for a long run.