On-line storage of MCMC output

I frequently run MCMC for Bayesian estimation which

  1. takes a long time (weeks),

  2. had a large dimension (10^510^6)

It would be great to have a means to

  1. save the results to disk while in progress,

  2. “peek into” the progress of the chain while it is running (eg calculate ESS & \hat{R})

I am wondering if there is an existing solution, and if not, opening this discussion to brainstorm about it, so that various MCMC implementations could standardize on a common format.

The ideal solution would be

  1. economical with disk space (eg allow the user Float32 when that is enough for storage, or automatically thin samples),
  2. failure tolerant (eg computation shut down in the middle for whatever reason would still have partial results),
  3. have a “core” API just for retrieving the posterior results (analogous to an AbstractMatrix),
  4. but also means to save occasional simple metadata (adaptation info, etc)
  5. not be tied to a specific Julia version or data structure (so serialization and JLD2 are not ideal)

Suggestions welcome.

A netcdf (using NCDatasets.jl for instance) could do the job? It’s mainly used for climate data but it ticks all the boxes and can easily be used outside of julia.

1 Like

Thanks. My understanding is that NetCDF is practically HDF5 though, so I guess I could just use that.

Hey man, so funny you say this. I have had the same thought :slight_smile:

I’ve been working on a personal project for fun, a combination of wanting to move outside the bounds of the existing ecosystem in Julia and wanting to build a Bayesian “IDE”. It’s a TUI – so not quite what you want, but I think this is a clear demonstration that callbacks are fine with NUTS sampling.

I shared your frustration immensely that there was no out of the box solution for looking at chains in real time.

Here’s a clip of my Tachikoma live sampling viewer. Give it a moment for when Enzyme is compiling the gradient!

https://asciinema.org/a/ZSBs3oqsjZntOHUx

edit: The traces/hists/stats don’t start showing until after warmup is completed

2 Likes

I have my own ad-hoc solution (dump to a CSV line by line, then have a script that loads each and calculates what I want), just thought I would invest in something less hacky.

A GUI/TUI for monitoring is one application, but I think it would make sense to figure out the storage format separately and build on that.

I am now drafting an API and will post it here.

1 Like

Some of the tooling for machine learning is pretty good for this. TensorBoardLogger.jl works if you integrate it into your solver to run every X updates. It just saves to disk and you can launch a web interface with the tensorboard python package to view the results in a browser. Also works if you run on a remote machine like a HPC if you port forward (very easy with VS code SSH extension).

At our work, I’ve set up an MLFlow server that’s publicly available and any machine with the right credentials can save “experiments” (similar to tensorboard with scalar, picture or matrix logging). This is a much more python centric solution but there are some client libraries in Julia like MLFlowClient.jl but really useful to monitor experiments that take forever and to share results with team members.

Note that the above tools that are best for monitoring. For long running jobs I tend to use something like HDF5 and save snapshots that can be restarted and run to completion. You can add extra storage to MLFlow to add files to a particular “experiment”, but it’s not the most performant solution.

1 Like

I believe cmdstan writes draws online to a CSV file in their own format, which has become a bit more standard across PPLs, e.g.

I get that CSV files are way more accessible, but I find the reliance on them in this case a bit perplexing. I imagine most users needing to read their chains in real time are probably users for whom reading an HDF5 file is no big deal.

HDF5 offers way more flexibility, efficiency, compression, etc.

1 Like

IIUC, HDF5 by itself does not satisfy the goal property

HDF “files” must be explicitly closed to flush data (and metadata) to the filesystem, and glitches before closing may leave them much less useful than a truncated CSV file. So you might want to augment an HDF5 sink with some checkpointing and output-swapping for a long run.

1 Like