How do you save data in Monte Carlo simulations?

question
data

#1

Let’s say I have a prior made of continuous and discrete variables:

using Distributions

a = Uniform()
b = Normal()
c = Binomial()
...

I generate random numbers from the prior, input them into an expensive model and the model gives back a list of arrays as a result. I repeat the process with another drawing of the input parameters and get a different set of arrays.

I have a routine that saves these “arrays” into disk, but the problem is that I want to track later on, what parameters from the prior have generated a particular array.

I am looking for a solution or file format that is portable enough, do you have recommendations? JLD, JLD2, HDF5, …


#2

These all work. Or if you don’t care about version compatibility, you can just use serialization.


#3

Thanks @ChrisRackauckas, I will investigate better the pros and cons of each format…


#4

JLD is pretty much standard. JLD2 can be faster but isn’t strictly compatible with JLD. Serialization is the fastest, but there’s no guarantee it will be the same when Julia changes versions. HDF5 is the standard for multiple programming languages so it can easily be opened in say MATLAB or Python, but I don’t think it can handle arbitrary Julia types? The others can just save your type, and when you read it back you’ll get that type back.


#5

That is a very good overview, thanks. I think I will go with JLD(2) for now, will see which one of the two exactly. I will very likely delete these files anyways, still in the phase of trying things out.


#6

I’ve used HDF5 for this because it could memory map the arrays and write them to disk in the fly. This way if my run stopped early I still got partial results. This was useful for debugging. JLD(2) may be able to do this by now as well, I did this a year and a half ago. I was just storing arrays of MCMC samples though, no Julia types.


#7

Generally, I use posterior data from MCMC in two stages:

  1. Convergence analysis, mostly Rhat (potential scale reduction factor) and effective sample size. For this I need scalars which may or may not correspond to variables (eg a scalar could be part of a symmetric matrix). I use vectors matrices (one for each chain). I save this using HDF5, keeping the whole sample.

  2. For understanding the posterior, especially plotting and posterior predictive checks, I reconstitute the actual objects from the raw parameters (eg a matrix from a bunch of values). I frequently make a struct for this, otherwise a Tuple will do (if there are few values). I save the second half of each chain, if the mixing is good (mostly with a well-parametrized HMC) I save all values, otherwise I thin. This gives me a few thousand posterior draws. JLD handles this just fine.


#8

I have the impression JLD2 is meant to replace JLD and so be the more forward-compatible option. It should also do memory-mapping.


#9

Seems like that might happen.