How do you save data in Monte Carlo simulations?

juliohm · August 16, 2017, 3:53am

Let’s say I have a prior made of continuous and discrete variables:

using Distributions

a = Uniform()
b = Normal()
c = Binomial()
...

I generate random numbers from the prior, input them into an expensive model and the model gives back a list of arrays as a result. I repeat the process with another drawing of the input parameters and get a different set of arrays.

I have a routine that saves these “arrays” into disk, but the problem is that I want to track later on, what parameters from the prior have generated a particular array.

I am looking for a solution or file format that is portable enough, do you have recommendations? JLD, JLD2, HDF5, …

ChrisRackauckas · August 16, 2017, 4:01am

These all work. Or if you don’t care about version compatibility, you can just use serialization.

juliohm · August 16, 2017, 4:12am

Thanks @ChrisRackauckas, I will investigate better the pros and cons of each format…

ChrisRackauckas · August 16, 2017, 4:14am

JLD is pretty much standard. JLD2 can be faster but isn’t strictly compatible with JLD. Serialization is the fastest, but there’s no guarantee it will be the same when Julia changes versions. HDF5 is the standard for multiple programming languages so it can easily be opened in say MATLAB or Python, but I don’t think it can handle arbitrary Julia types? The others can just save your type, and when you read it back you’ll get that type back.

juliohm · August 16, 2017, 4:18am

That is a very good overview, thanks. I think I will go with JLD(2) for now, will see which one of the two exactly. I will very likely delete these files anyways, still in the phase of trying things out.

jkbest2 · August 16, 2017, 4:53am

I’ve used HDF5 for this because it could memory map the arrays and write them to disk in the fly. This way if my run stopped early I still got partial results. This was useful for debugging. JLD(2) may be able to do this by now as well, I did this a year and a half ago. I was just storing arrays of MCMC samples though, no Julia types.

Tamas_Papp · August 16, 2017, 5:25am

Generally, I use posterior data from MCMC in two stages:

Convergence analysis, mostly Rhat (potential scale reduction factor) and effective sample size. For this I need scalars which may or may not correspond to variables (eg a scalar could be part of a symmetric matrix). I use vectors matrices (one for each chain). I save this using HDF5, keeping the whole sample.
For understanding the posterior, especially plotting and posterior predictive checks, I reconstitute the actual objects from the raw parameters (eg a matrix from a bunch of values). I frequently make a struct for this, otherwise a Tuple will do (if there are few values). I save the second half of each chain, if the mixing is good (mostly with a well-parametrized HMC) I save all values, otherwise I thin. This gives me a few thousand posterior draws. JLD handles this just fine.

mkborregaard · August 16, 2017, 5:48am

I have the impression JLD2 is meant to replace JLD and so be the more forward-compatible option. It should also do memory-mapping.

ChrisRackauckas · August 16, 2017, 6:03am

Seems like that might happen.

github.com/JuliaIO/JLD2.jl

Migration plan?

opened 07:58PM - 07 Jul 17 UTC

closed 10:08AM - 16 Jul 20 UTC

simonster

Since this package is now working on all platforms, I think it's about time to r…elease it. I'm not sure exactly how to do that. I see a few options: - Start with a separate package from JLD.jl, and merge once it's seen wider use. (The only potentially complication here is if/how the package should be integrated with FileIO.) - Move code here to JLD.jl, but don't enable by default, since it hasn't received the same degree of testing. Once it sees wider use, enable it by default. - Move code here to JLD.jl and immediately enable by default. I'm a little hesitant to do this, since there may be cases I'm not handling properly yet.

Topic		Replies	Views
How to handle and store large amounts of (distributed) generated data? Data	4	1447	April 15, 2017
Save a fitted Turing model to disk Probabilistic programming question	4	254	July 4, 2024
Best Practice for Logging MCMC Results? Probabilistic programming	4	558	February 12, 2021
Storing and retrieving multi-dimensional arrays from a file General Usage hdf5 , jld2 , arrays , io	5	2242	December 14, 2021
Suggested formats for saving and serialization Data package , data	8	1526	April 17, 2017

How do you save data in Monte Carlo simulations?

Related topics