How to handle and store large amounts of (distributed) generated data?

ChrisRackauckas · April 15, 2017, 10:14am

A lot of (Monte Carlo) simulations can be done simultaneously and independently on many nodes of an HPC, each generating large solutions for later analysis. However, a straight pmap will try to build a giant output that likely won’t fit on any node, and then just crash the system.

Are there distributed databases I could instead write information to, or instead somehow serialize the types separately and concatenate them into one big data file?

Related to this post is the other post, discussing what to actually save:

and the DiffEq issue monitoring the updates:

vchuravy · April 15, 2017, 11:21am

In the past I have used DistributedArrays (you can distribute more than arrays) and the write the data to a series of files on a distributed Filesystem (lustre). In my use cases it was most likely that I would want to read-in the data distributed as well in the end.

There is an mpi extension to HDF5 that might be usable, but I have never used that from within Julia.

Keno · April 15, 2017, 12:01pm

If you know how much data you’ll be writing ahead of time, pre-creating the files and doing pwrites to such files works just fine generally.

ChrisRackauckas · April 15, 2017, 12:10pm

What’s a pwrite? A quick Google search (SO) says it’s a file writing which uses independent memory and can be done concurrently?

But I don’t see a pwrite in Julia, just in the Linux manual. Is there a standard way to do this in Julia?

Keno · April 15, 2017, 12:15pm

Yes, you basically give it an offset and as long as you’re doing stripe aligned writes (writes in multiples of the file system block size), and only write a given strip from one process at a time, it’ll generally be pretty high performance. I just checked whether we have wrapped pwrite in filesystem.jl, but it doesn’t seem like it. For the quick and dirty solution see https://github.com/jeff-regier/Celeste.jl/blob/master/src/SDSSIO.jl#L736-L782, which is the same idea but for reads. At some point we should wrap it in base.

Topic		Replies	Views
Accumulating distributed data Data juliadb	0	521	August 28, 2019
How do you save data in Monte Carlo simulations? Data question , data	8	2317	August 16, 2017
How to save parallel outputs to the same file on HPC cluster Julia at Scale question	6	1295	April 22, 2020
Suggested formats for saving and serialization Data package , data	8	1552	April 17, 2017
DataFrames and serialization General Usage	0	305	July 11, 2019

How to handle and store large amounts of (distributed) generated data?

Related topics