Serializing many different arrays incrementally over time

I have a simulation that’s going to run for a long time, producing a lot of results, and I need those results to go do disk while the sim is running rather than all at once at the end (not enough RAM for all those results). The simulation’s outputs are essentially a bunch of different Vector{Any}, and elements are added to those arrays at different times. The elements are not primitive types but are structured data, and the structure can even be different between elements. I’m looking for a way to store this stuff, and I don’t feel like I’ve found the right thing yet. I’m looking for suggestions.

Just to illustrate the point, here’s what I’m thinking of putting together. My primary hesitation is just that I feel like this is surely a solved problem, and I must not be searching for the right thing.

On disk, we’d end up with something like this:

[array ID, number of bytes, all the bytes for the serialized element, ...]

So I’d then say, “Hey, give me array ID 117,” and it would make an empty Any[]. Then it would look through the file. For each array ID of 117, it would see how many bytes to deserialize. It would push the deserialized element into the array. It would then look through the file until it found the next array ID 117.

There’s a lot we could imagine doing here, but this feels like the essence.

One option that I’ve considered is using HDF5. I’d make a dataset for each array ID that’s an array of UInt8s. I’d grow that array over time, dumping in serialized elements in chunks. Storing serialized data on the fly is not exactly what HDF5 is meant for, and I’m concerned this would be tedious, slow, and non-standard.

Another option is to literally implement what I have above from scratch, but if there’s a standard way and existing support, I’d much prefer to use that instead.

Thanks for any tips!

Using HDF5 for this should be fine. I’ve used it in similar scenarios and it was fine. IIRC the thing to look out for was to make sure that the HDF5 file doesn’t become invalid (e.g. when the compute job aborts while you’re writing to the file).

Apart from HDF5 you could consider JLD2 which essentially is HDF5 + Serialization of arbitrary Julia objects. I must admit though that I always preferred working with hdf5 directly for “serious” computations. (Mostly because hdf5 is an established standard.)

Otherwise, you could perhaps even consider just using Serialization and the file system, i.e. creating a separate file for each array. In other words, why does it need to be a single file?

I guess one could also mention ADIOS2.jl here. However, I haven’t used it much beyond simple tests.

1 Like

Perhaps Zarr.jl is another option to consider.

Thanks for your reply!

Regarding JLD2, I didn’t see a way of appending to an array inside of a JLD2 over time. Did I miss that there’s a way to do that?

Yes, I thought this would be a simple way to do it, but I’ll have thousands of arrays – more than the OS can have file handles open.

Zarr.jl and ADIOS2.jl are both interesting. I’ll spend some time checking those out. Thanks!

1 Like