Serializing many different arrays incrementally over time

tuckermcclure · August 3, 2024, 12:20am

I have a simulation that’s going to run for a long time, producing a lot of results, and I need those results to go do disk while the sim is running rather than all at once at the end (not enough RAM for all those results). The simulation’s outputs are essentially a bunch of different Vector{Any}, and elements are added to those arrays at different times. The elements are not primitive types but are structured data, and the structure can even be different between elements. I’m looking for a way to store this stuff, and I don’t feel like I’ve found the right thing yet. I’m looking for suggestions.

Just to illustrate the point, here’s what I’m thinking of putting together. My primary hesitation is just that I feel like this is surely a solved problem, and I must not be searching for the right thing.

On disk, we’d end up with something like this:

[array ID, number of bytes, all the bytes for the serialized element, ...]

So I’d then say, “Hey, give me array ID 117,” and it would make an empty Any[]. Then it would look through the file. For each array ID of 117, it would see how many bytes to deserialize. It would push the deserialized element into the array. It would then look through the file until it found the next array ID 117.

There’s a lot we could imagine doing here, but this feels like the essence.

One option that I’ve considered is using HDF5. I’d make a dataset for each array ID that’s an array of UInt8s. I’d grow that array over time, dumping in serialized elements in chunks. Storing serialized data on the fly is not exactly what HDF5 is meant for, and I’m concerned this would be tedious, slow, and non-standard.

Another option is to literally implement what I have above from scratch, but if there’s a standard way and existing support, I’d much prefer to use that instead.

Thanks for any tips!

carstenbauer · August 3, 2024, 5:38am

Using HDF5 for this should be fine. I’ve used it in similar scenarios and it was fine. IIRC the thing to look out for was to make sure that the HDF5 file doesn’t become invalid (e.g. when the compute job aborts while you’re writing to the file).

Apart from HDF5 you could consider JLD2 which essentially is HDF5 + Serialization of arbitrary Julia objects. I must admit though that I always preferred working with hdf5 directly for “serious” computations. (Mostly because hdf5 is an established standard.)

Otherwise, you could perhaps even consider just using Serialization and the file system, i.e. creating a separate file for each array. In other words, why does it need to be a single file?

I guess one could also mention ADIOS2.jl here. However, I haven’t used it much beyond simple tests.

nsajko · August 3, 2024, 11:31am

Perhaps Zarr.jl is another option to consider.

tuckermcclure · August 3, 2024, 4:13pm

Thanks for your reply!

Regarding JLD2, I didn’t see a way of appending to an array inside of a JLD2 over time. Did I miss that there’s a way to do that?

Yes, I thought this would be a simple way to do it, but I’ll have thousands of arrays – more than the OS can have file handles open.

tuckermcclure · August 3, 2024, 4:21pm

Zarr.jl and ADIOS2.jl are both interesting. I’ll spend some time checking those out. Thanks!

tuckermcclure · March 14, 2025, 4:17am

Just to follow up here, I implemented what I described in the original post literally, and it works as expected.

I’m finding that the deserialization is very slow for these very large sim results, but I don’t see a faster way to serialize/deserialize general types. The result is that reading a handful of the logged array from the sim results takes longer than running the sim! I’m left wishing there were a faster way to do this kind of thing, but I don’t want to give up the flexibility of this method.

Topic		Replies	Views
How to optimaly save in JLD or HDF5 many Any arrays General Usage hdf5	0	649	January 18, 2017
(De-)Serialize N-dimensional arrays in julia New to Julia question , package , serialization	28	2198	June 4, 2021
Storing huge amount of data efficiently Performance performance , jld2 , numerics , io , arrow	15	2678	February 24, 2023
Proposal: working with larger than memory data in hdf5 format using HDF5Arrays (implementation of DiskArrays.jl for HDF5) Data hdf5	11	1729	November 4, 2020
Efficient Disk Usage and JLD General Usage	5	641	April 18, 2018

Serializing many different arrays incrementally over time

Related topics