Stream into JLD2 file

I am trying to write a larger than memory dataset to a JLD2 file in a streaming way. That dataset is generated by an iterator of known length and element type. So I thought I could use JLD2.create_dataset() to create that dataset and then write the elements as they become available. But the following code gives me a ERROR: ArgumentError: Dataset is not an array error.

using JLD2

struct MyType
    number::ComplexF64
    integer::UInt64
end

random_entry() = MyType(rand(ComplexF64), rand(UInt64))

jldopen("filename.jld2", "w") do file
    file["a_number"] = 5
    dataset = JLD2.create_dataset(file, "data", MyType, (10,))
    for i in 1:10
        dataset[i] = random_entry()
    end
end

What would be the correct syntax to create such an empty dataset and then incrementally add entries as they become available?

You could use HDF5.jl instead (JLD is an HDF5 file under the hood anyway) — it allows you to create a dataset with “unlimited” dimensions that you can append to incrementally (which is implemented by the underlying HDF5 library).

I did consider that and inspected the HDF files created by JLD2 and found them a bit confusing, in particular how it dealt with serializing my custom type MyType. But is there maybe a way to “mostly” use JLD2 and HDF5 only for the streaming bit? I.e. do something along the lines of

using JLD2

struct MyType
    number::ComplexF64
    integer::UInt64
end

const HDF5MyType = # whatever the type would be that JLD2 serialises MyType to
convert_to_hdf5(x::MyType)::HDF5MyType = # logic to convert MyType to HDF5MyType 

random_entry() = MyType(rand(ComplexF64), rand(UInt64))

jldopen("filename.jld2", "w") do file
    file["a_number"] = 5
    hdf5_file = # somehow get the hdf5 file pointer wrapped by file
    dataset = HDF5.create_dataset(hdf5_file, "data", hdf_mytypy_type, (10,))
    for i in 1:10
        dataset[i] = random_entry() |> convert_to_hdf5
    end
end

JLD2 re-implements the HDF5 library to optimize performance for just the portions needed for serializing Julia data, so it’s probably not possible to use both it and the HDF5 library at the same time. And I don’t think JLD2 implemented the extensible-data of HDF5.

However, if you use the original JLD.jl library, it should be compatible with JLD2, and JLD.jl uses HDF5.jl directly so you should be able to access the extensible-dataset support.

You can also split your data into large chunks and save each chunk when it is ready. HDF5 can do this automatically internally, and there is a vibe-coded exploration of what automatic chunking in JLD2 might look like Add writing chunked arrays to JLD2 by JonasIsensee · Pull Request #687 · JuliaIO/JLD2.jl · GitHub

Something to look out for is that with both JLD2 and HDF5, if your program crashes mid-append, the entire file may become corrupted.

This should be possible!
This relatively new and experimental only and hence missing a proper API…
Please open an issue about this.

julia> jldopen("filename.jld2", "w") do file
           dims = (10,)
           type = MyType
           # create the dataset
           dset = JLD2.create_dataset(file, "data", type, dims)
           # properly compute the datatype and dataspace
           dset.datatype = JLD2.h5fieldtype(file, type, type, Val{false})
           dset.dataspace = JLD2.WriteDataspace(JLD2.DS_SIMPLE, UInt64.(reverse(dims)), ())
          # allocate space in the file
          JLD2.allocate_early(dset, type)
           # Wrap in an ArrayDataset purely for enabling the `setindex!` syntax.
           arr_dset = JLD2.ArrayDataset(dset)

           # Now you can finally use setindex and getindex (even with slicing) on your dataset 
           for i in 1:10
               arr_dset[i] = random_entry()
           end
       end

julia> load("filename.jld2", "data")
10-element Vector{MyType}:
 MyType(0.766374308782779 + 0.11729864178339178im, 0x0626f98eebb7dfea)
 MyType(0.9064816833136069 + 0.09066160816855895im, 0x13c42e706b2d4d11)
 MyType(0.5688427673261466 + 0.24529031686957314im, 0xdac65158cf010e4a)
 MyType(0.5927902348006218 + 0.3875890551840936im, 0xffa5d651aa24d6a0)
 MyType(0.10769436915479025 + 0.4072549111427217im, 0x090e3b706d911ba6)
 MyType(0.23230564772855966 + 0.04665201130970498im, 0x0f331f14f34cd847)
 MyType(0.4687520222054148 + 0.9234981438350177im, 0x5a1941442978a130)
 MyType(0.9892746567394705 + 0.6058069267401043im, 0x68ffdfac0c6e2929)
 MyType(0.4207064651484389 + 0.7265640369544515im, 0xe18253ee50dfff21)
 MyType(0.49064364773021263 + 0.1922831135335542im, 0xca0188a4aa6830c8)

All of this should be wrapped in a proper API, but if you want to, it will work already :wink:

2 Likes

Note that this still requires you to know the length of your data in advance. It would be nice to support HDF5’s “extensible” dataset feature, too. Not sure if the OP needs that feature, though.

That is correct.

There are open PRs for adding the chunking features to JLD2 as well. However, these were heavily AI assisted and need further review and real world testing.