I am trying to write a larger than memory dataset to a JLD2 file in a streaming way. That dataset is generated by an iterator of known length and element type. So I thought I could use JLD2.create_dataset() to create that dataset and then write the elements as they become available. But the following code gives me a ERROR: ArgumentError: Dataset is not an array error.
using JLD2
struct MyType
number::ComplexF64
integer::UInt64
end
random_entry() = MyType(rand(ComplexF64), rand(UInt64))
jldopen("filename.jld2", "w") do file
file["a_number"] = 5
dataset = JLD2.create_dataset(file, "data", MyType, (10,))
for i in 1:10
dataset[i] = random_entry()
end
end
What would be the correct syntax to create such an empty dataset and then incrementally add entries as they become available?
I did consider that and inspected the HDF files created by JLD2 and found them a bit confusing, in particular how it dealt with serializing my custom type MyType. But is there maybe a way to “mostly” use JLD2 and HDF5 only for the streaming bit? I.e. do something along the lines of
using JLD2
struct MyType
number::ComplexF64
integer::UInt64
end
const HDF5MyType = # whatever the type would be that JLD2 serialises MyType to
convert_to_hdf5(x::MyType)::HDF5MyType = # logic to convert MyType to HDF5MyType
random_entry() = MyType(rand(ComplexF64), rand(UInt64))
jldopen("filename.jld2", "w") do file
file["a_number"] = 5
hdf5_file = # somehow get the hdf5 file pointer wrapped by file
dataset = HDF5.create_dataset(hdf5_file, "data", hdf_mytypy_type, (10,))
for i in 1:10
dataset[i] = random_entry() |> convert_to_hdf5
end
end
JLD2 re-implements the HDF5 library to optimize performance for just the portions needed for serializing Julia data, so it’s probably not possible to use both it and the HDF5 library at the same time. And I don’t think JLD2 implemented the extensible-data of HDF5.
However, if you use the original JLD.jl library, it should be compatible with JLD2, and JLD.jl uses HDF5.jl directly so you should be able to access the extensible-dataset support.
Note that this still requires you to know the length of your data in advance. It would be nice to support HDF5’s “extensible” dataset feature, too. Not sure if the OP needs that feature, though.
There are open PRs for adding the chunking features to JLD2 as well. However, these were heavily AI assisted and need further review and real world testing.