Appending an element to a JLD2 file

question
hdf5
jld2

#1

I’m curious what the right way to append elements to a vector stored in a JLD2 file, assuming that I want to save the data to disk every time I add an element (because each element is expensive to compute).

The first thing I tried was:

jldopen("test.jld2", "a+") do file
    file["x"] = []
end

for i in 1:10
    jldopen("test.jld2", "a+") do file
        push!(file["x"], 1)
    end
end

but that actually results in the file being empty, because doing file["x"] gives me a copy of the data, and appending to that copy does nothing to the data on disk.

So, instead, I could do:

results = []
for i in 1:10 
    push!(results, 1)
    jldopen("test.jld2", "w") do file
        file["x"] = results
    end
end

but this will write the entire results vector to disk at every iteration, which seems wasteful.

Am I missing something obvious? Is there a better way?


HDF5 speed?
View into HDF5 dataset
#2

HDF5 (on which JLD is based) is not very efficient with small writes, and by default leaves you open to losing data if your program crashes before flushing. The new single writer multiple reader (SWMR) feature allows you to flush a specific dataset, and I believe it syncs the dataset and metadata to disk. I think if you use SWMR, or if you just flush the file after every write, you will be safe.

Use HDF5.jl directly to create a dataset with extendible dimensions (search for extendible), and push to that. The SWMR API is not in the HDF5.jl docs, but the swmr.jl test file shows the usage.


#3

Also both of you examples show opening and closing the JLD file on each iteration, which is expensive. You might consider writing to a flat binary file on each iteration as a backup, and just writing the complete results in JLD at the end. Or

jldopen("test.jld2", "w") do file
results = []
    for i in 1:10 
        push!(results, 1)
        file["x$i"] = results[i]
    end
end

#4

Ok, makes sense. Thanks!


#5

x-ref: HDF5 speed?