Saving and updating in-memory HDF5 files?

Ahmed_Salih · April 2, 2024, 1:01am

Hello!

Reading the documentation of HDF5.jl (Home · HDF5.jl) I see how it is possible to have an in-memory hdf5 file. I was wondering if;

Can I update a dataset in an hdf in place?
Can I save a copy of the current in-memory hdf5 file to disk?

I have a use case in which over N timesteps I have a fixed number of data points I want to save. I thought that HDF5 would be perfect for this, especially if one could do in-place.

Kind regards

mkitti · April 2, 2024, 2:01am

You should first familiarize yourself with the upstream documentation.

https://docs.hdfgroup.org/hdf5/develop/_h5_f__u_g.html#subsubsec_file_alternate_drivers_mem

The example shows you how you can obtain a Vector{UInt8} of the file image. If you write that out to disk, you can open it just like a regular HDF5 file.

This is a highly specialized operation, and I’m not sure about your specific use case. Even for a HDF5 on disk, you can update a subsection of a dataset in-place.

Before we get into a XY problem situation, could you explain what you are trying to do or optimize?

Ahmed_Salih · April 2, 2024, 2:16am

Thanks!

What I want do in reality is saving datafiles at each output of my simulation. In my case since I know the number of data points before hand and that it never changes, I thought I could “preallocate” a HDF5 in memory, efficiently overwrite it in place, make a copy and save to disk. I thought by doing so that I could get an extra speed up / reduce allocations.

Kind regards

Ahmed_Salih · April 2, 2024, 4:52pm

I found that a simple solution for fast file writing using HDF5.jl, without resorting to the complexity above, is to save all data into one single file. In pseudo-code:

function SaveHDF5!(fid::HDF5.File, group_name, variable_names, args...)
    create_group(fid, group_name)
    if !isnothing(args)
        for i in eachindex(args)
            arg           = args[i]
            var_name          = variable_names[i]
            fid[group_name][var_name] = arg
        end
    end
end

Where I pass in a consistent file id, update the group name and the variable input as needed. This would give me the following timings:

The reason for writing to one file is that surprisingly, atleast on Windows, close the HDF5.File is actually a bottle neck.

All in all, I am pretty pleased, writing to HDF5 in this way is about 10x faster than writing to individual .vtp files as I did in the past.

Just sharing my findings here, perhaps someone can benefit in the future.

Topic		Replies	Views
Update a variable in an HDF5 file General Usage hdf5	7	2434	November 30, 2021
HDF5 speed? Data hdf5	3	1714	October 20, 2017
View into HDF5 dataset Performance	1	1244	December 4, 2017
How to modify HDF5 dataset General Usage	1	2226	March 24, 2020
Load HDF5 file larger than memory New to Julia hdf5	6	595	December 12, 2023

Saving and updating in-memory HDF5 files?

Related topics