Reading and writing HDF5 compound-typed array datasets

I wanted to write vectors of LabelledArrays to HDF5 files, preserving the labels, and be able to read these back and reconstruct the original data. My data is pretty much entirely numeric.

I’ve found it hard to find any relevant documentation; this is mentioned in an issue (#819) in HDF5.jl:

We are certainly lacking examples on writing compound data types in the documentation.

Indeed - the only mention at all is the (read and write) support for Complex. But I found that you can write arrays of NamedTuple and they end up being HDF5 array datasets with compound datatype as you’d expect. There are some restrictions, e.g. the fields can’t be strings.

Conversely, you can read such a dataset, even with a string field, and you get out a vector of NamedTuple including string field.

So for example:

# File downloaded from https://www.neonscience.org/resources/learning-hub/tutorials/hdf5-intro-python
julia> fn = "/Users/patrick/Desktop/NEONDSTowerTemperatureData.hdf5";

julia> data = h5open(fn, "r") do h5f
          read(h5f, "Domain_03/OSBS/min_1/boom_1/temperature")
        end;

julia> typeof(data), size(data)
(Vector{NamedTuple{(:date, :numPts, :mean, :min, :max, :variance, :stdErr, :uncertainty), Tuple{String, Int32, Vararg{Float64, 6}}}}, (4323,))

But the reverse doesn’t work (unless the string-typed field, date, is removed):

julia> h5open("test.h5", "w") do h5f
          write_dataset(h5f, "test_dataset", data)
        end
ERROR: ArgumentError: Could not convert non-bitstype NamedTuple{(:date, :numPts, :mean, :min, :max, :variance, :stdErr, :uncertainty), Tuple{String, Int32, Vararg{Float64, 6}}} to NamedTuple{(:date, :numPts, :mean, :min, :max, :variance, :stdErr, :uncertainty), Tuple{HDF5.FixedString{1, 0}, Int32, Vararg{Float64, 6}}} for writing to HDF5. Consider implementing `convert(::Type{NamedTuple{(:date, :numPts, :mean, :min, :max, :variance, :stdErr, :uncertainty), Tuple{HDF5.FixedString{1, 0}, Int32, Vararg{Float64, 6}}}}, ::NamedTuple{(:date, :numPts, :mean, :min, :max, :variance, :stdErr, :uncertainty), Tuple{String, Int32, Vararg{Float64, 6}}})`

So for my use case (all the fields will be numeric types, all Int and Float) the functionality for read and write is there, but I’m hesitant to use it given that it’s not documented.

From what I can tell looking at the PRs, adding compound datatype reading and writing functionality was a deliberate thing, but what I’m not clear on is whether the lack of documentation is just an oversight, or perhaps it’s because the functionality isn’t meant for external usage (maybe because of the problem with writing strings?).

I’m hoping it’s the former and I can just start using this as-is (and contribute a documentation PR). Does anyone happen to know?

(Edit: cross-referenced this in a comment to the above-mentioned issue).

1 Like

Maybe this is irrelevant, but do you know you can use JLD2.jl and it saves in HDF5 format? IMO the main reason to use HDF5.jl directly is for interoperability with other programming languages: JLD2 makes specific choices about storage format, and perhaps that’s not the format you’d prefer things be stored in. (HDF5 allows many different solutions to the same problem.)

Definitely relevant, because I wasn’t aware of that fact, thanks! I don’t think it’s the right tool for this particular job since we do want interoperability, but it looks like it could be useful in some contexts. Basically it seems to be a near-equivalent to pickling in Python (with similar caveats), is that about right?

Our last approach on this was

julia> using HDF5
                                  julia> nt = (x=5, y=6)
(x = 5, y = 6)

julia> h5open("test.h5", "w") do h
           write_dataset(h, "test", [nt])
       end

julia> h5open("test.h5", "r") do h
           h["test"][]
       end
1-element Vector{NamedTuple{(:x, :y), Tuple{Int64, Int64}}}:
 (x = 5, y = 6)

The of the main issues you seem to be running into is that your NamedTuple contains a String. This create some headaches since variable length strings are not great to work with in HDF5. Preferably one would use fixed length strings.

There is HDF5.FixedString but this has limited utility. For this reason I started to work on StaticStrings.jl as an expanded version of this. An alternative would be using InlineStrings.jl but this does not allow arbitrary string sizes.

Regarding the “should you use this question”, the strongest affirmative support is that we are testing for some cases of NamedTuple

The honest truth here is there is a lot work to do on this package, and we could use all of hte help we could get. My attention is currently on the compilation side reviewing and assisting with this pull request:

If you would like to document your use case, I would be happy to accept the pull request.

2 Likes