Problems reading compressed HDF5 file created in Python

I’m having problems reading an HDF5 file that was saved in Python with the pytables library (www.pytables.org). The file I’m trying to read is available at:

https://www.cefala.org/~adriano/.shared/simple_mjpeg_frames_0_12_flow_field.hdf

The structure of the file is:

/header/header_table
/flow/x
/flow/y

where the /flow/x and /flow/y datasets should be vectors of strings (for this particular file, the vectors are 12 elements long). Each element is a string that was created from a numpy array using the tostring() method. The array of strings was compressed by pytables using the zlib library, with a compression level of 2 (I can give more details on that if needed). When reading the HDF file in Python, the strings are converted back to numpy arrays by using the numpy.fromstring() function.

The Julia code I’m using right now is:

using HDF5

file_name = "simple_mjpeg_frames_0_12_flow_field.hdf"

fid = h5open(file_name, "r")

# The header data_set
header = fid["/header/header_table"]

# The "flow_x" and "flow_y" data sets
flow_x = fid["/flow/x"]
flow_y = fid["/flow/y"]

# Read the array of strings
frames = read(flow_x)

In the code above, the variable frames comes up as an array of strings of the right length (12 elements), but the strings are all empty! I’ve read the docs for HDF5.jl more than once but I still don’t understand what I’m missing. Can anybody help me out? Any help is greatly appreciated.

I tried to explicitly set the compression information like this:

using HDF5

file_name = "simple_mjpeg_frames_0_12_flow_field.hdf"

frames = h5read(file_name, "/flow/x", "deflate", 0x2)

but the results are exactly the same, a vector of empty strings :frowning:

HDF5.jl may not yet have support for reading arrays of raw bytes. As a workaround, try

using PyCall

pt = pyimport_conda("tables", "pytables")
np = pyimport_conda("numpy", "numpy")

h5file = pt.open_file("simple_mjpeg_frames_0_12_flow_field.hdf", "r")

arr = h5file.root.flow.x.read()

[[reinterpret(Int32, c) for c in v] for v in arr]

The underlying problem is this: Strings in HDF5 — h5py 3.10.0 documentation

You can’t store arbitrary binary data in HDF5 strings. Not only will this break, it will break in odd, hard-to-discover ways that will leave you confused and cursing

If you have a non-text blob in a Python byte string (as opposed to ASCII or UTF-8 encoded text, which is fine), you should wrap it in a void type for storage. This will map to the HDF5 OPAQUE datatype, and will prevent your blob from getting mangled by the string machinery.

1 Like

Base64 encoding into strings is also an option (at some storage/performance cost, of course).

HDF5 is great at writing and reading numerical arrays across many platforms. It sounds like you may be going out of your way to store a numerical array as a compressed string, and your task could be much easier if you just store the numerical data directly.

Thanks for all the replies.

The file in question was created by an old piece of Python code that hasn’t been touched in many years. At the time this code was written, we decided to use compressed strings because this produced quite smaller files, and this was important because the files being saved were huge. I appreciate the suggestions of not using strings (especially @stillyslalom’s quote from the h5py project), but the thing is that I have many old HDF5 files in this format that I need to read.

Interestingly, before I got any answers, I tried h5py directly in Python and the result was the same as Julia’s: a vector of empty matrices.

I then tried @stillyslalom’s code (using pytables through Conda) and it works. However, some of the read strings are corrupted (i.e., they’re shorter than they should be). Using pytables directly in Python works as expected.

So I think I’ll stick with Python for the time being. At some point I’ll try to find the time to migrate the whole thing to Julia.

Thanks to all.

1 Like