Question on Loading HDF Data Saved in Python

TI36XPro · May 22, 2024, 12:58pm

Hello,

Does anyone have experience with this?

My colleague has some large data saved on our cluster in the .h5 file format. He saved it in python using the h5py package.

I’d like to now open and analyze the same data in Julia. Do you know if that makes sense or is even possible if the data was saved in Python? I had assumed it should be since it is using a common file format (HDF).

I’ve tried the following

using HDF5
datapath = "/path/to/data/results.h5"
data = h5open(datapath, "r")

But I get the following error.

LoadError: unable to determine if /path/to/data/results.h5 is accessible in the HDF5 format (file may not exist)

The file definitely exists. Could it be that it is in the HDF4 format? (not sure how I could tell from the file extension).

Any ideas?

mkitti · May 22, 2024, 1:07pm

Double check yourself with isfile(datapath).

Otherwise, this suggests that the file is corrupt. Check the file length or checksum.

TI36XPro · May 22, 2024, 1:08pm

Thanks for the suggestion. That returns true. So I guess it must be corrupt.

mkitti · May 22, 2024, 1:09pm

Try

open(datapath, "r") do io
    read(io, 8)
end

Basically this should read out a signature as follows.

julia> open("test.h5", "r") do io
           String(read(io, 8))
       end
"\x89HDF\r\n\x1a\n"

TI36XPro · May 22, 2024, 1:14pm

Thank you - I appreciate the tip. I was wondering how to read out bytes form a file.

Here is what I get

"OK侤|\xea\xbe"

Does that confirm it is corrupt?

Edit: Had looked at the wrong file. Now correctly updated with what read returns.
Edit 2: Added string from the above example, and updated answer accordingly.

mkitti · May 22, 2024, 1:17pm

The first file you read seemed fine. The new file seems incorrect.

TI36XPro · May 22, 2024, 1:19pm

Here is the output from your code from another one of the files.

"\xd7#\x8a\xbeR\x18\xaf\xbe"

I don’t have your intuition on what would appear correct or not (if you don’t mind sharing any hints at how you determine that I’d appreciate it!).

mkitti · May 22, 2024, 1:22pm

How are you downloading these files?

mkitti · May 22, 2024, 1:22pm

Try

HDF5.ishdf5(datapath)

TI36XPro · May 22, 2024, 1:28pm

returns false.

I am not downloading them. My colleague did, no idea how they did it.

Not sure if it matters, but the file are about 6.5 GB each. I just tried doing the same strategy outlined above for a smaller .h5 file downloaded from the same source (about 5 MB) and I got no issues or errors at all.

Could it be related to file size? Maybe the bigger file took longer to download and some type of corruption happened while that was occurring? I’m just spitballing here so if that doesn’t make any sense please ignore.

mkitti · May 22, 2024, 1:30pm

“\x89HDF\r\n\x1a\n” is the only correct signature. There is a mechanism that HDF5 could have extra data at the beginning, but this is rare.

Can you open the same files in h5py?

moble · May 23, 2024, 4:57am

Just to be clear, the answer to the original question is yes, you definitely should be able to open an .h5 file generated by h5py and read it in Julia with HDF5.jl.

@mkitti is correctly showing you how to read the “file signature” (aka, “magic bytes” or “magic number”). Most common file formats begin with several specific bytes to help identify the format. HDF5 uses \x89HDF\r\n\x1a\n. A gzipped file (even a .tar.gz) would start with \x1F\x8B. And so on.

The fact that your files don’t start with \x89HDF\r\n\x1a\n means they are not valid HDF5 files. HDF4’s magic number is \x0e\x03\x13\x01, so your files aren’t in HDF4. In fact, I can’t find any file format that starts with either of your strings.

I suspect that your colleague failed to write the file correctly. For example, they might have forgotten to close the file. Note that it’s better to use context managers in python:

with h5py.File("results.h5", "w") as f:
    dset = f.create_dataset("mydataset", (100,), dtype='i')

This automatically closes the file.

You could prove that it isn’t a problem on the Julia side by using h5py to try to read the file:

with h5py.File("/path/to/data/results.h5", "r") as f:
    list(f)

I bet you get an error like

OSError: Unable to synchronously open file (file signature not found)

mkitti · May 23, 2024, 5:06am

There is one possibility of why the signature might not be the first 8 bytes: a userblock.

https://docs.h5py.org/en/stable/high/file.html#user-block
https://docs.hdfgroup.org/hdf5/develop/group___f_c_p_l.html#ga403bd982a2976c932237b186ed1cff4d

I suspect that is not what is happening here.

Another way you could check if the files are valid HDF5 files is to use utilities such as h5ls which may be installed on the cluster.

moble · May 23, 2024, 3:50pm

Ah, I forgot about user blocks. Good point. Though I can verify that HDF5.jl is able to deal with them correctly.

Just for fun, here’s a way to check for the magic number after a user block:

open("results.h5", "r") do io
    fs = filesize(io)
    i = 0
    while i+8 < fs
        seek(io, i)
        if String(read(io, 8)) == "\x89HDF\r\n\x1a\n"
            println("Found HDF5 magic number $i bytes into the file")
            break
        end
        i = i==0 ? 512 : 2i
    end
end

TI36XPro · May 23, 2024, 6:30pm

Thank you, and @mkitti as well. This is fascinating and I really appreciate the explanations. I hope you don’t mind but I have a few clarifying questions.

I understand now that these magic bytes allow one to identify the type of file. My questions are:

How did you know to read the first 8 bytes? Are the signatures standardized so the first 8 bytes are always the signature? From here it seems like they have varying lengths depending on the file type. If they are used for identification presumably you don’t have any hints on what type of file it might be. Obviously in this case we did, but I am wondering about the more general case.
When I use read to get the first 8 bytes I get an array of the following hex numbers [0x89 0x48 0x44 0x46 0x0d 0x0a 0x1a 0x0a]. That makes sense, each pair is a single byte and I can see these hex numbers correspond to decimal numbers that can be corresponded to an ASCII map to get the result that is returned by String(). That is, "\x89HDF\r\n\x1a\n". What I don’t understand is why was 89 and 1a not converted to ASCII characters by String? Also why did it insert \ in front of the x of 89 and 1a? The rest makes sense.

When I get a chance in the next few days I will try out what you suggested above and report back.

mkitti · May 23, 2024, 6:41pm

Julia uses UTF-8, or perhaps WTF-8. From 0x01 to 0x7f ASCII and UTF-8 are the same.

0x89 and 0x1a are not valid UTF-8 bytes that can be mapped to characters where they are placed.

The \x is an escape sequence to represent the otherwise unrepresentable byte.

See the following ascii and UTF-8 references:

moble · May 23, 2024, 7:03pm

I just want to emphasize again that your problem is almost certainly just a file corruption issue or something like that (and surely coming from outside of Julia); these are not issues that you should normally have to deal with. We’ve only gone into this detail because you have successfully nerd sniped!

Yes, magic numbers are just sequences of bytes that people make up when they’re creating a new file format. So they can — in principle — be any number of bytes. The people creating HDF5 just happened to choose those eight. They explain their motives in the format specification:

The first two bytes distinguish HDF5 files on systems that expect the first two bytes to identify the file type uniquely. The first byte is chosen as a non-ASCII value to reduce the probability that a text file may be misrecognized as an HDF5 file; also, it catches bad file transfers that clear bit 7. Bytes two through four name the format. The CR-LF sequence catches bad file transfers that alter newline sequences. The control-Z character stops file display under MS-DOS. The final line feed checks for the inverse of the CR-LF translation problem. (This is a direct descendent of the PNG file signature.)

So the 0x89 was chosen because it’s not an ASCII value, and the 0x1a is control-Z, which happens to be a non-printable ASCII value. [Non-printables are displayed (but not printed…) as their corresponding C escape sequence, which can be \r, \n, etc., but can also just be \xhh for values that aren’t so special to C.]

TI36XPro · May 24, 2024, 1:28am

But why doesn’t it just print it as 0x89 again then? I’m confused why the 0 is traded for a \. Sorry if I am being dense here.

mkitti · May 24, 2024, 1:37am

We asked to print it as a character that is part of a string. There needs to be a way to differentiate the following two strings.

julia> String([0x30, 0x78, 0x38, 0x39])
"0x89"

julia> String([0x89])
"\x89"

TI36XPro · May 28, 2024, 12:35am

My colleague double checked and confirmed the HDF5 files are indeed corrupted. We are going to re-download them and I’ll be double checking his script to make sure files are closed correctly after writing.

Thanks @moble and @mkitti for the help here, I really appreciate the quick responses and explanations.