Reading a large array from an HDF5 file

I’m trying to read a large 3D array from an HDF5 file. If I just naively read a 3D array, I found that every time I try to access the element (e.g. in a for-loop), I see a lot of unexpected allocations and therefore the low speed (see the read1 function below). Instead, I have to first allocate an undef 3D array and then assign the value with .= (the read2 function) to avoid allocations when referencing the elements, which is a bit clumsy. The code example:

using BenchmarkTools
using HDF5
const N = 1000000
A = rand(3,3,N);
h5write("data_3darray.h5", "group", A);

function read1()
    data1 = h5read("data_3darray.h5", "group")
    for n in 1:N, i in 1:3, j in 1:3
        data1[j,i,n] += 0.0
    end
    return data1
end

function read2()
    data2 = Array{Float64,3}(undef, 3, 3, N)
    data2 .= h5read("data_3darray.h5", "group")
    for n in 1:N, i in 1:3, j in 1:3
        data2[j,i,n] += 0.0
    end
    return data2
end
@btime data=read1()   #2.395 s (35990858 allocations: 617.84 MiB)
@btime data=read2()   #88.712 ms (60 allocations: 137.33 MiB)

Why does it allocate so much memory during the loop in the first case? Is there a better way to read a large array from an HDF5 file?

1 Like

Yep, so @code_warntype read1() shows clearly that the inferred return type of h5read is Any, and so data1 is also inferred to be of type Any. You can either explicitly annotate the type (https://docs.julialang.org/en/v1/manual/performance-tips/#Annotate-values-taken-from-untyped-locations-1),

data1::Array{Float64, 3} = h5read("data_3darray.h5", "group")

or

data1 = convert(Array{Float64, 3}, h5read("data_3darray.h5", "group"))

or introduce a function barrier (https://docs.julialang.org/en/v1/manual/performance-tips/#kernel-functions-1).

7 Likes

HD5 is not designed to read data element by element. Internally, it does data chunking and elements are read from/written to a file chunk-wise. See here for more explanation: https://portal.hdfgroup.org/display/HDF5/Chunking+in+HDF5.

1 Like

If possible, I would suggest a different file format. We have spent quite a lot of time searching for a reliable and fast fileformat (you can find the discussions on this forum) and arrived to FlatBuffers. So if you can, I would suggest those. I have a good experience with them.

Tomas

3 Likes

HDF5 is fast and reliable. One can even mmap the array in question with readmmap as is outlined in this example:

It is clear that read1 has two issues. Type instability and reading / writing scalar values from a file.

1 Like

Why do you both say this? The single h5read call in read1 reads the whole array, I’d think?

1 Like

Hi,
i’m searching for a fast fileformt which can be converted from hdf5 file. Is it Flatbuffer? If yes, could you let me know the where in this Forum i can find about it?

JK

The search engine provides:

as to whether it’s faster than hdf5, i don’t think that question has been answered yet, most likely it depends on precisely what you are doing.

1 Like

I am reading HDF5 files in the rage of 50 GB with out any issues. In fact I am reading 3d arrays of integers on the lines of:

using HDF5

function loadData(filename)
    fid = HDF5.h5open(filename, "r")
    obj = fid["key_lvl_1"]
    metadata = read(obj)
    close(fid)
    data = metadata["key_lvl_2"]["key_lvl_3"]["key_lvl_4"][:,:,:];
return data, metadata
end

@time (data,metadata) = loadData(filename); # 2.192142 seconds (586 allocations: 1018.752 MiB, 4.13% gc time)

Sure, the return of data & metadata is questionable. With a 54 GB file ready to use, I am very happy.

Could you expand on what you mean by that ? My first reading of it make me think that the returned data was questionable, which would make HDF5 a poor choice indeed :laughing:

All of data is contained in metadata.
Would you really need both? Probably not. Anyhow, as it come at very little extra cost, I enjoy beeing able to choose if I just want my central Array{Int, 3} or all the metadata as well by simply choosing between:

data = loadData(filename)

and

(data, metadata) = loadData(filename)