Reading a large array from an HDF5 file

I’m trying to read a large 3D array from an HDF5 file. If I just naively read a 3D array, I found that every time I try to access the element (e.g. in a for-loop), I see a lot of unexpected allocations and therefore the low speed (see the read1 function below). Instead, I have to first allocate an undef 3D array and then assign the value with .= (the read2 function) to avoid allocations when referencing the elements, which is a bit clumsy. The code example:

using BenchmarkTools
using HDF5
const N = 1000000
A = rand(3,3,N);
h5write("data_3darray.h5", "group", A);

function read1()
    data1 = h5read("data_3darray.h5", "group")
    for n in 1:N, i in 1:3, j in 1:3
        data1[j,i,n] += 0.0
    end
    return data1
end

function read2()
    data2 = Array{Float64,3}(undef, 3, 3, N)
    data2 .= h5read("data_3darray.h5", "group")
    for n in 1:N, i in 1:3, j in 1:3
        data2[j,i,n] += 0.0
    end
    return data2
end
@btime data=read1()   #2.395 s (35990858 allocations: 617.84 MiB)
@btime data=read2()   #88.712 ms (60 allocations: 137.33 MiB)

Why does it allocate so much memory during the loop in the first case? Is there a better way to read a large array from an HDF5 file?

Yep, so @code_warntype read1() shows clearly that the inferred return type of h5read is Any, and so data1 is also inferred to be of type Any. You can either explicitly annotate the type (https://docs.julialang.org/en/v1/manual/performance-tips/#Annotate-values-taken-from-untyped-locations-1),

data1::Array{Float64, 3} = h5read("data_3darray.h5", "group")

or

data1 = convert(Array{Float64, 3}, h5read("data_3darray.h5", "group"))

or introduce a function barrier (https://docs.julialang.org/en/v1/manual/performance-tips/#kernel-functions-1).

4 Likes

HD5 is not designed to read data element by element. Internally, it does data chunking and elements are read from/written to a file chunk-wise. See here for more explanation: https://portal.hdfgroup.org/display/HDF5/Chunking+in+HDF5.

1 Like

If possible, I would suggest a different file format. We have spent quite a lot of time searching for a reliable and fast fileformat (you can find the discussions on this forum) and arrived to FlatBuffers. So if you can, I would suggest those. I have a good experience with them.

Tomas

2 Likes

HDF5 is fast and reliable. One can even mmap the array in question with readmmap as is outlined in this example:

It is clear that read1 has two issues. Type instability and reading / writing scalar values from a file.

1 Like

Why do you both say this? The single h5read call in read1 reads the whole array, I’d think?

1 Like