Reading a large array from an HDF5 file

huchiayu0517 · May 8, 2019, 3:45am

I’m trying to read a large 3D array from an HDF5 file. If I just naively read a 3D array, I found that every time I try to access the element (e.g. in a for-loop), I see a lot of unexpected allocations and therefore the low speed (see the read1 function below). Instead, I have to first allocate an undef 3D array and then assign the value with .= (the read2 function) to avoid allocations when referencing the elements, which is a bit clumsy. The code example:

using BenchmarkTools
using HDF5
const N = 1000000
A = rand(3,3,N);
h5write("data_3darray.h5", "group", A);

function read1()
    data1 = h5read("data_3darray.h5", "group")
    for n in 1:N, i in 1:3, j in 1:3
        data1[j,i,n] += 0.0
    end
    return data1
end

function read2()
    data2 = Array{Float64,3}(undef, 3, 3, N)
    data2 .= h5read("data_3darray.h5", "group")
    for n in 1:N, i in 1:3, j in 1:3
        data2[j,i,n] += 0.0
    end
    return data2
end
@btime data=read1()   #2.395 s (35990858 allocations: 617.84 MiB)
@btime data=read2()   #88.712 ms (60 allocations: 137.33 MiB)

Why does it allocate so much memory during the loop in the first case? Is there a better way to read a large array from an HDF5 file?

tkoolen · May 8, 2019, 4:13am

Yep, so @code_warntype read1() shows clearly that the inferred return type of h5read is Any, and so data1 is also inferred to be of type Any. You can either explicitly annotate the type (Performance Tips · The Julia Language),

data1::Array{Float64, 3} = h5read("data_3darray.h5", "group")

or

data1 = convert(Array{Float64, 3}, h5read("data_3darray.h5", "group"))

or introduce a function barrier (Performance Tips · The Julia Language).

bicycle1885 · May 8, 2019, 4:24am

HD5 is not designed to read data element by element. Internally, it does data chunking and elements are read from/written to a file chunk-wise. See here for more explanation: https://portal.hdfgroup.org/display/HDF5/Chunking+in+HDF5.

Tomas_Pevny · May 8, 2019, 6:34am

If possible, I would suggest a different file format. We have spent quite a lot of time searching for a reliable and fast fileformat (you can find the discussions on this forum) and arrived to FlatBuffers. So if you can, I would suggest those. I have a good experience with them.

Tomas

tobias.knopp · May 8, 2019, 8:08am

HDF5 is fast and reliable. One can even mmap the array in question with readmmap as is outlined in this example:

https://github.com/JuliaIO/HDF5.jl/blob/master/test/mmap.jl

It is clear that read1 has two issues. Type instability and reading / writing scalar values from a file.

tkoolen · May 8, 2019, 3:50pm

Why do you both say this? The single h5read call in read1 reads the whole array, I’d think?

JK_loo · March 4, 2020, 10:04pm

Hi,
i’m searching for a fast fileformt which can be converted from hdf5 file. Is it Flatbuffer? If yes, could you let me know the where in this Forum i can find about it?

JK

purplishrock · March 5, 2020, 12:34am

The search engine provides:

https://github.com/JuliaData/FlatBuffers.jl

as to whether it’s faster than hdf5, i don’t think that question has been answered yet, most likely it depends on precisely what you are doing.

met-j · March 5, 2020, 1:02pm

I am reading HDF5 files in the rage of 50 GB with out any issues. In fact I am reading 3d arrays of integers on the lines of:

using HDF5

function loadData(filename)
    fid = HDF5.h5open(filename, "r")
    obj = fid["key_lvl_1"]
    metadata = read(obj)
    close(fid)
    data = metadata["key_lvl_2"]["key_lvl_3"]["key_lvl_4"][:,:,:];
return data, metadata
end

@time (data,metadata) = loadData(filename); # 2.192142 seconds (586 allocations: 1018.752 MiB, 4.13% gc time)

Sure, the return of data & metadata is questionable. With a 54 GB file ready to use, I am very happy.

purplishrock · March 6, 2020, 11:16pm

Could you expand on what you mean by that ? My first reading of it make me think that the returned data was questionable, which would make HDF5 a poor choice indeed

met-j · March 7, 2020, 6:53am

All of data is contained in metadata.
Would you really need both? Probably not. Anyhow, as it come at very little extra cost, I enjoy beeing able to choose if I just want my central Array{Int, 3} or all the metadata as well by simply choosing between:

data = loadData(filename)

and

(data, metadata) = loadData(filename)

mgiugliano · February 6, 2023, 5:54pm

hdf5 newbie here.

Is it correct that you are reading ALL the file into the memory? Is it desirable?
How could one only load a chunk of data?

fft · February 6, 2023, 8:22pm

You can load all of the data or a subset of the data. Check out the documentation Home · HDF5.jl

mgiugliano · February 6, 2023, 9:26pm

Thank you. I personally do find the documentation very poor, but it must be me, although people even pointed me to the documentation of the HDF5 Python library jus few weeks ago

fft · February 6, 2023, 9:37pm

https://juliaio.github.io/HDF5.jl/stable/#Reading-and-writing-data

When you read the data just add add the slice for the subset of the data you want.


Asub = dset[2:3, 1:3]

Topic		Replies	Views
Optimisation Problems with HDF5 Data New to Julia question	4	914	April 27, 2017
Read!( ) not working for HDF5 file General Usage hdf5 , memory-allocation , io	1	117	June 15, 2025
Reading hdf5 files with ascii headers Data hdf5	0	47	July 30, 2024
Load HDF5 file larger than memory New to Julia hdf5	6	598	December 12, 2023
Reading (big) ascii files Data	11	2731	April 5, 2019

Reading a large array from an HDF5 file

Related topics