I’m trying to read a large 3D array from an HDF5 file. If I just naively read a 3D array, I found that every time I try to access the element (e.g. in a for-loop), I see a lot of unexpected allocations and therefore the low speed (see the read1 function below). Instead, I have to first allocate an undef 3D array and then assign the value with .= (the read2 function) to avoid allocations when referencing the elements, which is a bit clumsy. The code example:
using BenchmarkTools
using HDF5
const N = 1000000
A = rand(3,3,N);
h5write("data_3darray.h5", "group", A);
function read1()
data1 = h5read("data_3darray.h5", "group")
for n in 1:N, i in 1:3, j in 1:3
data1[j,i,n] += 0.0
end
return data1
end
function read2()
data2 = Array{Float64,3}(undef, 3, 3, N)
data2 .= h5read("data_3darray.h5", "group")
for n in 1:N, i in 1:3, j in 1:3
data2[j,i,n] += 0.0
end
return data2
end
@btime data=read1() #2.395 s (35990858 allocations: 617.84 MiB)
@btime data=read2() #88.712 ms (60 allocations: 137.33 MiB)
Why does it allocate so much memory during the loop in the first case? Is there a better way to read a large array from an HDF5 file?
Yep, so @code_warntype read1() shows clearly that the inferred return type of h5read is Any, and so data1 is also inferred to be of type Any. You can either explicitly annotate the type (Performance Tips · The Julia Language),
HD5 is not designed to read data element by element. Internally, it does data chunking and elements are read from/written to a file chunk-wise. See here for more explanation: https://portal.hdfgroup.org/display/HDF5/Chunking+in+HDF5.
If possible, I would suggest a different file format. We have spent quite a lot of time searching for a reliable and fast fileformat (you can find the discussions on this forum) and arrived to FlatBuffers. So if you can, I would suggest those. I have a good experience with them.
Hi,
i’m searching for a fast fileformt which can be converted from hdf5 file. Is it Flatbuffer? If yes, could you let me know the where in this Forum i can find about it?
Could you expand on what you mean by that ? My first reading of it make me think that the returned data was questionable, which would make HDF5 a poor choice indeed
All of data is contained in metadata.
Would you really need both? Probably not. Anyhow, as it come at very little extra cost, I enjoy beeing able to choose if I just want my central Array{Int, 3} or all the metadata as well by simply choosing between:
Thank you. I personally do find the documentation very poor, but it must be me, although people even pointed me to the documentation of the HDF5 Python library jus few weeks ago