Load HDF5 file larger than memory

Hi,

I am trying to read data from a HDF5 file which is larger than the memory of my computer. I think every time I just need part of data in the file. Is there a way to get part of groups/dataset in the file without loading the whole file? Thanks

1 Like

Yes. That’s how it HDF5.jl usually works.

Could you share some example code demonstrating your problem?

Here’s a demonstration of creating a 8 GB file and then retrieving a single element.

julia> using HDF5

julia> h5open("bigfile.h5", "w") do h5f
           h5f["large_dataset"] = rand(1024, 1024, 1024)
       end;

julia> g() = h5open("bigfile.h5") do h5f
           h5f["large_dataset"][1024,512,256]
       end
g (generic function with 1 method)

julia> @time g()
  0.000527 seconds (51 allocations: 2.031 KiB)
0.5066790863746067

julia> @time g()
  0.001117 seconds (51 allocations: 2.031 KiB)
0.5066790863746067
2 Likes

You could also potentially memory map the file, look for mmap.

1 Like

Hi Mark:

Yes this works. Before I thought h5open will load the whole file which is too large (the file I am using is tens of GB) but I tried what you suggested and everything is good!

Hi,

I have a further question about this. Suppose in my file I have many groups (like 10000) and I want just get 1000 random group each time. Is there a way to do this?Thanks.

For an array I know I can do sample(data,1000,replace = true) (the reason I want replace to be true is I am trying to do something like bootstrapping). But I don’t know how to do this for groups in a hdf5 file.

You could do something like this, but this is not lazy anymore.

julia> h5open("test.h5", "w") do h5f
           h5f["r/a"] = 1
           h5f["r/b"] = 2
           h5f["r/c"] = 3
           h5f["r/d"] = 4
       end
4

julia> h5open("test.h5", "r") do h5f
           _samples = sample(keys(h5f["r"]), 1000, replace = true)
           map(_samples) do _sample
               h5f["r"][_sample][]
           end
       end
1000-element Vector{Int64}:
 4
 2
 1
 1
 2
 3
 4
 1
 2
 2
 ⋮
 4
 3
 4
 1
 2
 3
 1
 3
 3