Large memory consumption when reading multiple HDF5 files

I found this large memory consumption running the following code.

Memory usage n=1 0.003026944 GB
Memory usage n=2 0.136589312 GB
Memory usage n=4 0.27860992 GB
Memory usage n=8 0.60479488 GB
Memory usage n=16 1.207263232 GB
Memory usage n=32 2.410348544 GB

Something weird is that the memory leak is not always reproducible. (Very) occasionally I would get a more reasonable memory usage like this:

Memory usage n=1 -0.003125248 GB
Memory usage n=2 0.000270336 GB
Memory usage n=4 0.003125248 GB
Memory usage n=8 -0.001847296 GB
Memory usage n=16 0.003145728 GB
Memory usage n=32 0.008585216 GB

The code is

using HDF5

a=zeros(256,256,256)

for n in [1;2;4;8;16;32]
    memi = Int64(Sys.free_memory())
    for i=1:n
        fid=h5open("test$i.h5","r")
        a.=read(fid,"a")
        close(fid)
    end
    meme = Int64(Sys.free_memory())
    println("Memory usage n=$n $((memi-meme)/1e9) GB")
end

The test data files are generated using:

using HDF5
a=rand(256,256,256)
for i=1:50
     fgid=h5open("test$i.h5","w")
     write(fgid,"a",a)
     close(fgid)
end

The julia version I’m using is:

Julia Version 1.6.1
Commit 6aaedecc44 (2021-04-23 05:59 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Xeon(R) Platinum 8168 CPU @ 2.70GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-11.0.1 (ORCJIT, skylake-avx512)

Does anyone have any idea how this issue might be solved? Thanks!

BTW, simply adding GC.gc() at the end of each loop sometimes solves this issue, but not always.

This issue seems very similar to what I often see, and recently investigated a bit closer here

This instance is with CSV.jl, but both are IO related and I’ve seen this problem in many other file reading contexts as well.

1 Like

Indeed. Looks like GC doesn’t work smart enough in these cases.