Large memory consumption when reading multiple HDF5 files

useJulia · July 3, 2021, 4:17pm

I found this large memory consumption running the following code.

Memory usage n=1 0.003026944 GB
Memory usage n=2 0.136589312 GB
Memory usage n=4 0.27860992 GB
Memory usage n=8 0.60479488 GB
Memory usage n=16 1.207263232 GB
Memory usage n=32 2.410348544 GB

Something weird is that the memory leak is not always reproducible. (Very) occasionally I would get a more reasonable memory usage like this:

Memory usage n=1 -0.003125248 GB
Memory usage n=2 0.000270336 GB
Memory usage n=4 0.003125248 GB
Memory usage n=8 -0.001847296 GB
Memory usage n=16 0.003145728 GB
Memory usage n=32 0.008585216 GB

The code is

using HDF5

a=zeros(256,256,256)

for n in [1;2;4;8;16;32]
    memi = Int64(Sys.free_memory())
    for i=1:n
        fid=h5open("test$i.h5","r")
        a.=read(fid,"a")
        close(fid)
    end
    meme = Int64(Sys.free_memory())
    println("Memory usage n=$n $((memi-meme)/1e9) GB")
end

The test data files are generated using:

using HDF5
a=rand(256,256,256)
for i=1:50
     fgid=h5open("test$i.h5","w")
     write(fgid,"a",a)
     close(fgid)
end

The julia version I’m using is:

Julia Version 1.6.1
Commit 6aaedecc44 (2021-04-23 05:59 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Xeon(R) Platinum 8168 CPU @ 2.70GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-11.0.1 (ORCJIT, skylake-avx512)

Does anyone have any idea how this issue might be solved? Thanks!

useJulia · July 3, 2021, 4:37pm

BTW, simply adding GC.gc() at the end of each loop sometimes solves this issue, but not always.

baggepinnen · July 3, 2021, 6:03pm

This issue seems very similar to what I often see, and recently investigated a bit closer here

github.com/JuliaData/CSV.jl

Memory used during parsing never reclaimed

opened 07:03AM - 01 Jul 21 UTC

closed 07:25AM - 16 Nov 21 UTC

baggepinnen

bug

I have long been trying to find the source of what I suspected was a memory leak… somewhere in my data pipeline and I think I have found a small MWE that reproduces my issue. If the following snippet is run multiple times, the memory use of the julia process increases steadily. If I call `GC.gc();GC.gc();GC.gc();GC.gc();GC.gc();GC.gc();`, I get some of it back but not all. If I continue to call the `CSV.read` another, say, 10 times, the memory claimed bu the julia process jumps up again and when I trigger the GC again, I get even less memory back and the julia process now holds on to more of it. I can continue this process until I run out of RAM ```julia using CSV, DataFrames logfile = "my_1GB_file.csv" @time CSV.read( logfile, DataFrame; header = 15, datarow = 26, drop = (i, name) -> startswith(string(name), "Name") || startswith(string(name), "SymbolName"), delim = '\t', footerskip = 1, ); ``` #### Some concrete numbers: With julia just started and CSV and DataFrames loded, julia uses up 167MB, after the first read, the figure is 1.7GB. Running GC multiple times brings this to 1.2GB. Repeating the reading 10 times brings the memory usage to 5.7GB, and triggering GC brings it down to 2.1GB. Why does it not go down back to 1.2GB here? The csv-file I'm using is about 1GB and 160MB zipped, I'd be happy to share it if someone wants to reproduce this issue. Edit: It's here https://drive.google.com/file/d/1LQSKbDIYHb_N8NqnD40Xw-13V1uTCSOk/view?usp=sharing I'm running ``` julia> versioninfo() Julia Version 1.6.1 Commit 6aaedecc44 (2021-04-23 05:59 UTC) Platform Info: OS: Linux (x86_64-pc-linux-gnu) CPU: Intel(R) Core(TM) i7-10510U CPU @ 1.80GHz WORD_SIZE: 64 LIBM: libopenlibm LLVM: libLLVM-11.0.1 (ORCJIT, skylake) Environment: JULIA_NUM_THREADS = 4 (@v1.6) pkg> st CSV Status `~/.julia/environments/v1.6/Project.toml` [336ed68f] CSV v0.8.5 ``` --- Edit: I've noticed that each read says something like ``` 1.540763 seconds (793.73 k allocations: 1.238 GiB, 2.47% gc time, 19.43% compilation time) ``` i.e., it always claims a positive compilation time. Is it perhaps the compiled code that eventually eats up memory? --- Edit2: The compilation time appears to be due to my `drop` function. Removing this removes the compilation time, but does not change the memory issues.

This instance is with CSV.jl, but both are IO related and I’ve seen this problem in many other file reading contexts as well.

useJulia · July 3, 2021, 6:30pm

Indeed. Looks like GC doesn’t work smart enough in these cases.

Topic		Replies	Views
Running out of memory saving files with HDF5 Performance hdf5 , memory , memory-allocation	7	1720	August 22, 2019
"memory mapping failed" when reading many CSVs General Usage	11	2068	May 8, 2020
Why DataFrames v.0.21.2 (julia v1.4.2) requires more memory than the previous version Performance dataframes	22	2314	June 29, 2020
Debugging segfaults/running out of memory, maybe a memory leak? General Usage	5	185	December 16, 2024
Very slow readdlm() General Usage	14	1914	October 2, 2018

Large memory consumption when reading multiple HDF5 files

Related topics