I found this large memory consumption running the following code.
Memory usage n=1 0.003026944 GB
Memory usage n=2 0.136589312 GB
Memory usage n=4 0.27860992 GB
Memory usage n=8 0.60479488 GB
Memory usage n=16 1.207263232 GB
Memory usage n=32 2.410348544 GB
Something weird is that the memory leak is not always reproducible. (Very) occasionally I would get a more reasonable memory usage like this:
Memory usage n=1 -0.003125248 GB
Memory usage n=2 0.000270336 GB
Memory usage n=4 0.003125248 GB
Memory usage n=8 -0.001847296 GB
Memory usage n=16 0.003145728 GB
Memory usage n=32 0.008585216 GB
The code is
using HDF5
a=zeros(256,256,256)
for n in [1;2;4;8;16;32]
memi = Int64(Sys.free_memory())
for i=1:n
fid=h5open("test$i.h5","r")
a.=read(fid,"a")
close(fid)
end
meme = Int64(Sys.free_memory())
println("Memory usage n=$n $((memi-meme)/1e9) GB")
end
The test data files are generated using:
using HDF5
a=rand(256,256,256)
for i=1:50
fgid=h5open("test$i.h5","w")
write(fgid,"a",a)
close(fgid)
end
The julia version I’m using is:
Julia Version 1.6.1
Commit 6aaedecc44 (2021-04-23 05:59 UTC)
Platform Info:
OS: Linux (x86_64-pc-linux-gnu)
CPU: Intel(R) Xeon(R) Platinum 8168 CPU @ 2.70GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-11.0.1 (ORCJIT, skylake-avx512)
Does anyone have any idea how this issue might be solved? Thanks!
BTW, simply adding GC.gc() at the end of each loop sometimes solves this issue, but not always.
This issue seems very similar to what I often see, and recently investigated a bit closer here
opened 07:03AM - 01 Jul 21 UTC
closed 07:25AM - 16 Nov 21 UTC
bug
I have long been trying to find the source of what I suspected was a memory leak… somewhere in my data pipeline and I think I have found a small MWE that reproduces my issue. If the following snippet is run multiple times, the memory use of the julia process increases steadily. If I call `GC.gc();GC.gc();GC.gc();GC.gc();GC.gc();GC.gc();`, I get some of it back but not all. If I continue to call the `CSV.read` another, say, 10 times, the memory claimed bu the julia process jumps up again and when I trigger the GC again, I get even less memory back and the julia process now holds on to more of it. I can continue this process until I run out of RAM
```julia
using CSV, DataFrames
logfile = "my_1GB_file.csv"
@time CSV.read(
logfile,
DataFrame;
header = 15,
datarow = 26,
drop = (i, name) ->
startswith(string(name), "Name") || startswith(string(name), "SymbolName"),
delim = '\t',
footerskip = 1,
);
```
#### Some concrete numbers:
With julia just started and CSV and DataFrames loded, julia uses up 167MB, after the first read, the figure is 1.7GB. Running GC multiple times brings this to 1.2GB.
Repeating the reading 10 times brings the memory usage to 5.7GB, and triggering GC brings it down to 2.1GB. Why does it not go down back to 1.2GB here?
The csv-file I'm using is about 1GB and 160MB zipped, I'd be happy to share it if someone wants to reproduce this issue.
Edit: It's here https://drive.google.com/file/d/1LQSKbDIYHb_N8NqnD40Xw-13V1uTCSOk/view?usp=sharing
I'm running
```
julia> versioninfo()
Julia Version 1.6.1
Commit 6aaedecc44 (2021-04-23 05:59 UTC)
Platform Info:
OS: Linux (x86_64-pc-linux-gnu)
CPU: Intel(R) Core(TM) i7-10510U CPU @ 1.80GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-11.0.1 (ORCJIT, skylake)
Environment:
JULIA_NUM_THREADS = 4
(@v1.6) pkg> st CSV
Status `~/.julia/environments/v1.6/Project.toml`
[336ed68f] CSV v0.8.5
```
---
Edit: I've noticed that each read says something like
```
1.540763 seconds (793.73 k allocations: 1.238 GiB, 2.47% gc time, 19.43% compilation time)
```
i.e., it always claims a positive compilation time. Is it perhaps the compiled code that eventually eats up memory?
---
Edit2: The compilation time appears to be due to my `drop` function. Removing this removes the compilation time, but does not change the memory issues.
This instance is with CSV.jl, but both are IO related and I’ve seen this problem in many other file reading contexts as well.
1 Like
Indeed. Looks like GC doesn’t work smart enough in these cases.