Thanks for your reply!
I think I nailed it down to two problems, let me share some data and code to elaborate on them.
Problem 1: Garbage collection, if activated does not properly clean the memory!
As suggested, I added
Dagger.MemPool.MEM_RESERVED[] = 10^6 * 1000 # 1000 MB
Dagger.MemPool.MEM_RESERVE_SWEEPS[] = 3 # default is 3
In order to check the garbage collection behavior of MemPool.jl, I turned on Debug logging for MemPool.jl.
Running Julia (e.g. julia -t 4) with these settings gives the following output:
Dagger.MemPool.MEM_RESERVED[]= 1000000000
Dagger.MemPool.MEM_RESERVE_SWEEPS[]= 3
Memory available before computation: 1.512 GiB
Dagger, running for matrix size: (1000, 1000)
Iteration 10: Memory: 869.773 MiB | Free Memory: 986.031 MiB
Iteration 20: Memory: 1.665 GiB | Free Memory: 986.031 MiB
Iteration 30: Memory: 2.360 GiB | Free Memory: 986.031 MiB
┌ Debug: Not enough memory to continue! Sweeping up unused memory...
└ @ MemPool ~/.julia/packages/MemPool/xbec0/src/datastore.jl:422
┌ Debug: Freed 0 bytes bytes, available: 82.000 MiB
└ @ MemPool ~/.julia/packages/MemPool/xbec0/src/datastore.jl:440
┌ Debug: Freed 0 bytes bytes, available: 82.000 MiB
└ @ MemPool ~/.julia/packages/MemPool/xbec0/src/datastore.jl:440
┌ Debug: Freed 0 bytes bytes, available: 82.000 MiB
└ @ MemPool ~/.julia/packages/MemPool/xbec0/src/datastore.jl:440
┌ Debug: Made too many sweeps, bailing out...
└ @ MemPool ~/.julia/packages/MemPool/xbec0/src/datastore.jl:445
┌ Debug: Swept for 3 cycles
└ @ MemPool ~/.julia/packages/MemPool/xbec0/src/datastore.jl:450
┌ Debug: Not enough memory to continue! Sweeping up unused memory...
└ @ MemPool ~/.julia/packages/MemPool/xbec0/src/datastore.jl:422
┌ Debug: Freed 0 bytes bytes, available: 82.000 MiB
└ @ MemPool ~/.julia/packages/MemPool/xbec0/src/datastore.jl:440
┌ Debug: Freed 0 bytes bytes, available: 82.000 MiB
└ @ MemPool ~/.julia/packages/MemPool/xbec0/src/datastore.jl:440
┌ Debug: Freed 0 bytes bytes, available: 82.000 MiB
└ @ MemPool ~/.julia/packages/MemPool/xbec0/src/datastore.jl:440
┌ Debug: Made too many sweeps, bailing out...
└ @ MemPool ~/.julia/packages/MemPool/xbec0/src/datastore.jl:445
┌ Debug: Swept for 3 cycles
# Code gets stuck in cleaning memory!
Increasing the MEM_RESERVE_SWEEPS to 200 sweeps causes some freeing of memory, but it does not increase the amount of memory available significantly.
Dagger.MemPool.MEM_RESERVED[]= 1000000000
Dagger.MemPool.MEM_RESERVE_SWEEPS[]= 200
Memory available before computation: 4.114 GiB
Dagger, running for matrix size: (1000, 1000)
Iteration 10: Memory: 877.430 MiB | Free Memory: 3.255 GiB
Iteration 20: Memory: 1.582 GiB | Free Memory: 3.255 GiB
Iteration 30: Memory: 2.285 GiB | Free Memory: 3.255 GiB
Iteration 40: Memory: 3.078 GiB | Free Memory: 1.207 GiB
Iteration 50: Memory: 3.773 GiB | Free Memory: 1.207 GiB
┌ Debug: Not enough memory to continue! Sweeping up unused memory...
└ @ MemPool ~/.julia/packages/MemPool/xbec0/src/datastore.jl:422
┌ Debug: Freed 53.578 MiB bytes, available: 68.500 MiB
└ @ MemPool ~/.julia/packages/MemPool/xbec0/src/datastore.jl:440
┌ Debug: Freed 0 bytes bytes, available: 65.922 MiB
└ @ MemPool ~/.julia/packages/MemPool/xbec0/src/datastore.jl:440
┌ Debug: Freed 0 bytes bytes, available: 65.922 MiB
└ @ MemPool ~/.julia/packages/MemPool/xbec0/src/datastore.jl:440
┌ Debug: Freed 0 bytes bytes, available: 65.922 MiB
└ @ MemPool ~/.julia/packages/MemPool/xbec0/src/datastore.jl:440
┌ Debug: Freed 0 bytes bytes, available: 65.922 MiB
└ @ MemPool ~/.julia/packages/MemPool/xbec0/src/datastore.jl:440
┌ Debug: Freed 0 bytes bytes, available: 65.922 MiB
└ @ MemPool ~/.julia/packages/MemPool/xbec0/src/datastore.jl:440
┌ Debug: Freed 0 bytes bytes, available: 65.922 MiB
└ @ MemPool ~/.julia/packages/MemPool/xbec0/src/datastore.jl:440
┌ Debug: Freed 0 bytes bytes, available: 65.922 MiB
# Code gets stuck in cleaning memory!
Code:
using Dagger
using Profile
ENV["JULIA_DEBUG"] = Dagger.MemPool
using Logging
Dagger.MemPool.MEM_RESERVED[] = 10^6 * 1000 # 1000 MB
Dagger.MemPool.MEM_RESERVE_SWEEPS[] = 3 # default is 3
println("Dagger.MemPool.MEM_RESERVED[]= ", Dagger.MemPool.MEM_RESERVED[])
println("Dagger.MemPool.MEM_RESERVE_SWEEPS[]= ", Dagger.MemPool.MEM_RESERVE_SWEEPS[])
function get_free_memory()
return Base.format_bytes(Int(Sys.free_memory()))
end
function get_memory_usage_main()
total_mem = Base.gc_live_bytes()
return "\tMemory: $(Base.format_bytes(total_mem)) | Free Memory: $(get_free_memory())"
end
function heavy_computation()
A = rand(1000,1000)
return A
end
function compute()
matrix = zeros(1000,1000)
println("Memory available before computation: ", get_free_memory())
println("Dagger, running for matrix size: ", size(matrix))
for j in 1:100
job_outputs = []
@sync for i in 1:10
job = Dagger.@spawn heavy_computation()
push!(job_outputs, job)
end
results = Dagger.fetch.(job_outputs)
matrix .+= sum(results)
if j % 10 == 0
println("Iteration $j: " * get_memory_usage_main())
end
end
end
compute()
As a sanity check, I also removed Dagger.jl entirely which results in no memory issues:
No Dagger, running for matrix size: (1000, 1000)
Iteration 10: Memory: 208.951 MiB
Iteration 20: Memory: 231.920 MiB
Iteration 30: Memory: 231.920 MiB
Iteration 40: Memory: 185.982 MiB
Iteration 50: Memory: 109.418 MiB
Code:
function compute()
matrix = zeros(1000,1000)
println("No Dagger, running for matrix size: ", size(matrix))
for j in 1:50
job_outputs = []
for i in 1:10
job = heavy_computation()
push!(job_outputs, job)
end
results = job_outputs
matrix .+= sum(results)
if j % 10 == 0
println("Iteration $j: " * get_memory_usage_main())
end
end
end
compute()
Problem 2: Garbage collecting based on amount of memory left on device is unpredictable.
Setting Dagger.MemPool.MEM_RESERVED[] = 1024^3 will make Dagger.jl invoke the garbage collector when there is less than 1 GB of memory left. Looking a bit through the code of MemPool.jl, the amount of memory available is based on:
...
if Sys.islinux()
function free_memory()
open("/proc/meminfo", "r") do io
# TODO: Cache in TLS
buf = zeros(UInt8, 128)
readbytes!(io, buf)
free = match(r"MemAvailable:\s*([0-9]*)\s.*", String(buf)).captures[1]
return parse(UInt64, free) * 1024
end
end
else
# FIXME: Sys.free_memory() includes OS caches
free_memory() = Sys.free_memory()
end
storage_available(::CPURAMResource) = _query_mem_periodically(:available)
storage_capacity(::CPURAMResource) = _query_mem_periodically(:capacity)
...
The free_memory() function has different behavior per operating system, such as Linux (trying to not include OS caches, if I understand correctly?). The code in the previous sections was executed on a machine running MacOS.
I also tested the code on a Linux system running SLURM and I reserved memory based on e.g. srun/sbatch … --mem=4G. However, cat /proc/meminfo gives the amount of memory available on the entire machine not the allocated amount of memory, resulting in the job getting killed by the SLURM daemon. Using something like julia ... --heap-size-hint=2G as suggested in other discussion also does not limit allocations in this case (see Interaction between `addprocs` and `--heap-size-hint` · Issue #50673 · JuliaLang/julia · GitHub).
Conclusion
If activated, garbage collection seems to effectively free up memory. The way of determining free_memory() can have unpredictable behavior on a SLURM node.
PS: I can work around this problem for now by using pmap from Distributed.jl, however I like Dagger.jl’s flexibility to switch between workers and threads.