Memory usage & garbage collection of Dagger

Hi all,

Problem:

There is a heavy_computation() which returns a large Array. The computation of the heavy_computation() is parallelized with the help of Dagger.jl. The DTasks are spawned and later fetched.

Running this code results in an ever growing memory usage (3GB-4GB). It seems to me that the DTasks or the result of the DTask is not freed by the garbage collection.

Question:

  • Should I expect this memory behavior here?
  • How can I reformulate my problem to avoid it?

I am very curious to your thoughts!

Dagger v0.19.2, Julia 1.11.5

Example code:

function compute()
    matrix = zeros(1000,1000)

    for j in 1:50
        println("Iteration $j")
        
        job_outputs = []
        @sync for i in 1:10
            job = Dagger.@spawn rand(1000,1000)
            push!(job_outputs, job)
        end
        
        results = Dagger.fetch.(job_outputs)
        matrix .+= sum(results)
    end
end
compute()

With profiling:

using Dagger
using Profile

function get_memory_usage_main()
    output = ""
    for var in names(Main)
        try
            obj = getfield(Main, var)
            mem = Base.summarysize(obj)
            output *= "Variable: $var, Size: $(Base.format_bytes(mem))\n"
        catch e
            output *= "Could not get size for variable: $var\n"
        end
    end
    total_mem = Base.gc_live_bytes()
    output *= "Total memory used in Main: $(Base.format_bytes(total_mem))\n"
    return output
end

function heavy_computation()
    A = rand(1000,1000)
    return A
end

function compute()
    matrix = zeros(1000,1000)

    for j in 1:50
        println("Iteration $j")
        
        job_outputs = []
        @sync for i in 1:10
            job = Dagger.@spawn heavy_computation()
            push!(job_outputs, job)
        end
        
        results = Dagger.fetch.(job_outputs)
        matrix .+= sum(results)
        
        println(get_memory_usage_main())
    end

    Profile.take_heap_snapshot("test.heapsnapshot")
end

compute()

Dagger’s usage of remote references (in the form of MemPool.DRef, which all tasks generate) causes the GC to be a bit slower to free memory than it normally would be. You can ask Dagger to more aggressively free memory by doing Dagger.MemPool.MEM_RESERVED[] = 1024^3 to reserve 1GB of memory (amount can be set to desired value), which will cause Dagger to invoke the GC manually anytime there is less than this amount of memory remaining on the system. You can also increase Dagger.MemPool.MEM_RESERVE_SWEEPS[] to make each GC cycle more aggressive to ensure memory really gets freed at a decent pace.

1 Like

Thanks for your reply!

I think I nailed it down to two problems, let me share some data and code to elaborate on them.

Problem 1: Garbage collection, if activated does not properly clean the memory!

As suggested, I added

Dagger.MemPool.MEM_RESERVED[] = 10^6 * 1000   # 1000 MB
Dagger.MemPool.MEM_RESERVE_SWEEPS[] = 3 # default is 3

In order to check the garbage collection behavior of MemPool.jl, I turned on Debug logging for MemPool.jl.

Running Julia (e.g. julia -t 4) with these settings gives the following output:

Dagger.MemPool.MEM_RESERVED[]= 1000000000
Dagger.MemPool.MEM_RESERVE_SWEEPS[]= 3
Memory available before computation: 1.512 GiB
Dagger, running for matrix size: (1000, 1000)
Iteration 10:   Memory: 869.773 MiB | Free Memory: 986.031 MiB
Iteration 20:   Memory: 1.665 GiB | Free Memory: 986.031 MiB
Iteration 30:   Memory: 2.360 GiB | Free Memory: 986.031 MiB
┌ Debug: Not enough memory to continue! Sweeping up unused memory...
└ @ MemPool ~/.julia/packages/MemPool/xbec0/src/datastore.jl:422
┌ Debug: Freed 0 bytes bytes, available: 82.000 MiB
└ @ MemPool ~/.julia/packages/MemPool/xbec0/src/datastore.jl:440
┌ Debug: Freed 0 bytes bytes, available: 82.000 MiB
└ @ MemPool ~/.julia/packages/MemPool/xbec0/src/datastore.jl:440
┌ Debug: Freed 0 bytes bytes, available: 82.000 MiB
└ @ MemPool ~/.julia/packages/MemPool/xbec0/src/datastore.jl:440
┌ Debug: Made too many sweeps, bailing out...
└ @ MemPool ~/.julia/packages/MemPool/xbec0/src/datastore.jl:445
┌ Debug: Swept for 3 cycles
└ @ MemPool ~/.julia/packages/MemPool/xbec0/src/datastore.jl:450
┌ Debug: Not enough memory to continue! Sweeping up unused memory...
└ @ MemPool ~/.julia/packages/MemPool/xbec0/src/datastore.jl:422
┌ Debug: Freed 0 bytes bytes, available: 82.000 MiB
└ @ MemPool ~/.julia/packages/MemPool/xbec0/src/datastore.jl:440
┌ Debug: Freed 0 bytes bytes, available: 82.000 MiB
└ @ MemPool ~/.julia/packages/MemPool/xbec0/src/datastore.jl:440
┌ Debug: Freed 0 bytes bytes, available: 82.000 MiB
└ @ MemPool ~/.julia/packages/MemPool/xbec0/src/datastore.jl:440
┌ Debug: Made too many sweeps, bailing out...
└ @ MemPool ~/.julia/packages/MemPool/xbec0/src/datastore.jl:445
┌ Debug: Swept for 3 cycles

# Code gets stuck in cleaning memory!

Increasing the MEM_RESERVE_SWEEPS to 200 sweeps causes some freeing of memory, but it does not increase the amount of memory available significantly.

Dagger.MemPool.MEM_RESERVED[]= 1000000000
Dagger.MemPool.MEM_RESERVE_SWEEPS[]= 200
Memory available before computation: 4.114 GiB
Dagger, running for matrix size: (1000, 1000)
Iteration 10:   Memory: 877.430 MiB | Free Memory: 3.255 GiB
Iteration 20:   Memory: 1.582 GiB | Free Memory: 3.255 GiB
Iteration 30:   Memory: 2.285 GiB | Free Memory: 3.255 GiB
Iteration 40:   Memory: 3.078 GiB | Free Memory: 1.207 GiB
Iteration 50:   Memory: 3.773 GiB | Free Memory: 1.207 GiB
┌ Debug: Not enough memory to continue! Sweeping up unused memory...
└ @ MemPool ~/.julia/packages/MemPool/xbec0/src/datastore.jl:422
┌ Debug: Freed 53.578 MiB bytes, available: 68.500 MiB
└ @ MemPool ~/.julia/packages/MemPool/xbec0/src/datastore.jl:440
┌ Debug: Freed 0 bytes bytes, available: 65.922 MiB
└ @ MemPool ~/.julia/packages/MemPool/xbec0/src/datastore.jl:440
┌ Debug: Freed 0 bytes bytes, available: 65.922 MiB
└ @ MemPool ~/.julia/packages/MemPool/xbec0/src/datastore.jl:440
┌ Debug: Freed 0 bytes bytes, available: 65.922 MiB
└ @ MemPool ~/.julia/packages/MemPool/xbec0/src/datastore.jl:440
┌ Debug: Freed 0 bytes bytes, available: 65.922 MiB
└ @ MemPool ~/.julia/packages/MemPool/xbec0/src/datastore.jl:440
┌ Debug: Freed 0 bytes bytes, available: 65.922 MiB
└ @ MemPool ~/.julia/packages/MemPool/xbec0/src/datastore.jl:440
┌ Debug: Freed 0 bytes bytes, available: 65.922 MiB
└ @ MemPool ~/.julia/packages/MemPool/xbec0/src/datastore.jl:440
┌ Debug: Freed 0 bytes bytes, available: 65.922 MiB

# Code gets stuck in cleaning memory!

Code:

using Dagger
using Profile

ENV["JULIA_DEBUG"] = Dagger.MemPool

using Logging

Dagger.MemPool.MEM_RESERVED[] = 10^6 * 1000   # 1000 MB
Dagger.MemPool.MEM_RESERVE_SWEEPS[] = 3 # default is 3

println("Dagger.MemPool.MEM_RESERVED[]= ", Dagger.MemPool.MEM_RESERVED[])
println("Dagger.MemPool.MEM_RESERVE_SWEEPS[]= ", Dagger.MemPool.MEM_RESERVE_SWEEPS[])

function get_free_memory()
    return Base.format_bytes(Int(Sys.free_memory()))
end

function get_memory_usage_main()
    total_mem = Base.gc_live_bytes()
    return "\tMemory: $(Base.format_bytes(total_mem)) | Free Memory: $(get_free_memory())"
end

function heavy_computation()
    A = rand(1000,1000)
    return A
end

function compute()
    matrix = zeros(1000,1000)

    println("Memory available before computation: ", get_free_memory())

    println("Dagger, running for matrix size: ", size(matrix))

    for j in 1:100
        
        job_outputs = []
        @sync for i in 1:10
            job = Dagger.@spawn heavy_computation()
            push!(job_outputs, job)
        end
        
        results = Dagger.fetch.(job_outputs)
        matrix .+= sum(results)
        

        if j % 10 == 0
            println("Iteration $j: " * get_memory_usage_main())
        end
    end
end


compute()

As a sanity check, I also removed Dagger.jl entirely which results in no memory issues:

No Dagger, running for matrix size: (1000, 1000)
Iteration 10:   Memory: 208.951 MiB
Iteration 20:   Memory: 231.920 MiB
Iteration 30:   Memory: 231.920 MiB
Iteration 40:   Memory: 185.982 MiB
Iteration 50:   Memory: 109.418 MiB

Code:

function compute()
    matrix = zeros(1000,1000)

    println("No Dagger, running for matrix size: ", size(matrix))

    for j in 1:50
        job_outputs = []
        for i in 1:10
            job = heavy_computation()
            push!(job_outputs, job)
        end

        results = job_outputs
        matrix .+= sum(results)

        if j % 10 == 0
            println("Iteration $j: " * get_memory_usage_main())
        end
    end
end

compute()

Problem 2: Garbage collecting based on amount of memory left on device is unpredictable.

Setting Dagger.MemPool.MEM_RESERVED[] = 1024^3 will make Dagger.jl invoke the garbage collector when there is less than 1 GB of memory left. Looking a bit through the code of MemPool.jl, the amount of memory available is based on:

...
if Sys.islinux()
function free_memory()
    open("/proc/meminfo", "r") do io
        # TODO: Cache in TLS
        buf = zeros(UInt8, 128)
        readbytes!(io, buf)
        free = match(r"MemAvailable:\s*([0-9]*)\s.*", String(buf)).captures[1]
        return parse(UInt64, free) * 1024
    end
end
else
# FIXME: Sys.free_memory() includes OS caches
free_memory() = Sys.free_memory()
end
storage_available(::CPURAMResource) = _query_mem_periodically(:available)
storage_capacity(::CPURAMResource) = _query_mem_periodically(:capacity)
...

The free_memory() function has different behavior per operating system, such as Linux (trying to not include OS caches, if I understand correctly?). The code in the previous sections was executed on a machine running MacOS.

I also tested the code on a Linux system running SLURM and I reserved memory based on e.g. srun/sbatch … --mem=4G. However, cat /proc/meminfo gives the amount of memory available on the entire machine not the allocated amount of memory, resulting in the job getting killed by the SLURM daemon. Using something like julia ... --heap-size-hint=2G as suggested in other discussion also does not limit allocations in this case (see Interaction between `addprocs` and `--heap-size-hint` · Issue #50673 · JuliaLang/julia · GitHub).


Conclusion

If activated, garbage collection seems to effectively free up memory. The way of determining free_memory() can have unpredictable behavior on a SLURM node.

PS: I can work around this problem for now by using pmap from Distributed.jl, however I like Dagger.jl’s flexibility to switch between workers and threads.

If you think that is an issue with Dagger.jl, can you create an issue at GitHub · Where software is built ?

Thanks for the excellent analysis!

As you’ve found, the process of detecting remaining memory is OS-specific, because the built-in method for detecting remaining memory (Sys.maxrss()) has not, in previous testing, reported a value appropriate for this purpose (I don’t recall specifics, but I wouldn’t go to this amount of effort if it weren’t necessary :smiley: ). I’m very open to any contributions to improve this logic - especially having better support on SLURM for being allocation-aware would be a big win! I don’t personally know how to do this, so if you get the chance to investigate and provide a working implementation, that would be a huge help :heart:

You’re welcome to file an issue, but to be honest, it’s not going to speed up the resolution. GC issues like these are very hard to investigate and permanently resolve, especially when dealing with hardware and scheduling software differences between systems. If you can investigate and find a workable solution for your system, however, I can definitely help you get those changes merged into MemPool/Dagger!

2 Likes

Thanks for the reply and I am happy to investigate further. From what I can tell so far, the two problems likely require different solutions.

Problem 1: The garbage collector does not effectively clean up the objects.

What would you suggest to further investigate this?

  • Monitor Refcounts in MemPool?
  • Test explicitly cleaning up objects

Problem 2: Triggering garbage collection within a SLURM job based on the job’s allocated memory.

This likely involves modifying the free_memory() function such that, when SLURM environment variables are detected, it calculates the available memory as:

free_memory() = (Memory allocated by SLURM — e.g., SLURM_MEM_PER_CPU or SLURM_MEM_PER_NODE) - (Currently used memory according to SLURM)

Perhaps not the most neat solution but it could potentially do the job.