Don't understand why code runs out of memory and crashes

Davide98888 · December 7, 2024, 8:24am

Hello all

We are having a fantastic experience with Julia, but then had a confusing experience. I am doing a simulation study, multithreaded, and eventually Julia runs out of memory and crashes. This happens especially on Linux (on Intel) but can happen on an ARM (M2). When I run the code below, and monitor Julia’s memory use, it keeps climbing, until it crashes. Clearly, is is not doing garbage collection, especially on Linux. But depending on what else code does, same can happen on the M2. We are on the latest version.

best, d

using Pkg, Revise,StatsBase
function Simulate()
    Simulations=Int(1e7)
    Size=1000
    result = Array{Float64}(undef, Simulations, 1)
    Threads.@threads for i = 1:Simulations
         x = randn(Size)
         s = sort(x)
        result[i, 1] = s[1]
    end
    println(median(result))
end
for i in 1:1000
    println(i)
    Simulate()
end

Sukera · December 7, 2024, 9:01am

Do you also get the problem on 1.10?

Possibly related:

github.com/JuliaLang/julia

Memory leak with Julia 1.11's GC (discovered in SymbolicRegression.jl)

opened 09:34AM - 05 Dec 24 UTC

MilesCranmer

GC regression 1.11

We're seeing memory leaks in PySR/SymbolicRegression.jl that appear related to J…ulia 1.11's parallel GC. The user (@GoldenGoldy) tried various solutions including heap size hints and other parameter adjustments, but memory usage would steadily climb until OOM crashes occurred after 8-11 hours. The issue vanishes completely when switching to Julia 1.10 - no other changes needed. While we don't yet have a minimal working example in pure Julia, I wanted to raise this as it's causing OOM crashes in production workloads. Full reproduction steps and details in: MilesCranmer/PySR#764, including detailed diagnostics on the memory usage `

Davide98888 · December 7, 2024, 10:07am

Thanks @Sukera

yes, just tried on 1.10 and it also leaks memory.

d

rafael.guerra · December 7, 2024, 12:15pm

Use vectors instead of Nx1 matrices:

result = Array{Float64}(undef, Simulations)

and:

result[i] = s[1]

otherwise, keep the same code but compute the median along the long dimension of the matrix:

median(result, dims=1)

Salmon · December 7, 2024, 12:18pm

Not saying that there shouldnt be a built-in safeguard for this but your code is allocating so much memory that its not suprising it runs out. Each thread is allocation two 1000 element arrays per iteration.

I dont know if this is just meant to be an MWE, but in general its a good idea to avoid so many allocations especially in multithreaded code.

Davide98888 · December 7, 2024, 12:28pm

@Salmon thanks!

I agree. That said, this code sample took out all the computationally intensive calculations, so the overhead is minimal the actual code.

Even then, I just dont know how to preallocate in multithreading code like this. Would need to pre-allocate one vector per core, and be able to use that correctly. Maybe possible, I just don’t know how, I’ll take a closer look at the docs.

and this problem is much worse on Linux than on a Mac under the same Julia version.

best, d

Davide98888 · December 7, 2024, 12:29pm

@rafael.guerra thanks, yes, agree in this sample code, but in the actual code, which does a lot of calculations, it needs to be a matrix, and I just carried it over.

best, d

Salmon · December 7, 2024, 12:36pm

To avoid most of the allocations, the following should work (I havent checked if it runs)

using Pkg, Revise,StatsBase
import ChunkSplitters
function Simulate()
    Simulations=round(Int,1e7)
    Size=1000
    buffers = [zeros(Size) for _ in 1:Threads.nthreads()]
    result = zeros(Simulations,1)
    chunks = ChunkSplitters.chunk(1:Simulations,n=Threads.nthreads())

    Threads.@threads for (n_chunk,indices) in enumerate(chunks)
         x = buffers[n_chunk]
         for i = 1:Simulations
            randn!(x)
            sort!(x)
            result[i,1] = s[1] # I guess in this case sorting is technically not needed and could be replaced by maximum, but perhaps you need the sorting for other reasons
         end
    end
    println(median(result))
end
for i in 1:1000
    println(i)
    Simulate()
end

PS: note I also replaced Int(1e7) by rounding. Probably not absolutely needed in your use-case but seems less error-prone

Salmon · December 7, 2024, 12:46pm

Maybe not to distract from the more fundamental problem, does it work if you run julia with zhe flag julia --heap-size-hint=8G (or whatever amount of memory you have available, say 80 percent of your RAM)
in theory this should allow for more aggressive GC though im not sure how well it works for multithreaded code

lmiq · December 7, 2024, 1:16pm

You may get a better GC behavior if you put these inside a function.

Davide98888 · December 7, 2024, 1:20pm

Hi @Salmon

Thanks for this, had not seen ChunkSplitters, so many wonderful packages to discover.

This code, didn’t quite solve it. Besides the small problems of needing using Random and ChunkSplitters.chunks, it kept on allocating more memory on both Linux and Mac (and was much slower on the Linux, which usually is faster than the Mac). I think, perhaps it repeats the inner loop for all Simulations for all Threads, and one would need to spit the inner loop into Simulations/number of simulations chunks?

Not sure.

best, d

Davide98888 · December 7, 2024, 1:33pm

Hi @Salmon

That did indeed force it to do GC, but did not solve it.

a) the agressive GC really slows the code when it close to the limit

b) The code crashed previously since the inner simulation loop took memory allocation up to the limit (64G in the machines) but that meant it could not allocate for the post simulation processing. and crashed.

eldee · December 7, 2024, 1:35pm

You could reuse the buffers and (at least in this example) result between different runs of Simulate by moving them into the arguments of the method:

function Simulate!(buffers, result)
    Simulations = size(result, 1)
    Size = length(first(buffers))
    chunks = ...
    ...
end 

function SimulateLoop(runs=1000, Simulations=10^7, Size=1000)
    buffers = [zeros(Size) for _ in 1:Threads.nthreads()]
    result = zeros(Simulations, 1)
    for i in 1:runs
       println(i)
       Simulate!(buffers, result)
    end
end

lmiq · December 7, 2024, 1:35pm

Salmon:

    Threads.@threads for (n_chunk,indices) in enumerate(chunks)
         x = buffers[n_chunk]
         for i = 1:Simulations
            randn!(x)
            sort!(x)
            result[i,1] = s[1] # I guess in this case sorting is technically not needed and could be replaced by maximum, but perhaps you need the sorting for other reasons
         end
    end

Here s is not defined and it is running all simulations repeatedly for each chunk.

Davide98888 · December 7, 2024, 1:38pm

Hi @Imiq

thanks for that, did not work.

lmiq · December 7, 2024, 2:05pm

using Pkg, Revise,StatsBase
import ChunkSplitters
function Simulate()
    Simulations=round(Int,1e7)
    Size=1000
    result = zeros(Simulations)
    Threads.@threads for c in ChunkSplitters.chunks(1:Simulations,n=Threads.nthreads())
         x = zeros(Size)
         for i in c
            randn!(x)
            sort!(x)
            result[i,1] = x[1] 
         end
    end
    println(median(result))
end
for i in 1:1000
    println(i)
    Simulate()
end

I think the above should work (I’m on the phone)

To be completely safe about allocations, you can preallocate the temporary arrays:

using Pkg, Revise,StatsBase, Random
import ChunkSplitters
function Simulate!(result, xt)
    result .= 0.0
    Threads.@threads for (ic, c) in enumerate(ChunkSplitters.chunks(eachindex(result),n=length(xt)))
         x = xt[ic]
         x .= 0.0
         for i in c
            randn!(x)
            sort!(x)
            result[i] = x[1] 
         end
    end
    println(median(result))
end
function run(; ntasks=Threads.nthreads())
    Simulations=10^7
    Size=1000
    result = zeros(Simulations)
    xt = [ zeros(Size) for _ in ntasks ]
    for i in 1:1000
        println(i)
        Simulate!(result, xt)
    end
end

@Davide98888 if your actual problem has this structure, the above can solve, in practice, the issues you are having. (It does not solve the GC bug)

Or you could parallelize the loop over the simulations, at a higher level.

Davide98888 · December 7, 2024, 2:48pm

Hi @lmiq

Wow, that is very nice! I would never have discovered that. Its about 20% faster that my original code.

It needs also using Random since randn!() needs that but not randn()

But, it still leaks memory, albeit slower than in my code.

d

LaurentPlagne · December 7, 2024, 3:04pm

Correct me if I am wrong but it seems to me that most (all ?) of the answers tend to improve the OP’s code reducing the allocations which is interesting per se but do not address whether the original OP’s MWE actually illustrates a genuine threading bug/pb with the GC (It looks like one to me). If it is the case, I guess that an issue should be filled.

Sukera · December 7, 2024, 3:05pm

How are you measuring the leakage? Just looking at top? I’m curious how your code behaves on my machine, so I’d like to reproduce your methodology. The example your showing does indeed allocate a lot of memory overall, but it shouldn’t leak this. There’s plenty of opportunities here for the GC to run, and eventually reuse those allocations internally.

Davide98888 · December 7, 2024, 3:09pm

@Sukera

I use btop on my Linux and Mac. My production code eventually crashes when it does processing of results after the simulation loop, and btop shows me it happens when Julia has taken all of the 64G on my machines (each simulation loop needs about 0.5G). The linux console gives an out of memory message also.

d

Topic		Replies	Views
Julia killed with Out of memory error on Linux -- runs fine on MacOS General Usage memory , os	8	729	September 12, 2023
Memory leak when migrating from Julia 1.10.7 to 1.11.2 General Usage question	7	553	February 13, 2025
Memory leak in Julia 1.11.x General Usage bug	6	317	January 23, 2025
Memory Error Performance	9	1184	October 24, 2019
Garbage collector behaviour when memory is almost full Performance	7	2178	June 24, 2021

Don't understand why code runs out of memory and crashes

Related topics