Don't understand why code runs out of memory and crashes

Hello all

We are having a fantastic experience with Julia, but then had a confusing experience. I am doing a simulation study, multithreaded, and eventually Julia runs out of memory and crashes. This happens especially on Linux (on Intel) but can happen on an ARM (M2). When I run the code below, and monitor Julia’s memory use, it keeps climbing, until it crashes. Clearly, is is not doing garbage collection, especially on Linux. But depending on what else code does, same can happen on the M2. We are on the latest version.

best, d

using Pkg, Revise,StatsBase
function Simulate()
    Simulations=Int(1e7)
    Size=1000
    result = Array{Float64}(undef, Simulations, 1)
    Threads.@threads for i = 1:Simulations
         x = randn(Size)
         s = sort(x)
        result[i, 1] = s[1]
    end
    println(median(result))
end
for i in 1:1000
    println(i)
    Simulate()
end
1 Like

Do you also get the problem on 1.10?

Possibly related:

2 Likes

Thanks @Sukera

yes, just tried on 1.10 and it also leaks memory.

d

Use vectors instead of Nx1 matrices:

result = Array{Float64}(undef, Simulations)

and:

result[i] = s[1]

otherwise, keep the same code but compute the median along the long dimension of the matrix:

median(result, dims=1)
2 Likes

Not saying that there shouldnt be a built-in safeguard for this but your code is allocating so much memory that its not suprising it runs out. Each thread is allocation two 1000 element arrays per iteration.

I dont know if this is just meant to be an MWE, but in general its a good idea to avoid so many allocations especially in multithreaded code.

2 Likes

@Salmon thanks!

I agree. That said, this code sample took out all the computationally intensive calculations, so the overhead is minimal the actual code.

Even then, I just dont know how to preallocate in multithreading code like this. Would need to pre-allocate one vector per core, and be able to use that correctly. Maybe possible, I just don’t know how, I’ll take a closer look at the docs.

and this problem is much worse on Linux than on a Mac under the same Julia version.

best, d

1 Like

@rafael.guerra thanks, yes, agree in this sample code, but in the actual code, which does a lot of calculations, it needs to be a matrix, and I just carried it over.

best, d

To avoid most of the allocations, the following should work (I havent checked if it runs)

using Pkg, Revise,StatsBase
import ChunkSplitters
function Simulate()
    Simulations=round(Int,1e7)
    Size=1000
    buffers = [zeros(Size) for _ in 1:Threads.nthreads()]
    result = zeros(Simulations,1)
    chunks = ChunkSplitters.chunk(1:Simulations,n=Threads.nthreads())

    Threads.@threads for (n_chunk,indices) in enumerate(chunks)
         x = buffers[n_chunk]
         for i = 1:Simulations
            randn!(x)
            sort!(x)
            result[i,1] = s[1] # I guess in this case sorting is technically not needed and could be replaced by maximum, but perhaps you need the sorting for other reasons
         end
    end
    println(median(result))
end
for i in 1:1000
    println(i)
    Simulate()
end

PS: note I also replaced Int(1e7) by rounding. Probably not absolutely needed in your use-case but seems less error-prone

1 Like

Maybe not to distract from the more fundamental problem, does it work if you run julia with zhe flag julia --heap-size-hint=8G (or whatever amount of memory you have available, say 80 percent of your RAM)
in theory this should allow for more aggressive GC though im not sure how well it works for multithreaded code

1 Like

You may get a better GC behavior if you put these inside a function.

1 Like

Hi @Salmon

Thanks for this, had not seen ChunkSplitters, so many wonderful packages to discover.

This code, didn’t quite solve it. Besides the small problems of needing using Random and ChunkSplitters.chunks, it kept on allocating more memory on both Linux and Mac (and was much slower on the Linux, which usually is faster than the Mac). I think, perhaps it repeats the inner loop for all Simulations for all Threads, and one would need to spit the inner loop into Simulations/number of simulations chunks?

Not sure.

best, d

Hi @Salmon

That did indeed force it to do GC, but did not solve it.

a) the agressive GC really slows the code when it close to the limit

b) The code crashed previously since the inner simulation loop took memory allocation up to the limit (64G in the machines) but that meant it could not allocate for the post simulation processing. and crashed.

You could reuse the buffers and (at least in this example) result between different runs of Simulate by moving them into the arguments of the method:

function Simulate!(buffers, result)
    Simulations = size(result, 1)
    Size = length(first(buffers))
    chunks = ...
    ...
end 

function SimulateLoop(runs=1000, Simulations=10^7, Size=1000)
    buffers = [zeros(Size) for _ in 1:Threads.nthreads()]
    result = zeros(Simulations, 1)
    for i in 1:runs
       println(i)
       Simulate!(buffers, result)
    end
end
1 Like

Here s is not defined and it is running all simulations repeatedly for each chunk.

1 Like

Hi @Imiq

thanks for that, did not work.

using Pkg, Revise,StatsBase
import ChunkSplitters
function Simulate()
    Simulations=round(Int,1e7)
    Size=1000
    result = zeros(Simulations)
    Threads.@threads for c in ChunkSplitters.chunks(1:Simulations,n=Threads.nthreads())
         x = zeros(Size)
         for i in c
            randn!(x)
            sort!(x)
            result[i,1] = x[1] 
         end
    end
    println(median(result))
end
for i in 1:1000
    println(i)
    Simulate()
end

I think the above should work (I’m on the phone)

To be completely safe about allocations, you can preallocate the temporary arrays:

using Pkg, Revise,StatsBase, Random
import ChunkSplitters
function Simulate!(result, xt)
    result .= 0.0
    Threads.@threads for (ic, c) in enumerate(ChunkSplitters.chunks(eachindex(result),n=length(xt)))
         x = xt[ic]
         x .= 0.0
         for i in c
            randn!(x)
            sort!(x)
            result[i] = x[1] 
         end
    end
    println(median(result))
end
function run(; ntasks=Threads.nthreads())
    Simulations=10^7
    Size=1000
    result = zeros(Simulations)
    xt = [ zeros(Size) for _ in ntasks ]
    for i in 1:1000
        println(i)
        Simulate!(result, xt)
    end
end

@Davide98888 if your actual problem has this structure, the above can solve, in practice, the issues you are having. (It does not solve the GC bug)

Or you could parallelize the loop over the simulations, at a higher level.

1 Like

Hi @lmiq

Wow, that is very nice! I would never have discovered that. Its about 20% faster that my original code.

It needs also using Random since randn!() needs that but not randn()

But, it still leaks memory, albeit slower than in my code.

d

Correct me if I am wrong but it seems to me that most (all ?) of the answers tend to improve the OP’s code reducing the allocations which is interesting per se but do not address whether the original OP’s MWE actually illustrates a genuine threading bug/pb with the GC (It looks like one to me). If it is the case, I guess that an issue should be filled.

6 Likes

How are you measuring the leakage? Just looking at top? I’m curious how your code behaves on my machine, so I’d like to reproduce your methodology. The example your showing does indeed allocate a lot of memory overall, but it shouldn’t leak this. There’s plenty of opportunities here for the GC to run, and eventually reuse those allocations internally.

2 Likes

@Sukera

I use btop on my Linux and Mac. My production code eventually crashes when it does processing of results after the simulation loop, and btop shows me it happens when Julia has taken all of the 64G on my machines (each simulation loop needs about 0.5G). The linux console gives an out of memory message also.

d