Calling for people using multithreading heavily, memory fragmentation problem

Hi Guys,

I use Julia for computation intensive tasks. And of coz the best way is to reduce my run time is to use multithreading and parallelize all my work.

Sometimes, my works will require larger per thread memory and hence I wrote a util to control the number of threads that I will be using dynamically. And it looks like this. (I am aware that v1.5 got some new function of launching tasks to multithreads without oversubscribe, but I still love control of number of threads dynamically for not over booking the memory).

function my_map(
    func, to_process, params...;
    num_thread=Threads.nthreads() - 1,
    pos = Threads.Atomic{Int64}(1);

    # prepare the output array
    out_array = Vector{outType}(undef, length(to_process))

    @threads for i = 1:num_thread
            this_thread_pos = atomic_add!(pos, 1)
            if this_thread_pos <= length(to_process)
                if hasRet
                    out_array[this_thread_pos] =
                        func(to_process[this_thread_pos], params...)
                    func(to_process[this_thread_pos], params...)
        end #while true loop

    # return result
    if hasRet
        return out_array

From here, you can see that each of the threads is reused multiple times. And this is where the problem emerge.

The thing is it looks like that there are memory fragmentation problem, although I tried to add GC.safepoint() in my util function as well as my actual computation functions. The memory usage seems to grow overtime and eventually lead to memory error. I am using Centos 7.6 with default kernel setting. And it looks like to me that the growing memory problem seems to arise from memory fragmentation with many allocation and de-allocation within my computation function (I am using some 3rd library as well so it is not avoidable).

I got this kind of error in kernel

[2892174.756628] Out of memory: Kill process 77480 (julia) score 600 or sacrifice child
[2892174.756631] Killed process 77480 (julia) total-vm:149233800kB, anon-rss:78731484kB, file-rss:4kB, shmem-rss:1756kB
[2892174.851900] julia: page allocation failure: order:0, mode:0x280da
[2892174.851905] CPU: 5 PID: 77480 Comm: julia Kdump: loaded Tainted: P           OE  ------------ T 3.10.0-957.21.3.el7.x86_64 #1
[2892174.851907] Hardware name: Gigabyte Technology Co., Ltd. X299 AORUS MASTER/X299 AORUS MASTER, BIOS F2 11/05/2018
[2892174.851909] Call Trace:
[2892174.851915]  [<ffffffff88f63107>] dump_stack+0x19/0x1b

I have even tried to change the MMAP thresholds but that doesn’t help.

I have tried to do similar stuff with Distributed, where everytime I use addprocs() to add a process, and use rmprocs() to remove the process once the task is finished. In that case, since the work is forked and the memory will become clean after each function is complete, I don’t get the memory error. That proved that my machine should have enough memory for all this work. It just that the garbage collector do not clean up all the memory over time.

I am not an expert of garbage collector, but I think Julia GC do not copy nor move the object. It just mark the object that is no longer in use and release them. If that is allocated thru MMAP I think it will always be released by the system. If it is not allocated thru MMAP then our only hope is it is at the head of the heap such that it can be trimmed. (not sure if my understanding is right).

I guess I can try to make a MRE to illustrate the problem with random Vector allocation and deallocation with difference sizes.

I think Julia is a wondering language but if we want it to make significant contribution to the scientific world. This kind of problem has to be tackled. And I would like to contribute to solve this kind of problem.

Any thoughts, suggestions?

1 Like

You aren’t doing that. The number of thread created are always the same. No new threads are created by your function.

That does nothing here. You already have safepoint.

No that doesn’t prove much. The closest you can compare it to is forcing GC at the equivalent point every time you call rmprocs while also making sure the third party code leaks no memory.