[cluster] Understanding parallel performance. Is GC triggered too early?

carlo · November 9, 2021, 9:42am

Good morning, everyone!
After weeks of lurking in the forum to fix other of my problems (thanks, it’s a precious resource) I decided to go on and ask my own particular question.
I am facing the following problem: I am writing this function that is particularly heavy and ~memory intensive. On a single run (or on few non-parallelized runs) it usually stays below <300ms and <200MiB, but if I run it ~10k times in parallel some of its execution go well above what’s considerable reasonable (sometimes even ~30 seconds & 12GiB mem allocation for some calls that would, on their own, take well below 1 sec and <200MiB of memory allocation!).
An explanatory code of how I parallelize is the following:

@time @sync @distributed for i in 1:nworkers()
    Threads.@threads for i in BIG_RANGE
        stats = @timed my_function(array[i])  # here I use DifferentialEquations.jl hundreds of times
        println("(proc: $(myid()),\tthread: $(Threads.threadid()),\tcore: $(glibc_coreid()))\t--> $(pretty_stats(stats))")
    end
end

in my log file I get outputs such as this one:

(proc: 3,	thread: 17,	core: 16)	-->  0.109603  seconds (874.72  k allocations: 256.30 MiB)
(proc: 3,	thread: 10,	core: 9)	-->  0.100472  seconds (807.37  k allocations: 234.26 MiB)
(proc: 3,	thread: 10,	core: 9)	-->  0.101613  seconds (823.02  k allocations: 239.82 MiB)
(proc: 3,	thread: 17,	core: 16)	-->  0.109390  seconds (868.61  k allocations: 251.80 MiB)
(proc: 2,	thread: 35,	core: 34)	--> 32.768049  seconds (41.95  M allocations: 7.97 GiB, 66.90 % gc time)
(proc: 3,	thread: 10,	core: 9)	-->  0.111960  seconds (872.61  k allocations: 260.20 MiB)
(proc: 3,	thread: 17,	core: 16)	-->  0.113729  seconds (892.02  k allocations: 263.60 MiB)
(proc: 3,	thread: 23,	core: 22)	--> 30.223146  seconds (38.25  M allocations: 7.44 GiB, 68.29 % gc time)
(proc: 2,	thread: 8,	core: 7)	--> 52.040675  seconds (105.42  M allocations: 18.57 GiB, 42.12 % gc time)
(proc: 3,	thread: 23,	core: 22)	-->  0.084965  seconds (919.86  k allocations: 141.25 MiB)
(proc: 2,	thread: 8,	core: 7)	-->  0.074312  seconds (448.49  k allocations: 99.94 MiB)
(proc: 3,	thread: 23,	core: 22)	-->  0.102682  seconds (1.19  M allocations: 167.60 MiB)
(proc: 2,	thread: 8,	core: 7)	-->  0.071680  seconds (878.05  k allocations: 114.30 MiB)
(proc: 3,	thread: 23,	core: 22)	-->  0.094416  seconds (406.47  k allocations: 121.14 MiB)

~~I initially thought it was due to the GC doing freeing some memory or things like that, but it doesn’t seem the case.~~
Looking at the cluster’s node during my computation I don’t see a high memory usage (compared to the available one) and I really don’t know what to do now… I already tried to optimize my code of the basic function as well as I could without changing it completely, but that’s where I stop.
Do you have any idea? :\

P.S.: the stacked core utilisation is at maximum at 50% because I chose to disable hyperthreading

P.P.S.: I am starting to change my mind again about the Garbage Collector. It could be that it interferes with the computation times. And thank god otherwise I would probably go on stack overflow. Problem is that the memory usage, as you can see from the graph above, is always well below 50GiB out of 375GiB available per node. Can I somehow set a higher threshold at which the GC starts?

Topic		Replies	Views
Memory allocation in multi-thread vs single-thread Julia at Scale performance	0	641	August 7, 2018
Multithreading an embarrassingly parallel algorithm increases garbage collection Performance multithreading , memory , memory-allocation , garbage-collection	12	2001	March 1, 2021
Displaying amount of data sent to worker General Usage parallel	0	422	October 10, 2018
Parallel code seems slow Performance	3	1440	October 20, 2017
Tricks for parallel computing in Julia Numerics	11	1069	July 7, 2020

[cluster] Understanding parallel performance. Is GC triggered too early?

Related topics