Good morning, everyone!
After weeks of lurking in the forum to fix other of my problems (thanks, it’s a precious resource) I decided to go on and ask my own particular question.
I am facing the following problem: I am writing this function that is particularly heavy and ~memory intensive. On a single run (or on few non-parallelized runs) it usually stays below <300ms and <200MiB, but if I run it ~10k times in parallel some of its execution go well above what’s considerable reasonable (sometimes even ~30 seconds & 12GiB mem allocation for some calls that would, on their own, take well below 1 sec and <200MiB of memory allocation!).
An explanatory code of how I parallelize is the following:
@time @sync @distributed for i in 1:nworkers()
Threads.@threads for i in BIG_RANGE
stats = @timed my_function(array[i]) # here I use DifferentialEquations.jl hundreds of times
println("(proc: $(myid()),\tthread: $(Threads.threadid()),\tcore: $(glibc_coreid()))\t--> $(pretty_stats(stats))")
end
end
in my log file I get outputs such as this one:
(proc: 3, thread: 17, core: 16) --> 0.109603 seconds (874.72 k allocations: 256.30 MiB)
(proc: 3, thread: 10, core: 9) --> 0.100472 seconds (807.37 k allocations: 234.26 MiB)
(proc: 3, thread: 10, core: 9) --> 0.101613 seconds (823.02 k allocations: 239.82 MiB)
(proc: 3, thread: 17, core: 16) --> 0.109390 seconds (868.61 k allocations: 251.80 MiB)
(proc: 2, thread: 35, core: 34) --> 32.768049 seconds (41.95 M allocations: 7.97 GiB, 66.90 % gc time)
(proc: 3, thread: 10, core: 9) --> 0.111960 seconds (872.61 k allocations: 260.20 MiB)
(proc: 3, thread: 17, core: 16) --> 0.113729 seconds (892.02 k allocations: 263.60 MiB)
(proc: 3, thread: 23, core: 22) --> 30.223146 seconds (38.25 M allocations: 7.44 GiB, 68.29 % gc time)
(proc: 2, thread: 8, core: 7) --> 52.040675 seconds (105.42 M allocations: 18.57 GiB, 42.12 % gc time)
(proc: 3, thread: 23, core: 22) --> 0.084965 seconds (919.86 k allocations: 141.25 MiB)
(proc: 2, thread: 8, core: 7) --> 0.074312 seconds (448.49 k allocations: 99.94 MiB)
(proc: 3, thread: 23, core: 22) --> 0.102682 seconds (1.19 M allocations: 167.60 MiB)
(proc: 2, thread: 8, core: 7) --> 0.071680 seconds (878.05 k allocations: 114.30 MiB)
(proc: 3, thread: 23, core: 22) --> 0.094416 seconds (406.47 k allocations: 121.14 MiB)
I initially thought it was due to the GC doing freeing some memory or things like that, but it doesn’t seem the case.
Looking at the cluster’s node during my computation I don’t see a high memory usage (compared to the available one) and I really don’t know what to do now… I already tried to optimize my code of the basic function as well as I could without changing it completely, but that’s where I stop.
Do you have any idea? :\
P.S.: the stacked core utilisation is at maximum at 50% because I chose to disable hyperthreading
P.P.S.: I am starting to change my mind again about the Garbage Collector. It could be that it interferes with the computation times. And thank god otherwise I would probably go on stack overflow. Problem is that the memory usage, as you can see from the graph above, is always well below 50GiB out of 375GiB available per node. Can I somehow set a higher threshold at which the GC starts?