[cluster] Understanding parallel performance. Is GC triggered too early?

Good morning, everyone!
After weeks of lurking in the forum to fix other of my problems (thanks, it’s a precious resource) I decided to go on and ask my own particular question.
I am facing the following problem: I am writing this function that is particularly heavy and ~memory intensive. On a single run (or on few non-parallelized runs) it usually stays below <300ms and <200MiB, but if I run it ~10k times in parallel some of its execution go well above what’s considerable reasonable (sometimes even ~30 seconds & 12GiB mem allocation for some calls that would, on their own, take well below 1 sec and <200MiB of memory allocation!).
An explanatory code of how I parallelize is the following:

@time @sync @distributed for i in 1:nworkers()
    Threads.@threads for i in BIG_RANGE
        stats = @timed my_function(array[i])  # here I use DifferentialEquations.jl hundreds of times
        println("(proc: $(myid()),\tthread: $(Threads.threadid()),\tcore: $(glibc_coreid()))\t--> $(pretty_stats(stats))")
    end
end

in my log file I get outputs such as this one:

(proc: 3,	thread: 17,	core: 16)	-->  0.109603  seconds (874.72  k allocations: 256.30 MiB)
(proc: 3,	thread: 10,	core: 9)	-->  0.100472  seconds (807.37  k allocations: 234.26 MiB)
(proc: 3,	thread: 10,	core: 9)	-->  0.101613  seconds (823.02  k allocations: 239.82 MiB)
(proc: 3,	thread: 17,	core: 16)	-->  0.109390  seconds (868.61  k allocations: 251.80 MiB)
(proc: 2,	thread: 35,	core: 34)	--> 32.768049  seconds (41.95  M allocations: 7.97 GiB, 66.90 % gc time)
(proc: 3,	thread: 10,	core: 9)	-->  0.111960  seconds (872.61  k allocations: 260.20 MiB)
(proc: 3,	thread: 17,	core: 16)	-->  0.113729  seconds (892.02  k allocations: 263.60 MiB)
(proc: 3,	thread: 23,	core: 22)	--> 30.223146  seconds (38.25  M allocations: 7.44 GiB, 68.29 % gc time)
(proc: 2,	thread: 8,	core: 7)	--> 52.040675  seconds (105.42  M allocations: 18.57 GiB, 42.12 % gc time)
(proc: 3,	thread: 23,	core: 22)	-->  0.084965  seconds (919.86  k allocations: 141.25 MiB)
(proc: 2,	thread: 8,	core: 7)	-->  0.074312  seconds (448.49  k allocations: 99.94 MiB)
(proc: 3,	thread: 23,	core: 22)	-->  0.102682  seconds (1.19  M allocations: 167.60 MiB)
(proc: 2,	thread: 8,	core: 7)	-->  0.071680  seconds (878.05  k allocations: 114.30 MiB)
(proc: 3,	thread: 23,	core: 22)	-->  0.094416  seconds (406.47  k allocations: 121.14 MiB)

I initially thought it was due to the GC doing freeing some memory or things like that, but it doesn’t seem the case.
Looking at the cluster’s node during my computation I don’t see a high memory usage (compared to the available one) and I really don’t know what to do now… I already tried to optimize my code of the basic function as well as I could without changing it completely, but that’s where I stop.
Do you have any idea? :\


P.S.: the stacked core utilisation is at maximum at 50% because I chose to disable hyperthreading

P.P.S.: I am starting to change my mind again about the Garbage Collector. It could be that it interferes with the computation times. And thank god otherwise I would probably go on stack overflow. Problem is that the memory usage, as you can see from the graph above, is always well below 50GiB out of 375GiB available per node. Can I somehow set a higher threshold at which the GC starts?